Daily Reading 20200520

DRIT++: Diverse Image-to-Image Translation via Disentangled Representations

posted on: ECCV2018

This paper is somewhat like MUNIT, which treat image translation as a one-to-many multimodal mapping with unpaired data. To generate diverse outputs with unpaired training data, they propose a disentangled representation framework for learning, where the input images are embedded onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space.

1) To achieve representation disentanglement, they apply two strategies: weight-sharing and a content discriminator. Weight-sharing, similar to UNIT, shares the weight between the last layer of content encoders and the first layer of generators. To further constrain the same content representations encode the same information for both domains, they propose a content discriminator and content adversarial loss, D_c distinguishes the membership of content features while the content decoders try to fool Dc.

2) To address unpair training data, they propose a cross-cycle consistency, which contains two I2I translation: forward and backward translations. In a word, they exchange the attribute representation twice, trying to reconstruct the original images.

There are some other loss functions. 1) self-reconstruction loss, reconstruct the original image by encoding and decoding. 2) domain-adversarial loss, encourages G to generate realistic images in each domain. 3) latent regression loss, inspired by BicycleGAN, enforces the reconstruction on the latent attribute vector. 4) KL loss, aim to align the attribute representation with a prior Gaussian distribution 5) mode seeking regularization, improve the diversity.

For metrics, they adopt FID to evaluate the image quality, LPIPS to evaluate diversity and JSD&NDB to measure the similarity between the distributions of real images and generated images. Their model could generalize to multi-domain and high-resolution I2I translations.

Pros:

Though embedding images to content and attribute space is similar to MUNIT, their content adversarial loss guarantees the content space containing no domain-specific information to utmost degree, which is more reasonable than MUNIT.
The cross-cycle consistency loss addresses the absence of paired training data in a cyclic way.
Their experiments are comprehensive and convincing.

Cons:

In their user study, there is no detail information about the number of users or the testing images, which is less convincing. And the image quality is much worse than CycleGAN.