Image-to-image translation for cross-domain disentanglement
posted on: NIPS2018
In this paper, they combine image translation and domain disentanglement and propose the concept of cross domain disentanglement. Similarly, they separate the latent representation into shared and exclusive parts. The shared part contains information for both domains and the exclusive part contains only factors of variation particular to each domain. Their network contains image translation modules and cross-domain auto-encoders. Image translation modules follow an encoder-decoder architecture.
l Given input image, the encoder output the latent representation, which is further separated into shared S and exclusive parts E. To guarantee the correct disentanglement, they apply two ways. 1) Based on the intuition that reconstructing Y images from Ex is impossible, they introduce a small decoder and apply GRL at the beginning layers. With adversarial learning, it forces Ex to contain exclusive features only. 2) to constrain the shared features contains similar information, they apply L1 loss on these features and add noise to avoid small signals.
l During disentangling, as higher resolution feature contains both shared and exclusive features, they reduce the bottleneck by increasing the size of the latent representation when encoding shared part and normally apply fully connected layers for exclusive part.
l The decoder takes the shared representation and a random noise which serves as the exclusive part as input. To enforce the exclusive features and noises have similar distribution, they adopt a discriminator to push distribution of Ex to N(0,1). And to avoid the noise being ignored, they propose to reconstruct the latent representation using a L1 loss.
Cross-domain auto-encoder takes the exchanged shared part and the exclusive part as input and reconstruct the original image by using a L1 loss. This offers an extra incentive for the encoder to put domain specific properties in exclusive representation.
Their experiment is conducted mainly on MNIST variations. 1) Without any labels, their model could generate diverse outputs which belongs to the other domain. 2) Given a reference of the other domain, it could also perform domain-specific translation by exchanging the exclusive parts. 3) By interpolating the exclusive and shared representations, it could generate smoothly transformed images. 4) By applying Euclidean distance between features, it could perform cross domain retrieval both semantically and stylistically. All experiments demonstrate the effectiveness of their cross-domain disentanglement.
Pros:
Though their model is trained on simple dataset MNIST variations, it could be applied to bidirectional multimodal image translation in more complex datasets.
It’s not constrained to cross-domain spatial correspondence like pix2pix and BicycleGAN do. Their disentanglement is general and practical.
Cons:
- Though the application of GRL is novel in domain disentanglement, the results in their ablation study indicates that it’s not as useful as they analyzed.