Daily Reading 20200519

Multimodal Unsupervised Image-to-Image Translation

posted on: CVPR2019

In this paper, they try to solve image completion in a pluralistic way. That is, given a masked input, the model could generate multiple and diverse plausible outputs, which is quite different to previous methods that could only generate one output. To have a distribution to sample foreground from, they combined CVAE and instance blind and explained why using them directly is infeasible: CVAE learns low variance prior and instance blind is unstable. Therefore, they propose the network with two parallel training paths: 1) reconstruction path is similar as instance blind, trying to reconstruct the original image and get smooth prior distribution of missing foreground. 2) generative path predicts the latent prior distribution for the missing regions conditioned on the visible pixels. During testing, only generative model would be used to infer outputs. The network is based on LS-GAN. For the loss function, they used distribution regularization (KL divergence), appearance matching loss (for rec path, it’s used on the whole image, while for gen path, it’s used on missing foreground only), and adversarial loss.

Note that they also present short+long term attention layer, a combination of self-attention layer and contextual flow. Short term attention is placed within decoder to harness distant spatial context. Long term attention is placed between encoder and decoder to capture the feature-feature context.

Pros:

  1. The mathematical derivation and explanation is clear and full (though I don’t understand some of them)

  2. From one-to-one to multi-model/one-to-many is the trend in image-to-image translation and image inpaiting.

  3. The short+long attention layer is a good way to attend to the finer-grained features in the encoder or the more semantically generative features in the decoder.