Daily Reading 20200525

Posted on 2019-05-25 | In Paper Note

Image-to-image translation for cross-domain disentanglement

posted on: NIPS2018

In this paper, they combine image translation and domain disentanglement and propose the concept of cross domain disentanglement. Similarly, they separate the latent representation into shared and exclusive parts. The shared part contains information for both domains and the exclusive part contains only factors of variation particular to each domain. Their network contains image translation modules and cross-domain auto-encoders. Image translation modules follow an encoder-decoder architecture.

l Given input image, the encoder output the latent representation, which is further separated into shared S and exclusive parts E. To guarantee the correct disentanglement, they apply two ways. 1) Based on the intuition that reconstructing Y images from Ex is impossible, they introduce a small decoder and apply GRL at the beginning layers. With adversarial learning, it forces Ex to contain exclusive features only. 2) to constrain the shared features contains similar information, they apply L1 loss on these features and add noise to avoid small signals.

l During disentangling, as higher resolution feature contains both shared and exclusive features, they reduce the bottleneck by increasing the size of the latent representation when encoding shared part and normally apply fully connected layers for exclusive part.

l The decoder takes the shared representation and a random noise which serves as the exclusive part as input. To enforce the exclusive features and noises have similar distribution, they adopt a discriminator to push distribution of Ex to N(0,1). And to avoid the noise being ignored, they propose to reconstruct the latent representation using a L1 loss.

Cross-domain auto-encoder takes the exchanged shared part and the exclusive part as input and reconstruct the original image by using a L1 loss. This offers an extra incentive for the encoder to put domain specific properties in exclusive representation.

Their experiment is conducted mainly on MNIST variations. 1) Without any labels, their model could generate diverse outputs which belongs to the other domain. 2) Given a reference of the other domain, it could also perform domain-specific translation by exchanging the exclusive parts. 3) By interpolating the exclusive and shared representations, it could generate smoothly transformed images. 4) By applying Euclidean distance between features, it could perform cross domain retrieval both semantically and stylistically. All experiments demonstrate the effectiveness of their cross-domain disentanglement.

Pros:

Though their model is trained on simple dataset MNIST variations, it could be applied to bidirectional multimodal image translation in more complex datasets.
It’s not constrained to cross-domain spatial correspondence like pix2pix and BicycleGAN do. Their disentanglement is general and practical.

Cons:

Though the application of GRL is novel in domain disentanglement, the results in their ablation study indicates that it’s not as useful as they analyzed.

Daily Reading 20200522

Posted on 2019-05-22 | In Paper Note

Shapes and Context: In-the-Wild Image Synthesis & Manipulation

posted on: CVPR2019

In image synthesis and image manipulation field, recent works are mainly learning-based parametric methods. In this paper, they propose a data-driven model with no learning for interactively synthesizing in-the-wild images from semantic label input masks. Their model is controllable and interpretable, following stages as: (1) global scene context, filter the list of training examples using labels and pixel overlap of labels; (2) instance shape consistency, search boundaries and extract the shapes with similar context; (3) local part consistency, a more fine-grain constrain when global shape is not able to capture, (4) pixel-level consistency, similar to part consistency, fill the remaining holes after (2) and (3).

In their quantitative comparison, they measure image realism by applying FID scores and measure image quality by comparing the segmentation outputs between synthesized image and the original. Compared with pix2pix and pix2pix-HD, their method could generate both high-quality and realistic images. In their qualitative comparison，the user study indicates that their results are more favorable than pix2pix. And it could generate diverse outputs without additional efforts.

Pros:

Compared to other parametric methods, their work has notable advantages: 1) it is not limited to specific training data dataset and distribution, 2) it performs better with more given data, while parametric methods will perform worse, 3) it could generate arbitrarily high-resolution images, 4) it can generate an exponentially large set of viable synthesized images. 5) it’s highly controllable and interpretable.

Cons:

The synthesized images has a good structure and semantic consistency, but the appearance of different instances is not consistent, making it visual unpleasant.

Daily Reading 20200521

Posted on 2019-05-21 | In Paper Note

GeneGAN: Learning Object Transfiguration and Attribute Subspace from Unpaired Data

posted on: BMVC2017

GeneGAN proposed a deterministic generative model which learns disentangled attribute subspaces from weakly labeled data by adversarial training. Fed with two unpaired sets of images (with and without object), GeneGAN uses an Encoder to encode image into two parts: object attribute subspace and background subspace. The object attribute may be eyeglasses, smile, hairstyle and lighting condition. By swapping the object feature input of Decoder, GeneGAN could generate different styles of the same person, such as a person with smile to without smile. Besides reconstruction loss and normal adversarial loss, they also present nulling loss to disentangle object features from background features and the parallelogram loss to enforce a constraint between the children object and the parent objects in image pixel values. Their experiments are conducted on aligned faces.

Pros:

Compared with CycleGAN, GeneGAN is simpler with only one generator and one discriminator and gains a good performance on face attribute transfiguration in face images from CelebA and Multi-PIE database.
The way to learn from weakly labeled unpaired data is inspiring. Two unpaired sets of images that with and without some object is effectively a 0/1 labeling over all training data.

Cons:

The constraints they presented hold only approximately and there will be potential leakage of information between the object and feature parts.
The object feature is not clearly defined. For eyeglasses, it can be the color, type, size, etc. While for hairstyle, it mainly focuses on the hair directions instead of any color information. Maybe it’s following the previous works and I’m wondering.

Daily Reading 20200520

Posted on 2019-05-20 | In Paper Note

DRIT++: Diverse Image-to-Image Translation via Disentangled Representations

posted on: ECCV2018

This paper is somewhat like MUNIT, which treat image translation as a one-to-many multimodal mapping with unpaired data. To generate diverse outputs with unpaired training data, they propose a disentangled representation framework for learning, where the input images are embedded onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space.

1) To achieve representation disentanglement, they apply two strategies: weight-sharing and a content discriminator. Weight-sharing, similar to UNIT, shares the weight between the last layer of content encoders and the first layer of generators. To further constrain the same content representations encode the same information for both domains, they propose a content discriminator and content adversarial loss, D_c distinguishes the membership of content features while the content decoders try to fool Dc.

2) To address unpair training data, they propose a cross-cycle consistency, which contains two I2I translation: forward and backward translations. In a word, they exchange the attribute representation twice, trying to reconstruct the original images.

There are some other loss functions. 1) self-reconstruction loss, reconstruct the original image by encoding and decoding. 2) domain-adversarial loss, encourages G to generate realistic images in each domain. 3) latent regression loss, inspired by BicycleGAN, enforces the reconstruction on the latent attribute vector. 4) KL loss, aim to align the attribute representation with a prior Gaussian distribution 5) mode seeking regularization, improve the diversity.

For metrics, they adopt FID to evaluate the image quality, LPIPS to evaluate diversity and JSD&NDB to measure the similarity between the distributions of real images and generated images. Their model could generalize to multi-domain and high-resolution I2I translations.

Pros:

Though embedding images to content and attribute space is similar to MUNIT, their content adversarial loss guarantees the content space containing no domain-specific information to utmost degree, which is more reasonable than MUNIT.
The cross-cycle consistency loss addresses the absence of paired training data in a cyclic way.
Their experiments are comprehensive and convincing.

Cons:

In their user study, there is no detail information about the number of users or the testing images, which is less convincing. And the image quality is much worse than CycleGAN.

Daily Reading 20200519

Posted on 2019-05-19 | In Paper Note

Multimodal Unsupervised Image-to-Image Translation

posted on: CVPR2019

In this paper, they try to solve image completion in a pluralistic way. That is, given a masked input, the model could generate multiple and diverse plausible outputs, which is quite different to previous methods that could only generate one output. To have a distribution to sample foreground from, they combined CVAE and instance blind and explained why using them directly is infeasible: CVAE learns low variance prior and instance blind is unstable. Therefore, they propose the network with two parallel training paths: 1) reconstruction path is similar as instance blind, trying to reconstruct the original image and get smooth prior distribution of missing foreground. 2) generative path predicts the latent prior distribution for the missing regions conditioned on the visible pixels. During testing, only generative model would be used to infer outputs. The network is based on LS-GAN. For the loss function, they used distribution regularization (KL divergence), appearance matching loss (for rec path, it’s used on the whole image, while for gen path, it’s used on missing foreground only), and adversarial loss.

Note that they also present short+long term attention layer, a combination of self-attention layer and contextual flow. Short term attention is placed within decoder to harness distant spatial context. Long term attention is placed between encoder and decoder to capture the feature-feature context.

Pros:

The mathematical derivation and explanation is clear and full (though I don’t understand some of them)
From one-to-one to multi-model/one-to-many is the trend in image-to-image translation and image inpaiting.
The short+long attention layer is a good way to attend to the finer-grained features in the encoder or the more semantically generative features in the decoder.

Daily Reading 20200515

Posted on 2019-05-15 | In Paper Note

Multimodal Unsupervised Image-to-Image Translation

posted on: CVPR2019

This work is the first exemplar-based video colorization algorithm which is like ‘Deep Exemplar-based Colorization’ except that it is extended to video. It adopts a recurrent structure and also contains two major sub-nets: correspondence subnet and colorization subnet. Compared to image colorization, they must consider temporal consistency besides the color and semantic correspondence. And that’s why they adopt a recurrent structure and take the result of previous frame as input. The loss function is also similar. Besides the perceptual loss and smoothness loss used in [1], they also introduced contextual loss and used adversarial loss and the temporal consistency loss. The contextual loss measures the local feature similarity between the output frame and the reference in a forward matching way. In addition, to degenerate to the common case that the reference comes from the video frames, they add L1 loss to make the output close to the ground truth. They compare to image colorization, auto video colorization and color propagation methods for quantitatively comparison. And they conducted user study for qualitatively comparison.

Daily Reading 20200514

Posted on 2019-05-14 | In Paper Note

Multimodal Unsupervised Image-to-Image Translation

posted on: ECCV2018

Image-to-image translation is simplified to the problem as a deterministic one-to-one mapping, which makes it difficult to generate diverse outputs from a given source domain image. To address this problem, they extend their previous work UNIT(one-to-one mapping) to multi-model task by combining to BicycleGAN. The image representation is decomposed into a content code that is domain-invariant, and a style code that captures domain-specific properties. To perform translation, they recombine its content code with a random style code sampled from the style space of the target domain. Based on the content code and style code, they propose bidirectional reconstruction loss, including image reconstruction (after encoding to two codes, decodes the original image) and latent reconstruction (same content code as input and same style code as the style). Besides user study, they also use metrics LPIPS and CIS (a modified version of IS).

Pros:

From 1-to-1 to 1-to-many, the image-to-image translation task is more clear and reasonable.
Their assumption of style code and content code is a proper abstract of image-to-image. It is using such an assumption to solve a more challenging problem.

Cons:

In another paper, they mention that the style code lack many details and is not such beneficial to image-to-image translation?

Daily Reading 20200513

Posted on 2019-05-13 | In Paper Note

Example-Guided Style-Consistent Image Synthesis from Semantic Labeling

posted on: CVPR2019

In this paper, they present a method for example guided image synthesis with style-consistency from general-form semantic labels. They mainly focus on face, dance and street view image synthesis tasks. Based on pix2pixHD, their network contains 1) a generator, with semantic map x, style example I and its corresponding label F(I) as input and output a synthetic image; 2) a standard discriminator to distinguish real images from fake ones given conditional inputs; and 3) a style consistency discriminator to detect whether the synthetic output and the guidance image I are style-compatible, which operates on image pairs. During network training, they propose to sample style-consistent and style-inconsistent image pairs from video to provide style awareness to the model. They also introduce the style consistency adversarial losses as well as the semantic consistency loss with adaptive weights to produce plausible results. They perform qualitative and quantitative comparison in several applications.

Pros:

In the past, image-to-image synthesis is difficult to tell whether the network has learned a new data distribution. The style image of this article gives different data distribution types, so this network can output different data distributions.

Cons:

Though they build their network based on pix2pixHD, they don’t apply that multi-scale architecture and limit the input size to 256*256.
The results could be effected significantly by the performance of the state-of-the-art semantic labeling function F (·).

Daily Reading 20200510

Posted on 2019-05-10 | In Paper Note

Toward Multimodal Image-to-Image Translation

posted on: NIPS2017

In this paper, they mentioned that in image-to-image translation, a single input may correspond to multiple possible outputs, making it a multi-model problem. So they propose to learn the distribution of possible outputs. They propose a hybrid model BicycleGAN by combining cVAE-GAN and cLR-GAN. cVAE-GAN learns the hidden distribution of image output through VAE, and models the multi-style output distribution. cVAE-GAN starts with ground truth target image B and encodes it into latent space. The generator then attempts to inversely map the input image A along with the sample z to the original image B. cLR-GAN starts with randomly sampled latent code and the condition generator is supposed to produce an output. When the output is used as input to the encoder, the same latent code should be returned to achieve self-consistency. cLR-GAN randomly samples the latent code from a known distribution, uses this code to map A to the output B, and then attempts to reconstruct the latent code from the output.

They perform quantitative and qualitative comparison. For quantitative comparison, they measure diversity using average LPIPS distance, and realism using a real vs. fake AMT test.

Pros:

Their method re-define the image-to-image translation problem in a multi-model way.
Combining multiple objectives for encouraging a bijective mapping between the latent and output spaces could address the problem of mode collapse in image generation.

Daily Reading 20200509

Posted on 2019-05-09 | In Paper Note

Deep Exemplar-based Colorization

posted on: TOG2018

In this paper, they proposed the first deep learning approach for exemplar-based local colorization, which could directly select, propagate, and predict colors from an aligned reference for a gray-scale image.

Their network contains two dub-nets. 1) Similarity sub-net, a preprocessing step which measures the semantic similarity between the reference to the target. Feeding two luminance channels of target and reference to gray VGG-19, they compute cosine distance between the feature maps and output a similarity map. 2) Colorization sub-net, colorization for similar/dissimilar patch/pixel pairs. Taking gray target, aligned reference with chrominance channels and the similarity map, it predicts the ab channels of target image. It contains two different branches with two different loss functions to predict plausible colorization in conditions with and without reliable reference. Chrominance loss in chrominance branch computes smooth L1 distance at each pixel to selectively propagate the correct reference colors. Perceptual loss in perceptual branch minimize the semantic difference between predicted one and target image when the reference is not reliable.

To recommend good reference to the user, they also propose an image retrieve algorithm to find a proper reference for a given target. They apply both global ranking (cosine distance between feature maps from the first fully connected layer) and local ranking (cosine distance between feature maps from relu5_2 and correlation coefficient between the illuminance histograms of two local windows) to select proper candidates.

Pros:

Instead of coloring the image with user strike or by learning from large-scale data, the methods they proposed makes a good balance between controllability from interaction and robustness from learning.
For references with proper semantic correspondence, it could propagate correct colors to the target. For those improper reference, they still could generate a plausible result by predict the dominant colors. So it loosen the constrain on references.