Cross-domain Correspondence Learning for Exemplar-based Image Translation
posted on: CVPR2020
In this paper, they proposed exemplar-based image translation CoCosNet which learns the dense cross-domain correspondence and outputs images resembling the fine structures of the exemplar at instance level. The cross-domain correspondence and the image translation are jointly learnt with weak supervision because both tasks facilitate each other. Given an exemplar image, they focus on converting a semantic segmentation mask, an edge map, and pose keypoints to a photo-realistic image.
CoCosNet has two main sub-network: 1) cross-domain correspondence network, which transforms the inputs from different domains to an intermediate domain. 2) translation network, which progressively synthesizes the output based on the warped exemplar. Take mask to image synthesis as an example. They first align the input semantic image and the input reference style image(exemplar) through the encoder, and use the feature to calculate the similarity between each pixel of the two. After obtain the warped exemplar image according to the similarity, it uses the methods of positional normalization and spatially-variant denormalizaiton (similar to AdaIN) to inject the style into the image during the process of generating the final image from fixed noise z.
They apply domain alignment loss and correspondence regularization to guarantee the inputs are aligned to the same domain and the network could learn meaningful correspondence. They also use 1) perceptual loss to minimize the semantic discrepancy, 2) context loss to maintain the same style information (color or texture), 3) feature matching loss to penalize the difference between the translation output and the ground truth for pseudo exemplar pairs, 4) adversarial loss to discriminate the translation output with the real sample of exemplar domain.
In their experiment, they select three datasets, ADE20K corresponds to mask-to-image subtask, CelebA-HQ corresponds to edge-to-image subtask and Deepfashion corresponds to keypoints-to-image subtask. They conduct quantitative and qualitative comparisons. For quantitative comparison, they compare in three aspects. 1) image quality: they use two metrics to measure, FID for measuring the distribution distance and SWD for measuring the statistical distance of low-level patch distribution. 2) semantic consistency: they use relu3_2, relu4_2 and relu5_2 of VGG19 to measure this. 3) color and texture distance: they use relu1_2 and relu2_2 to measure this between the semantically corresponding patches.
They also present two interesting applications of their work. One is image editing. By manipulating the segmentation layout, it’s feasible to output a translation with the same content manipulation. The other is the makeup transfer.
Pros:
- It’s a general framework for exemplar-based image translation. The development of image translation follows a clear vein, from paired supervision to unsupervised translation, and to multi-model translation. Since then, image translation has been further expanded toward higher resolution, higher quality, video, small sample adaptation, etc. But there are two main problems. 1) The style of the generated image is unpredictable, and the user cannot specify the style of the specific instance; 2) the outputs of existing methods often have obvious artifacts. And their method effectively solves these two problems.
- They module the input image and the exemplar image as in two distinct domains. And translating images from distinct domains is a general expression, which could generalize to many different kinds of inputs.
- Their quantitative experiments are comprehensive and the three aspects they consider are very related to their task.
Cons:
As the translation is based on the warped exemplar, it’s essential that their exemplar has the same semantic labels as the mask for mask-to-image task, which limits the applications.
In their ablation study, feature loss shows little improvements. And it seems that L_feat is only applied on pseudo exemplar pairs. So I wonder if it’s necessary to include this loss.