Where and Who? Automatic Semantic-Aware Person Composition
posted on: WACV2018
In this paper, they proposed an image composition method which focuses on person instance foreground and starts from selecting proper location and size, selecting proper foreground to final compositing. Their method contains three components. 1) using CNN to predict the bounding box, indicating the location and size of the potential person foreground. 2) person segment retrieval, which aims to find a specific segment that semantically matches the local and global context of the background. 3) leveraging alpha matting to make the foreground compatible with the background.
In the first step, person instances on COCO images are removed and inpainted with the background. Then after using Faster RCNN object detector to obtain object detection, the layout image and inpained image are fed to predict the normalized coordinates (x_stand, y_stand, w, h) of the bounding box. In the second step, they build a candidate pool using person instances from COCO, filter out those highly occluded ones and manually segment again. To select proper candidates, they compute cosine distance between the global and local feature representations of target segment and candidates. In the last step, the proper candidate is resized according to given bounding box. The alpha matting method is also applied to smooth the transitions.
To evaluate the box prediction, they measure the histogram correlation between the predicted and target histogram, including location histogram and size histogram. They conduct ablation studies on their special cases quantitatively and qualitatively.
Pros:
Most image composition and image harmonization methods focus on appearance consistency of a user selected foreground and a background image. In this work, they focus mainly on predicting candidate person locations and retrieve person instance from candidate pools.
The way that uses cosine distance between the global and local feature representations of target segment and candidates to select foreground is similar to our method.
It’s a possible solution to few shot problem in other image editing tasks.
Cons:
It pays little attention to color and illumination consistency. Though the foreground person instance is in a proper location, the composites suffer from lighting inconsistency problems.
In the first phase, the results rely on object detector and the following location/size predictor. And the performance of object detector influences the final results significantly.