Introduction:

Figure 1: The bidirectional similarity. Completeness (backward similarity) and coherence (forward similarity) together preserve the important information and image smoothness at the same time.

In the final project, we have implemented the content-aware image resizing algorithm of [1] and tried to improve it by applying patch dimension reduction. Unlike the popular warping-based approaches [2,3,4], this algorithm resizes image by maximizing the bidirectional similarity (see Figure 1) between the original image and the resized. The bidirectional similarity ensures that each patch in the input image is included in the output in a satisfactory manner. When being given a smaller space, the algorithm starts to work on the texture redundancy to achieve a terse summarization. This results in novel texture "re-synthesis" effects as shown in the figure. In view of the correspondence, we tried to improve their method by projecting the patches to a lower dimension space and then work on it [5]. We formulated our new algorithm and did some analysis. As a conclusion, we also provided a brief discussion on the relationship between the method and the other similar work.

Algorithm implementation & details:

We implemented our program using VS .NET C++ with OpenCV to facilitate common image processing routines. Our implementation includes: 1) the original method given in [1] and 2) our experimental algorithm with patch dimension reduction. The details of respective parts are following:

The original method:

Although the concept of [1] is quite simple, the implementation is not so straightforward as the author has omitted many details. We illustrate our comprehension of their algorithm in Figure 2, 3 and 4. To resize an image without introducing noticeable artifacts, the algorithm scales images gradually with only a little bit of change each time (outer iterations) and performs some refinement iterations to re-touch the result (inner iterations). The two kinds of iterations are shown respectively as green arrows and black arrows in Figure 2. In each of the inner iterations, the algorithm runs EM-like steps to minimize the energy function, which consists of patch searches of two directions and the update (averaging). This stages constitute the single-scale resizing algorithm.

Original Method Workflow
Figure 2: The single scale resizing workflow of the original method in [1].
TAIR_Formula
Figure 3: The bidirectional similarity energy function written in math notation.

For the long range relationship to be captured, and thus preserved by the algorithm, the author of [1] proceeds the above algorithm in a coarse-to-fine manner which suggests the modification in Figure 4. We break every outer iteration into inner iterations of different resolutions. The output image is refined at one lower resolution upon convergence or to a certain point and then upsampled to a higher resolution. The process is repeated until we finish the refinement at the original resolution. At this time the algorithm then moves on the next outer iteration. Please note that at different resolutions, we optimize the bidirectional similarity energy with respect to the original image of different sizes. We implement the algorithm with a recursive function in practice.

Original Method Workflow
Figure 4: One outer iteration of the multi-scale resizing algorithm, showing three levels of the pyramid.

One big problem of this algorithm is its speed. As the naive full-image nearest-neighbor patch search is a quadratic time operation in terms of the pixel count, it becomes intractable even for very small images. Although numerous acceleration techniques have been proposed, they could not be directly applied here because the output image is constantly changing throughtout the optimization. To alleviate this problem, the author has used the motion prediction techniques common in video compression. In practical, they store the optimal location found in the previous iteration and constrain the new search in a small nearby window. This heuristic is effective since it reduces the quadratic running time to the linear. However, it risks the opportunity of getting stuck at local minima. Our experiment also suggests that it may not be so efficient as described in [1]. In resizing the cow image to its half size, our program takes six hours to complete as opposed to the reported 5 min in [1] (we use patch size = 7*7, window size= 35*35, area scaling factor = 95%, inner iter. count = 10).

Our method:

In the pioneering work [5], the author proposed to utilize the fact that local patches of a natural image lies on a low-dimensional manifold for fast patch search. Specifically, they project the local 5*5 patches of each pixel to an 8D space and then perform texture synthesis directly on this domain (the appearance space). As the new 8D "pixels" now contain the information from nearby region instead of a single pixel, the run-time neighborhood one need to check in the search can be dramatically reduced to only the 4 diagonal neighbors without sacrificing the quality. Inspired by their success, we try to parallel the idea and see if a similar acceleration could be achieved. Our modification and their concept are summarized in Figure 5 and Figure 6. We perform the patch PCA before we start to iterate refinement steps and do the neighborhood PCA before we need to perform the patch search. We synthesize the output image by direct looking for the nearest patches in the original image instead of using PCA back-projection to prevent blurry results. To perform the nearest-neighbor search, we have used the public ANN library [15,16] for simplicity. The kd-tree is re-built each time when the output (target) image changes.

Appearance Transform Function MakeSimilar( Image ref, Image tar)
{
     If (size still too big)
     {
          Down-sample images;
          MakeSimilar( ref, tar);
          Up-sample the result;
     }
     Project images to appearance space;
     For I = 0 to L
     {
          Project images to neighborhood space;
          Build kd-trees;
          Patch search for every pixel in both
          directions using the kd-trees.

          Update tar in the appearance domain.
          If (update too small)
               Break;
     }
     Render tar by NN-search on ref.
     Return tar;
}
Figure 5: Our experimental algorithm. Figure 6: The recursive function of one outer iteration.

Results & Discussion:

Here we show the some result produced by our implementation of [1]. We compare our result with those shown in the paper to validate its correctness.

Our implementation.
The result of [1].

We then show the output of our modified algorithm on several images (most of them are either from [1] or [7]). We illustrate the location of copied pixels with a chroma-space chart. As we found larger patch size generally leads to better results, we have used 9*9 patches in all the experiments. The other parameters are appearance space dim. = 4, neighborhood space dim = 25, neighborhood size = 9*9, iinner iter. cnt = 150, ANN max visited point count = 256 and ANN priority search. Our program runs in several hours depending on the image size and content, which is comparable to our implementation of [1]. More than 95% of the time is spent on the NN-search. The resized images typically looks good up to 0.6x~0.7x scaling, but starts to show artifacts when going further. We have also compared the result with the output from our implementation of [1] and found it to be better.

Tree:

Child:

Father:

Woman:

Woman 2:

Man:

Boy:

Cow:

Tower:

The algorithm's ability of capturing meaningful structures in the image is determined by the largest patch size used in defining similarity. If we want some texture redundancy to be compressed, we have to use a patch size at least as large as the period of the repetition. However, using a larger patch size will limit the minimum size of compaction and make patches more difficult to match with each other. The following shows one example where the algorithm failed to meet our expectation by mis-matching the textures. We have used a patch size roughly equal to the window's height (indicated as the red box in the picture).

The result of [7] which uses GSMFOE to guide patch compatibility.
Ours.

We have also applied our algorithm on the problem of inverse texture synthesis [6] (see the next section). Our algorithm generally produces good-looking results but breaks down on textures with semantic structures, which is also pointed out in [5] and [13] for small neighborhood pixel-based texture synthesis.

Fabric:

Text:

Performance:

Generally speaking, we found that the neighborhood size, the number of ANN visited nodes and inner iterations required to achieve camparable result with motion-prediciton will result in a similar running time. The benefit of using a more informative space is not evident in our case, and the number of points one needs to check in the NN-search is well corresponding to the windows size of motion-prediction. From the result, we suspect that other approximation-based acceleration methods such like TSVQ [11] will not provide significant performance improvement over motion-prediction. The k-coherence search [12] may not suffer from approximation, but its pre-processing will take the same order of time as our method. A possible solution is proposed in the concurrent work [6]. Nevertheless, since its time complexity is quadratic in the size of compaction, it may actually become even slower than the method of [1] when the output size is big, which is often the case in content-aware image resizing. In short, we believe the dynamic backward search poses a hard performance bottleneck for bidirectional similarity based methods.

Related Work:

In this section, we will talk about possible directions of the problem. Some of them were touched by other recent work.

Other bidirectional similarity energy:

Probably anyone who are familiar with this topic will find an almost identical energy was also proposed independently in [6], albeit with a different goal and a different optimization scheme. The main difference in terms of optimization is that they directly go to the desired size instead of the gradually resizing in [1]. Although their result is indeed impressive, it should be noted that they can do so only because their primary goal is to do summarization but not to provide a concise visual experience for the original image. Without a reasonably good intialization, EM steps of the bidirectional similarity energy function could at best produce locally smooth epitomes [14] as in [6] or even totally miss salient regions by averaging them.

Deform first, synthesize later:

As also mentioned in [4], one possible way to deal with textures is to apply synthesis algorithms such like [8] after the warping has been done. This sounds to be a promising idea since the warping-based methods are doing quite a good job on deformation now. However, it would have difficulty in choosing feature curves to align as now we have to derive them automatically from the warping field. A simple method could be to densely sample the output warping field and run the algorithm of [8] but exactly how it should proceed will be a problem. In practice, we think the hybrid method could work to some extent, but it is unlikely to accomplish effects such like the object count reduction as seen here.

Displacement Fusion:

The last possible direction, which we thought to be the most promising, is to use some computation-friendly alternative mechanism to simulate the effect of backward similarity energy in optimization. It should be noted that one very important thing the backward energy tries to do is to make sure the MORE common patches LESS used, which is very similar to the function of a saliency map. If we could further prevent the LESS common patches from dominating the output, we will not need to resort to the time-consuming backward patch search. This could be achieved either by the exclusion term in [7] or by an explicit patch repetition-suspressing penalty.

Based on the observation that a successful resized image usually contains large continuous parts copied from the original [this report, 1, 7], a practical realization of this idea is to apply MRF optimization with offset labels as in [9, 10]. The exact energy may take the following terms:

1. A saliency map term to preserve important information.
2. A local patch compability term to ensure a visually satisfactory output.
3. An offset regularization term to retain the large-scale ordering of individual elements in the image.
4. A repetition-penalizing term to prevent strong patches from dominating the output.

We can further include a coarse-to-fine framework for efficient hypotheses generation. Another useful source of labeling may come from the slighty versions of the previously resized result.

Acknowledgements:

We have to thank Dawsen Huang for providing his personal experiences on the problem of content-aware image resizing with us.

References:

[1] D. Simakov, Y. Caspi, E. Shechtman, and M. Irani, "Summarizing Visual Data Using Bidirectional Similarity", CVPR 2008.
[2] S. Avidan and A. Shamir, "Seam Carving for Content-Aware Image Resizing", SIGGRAPH 2007.
[3] L. Wolf, M. Guttmann and D. Cohen-Or, "Non-Homogeneous Content-Driven Video-Retargeting", ICCV 2007.
[4] Y.-S. Wang, C.-L. Tai, O. Sorkine, T.-Y. Lee, "Optimized Scale-and-Stretch for Image Resizing", SIGGRAPH Asia 2008.
[5] S. Lefebvre and H. Hoppe., "Appearance-Space Texture Synthesis", SIGGRAPH 2006.
[6] L.-Y. Wei, J. Han, K. Zhou, H. Bao, B. Guo and H.-Y. Shum, "Inverse Texture Synthesis", SIGGRAPH 2008.
[7] T. S. Cho, M. Butman, S. Avidan, and W. T. Freeman, "Patch Transform and its Applications", CVPR 2008.
[8] H. Fang and John C. Hart, "Detail Preserving Shape Deformation in Image Editing", SIGGRAPH 2007.
[9] V. Kwatra, A. Schodl, I. Essa, G. Turk and A. Bobick, "Graphcut Textures: Image and Video Synthesis Using Graph Cuts", SIGGRAPH 2003.
[10] G. Ramanarayanan and K. Bala, "Constrained Texture Synthesis via Energy Minimization", IEEE TVCG 2007.
[11] L.-Y. Wei and M. Levoy, "Fast Texture Synthesis Using Tree Structured Vector Quantization", SIGGRAPH 2000.
[12] X. Tong, J. Zhang, L. Liu., X. Wang, B. Guo and H.-Y. Shum, "Synthesis of bidirectional texture functions on arbitrary surfaces", SIGGRAPH 2002.
[13] S. Lefebvre and H. Hoppe., "Parallel Controllable Texture Synthesis", SIGGRAPH 2005.
[14] N. Jojic, B. J. Frey and A. Kannan, "Epitomic Analysis of Appearance and Shape", IEEE ICCI 2003.
[15] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu, "An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions", JACM 98.
[16] ANN Library, http://www.cs.umd.edu/~mount/ANN/ .