Project 4: Auto-stitching Photo Mosaics

Part A

In the first part of this project, we will be focusing on creating some preliminary photo mosaics by stitching together our own original photos! The steps for creating a mosaic is first recovering a homography matrix from correspondence points between images, warping the source image to reference image, and then finally blending the images together to make the mosaic. But before we can try creating mosaics, we need to capture some images and perform warps!

Capturing Images

Here are some of the pictures that will be used throughout this project for the various subtasks.

Above, we have a few images of objects that we want to rectify. From left to right, we have a Cal Gameday sign, a poker set, and a small whiteboard. The goal with these images is to change the perspective of the photo to capture the face of the objects head on.

The remainder of the images will be used to construct mosaics, stitching the two views together into one image. Now that we have all of our images, let’s jump into the subtasks, starting with the warping step.

Recovering Homographies

Unlike the previous project where we focused on performing an affine transform, we will be performing a homography in this case. The higher level process we are performing is going to be the same, but we will treat our homography matrix as our transform instead. Numerically in homography, we have 8 unknowns, which can be written out as the following

Hp = p' \\ \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = \begin{bmatrix} wx' \\ wy' \\ w \end{bmatrix}

Here, $p$ is our points in the source image which we want to transform to $p'$ using $H$ . To fill out these variables, we will be defining correspondence points between our two images. As a result, we have the coordinates of each pixel in the source image and the corresponding points that match in the destination/reference image. Some examples are shown below.

With this information, we can solve for our 8 unknowns in the homography matrix. To help with solving the system, we can move around terms to get a new form that will help solve specifically for those unknowns. With some manipulation, we can rewrite the system of equations as the following.

\begin{bmatrix} x & y & 1 & 0 & 0 & 0 & -xx' & -yx' \\ 0 & 0 & 0 & x & y & 1 & -xy' & -yy' \\ \end{bmatrix} \begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{bmatrix} = \begin{bmatrix} x' \\ y' \end{bmatrix}

Repeating this formulation for each correspondence point stacked in a large matrix will give us our final system of equations we need to solve. However, we can notice that the system is overdetermined in this scenario, so we will need to perform find the least squares solution to determine our homography matrix. This is another reason why we put the equation in the format shown above since this is standard for least squares optimization problems. We can verify the homography calculations by performing the forward and inverse warp on the correspondence points to see if we have a 1-to-1 bijection between the two sets of correspondence points.

Image Rectification

With the newly acquired homography matrix, we can now define a function to warp and rectify images. Let’s consider the case of rectification to start off. Rectification is slightly different from the example of two images we have been using so far. We only have one image which contains an object that we want to rectify. The way we define our correspondences in this case is 1) create a bounding box for our object and then 2) define an arbitrary bounding box for the resulting rectification. The following images show both sets of points on the same canvas.

In both cases, the blue points define the surface that we want to rectify and the yellow points show the target bounding box. The sizes of the arbitrary targets were calculated using the approximate sizes of the actual surfaces, so the rectification performs a reasonable transform. In the poker example, I chose to put the poker box at an angle to see how the rectification would perform differently from the simpler case in the sign. Once we have the correspondences defined, we can extract a homography matrix from the transform between the point sets. Now, let’s define our warp function.

For the warp, we start by creating our new canvas for the image post warp. To do this, we take the original corners of our image and apply our calculated homography matrix, giving us a set of four transformed corners for our final images. Sometimes our transformed points are negative, so we need to define a translation operation that we need to perform after our homography. We can look at the minimum x and y values for the transformed corners and shift them to be 0 in the new coordinate frame. This way we will always have our image “at the origin” on our new canvas instead of having really large values or having negative values. We can define our translation operation also as a matrix of the form:

T = \begin{bmatrix} 1 & 0 & -x_{min} \\ 0 & 1 & -y_{min} \\ 0 & 0 & 1 \end{bmatrix}

In essence, we are applying a homography matrix $H$ followed by a translation matrix $T$ , which will be an important fact for later on in the warping process. For the examples above, these are the final transformed corners for the new images.

Once we have our transformed corners, we can interpolate the values of the original image in the new rectified image. We start by collecting all of the points contained in both the original image (which is just every pixel) and the polygon enclosed in the 4 transformed corners. Each transformed point will have the inverse of the homography matrix and translation matrix we defined before applied to it to recover the corresponding pixel in the original image. Applying this for every transformed point, we can then perform interpolation using scipy.interpolate.griddata with the transformed points, corresponding recovered points, and the RGB values from the original image. After interpolating, we get the results below.

First point to notice in both is that the perspective of the image now shows our desired targets rectified or facing the camera view head on. In the photo with the Cal sign, the result turned out really good with very clear definitions on the shape of orientation of the sign. For the poker example, the result is much smaller (the shown image is a smaller crop of the whole transformed image) and much harder to tell whether it was rectified. The results of the rectification are very sensitive to smaller changes in the correspondence points as well as the orientation of the objects in the image, which is why the poker example turned out the way it did. Here is another example of rectification, this time performed on a whiteboard. One cool aspect of this is that the same sign we saw before is also rectified in the process since it is on the same plane as the whiteboard!

Warp Images

Now that we have rectified images, we can perform the same operations on a set of two images to warp then to each other. This will aid us in creating our photo mosaics next. As with before, we need to define our correspondences, but instead of doing it on the same image, we do it across different images.

Here, we have 16 correspondence points between the two images. Using these points, we can calculate our homography matrix like before, and then transform one of the images into the perspective of the other image. In this example, I will be warping the left image to the right one. The resulting transformed corners and final warped image are shown below.

During the warping, I saw some extra artifacts that were a byproduct of the operation, so they were removed from the result shown above. That is why the resulting transformation differs in dimensions from the projected warped corners. Let’s look at a few more examples of this warping operation!

Here are the correspondence points between two images of my living room. Applying the same steps to these images, we get the following warped image.

One point to notice here is that we see more pixelation in this scenario compared to the previous case, likely due to the scale of the warping that occurred with this image compared to the previous one. In particular, the windows seem to lose information possibly due to the reflections along with the downsizing of the image during these operations. Let’s build some mosaics out of these images a few others!

Photo Mosaics

To actually create a mosaic out of our images, we need to first align our images on the same canvas and then blend them together. Looking back at the forest example, we can start by transforming our correspondence points from before to new the canvas of our warped image, giving us the result below. We perform this operation using the homography and translation matrices for this warp.

We can use our correspondence points to actually align our images since they are points in the resulting image that should be overlapping. However, since homography is an approximate transformation using least squares, we can’t guarantee that all of the correspondence points will be exactly mapped to the target, but instead very close to it. With this in mind, I defined a function to find the best alignment of the two images within a range of values. This function is very similar to the naive alignment algorithm we used in project 1, but uses the correspondence points as a reference starting point to avoid the need of down- and up-sizing our images. With our optimal alignment, we can define a new translation matrix like before to transform our second image to our first, making our first image the base for our canvas.

To define the bounds of this new canvas, we can use a similar method as what we used for defining our warped canvas, but this time we have two sets of corners. We can find the minimum and maximum x’s and y’s across all of the corners for both images, which will then set our size for the output mosaic size. We can then just add both of images to the new canvas, where the warped image can be added directly and the second image will be translated into the canvas based on our calculated optimal alignment. To blend the images, we apply alpha blending at the pixels where the two images are overlapping to make sure we get a smooth stitching. The result of the alignment and stitching is as follows.

Looking at the living room example from before, we can get a very similar mosaic output.

Here is another example with my own bed room. The objects were much closer to the actual camera in this capture and the two captures had a considerable number of degrees of difference, so the result looks slightly different than the previous two. We still do see some pixelation due to downsampling of the image and interpolation.

Something I noticed with this example is how the homography started to actually diverges from the reference function cv2.findHomography. This point was very interesting to me since I would assume more points allow for a more precise transformation, but when least squares is working with a lot of points, the optimization is essentially forced to have larger residuals. I wanted to test upper limits, so I tried 87 correspondences as my highest number.

Conclusion

I really liked progression of this project and how it builds to the second part, but in particular, I was really amazed at the rectification concept. I find it really cool that we are able to extract different perspectives of the same object from one still image. Learning about it in lecture compared to trying it out on my own was a very fulfilling experience, especially with the different objects that I tried out.

Part B

In this part of the project, we focus on automating the process for defining correspondences between our two images, essentially enabling auto-stitching for any two given images. The higher level overview is to start with a large set of points for each of the images and then find the best matching “corners” between the source images. From there, we can use this smaller point set as our correspondences and create our mosaics!

Detecting Features

The first step in creating a set of correspondences is gathering sets of potential correspondence points. We can collect the Harris Corners of each of our input images, which will ideally be distinct points or defining features in our images.

On the left is a subsample of all of the Harris points we get by calling the skimage.feature.corner_harris function. This function also extracts the Harris corner response for all of the pixels in the image, which we see in the right image, so we can directly take note of which features are rated to be the most prominent. However, we have too many points currently in this set to use as correspondences. To fix this, let’s implement adaptive non-maximal suppression (ANMS) to choose a specific set of interest points from both of our images for the mosaic.

ANMS

A measure of how “good” a point is to use as a correspondence point in a given image. The general idea is to find the points with the highest Harris responses while being the most distant from other points. In other words, finding the most defining features of the image. To perform this algorithm, we start by flattening our Harris image responses and sorting them in descending order. Now, for each of the Harris corners we extracted, we want to calculate the euclidean distance between that coordinate and all of the coordinates with a larger Harris response score than this point. In the paper, there is a factor for the formulation $c_{robust}$ , which I am setting to 1 in this use case. Out of each of these points, we find the point that located closest to this point and set the $ith$ entry in an array $r$ to be the distance between the two. After following this process across all points, we can threshold on a certain euclidean distance to extract the most prominent points (largest $r_i$ values) from our original set.

The results of applying the ANMS algorithm with threshold value of 5 reduces the number of points drastically from 20K in the original harris corner output to 1K in the following image.

The main difference between this point set and the original being the fact that most of the points are now located in areas with large response values. Let’s apply this algorithm with slightly different parameters (threshold = 10) to both images to reduce the number points even more.

Extracting Feature Descriptors

After applying ANMS, we need to match points between the two images to make our final correspondence point set. However, to actually match the points, we need a representation of the image for a given point. In this section, we want to assign a feature descriptor to each point in the extracted point sets for both images.

The way we will be creating a feature descriptor is extract a unique portion of the image for that pixel. As an arbitrary size, let’s look at the 40 x 40 grid around a pixel. While, we could just use this representation, this will get very expensive to compute as the number of points increasing. In other words, we need a way to summarize the image around a pixel. Staying with the 40 x 40 grid example, we can use a stride of 5 to extract a total of 64 pixels from the 40 x 40 grid, making a 8 x 8 feature descriptor. This captures the same area around a pixel while having a much more lightweight computation requirement. To help with anti-aliasing, we will be blurring the image using a ksize = 30 and sigma = 5. Here are a few feature descriptors for the first image in the top row and the second image in the bottom row.

Using these feature descriptors, we can now match correspondence points across images!

Matching Features

The matching process will be relatively straightforward with the generated descriptors: we want for any given point in the first image, we want to find the descriptor with the highest similarity in the second image and quantify its similarity in a quantity. In other words, implementing a nearest neighbors algorithm between descriptors in each image. Many of the points we have won’t have a matching point in the other image, which means we would want to filter out those points somehow. One way of doing this is taking the ratio of the similarity of the nearest neighbor to that of the second nearest neighbor. This is called Lowe’s thresholding, in which we will be discarding any points that fall below a certain value for this ratio.

For the algorithm, I start by computing the dot product between feature descriptors flattened into a length 64 vector, yielding an n x m matrix where n and m are the total number of points in the first and second image’s ANMS datasets respectively. Any (i, j)-th entry in the matrix is the similarity score between the i-th point in the first image and the jth point in the second image. We sort each row of this matrix to find the first and second nearest neighbors to each point in the first image. With this setup, we are now able to calculate Lowe’s ratio for all of the points to see which ones can be kept as a correspondence point, and which ones likely don’t have a match.

After applying the matching algorithm, we get the following result for the above two images.

We are only left with 43 points in total, which is more reasonable for the number of correspondences. However, even with this matching algorithm, we still have quite a few points that aren’t aligned between the two images, and due to the contents of the image had very similar feature descriptors. This can cause issues with the process since we will have outliers that skew the final homography matrix we use. To get around this issue, we can apply a RANSAC algorithm to detect and remove the outlier points.

Robust Homographies

RANSAC is an iterative algorithms that tries to identify all of the outliers in a dataset over a large number of runs. Our implementation of RANSAC will be 4-point RANSAC, meaning that we will be sampling 4 points from our dataset to define our generalization for the whole dataset. In our case, we are trying to find outliers to our homography operation, so we will randomly sample 4 points from the first image’s point set. With these 4 points, we will use our computeH function from earlier to get a possible homography for this point set to that of the second image. Using this homography matrix, we calculate the estimated points in the second image using our first image’s point set. Computing the euclidean distance between the estimated point set and the matched point set gives us a metric to see how good this specific homography matrix is. To quantify it more clearly, we calculate how many inliers we have: the number of points that are are transformed within a threshold distance of the second image’s target set. We keep a running subset of the best points and the maximum number of inliers, so that on any given iteration, we can check if the current homography matrix yields more inliers. If it does, we want to update the running best subset of points for both images. After n iterations, we would ideally have found a subset of points that best fit the overall trend of the original set of points.

After RANSAC, we are only left with 17 points for the correspondences between the two images. For these points, we can notice that most if not all of the points are spatial in the same physical spot in both images, meaning our algorithm was successful. With our correspondence points, we can apply the same mosaic process from part A to our images!

The part A forest mosaic is on the left and the auto-stitched mosaic is on the right. The hyperparameters used for the auto-stitched mosaic are an ANMS threshold of 10, a matching threshold of 1.2, a RANSAC threshold of 1, and the number of RANSAC iterations equal to 10000. For the same sets of images from before, here are the results of the auto-stitching process.

The part A living room mosaic is on the left and the auto-stitched mosaic is on the right. The hyperparameters used for the above mosaic are an ANMS threshold of 15, a matching threshold of 1.1, a RANSAC threshold of 1, and the number of RANSAC iterations equal to 10000.

Lastly, we have the part A bedroom mosaic is on the left and the auto-stitched mosaic is on the right. The above mosaic uses the same hyperparameters as the forest mosaic. For all of the images, we see a very similar results between the auto-stitched mosaics and those from part A, meaning our homography matrices were very similar even if our point sets were completely different.

Reflection

Before starting Part B, I was looking forward to learning how to do RANSAC since this is a term I have seen and algorithm I have used countless times for my research lab. However, I never knew about the actual innerworkings of the algorithm, just how to use it to clean up a point cloud representation. While we weren’t working in 3D, being able to step through the algorithm and apply it from scratch was definitely the most fulfilling part of the project. At the same time, I did feel a little disappointed since I didn’t realize how simple it was to implement on our own, even if it was a simplified version.

As a result, the coolest new thing I learned throughout both parts of this project has to be ANMS, but more generally the features point/detection process. ANMS is something I have never heard of or used myself, so learning about the logic of the framework through the paper was definitely an awesome experience. Feature detection is such a big part of computer vision, yet I had never dove into the space as much as we did in this project, which I was really thankful for. My perspective of detection has been solely focused on learning techniques, so seeing other perspectives to solve the problem is always a great learning experience. Thank you for taking a look through my project page!