Introduction

In this project we are interested in using Generative Adversarial Networks (GANs) to do city-to-city image translation. GAN models such as Pix2Pix and CycleGAN have been used for similar image translation tasks, so we explore how to apply these models to city image translation.

Unpaired Image Translation with CycleGAN

CycleGAN results: [128x128] [256x256]
CycleGAN result on 128x128 images. Top is from Paris to Venice and then back. Bottom is from Venice to Paris and then back.

There are countless cities around the world and many of them have characteristics so distinct that people can easily tell them apart; however, most cities are also structurally similar since they all have the same components such as buildings, transportation, and pedestrians. What seperate them from each other is usually the appearance of these components. In other words, cities share the same underlying semantic representation in a similar way as the images of horses and zebras or maps and satellite images.

It is nearly impossible to collect at scale images of 2 cities that are pair-wise pose-aligned. Therefore we decided to use CycleGAN to perform unpaired and unsupervised translation between cities. The 2 candidates we chose are Paris and Burano Island of Venice, the most colorful place in the world. They are ideal for this task because the two have very distinct building styles and more importantly, Venice has many canals in place of roads in a conventional city. It would be very interesting to see the tranlsation between roads and canals. We collected the data by scraping approximate 1500 images for each city from Flickr. Some interesting observations can be made from the results apart from changin the street to canal and vice versa. For example, it becomes increasingly obvious that Paris has way more trees than Burano as the network often adds trees going from Burano to Paris. In the other direction, trees are all turned into houses. We can also see that Paris has much taller buildings so the upper part of buildings are erased going from Paris to Burano. Likewise, buildings are grown taller going from Burano to Paris.

CycleGAN result on 256x256 images. Top is from Paris to Venice and then back. Bottom is from Venice to Paris and then back.

During experiment, we had some surprising foundings about CycleGAN. The network itself was trained on images of 128x128, but since the neural network is convolutional, we were able to apply it to images 256x256. The network, perhaps rather counter-intuitively, worked on 256x256 as well. The image at the top of this page is produced by running 256x256 photo through a CycleGAN trained on 128x128 photo. This may imply that many features learned by CycleGAN is invariant to scales which may explains why it excels at translating textures but has a hard time morphing shapes.

Images from Street View Depth Maps

Google Street View panoramas include depth maps which gives us a depth value for each image. We became interested in using GANs to translate from depth maps to RGB images, using this Street View data. By applying these learned models to depth maps from a different city, we could then achieve the city-to-city image translation goal.

Example Street View panorama (right) with corresponding depth map (left).

Modifying some existing scraping tools, we implemented a scraper to densely sample these panoramas and depth maps in a certain region. The scraper starts from a central latitude and longitude position and then does a depth first search (each panorama can be linked to its neighboring panoramas).

Our scraper at work.

After scraping thousands of panoramas from Paris and Manhattan, we tried to train GANs directly on the spherical panoramas, but this did not produce great results. Instead, we created projected views by projecting the spherical panorama to a cube map. We then trained a Pix2Pix model on the projected images. Example test results of our trained models are shown below:

Depth Map to Image Results

More results on random test data: [Depth2Paris] [Depth2Manhattan]
Example test results of our trained depth map to image model. Each row is a different sample.

Street View to Depth to Street View

Using two separate models to achieve translation of images from one city to another.

Using the Street View data, we also trained a model in the other direction for each city, taking input RGB data and producing depth maps. Then we can do city image translation with the following process: take an image from City A, producing a depth map using our RGB->depth model trained on City A, and then feeding that depth map to our depth->RGB model trained in City B.

Although each model is trained with depth maps, the final combined model does not require depth input and performs city image translation with RGB data only. Example test results of our trained models are shown above.

Image-to-Image Translation through Depth Results

More results on random test data: [Manhattan2Depth2Paris]
Manhattan picture (left), intermediate depth map output (middle), final Paris output (right)