Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images
Xindi Wu 1* KwunFung Lau 1 Francesco Ferroni 2 Aljoša Ošep 1 Deva Ramanan 1, 2
Carnegie Mellon University 1
ArgoAI 2
* work done while at CMU, now at Princeton University
CVPR 2023 CMU Argo

Abstract
Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that this problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps, represented as discrete graphs that encode the topological layout of the visual surroundings. We conduct our experimental evaluation using the Argoverse dataset and show that it is indeed possible to accurately retrieve street maps corresponding to both seen and unseen roads solely from image data. Moreover, we show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results for visual localization and image retrieval from spatial graphs.

Task overview
Problem: Infer topological road maps from images.
Challenges: Learning to map continuous images to discrete graphs (maps) with varying numbers of nodes and topology in bird’s eye view (BEV) is difficult.
Prior works: (jointly) learn a non-linear mapping from image pixels to BEV, and estimate the road layout by generating a discrete spatial graph from detected lane markings.

Our approach:
Pix2Map returns the graph with the embedding most similar to input image via cross-modal retrieval

Pix2Map: The graph encoder (bottom) computes a graph embedding vector \( \phi_{\text{graph}} \) for each street map in a batch. The image encoder, (top) outputs an image embedding \( \phi_{\text{image}} \) for each corresponding image stack. We then build a similarity matrix for a batch, that contrasts the image and graph embeddings. We highlight that the adjacency matrix of a given graph is used as the attention mask for our transformer-based graph encoder.


Results
Baseline comparisons. For fair comparisons with the prior art[1], in this experiment, we (i) train Pix2Map using frontal \(50m \times 50m\) road-graphs (as opposed to our default setting of predicting the surrounding 40m x 40m area). Moreover, we (ii) train Pix2Map with a single frontal view (Pix2Map-Single) to ensure consistent comparisons to baselines. Importantly, even in this setting, our method still outperforms baselines by a large margin 2.6819 in terms of Chamfer distance, as compared to 3.0140 obtained by the closest competitor, TOPO-TR[1].

Video presentation