InLoc: Indoor Visual Localization with Dense Matching and View Synthesis
Abstract
We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map. The contributions of this work are three-fold. First, we develop a new large-scale visual localization method targeted for indoor environments. The method proceeds along three steps: (i) efficient retrieval of candidate poses that ensures scalability to large-scale environments, (ii) pose estimation using dense matching rather than local features to deal with textureless indoor scenes, and (iii) pose verification by virtual view synthesis to cope with significant changes in viewpoint, scene layout, and occluders. Second, we collect a new dataset with reference 6DoF poses for large-scale indoor localization. Query photographs are captured by mobile phones at a different time than the reference 3D map, thus presenting a realistic indoor localization scenario. Third, we demonstrate that our method significantly outperforms current state-of-the-art indoor localization approaches on this new challenging data.
Community
Proposes InLoc: localise (6 DoF VPR) a query image in an indoor environment; retrieval of candidate, pose estimation using dense matching instead of local features (texture less scenes), pose verification by virtual view synthesis; proposes a dataset (and achieves SOTA) for indoor localisation (VPR). InLoc dataset contains RGBD database images (panoramic 3D scans) registered to a floor map (plan) and RGB query images (6 DoF annotated in the map); five scenes (one per floor); reference pose generation using P3P-RANSAC and bundle adjustment, visual and quantitative inspection by projecting database/map 3D points on the query images using estimated pose (see overlap with edges). Uses multi-scale dense CNN features for image description and matching (to tackle lack of sparse local features); coarse-to-fine dense feature matching with geometric verification and camera pose estimation using P3P-RANSAC (to tackle large image changes with viewpoint); count the negative evidence (to tackle self-similarity in indoor scenes) by verifying the estimate by view synthesis. Uses NetVLAD global descriptors to get N best database images for a query (use normalized L2 distance), choose database poses as candidates. Dense feature matching for pose estimation; use VGG-16, layer 5 for coarse and layer 3 for fine; query pixel to database pixel (then 3D point in map, as database has depth), use P3P-RANSAC for pose. Render an image from predicted pose and compare DenseRootSIFT descriptors of query and rendered image; count matching and not-matching regions (median descriptor distance while ignoring missing map segments) to rerank pose estimates. Trained the VGG-16 model on Pitts30k for NetVLAD; compute poses using P3P-LO-RANSAC. Better than Disloc (BoVW with hamming-embedding) and NetVLAD with sparse pose estimation on InLoc, MatterPort3D, and 7 Scenes dataset. Appendix has database and query images (samples), more qualitative results, and failure cases (many people or highly dynamic scenes). From Tokyo Institute of Technology, ETHz, CTU Prague, Microsoft, Inria.
Links: website, PapersWithCode, GitHub
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper