LoTIS: Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Finn Busch, Matti Vahs, Quantao Yang, Jesús Gerardo Ortega Peimbert, Yixi Cai, Jana Tumova, Olov Andersson

Division of Robotics, Perception, and Learning at KTH Royal Institute of Technology

LoTIS is a model for visual navigation that provides robot-agnostic image-space guidance by localizing a reference RGB trajectory in the robot's current view. Given a reference trajectory (a sequence of RGB images) and a query image from the robot's current viewpoint, LoTIS predicts the 2D image-space coordinates, visibility, and relative distance of each trajectory pose as it would appear in the query view.

Setup

1. Clone this repository and install dependencies

git clone https://github.com/KTH-RPL/lotis.git
cd lotis

Dependencies are managed with uv. uv run will automatically install everything on first use. To install manually:

uv sync

2. Clone the DINOv3 repository

LoTIS uses DINOv3 as a frozen backbone. Clone Meta's repository into the project root:

git clone https://github.com/facebookresearch/dinov3.git

3. Download DINOv3 weights

Request access to DINOv3 pretrained weights at ai.meta.com/resources/models-and-libraries/dinov3-downloads. Once approved, you will receive an email with download URLs. Download the ViT-B/16 pretrain weights (dinov3_vitb16_pretrain.pth) and set the path:

export DINOV3_WEIGHTS=/path/to/dinov3_vitb16_pretrain.pth

Or pass it directly via --dinov3-weights when running scripts.

4. Download LoTIS model weights

huggingface-cli download fnnBsch/lotis final_model.pth final_config.yaml --local-dir .

Running inference

Localize a reference trajectory in a query (image, video, or directory), visualized live in Rerun:

uv run python infer.py \
    --trajectory examples/00_KTH_Campus/Courtyard/trajectory.mp4 \
    --query examples/00_KTH_Campus/Courtyard/queries/Forward/query.mp4

Live webcam — encode a trajectory once, then localize from your camera in real time:

uv run python infer.py \
    --trajectory path/to/trajectory.mp4 \
    --usb-cam

Use --cam-id 1 to select a different camera. All infer.py options:

--trajectory PATH       Reference trajectory (video, image dir, or single image)
--query PATH            Query input (video, image dir, or single image)
--usb-cam               Use webcam as live query  [mutually exclusive with --query]
--cam-id INT            Camera device ID (default: 0)
--checkpoint PATH       LoTIS checkpoint (default: final_model.pth)
--config PATH           LoTIS config (default: final_config.yaml)
--dinov3-weights PATH   DINOv3 weights file (or set DINOV3_WEIGHTS)
--dinov3-repo PATH      Path to cloned dinov3 repo (default: ./dinov3)
--device cuda|cpu       Inference device (auto-detected if not set)
--vis-threshold FLOAT   Visibility threshold for displaying points (default: 0.5)

Rerun also accepts additional flags (e.g. --serve to stream to a remote viewer). Run python infer.py --help for the full list.

Remote visualization

If running on a remote machine, start the Rerun viewer locally and forward the port:

# On your local machine
rerun

# Forward the Rerun port from remote to local
ssh -R 9876:localhost:9876 <remote-host>

# On the remote machine
uv run python infer.py --trajectory ... --query ... --connect

Gradio demo

A full interactive demo is available at the HuggingFace Space. To run it locally:

uv run python app.py

Set DINOV3_WEIGHTS (and optionally DINOV3_REPO) before running.

Python API

from lotis import TrajectoryLocalizer

localizer = TrajectoryLocalizer.from_checkpoint(
    checkpoint_path="final_model.pth",
    config_path="final_config.yaml",
    dinov3_weights="/path/to/dinov3_vitb16_pretrain.pth",
)

# Encode a trajectory — do this once and reuse
encoding = localizer.encode_trajectory("path/to/trajectory.mp4")

# Localize a query image
result = localizer.localize("query.jpg", encoding)

print(f"Closest trajectory frame: {result.closest_frame()}")
print(f"Visible frames: {result.visible_indices()}")

# Save and reload the trajectory encoding
import torch
torch.save(encoding.to_dict(), "encoding.pt")
encoding = TrajectoryEncoding.from_dict(torch.load("encoding.pt"))

localize() accepts single images or lists, and single encodings or lists — see lotis/localizer.py for the full batching API.

Roadmap

Inference code + Gradio Demo
Evaluation code
Training code

Citation

@misc{busch2026learninglocalizereferencetrajectories,
      title={Learning to Localize Reference Trajectories in Image-Space for Visual Navigation}, 
      author={Finn Lukas Busch and Matti Vahs and Quantao Yang and Jesús Gerardo Ortega Peimbert and Yixi Cai and Jana Tumova and Olov Andersson},
      year={2026},
      eprint={2602.18803},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2602.18803}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
lotis		lotis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
final_config.yaml		final_config.yaml
infer.py		infer.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LoTIS: Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Setup

Running inference

Remote visualization

Gradio demo

Python API

Roadmap

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

KTH-RPL/lotis

Folders and files

Latest commit

History

Repository files navigation

LoTIS: Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Setup

Running inference

Remote visualization

Gradio demo

Python API

Roadmap

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages