MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles

Chalmers University of Technology

arXiv Dataset and models Code Hugging Face

Abstract

Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images — a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as “person”, or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.726 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility.

Dataset

The dataset and model weights are hosted by the Swedish National Data Service (SND) and can be accessed by submitting a request: Link to the dataset.

File Structure

Below is the directory and file layout for the MicroVision dataset and model weights.

├── images.zip                  # All blurred scene frames (JPG)
├── labels.zip                  # YOLO format annotation files (TXT)
├── meta.csv                    # Dataset split and reproduction metadata
├── microvision_yolo11.pt       # YOLO11-X model weights
├── microvision_fasterrcnn.pth  # Detectron2 Faster R-CNN model weights
└── microvision_rfdetr.pth      # RF-DETR Large model weights

`images.zip`

Contains all images.

Format: JPG
Faces and license plates were blurred using BrighterAI’s Precision Blur technology
File naming convention:

S{X}_{Y}.jpg

Where:

X = scene number
Y = frame number within the scene
Both indices start from 0

`labels.zip`

Contains all labels in YOLO format.

Format: TXT
Includes:
- Object bounding box coordinates
- Class labels (1 = pedestrian, 2 = bicycle, 3 = cyclist, 4 = e-scooter, 5 = e-scooterist)
Uses the same file naming convention as the images

`meta.csv`

Contains metadata required to reproduce dataset splits used for benchmarking.

Format: CSV

Column	Description
`scene`	Scene identifier
`img`	Image/label name (without extension)
`split`	Dataset split (`train`, `val`, or `test`)
`split_notest`	Split used when training without a test set (`train` or `val`)

Model Weights

We trained three different object detection models on the MicroVision dataset, and the weights for all three models are available for download. The table below provides an overview of the available model weights. You can try out the models on our Hugging Face Space.

File	Description
`microvision_yolo11.pt`	Weights for the YOLO11-X model
`microvision_fasterrcnn.pth`	Weights for the Detectron2 Faster R-CNN model
`microvision_rfdetr.pth`	Weights for the RF-DETR large model (resolution: `1232px`)

Model inference example

Benchmark model performance on the test set

The evaluation metric is mAP@[0.5:0.95] (Lin et al., 2014). S, M, and L denote small, medium, and large objects, respectively. Bold values indicate the best performance per class and object size. More details can be found in the paper linked at the top.

Class	Model	S	M	L	All
Pedestrian	YOLO11	0.442	0.645	0.769	0.597
	Faster R-CNN	0.279	0.559	0.726	0.499
	RF-DETR	0.438	0.673	0.869	0.629

Bicycle	YOLO11	0.110	0.373	0.724	0.518
	Faster R-CNN	0.131	0.280	0.633	0.431
	RF-DETR	0.233	0.411	0.818	0.597

Cyclist	YOLO11	0.353	0.700	0.883	0.766
	Faster R-CNN	0.093	0.570	0.818	0.669
	RF-DETR	0.332	0.725	0.932	0.813

E-scooter	YOLO11	0.232	0.609	0.837	0.692
	Faster R-CNN	0.052	0.408	0.694	0.510
	RF-DETR	0.226	0.583	0.879	0.702

E-scooterist	YOLO11	0.574	0.787	0.927	0.865
	Faster R-CNN	0.280	0.671	0.880	0.789
	RF-DETR	0.499	0.816	0.950	0.889

All classes	YOLO11	0.342	0.623	0.828	0.687
	Faster R-CNN	0.167	0.497	0.750	0.580
	RF-DETR	0.346	0.641	0.890	0.726

BibTeX citation

@misc{microvision2026,
      title={MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles},
      author={Alexander Rasch and Rahul Rajendra Pai},
      year={2026},
      eprint={2603.18192},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.18192},
}

References

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755.