MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles

Abstract

Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images — a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as “person”, or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.726 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility.

Dataset

The dataset and model weights are hosted by the Swedish National Data Service (SND) and can be accessed by submitting a request: Link to the dataset.

Raw image (anonymized)
Annotated image

File Structure

Below is the directory and file layout for the MicroVision dataset and model weights.

├── images.zip # All blurred scene frames (JPG)
├── labels.zip # YOLO format annotation files (TXT)
├── meta.csv # Dataset split and reproduction metadata
├── microvision_yolo11.pt # YOLO11-X model weights
├── microvision_fasterrcnn.pth # Detectron2 Faster R-CNN model weights
└── microvision_rfdetr.pth # RF-DETR Large model weights

images.zip

Contains all images.

S{X}_{Y}.jpg

Where:


labels.zip

Contains all labels in YOLO format.


meta.csv

Contains metadata required to reproduce dataset splits used for benchmarking.

ColumnDescription
sceneScene identifier
imgImage/label name (without extension)
splitDataset split (train, val, or test)
split_notestSplit used when training without a test set (train or val)

Model Weights

We trained three different object detection models on the MicroVision dataset, and the weights for all three models are available for download. The table below provides an overview of the available model weights. You can try out the models on our Hugging Face Space.

FileDescription
microvision_yolo11.ptWeights for the YOLO11-X model
microvision_fasterrcnn.pthWeights for the Detectron2 Faster R-CNN model
microvision_rfdetr.pthWeights for the RF-DETR large model (resolution: 1232px)

Model inference example

Benchmark model performance on the test set

The evaluation metric is mAP@[0.5:0.95] (Lin et al., 2014). S, M, and L denote small, medium, and large objects, respectively. Bold values indicate the best performance per class and object size. More details can be found in the paper linked at the top.

ClassModelSMLAll
PedestrianYOLO110.4420.6450.7690.597
Faster R-CNN0.2790.5590.7260.499
RF-DETR0.4380.6730.8690.629
BicycleYOLO110.1100.3730.7240.518
Faster R-CNN0.1310.2800.6330.431
RF-DETR0.2330.4110.8180.597
CyclistYOLO110.3530.7000.8830.766
Faster R-CNN0.0930.5700.8180.669
RF-DETR0.3320.7250.9320.813
E-scooterYOLO110.2320.6090.8370.692
Faster R-CNN0.0520.4080.6940.510
RF-DETR0.2260.5830.8790.702
E-scooteristYOLO110.5740.7870.9270.865
Faster R-CNN0.2800.6710.8800.789
RF-DETR0.4990.8160.9500.889
All classesYOLO110.3420.6230.8280.687
Faster R-CNN0.1670.4970.7500.580
RF-DETR0.3460.6410.8900.726

BibTeX citation

@misc{microvision2026,
title={MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles},
author={Alexander Rasch and Rahul Rajendra Pai},
year={2026},
eprint={2603.18192},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.18192},
}

References

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755.