Abstract
Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images — a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as “person”, or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.726 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility.
Dataset
The dataset and model weights are hosted by the Swedish National Data Service (SND) and can be accessed by submitting a request: Link to the dataset.
File Structure
Below is the directory and file layout for the MicroVision dataset and model weights.
├── images.zip # All blurred scene frames (JPG)├── labels.zip # YOLO format annotation files (TXT)├── meta.csv # Dataset split and reproduction metadata├── microvision_yolo11.pt # YOLO11-X model weights├── microvision_fasterrcnn.pth # Detectron2 Faster R-CNN model weights└── microvision_rfdetr.pth # RF-DETR Large model weightsimages.zip
Contains all images.
- Format:
JPG - Faces and license plates were blurred using BrighterAI’s Precision Blur technology
- File naming convention:
S{X}_{Y}.jpgWhere:
X= scene numberY= frame number within the scene- Both indices start from
0
labels.zip
Contains all labels in YOLO format.
-
Format:
TXT -
Includes:
- Object bounding box coordinates
- Class labels (1 = pedestrian, 2 = bicycle, 3 = cyclist, 4 = e-scooter, 5 = e-scooterist)
-
Uses the same file naming convention as the images
meta.csv
Contains metadata required to reproduce dataset splits used for benchmarking.
- Format:
CSV
| Column | Description |
|---|---|
scene | Scene identifier |
img | Image/label name (without extension) |
split | Dataset split (train, val, or test) |
split_notest | Split used when training without a test set (train or val) |
Model Weights
We trained three different object detection models on the MicroVision dataset, and the weights for all three models are available for download. The table below provides an overview of the available model weights. You can try out the models on our Hugging Face Space.
| File | Description |
|---|---|
microvision_yolo11.pt | Weights for the YOLO11-X model |
microvision_fasterrcnn.pth | Weights for the Detectron2 Faster R-CNN model |
microvision_rfdetr.pth | Weights for the RF-DETR large model (resolution: 1232px) |
Model inference example
Benchmark model performance on the test set
The evaluation metric is mAP@[0.5:0.95] (Lin et al., 2014). S, M, and L denote small, medium, and large objects, respectively. Bold values indicate the best performance per class and object size. More details can be found in the paper linked at the top.
| Class | Model | S | M | L | All |
|---|---|---|---|---|---|
| Pedestrian | YOLO11 | 0.442 | 0.645 | 0.769 | 0.597 |
| Faster R-CNN | 0.279 | 0.559 | 0.726 | 0.499 | |
| RF-DETR | 0.438 | 0.673 | 0.869 | 0.629 | |
| Bicycle | YOLO11 | 0.110 | 0.373 | 0.724 | 0.518 |
| Faster R-CNN | 0.131 | 0.280 | 0.633 | 0.431 | |
| RF-DETR | 0.233 | 0.411 | 0.818 | 0.597 | |
| Cyclist | YOLO11 | 0.353 | 0.700 | 0.883 | 0.766 |
| Faster R-CNN | 0.093 | 0.570 | 0.818 | 0.669 | |
| RF-DETR | 0.332 | 0.725 | 0.932 | 0.813 | |
| E-scooter | YOLO11 | 0.232 | 0.609 | 0.837 | 0.692 |
| Faster R-CNN | 0.052 | 0.408 | 0.694 | 0.510 | |
| RF-DETR | 0.226 | 0.583 | 0.879 | 0.702 | |
| E-scooterist | YOLO11 | 0.574 | 0.787 | 0.927 | 0.865 |
| Faster R-CNN | 0.280 | 0.671 | 0.880 | 0.789 | |
| RF-DETR | 0.499 | 0.816 | 0.950 | 0.889 | |
| All classes | YOLO11 | 0.342 | 0.623 | 0.828 | 0.687 |
| Faster R-CNN | 0.167 | 0.497 | 0.750 | 0.580 | |
| RF-DETR | 0.346 | 0.641 | 0.890 | 0.726 |
BibTeX citation
@misc{microvision2026, title={MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles}, author={Alexander Rasch and Rahul Rajendra Pai}, year={2026}, eprint={2603.18192}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2603.18192},}