nanoVLM: Minimal PyTorch VLM

Build a Vision‑Language Model in pure PyTorch under 750 lines, train for six hours on a single H100, and run it on free Google Colab.

Overview

I’ll be presenting nanoVLM, a minimal, open-source PyTorch library for training Vision-Language Models (VLMs) from scratch in just ~750 lines of code. Inspired by nanoGPT, nanoVLM is simple, readable, and efficient — achieving competitive performance (35.3% on MMStar) with just 6 hours of training on a single H100 GPU. It combines a SigLiP-ViT encoder and LLaMA-style decoder, and is light enough to run in a free Google Colab.

Links

https://github.com/huggingface/nanoVLM
Lightweight PyTorch repository for finetuning small VLMs using SigLIP/SmolLM2 backbones.

Tech stack