TextTeacher: What Can Language Teach About Images?

Mar 11, 2026·

Tobias Christian Nauen

Stanislav Frolov

Brian B. Moser

Federico Raue

Ahmed Anwar

Andreas Dengel

Code Precomputed Embeddings

Abstract

The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models.

Type

Preprint

Publication

Preprint (Preprint)

Introduction

What if a language model could teach a vision model — and then step aside?

TextTeacher adds a frozen text encoder and a lightweight alignment loss to standard image classification training. Image captions provide per-instance semantic targets that shape the visual representation during training. At inference, everything text-related is discarded. The deployed model is a plain, fast vision network — no extra parameters, no added latency.

The result: Up to +2.7 p.p. on ImageNet, +1.0 p.p. average transfer gain across six fine-grained benchmarks, and +8.4 p.p. under 50% label noise — all with negligible compute overhead and without any multimodal pretraining of the target model. TextTeacher distills knowledge 33% more efficient than traditional vision-distillation.

How it works

A lightweight text head projects the backbone’s CLS token into the embedding space of a frozen text encoder. An auxiliary contrastive loss aligns each image’s projection with the text embedding of its caption, pulling image features toward a pre-organised semantic manifold. The text head and encoder are dropped after training.

Results

TextTeacher on ImageNet

Text teacher increases performance across models without additional training-time overhead.

Comparison of Knowledge Distillation Methods

TextTeacher runs at nearly the same per-epoch cost as baseline classification (32 min/epoch vs. 31 min/epoch), while online knowledge distillation from DINOv2-L costs ~48 min/epoch. In a compute-matched setting, TextTeacher (79.1%) matches DINOv2-L online distillation trained for the full 100 epochs (79.1%) at only ~66% the wall-clock time. Adding the full preprocessing (captioning + embedding the training set) still saves ≈6 GPU-hours over 300-epoch online distillation.

Analysis: How does TextTeacher work?

Left: Fréchet feature distance (FFD) to the final model over training. With TextTeacher active from epoch 0, the representation is already close to its final semantic configuration by epoch 10 (FFD 65 vs. 450 for the baseline). Right: Accuracy when TextTeacher is dropped at different epochs (jump schedules). Benefits saturate around epoch 50; dropping very late (epochs 70–90) hurts stability.

TextTeacher substantially increases cross-seed similarity in deeper attention layers, indicating it organises higher-level semantic subspaces while leaving low-level feature extractors flexible.

Thus, TextTeacher acts primarily as an early-phase preconditioner: most of the benefit is realised in the first 30–50 epochs, where it accelerates the formation of semantically organised features — especially in deeper layers.

Contributions

We pose and investigate a fundamental question: Can the semantic knowledge of a language model efficiently improve a vision model? — operationalised via a decoupled auxiliary signal from a frozen text encoder.
We introduce TextTeacher, a general approach that efficiently injects textual knowledge into standard vision backbones using an auxiliary alignment loss, requiring no text at inference.
We provide extensive experiments showing improvements in accuracy and transfer, and demonstrate that language-based guidance outperforms visual guidance and knowledge distillation in a compute-matched setting.

Associated Projects: SEmbedAI, Albatross

Citation

If you use this work, please cite our paper:

@inproceedings{Nauen2026TextTeacher,
  author = {Nauen, Tobias Christian and Frolov, Stanislav and Moser, Brian B.
            and Raue, Federico and Anwar, Ahmed and Dengel, Andreas},
  title = {TextTeacher: What Can Language Teach About Images?},
  month = {Mai},
  year = {2026},
}

Last updated on May 11, 2026

Deep Learning Knowledge Distillation Multimodal Data Efficient AI

Authors

Tobias Christian Nauen

PhD Student

I’m a researcher of artificial intelligence at DFKI and RPTU Kaiserslautern-Landau. My research interests include efficient deep learning, transformer models, multimodal learning, and computer vision. In my PhD project, my focus lies on the development of efficient transformer models for vision, language, and multimodal tasks.

When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators Feb 23, 2026 →