Nora Gully - Data Journalist

Nora GullyIn progress · Spring 2026

Does Seeing Help Models Think in Space?

A behavioral interpretability study using probing classifiers to ask whether visually grounded training (CLIP) produces text embeddings that encode spatial relations better than purely distributional training (Sentence-BERT) — and which spatial relation types benefit most.

Methods & Tools

Probing ClassifiersRSACLIPSentence-BERTVSR DatasetPythonscikit-learnHuggingFaceColab

The Problem

Language models are trained on text alone. Bender & Koller (2020) argue this means they learn form without meaning — statistical patterns over symbols without any grounding in the world those symbols describe. Spatial language is where this gap is most visible: words like above, left of, and behind are learned from co-occurrence patterns in sentences, never from the experience of actually being in space.

As a result, LLMs fail at spatial reasoning tasks that are trivial for humans. Benchmark datasets like VSR (Visual Spatial Reasoning) and StepGame make this failure systematic and measurable — they describe scenes and ask models to reason about where things are.

Research Question

CLIP was trained on 400 million image-caption pairs using a contrastive objective — meaning its text encoder has been shaped by visual experience, even if only 2D photographic experience. Does this leave a residue in its text embedding space? Are spatial relations more linearly decodable from CLIP embeddings than from Sentence-BERT — and if so, which ones?

Approach

The methodology is probing classifiers: lightweight logistic regression models trained on frozen embeddings to predict spatial relation labels. If a probe trained on model A achieves higher F1 than a probe trained on model B, it means model A's representations encode more information about that spatial relation — the probe can't create structure that isn't already there.

Probes are kept intentionally simple — no kernel tricks, weak regularization — so that positive results are attributable to the representation, not the probe. Results are reported as macro F1 per relation type, not just aggregate accuracy, because some relation types are much rarer than others in the dataset.

Baseline

Sentence-BERT

all-mpnet-base-v2

Distributional

Trained on NLI and semantic similarity from text. No visual signal. Learns spatial language from word co-occurrence alone.

Key comparison

CLIP (text only)

clip-vit-base-patch32

Visually grounded

Text encoder from the CLIP model, used without images. Shaped by contrastive image-text training — does that grounding persist in text-only inference?

Upper bound

CLIP (text + image)

clip-vit-base-patch32

Multimodal

Fusing CLIP's text and image encodings directly. Uses the actual image from each VSR example. Serves as a ceiling for what visual grounding can provide.

Primary Dataset

VSR — Visual Spatial Reasoning (Liu et al., 2022)

~10,000 image-caption pairs labeled with 66 spatial relation types including on, under, above, below, left of, right of, in front of, behind, inside, near, and more. Each example has a True/False label: does the caption correctly describe the spatial relationship in the image? The fine-grained relation type breakdown is the key feature — it lets the analysis ask not just does visual grounding help? but for which concepts?

A secondary analysis uses Representational Similarity Analysis (RSA) — a technique borrowed from computational neuroscience — to compare the geometry of the embedding spaces across models, independent of any labels. High correlation between two models' RDMs (representational dissimilarity matrices) means they structure the embedding space similarly; low correlation suggests fundamentally different representational organization.

What the Literature Predicts

Prior work (Tong et al., 2024; Lewis et al., 2022; Thrush et al., 2022) suggests CLIP struggles specifically with spatial compositionality — distinguishing A is left of B from B is left of A. But these studies test task performance, not representational structure. The probing approach asks a more targeted question: even if the model can't answer a spatial question correctly, is the spatial information encoded in its embeddings at all?

The most interesting expected result is the vertical vs. horizontal asymmetry: above/below should benefit more from visual grounding than left/right, because gravity makes vertical relations consistent across photographs while horizontal relations flip with camera angle. Left and right are viewpoint-dependent — 2D visual training may not be sufficient to ground them.

Progress

✓Literature review & problem framing

✓Project structure, src/ modules, and notebook scaffolding

✓VSR dataset loading and preprocessing (datasets.py)

✓SBERT and CLIP text embedder implementation (embedders.py)

✓Data exploration: relation type distribution, label balance (notebook 01)

CLIP multimodal embedder (text + image fusion)

Embedding extraction on full VSR dataset (Colab, GPU)

Probing classifier training per relation type (notebook 03)

RSA analysis: model-to-model representational similarity (notebook 04)

Visualization: heatmap, UMAP, dendrogram, probe weights (notebook 05)

Write-up and results

Study 2 (planned)

If 2D visual grounding is insufficient for viewpoint-dependent relations like left/right, the natural follow-on is 3D grounding. Study 2 will extend this methodology to representations trained on 3D scene data via LangSplat — testing whether richer spatial grounding closes the gaps that Study 1 identifies.

Links

GitHub repository →

Results and write-up will be linked here when complete.

← Back to portfolio