A behavioral interpretability study using probing classifiers to ask whether visually grounded training (CLIP) produces text embeddings that encode spatial relations better than purely distributional training (Sentence-BERT) — and which spatial relation types benefit most.
Methods & Tools
Language models are trained on text alone. Bender & Koller (2020) argue this means they learn form without meaning — statistical patterns over symbols without any grounding in the world those symbols describe. Spatial language is where this gap is most visible: words like above, left of, and behind are learned from co-occurrence patterns in sentences, never from the experience of actually being in space.
As a result, LLMs fail at spatial reasoning tasks that are trivial for humans. Benchmark datasets like VSR (Visual Spatial Reasoning) and StepGame make this failure systematic and measurable — they describe scenes and ask models to reason about where things are.
CLIP was trained on 400 million image-caption pairs using a contrastive objective — meaning its text encoder has been shaped by visual experience, even if only 2D photographic experience. Does this leave a residue in its text embedding space? Are spatial relations more linearly decodable from CLIP embeddings than from Sentence-BERT — and if so, which ones?
The methodology is probing classifiers: lightweight logistic regression models trained on frozen embeddings to predict spatial relation labels. If a probe trained on model A achieves higher F1 than a probe trained on model B, it means model A's representations encode more information about that spatial relation — the probe can't create structure that isn't already there.
Probes are kept intentionally simple — no kernel tricks, weak regularization — so that positive results are attributable to the representation, not the probe. Results are reported as macro F1 per relation type, not just aggregate accuracy, because some relation types are much rarer than others in the dataset.
Sentence-BERT
all-mpnet-base-v2
Distributional
Trained on NLI and semantic similarity from text. No visual signal. Learns spatial language from word co-occurrence alone.
CLIP (text only)
clip-vit-base-patch32
Visually grounded
Text encoder from the CLIP model, used without images. Shaped by contrastive image-text training — does that grounding persist in text-only inference?
CLIP (text + image)
clip-vit-base-patch32
Multimodal
Fusing CLIP's text and image encodings directly. Uses the actual image from each VSR example. Serves as a ceiling for what visual grounding can provide.
Primary Dataset
VSR — Visual Spatial Reasoning (Liu et al., 2022)
~10,000 image-caption pairs labeled with 66 spatial relation types including on, under, above, below, left of, right of, in front of, behind, inside, near, and more. Each example has a True/False label: does the caption correctly describe the spatial relationship in the image? The fine-grained relation type breakdown is the key feature — it lets the analysis ask not just does visual grounding help? but for which concepts?
A secondary analysis uses Representational Similarity Analysis (RSA) — a technique borrowed from computational neuroscience — to compare the geometry of the embedding spaces across models, independent of any labels. High correlation between two models' RDMs (representational dissimilarity matrices) means they structure the embedding space similarly; low correlation suggests fundamentally different representational organization.
Prior work (Tong et al., 2024; Lewis et al., 2022; Thrush et al., 2022) suggests CLIP struggles specifically with spatial compositionality — distinguishing A is left of B from B is left of A. But these studies test task performance, not representational structure. The probing approach asks a more targeted question: even if the model can't answer a spatial question correctly, is the spatial information encoded in its embeddings at all?
The most interesting expected result is the vertical vs. horizontal asymmetry: above/below should benefit more from visual grounding than left/right, because gravity makes vertical relations consistent across photographs while horizontal relations flip with camera angle. Left and right are viewpoint-dependent — 2D visual training may not be sufficient to ground them.
If 2D visual grounding is insufficient for viewpoint-dependent relations like left/right, the natural follow-on is 3D grounding. Study 2 will extend this methodology to representations trained on 3D scene data via LangSplat — testing whether richer spatial grounding closes the gaps that Study 1 identifies.
Results and write-up will be linked here when complete.