Three Weeks on the Wrong Approach

I tried to fine-tune V-JEPA2 to align with CLIP's text embeddings. It didn't work. Three weeks is a lot of time to spend learning what not to do, but I suppose that's one way to make progress: ruling out approaches until you find one that works. At least now I know this particular path is a dead end, which is worth something even if it doesn't feel like it.

What I Was Trying to Do

Sophia reasons in embedding space. Visual observations become coordinates. Linguistic concepts become coordinates. The dream is a unified space where video and language live together, where you can move fluidly between modalities without explicit translation. Imagine seeing a video of people kicking a ball. You query by video similarity and find nodes with similar motion patterns. You pull their text embeddings. You query by text. You discover 'soccer,' 'game,' 'sport.' Concepts that would be invisible in video space alone become accessible through cross-modal navigation. Beautiful vision. Obvious implementation: take V-JEPA2's video embeddings, fine-tune them to align with CLIP's text embeddings. Standard contrastive learning. Should work. It didn't.

Why It Seemed Reasonable

V-JEPA2 is excellent at video understanding. CLIP is excellent at text-image alignment. Both produce embeddings in high-dimensional vector spaces. Both are well-documented with accessible weights. The projection approach has worked for other modality alignment problems. I had notebooks. I had a training loop. I had loss curves going down. Everything looked fine. The downstream performance metrics told a different story.

Why It Didn't Work

V-JEPA2 produces 1024-dimensional embeddings. CLIP produces 768-dimensional embeddings. Already a mismatch, but projection layers exist precisely to handle dimension mismatches, so this shouldn't be fatal. The deeper issue is that the two spaces encode fundamentally different things. V-JEPA2 optimizes for temporal prediction: "Given this frame sequence, predict the next." It learns motion patterns, dynamics, temporal structure. The embedding space is organized around what happens next, not around what things mean semantically. CLIP optimizes for image-text matching: "Does this image match this caption?" It learns semantic correspondence between static visuals and descriptions. The embedding space is organized around meaning, around which images go with which words.

When you force V-JEPA2 embeddings into CLIP's space, you're asking the model to make its temporal dynamics predictions live in a space optimized for static image-caption pairs. The model cooperates. It learns to satisfy the contrastive loss. But the mapping destroys the temporal structure that made V-JEPA2 useful in the first place.

The Technical Details

Setup: freeze V-JEPA2 encoder, add projection head (1024 → 768), train with InfoNCE loss against CLIP text embeddings, float16 mixed precision, careful gradient management. Results: loss decreases (good sign), alignment metrics improve (good sign), downstream performance collapses (bad sign). The projected embeddings aligned with CLIP. But they lost the temporal structure that distinguished "reaching for object" from "withdrawing from object." The motion patterns got compressed away because they weren't relevant to the alignment objective.

What I Should Have Known

Post-hoc alignment is essentially translation between languages that encode different concepts. German has words English lacks. You can build translation systems, but you lose nuance. V-JEPA2 and CLIP aren't languages with overlapping concepts expressed differently. They're encoding entirely different aspects of perception. V-JEPA2 cares about temporal dynamics. CLIP cares about semantic content. No projection layer recovers what wasn't encoded to begin with. This is probably obvious in retrospect. It wasn't obvious to me when I started.

The Actual Solution

Models designed for unified spaces from the start. VL-JEPA (Vision-Language JEPA). VLA models (Vision-Language-Action). RT-2, OpenVLA, Octo. These architectures encode video and language in the same space natively. They were trained from the beginning to put both modalities in a shared representation. No post-hoc projection needed because there's nothing to project. I've submitted an access request for VL-JEPA from Meta, but I'm not just waiting: I've been building my own implementations based on the paper. The right architecture exists. I just have to stop trying to force incompatible systems together and build toward the right approach instead.

The Meta-Lesson

The obvious approach (align existing models) felt productive. I was writing code, running experiments, seeing loss decrease. Progress! Except the approach was fundamentally wrong. The correct move would have been to recognize the incompatibility, survey alternatives, and wait for the right tool. Instead I spent three weeks doing busy-work that felt like progress. "Busy" and "productive" aren't synonyms. Sometimes the productive path is doing nothing until the right infrastructure exists. Three weeks of wasted effort taught me that lesson, at least. And even the failure gave me insight into how these embedding spaces actually work, which I'll need when I do get access to the right models.

Lessons learned, post 1.

Three Weeks on the Wrong Approach

Lessons Learned

Three Weeks on the Wrong Approach

What I Was Trying to Do

Why It Seemed Reasonable

Why It Didn't Work

The Technical Details

What I Should Have Known

The Actual Solution

The Meta-Lesson

Lessons Learned