Why You Can't Project V-JEPA Into CLIP (And What VL-JEPA Does Instead)
Why You Can't Project V-JEPA Into CLIP (And What VL-JEPA Does Instead)
I spent three weeks trying to align V-JEPA2's video embeddings with CLIP's text embeddings. The loss went down, the alignment metrics improved, and the downstream performance collapsed. This post is the technical explanation of why that happened and what the actual solution looks like.
The Goal
Sophia reasons in geometric embedding space. For that to work with both vision and language, I need a unified space where video observations and linguistic concepts can coexist - where you can query by visual similarity and get semantically related text, or query by text and find visually similar observations. The dream is fluid cross-modal navigation without explicit translation steps.
The Naive Approach
V-JEPA2 produces excellent video embeddings. CLIP produces excellent text embeddings (and image embeddings aligned with them). Both output high-dimensional vectors. The obvious approach: learn a projection from V-JEPA2's space into CLIP's space. Freeze V-JEPA2, add a projection head, train with contrastive loss against CLIP text embeddings. Standard transfer learning pattern.
The setup:
- V-JEPA2 encoder (frozen): video → 1024-dim embedding
- Projection head (trainable): 1024-dim → 768-dim
- CLIP text encoder (frozen): text → 768-dim embedding
- Loss: InfoNCE contrastive loss between projected video and CLIP text
- Training: paired video-caption data, float16 mixed precision
Loss curves looked great. Alignment metrics improved steadily. I was optimistic.
What Actually Happened
The projected embeddings aligned beautifully with CLIP's text space. Cosine similarity between video embeddings and their corresponding captions went up. But when I tried to use these embeddings for downstream tasks - distinguishing "reaching for object" from "withdrawing from object," tracking motion patterns, any of the temporal reasoning that V-JEPA2 excels at - performance was worse than random.
The projection had successfully mapped V-JEPA2 into CLIP's space. It had also destroyed everything that made V-JEPA2 useful.
Why This Was Inevitable
V-JEPA2 and CLIP optimize for fundamentally different objectives, and those objectives determine what information gets encoded in the embedding space.
V-JEPA2's objective: temporal prediction. Given a sequence of frames with masked regions, predict what the masked regions will look like. This forces the model to learn motion patterns, dynamics, temporal structure. The embedding space is organized around what happens next. Two videos that look different but have similar motion dynamics will be nearby. Two videos that look similar but have different dynamics will be far apart.
CLIP's objective: image-text matching. Given an image and a caption, determine if they match. This forces the model to learn semantic correspondence between static visuals and linguistic descriptions. The embedding space is organized around what things mean. Two images with similar semantic content will be nearby regardless of visual appearance. Two images that look similar but mean different things will be far apart.
These aren't just different objectives - they encode orthogonal aspects of perception. V-JEPA2 cares about temporal dynamics that CLIP ignores entirely (CLIP works on static images). CLIP cares about semantic categories that V-JEPA2 has no reason to learn (V-JEPA2 never sees text during training).
The Math
V-JEPA2's training objective minimizes prediction error in latent space:
where is the encoding of masked video regions and is the prediction from unmasked context. The encoder learns to produce embeddings where temporal relationships are predictable - meaning temporal dynamics get encoded because they're needed for prediction.
CLIP's training objective is symmetric contrastive loss:
where and are the image and text embeddings for pair , and is a temperature parameter. The encoder learns to produce embeddings where matching image-text pairs have high cosine similarity - meaning semantic correspondence gets encoded because it's needed for matching.
My projection approach added a linear transformation and trained with InfoNCE:
The problem is visible in the math. The projection is a linear map from 1024 to 768 dimensions - it cannot be injective. The null space of contains all the information that gets discarded. The training objective only cares about - any component of that doesn't contribute to this dot product can be mapped to zero without increasing the loss.
V-JEPA2 embeddings encode temporal dynamics in directions that are orthogonal to semantic content (because V-JEPA2 was never trained on semantic tasks). CLIP embeddings are organized entirely around semantic content. The projection learns to discard the temporal directions (they don't help with alignment) and keep whatever accidental semantic signal exists. The null space of ends up containing exactly the information I needed.
The Projection Problem
When you train a projection from V-JEPA2 to CLIP, you're asking the model to map temporal dynamics onto semantic categories. The contrastive loss rewards projections where videos land near their caption embeddings. But captions describe semantic content, not temporal dynamics. "A person reaching for a cup" and "a person withdrawing from a cup" might have nearly identical CLIP text embeddings - the semantic content is similar even though the temporal dynamics are opposite.
The projection learns to ignore the temporal information (because it doesn't help with the alignment objective) and amplify whatever weak semantic signal exists in V-JEPA2's embeddings (because that's what CLIP's space is organized around). The result: embeddings that align well with CLIP but have lost the temporal structure that made V-JEPA2 valuable.
No projection layer can recover what wasn't preserved. You can't un-compress information that was discarded during the mapping.
What VL-JEPA Does Instead
VL-JEPA (Vision-Language JEPA) solves this by never separating vision and language in the first place. Instead of training a video encoder and a text encoder separately, then trying to align them post-hoc, VL-JEPA trains a single encoder that processes both modalities from the start.
The architecture uses a shared latent space where both video patches and text tokens get projected. The training objective combines multiple losses:
where:
- is the masked video prediction loss (preserves dynamics)
- predicts text from video context and video from text context
- is contrastive loss between video and text embeddings
The key difference from my projection approach: all three losses operate on the same encoder during training. The encoder cannot discard temporal dynamics to satisfy alignment, because temporal dynamics are still needed to minimize . The gradients from all objectives flow through the same parameters, forcing the model to find representations that satisfy all constraints simultaneously.
Because both modalities are present during training, the model learns representations that preserve temporal dynamics and semantic content and the correspondence between them. The space is organized around both what happens next and what things mean, because both objectives are active simultaneously.
This is why post-hoc alignment fails and joint training succeeds. Post-hoc alignment tries to retrofit correspondence onto spaces that were organized around different principles. Joint training builds the correspondence into the space from the beginning.
The General Lesson
This pattern generalizes beyond V-JEPA and CLIP. Whenever you have two embedding spaces trained with different objectives, post-hoc projection will sacrifice information from one space to satisfy alignment with the other. The only way to get a unified space that preserves what both original spaces encoded is to train them together.
This is why I'm building my own VL-JEPA implementations from the paper rather than trying to hack together existing models. The architecture matters. You can't shortcut around the fundamental constraint that embeddings encode what their training objective required and nothing else.
Practical Implications for Sophia
For Sophia's unified embedding space to work, the components that produce embeddings need to be trained together or at least with compatible objectives. I can't take an off-the-shelf video encoder and an off-the-shelf language model and expect projection to give me a coherent unified space. The models need to have learned, from the beginning, that both modalities live together.
This is one of the reasons VL-JEPA access matters. Not because I can't build something myself - I can and I am - but because Meta has the compute budget to train these models at scale that I can't match. My home-brewed implementations are good enough to validate the architecture, but production-quality unified embeddings require training resources I don't have.
In the meantime, I keep building infrastructure and running smaller-scale experiments. The three weeks I spent on the failed projection approach weren't entirely wasted - they gave me a much deeper understanding of what these embedding spaces actually encode and why the unified space has to be built right from the start.
Lessons learned, post 2. Previous: Three Weeks on the Wrong Approach