AI/논문 리뷰

[논문리뷰] When Does LeJEPA Learn a World Model?

도도걸만단 2026. 6. 12. 01:14

들어가기 앞서 이 논문은..

JEPA/LeJEPA류 self-supervised representation이 latent world structure를 linearly recover할 수 있는 조건을 증명한 이론 논문

1. 한 줄 요약

세계의 실제 latent variable이 독립 Gaussian이고, positive pair가 stationary additive-noise transition으로 생성된다면, LeJEPA의 alignment loss + Gaussian regularization은 관측 이미지 뒤에 숨어 있는 진짜 latent variable을 rotation까지만 애매하게 남기고 선형적으로 복원한다. 하지만 이 보장은 Gaussian latent일 때만 성립한다.

When Does LeJEPA Learn a World Model?

latent world가 Gaussian 구조를 가지고 있고, positive pair transition이 적절한 stationary additive-noise 형태일 때.

2. 이 논문이 풀려는 문제

기존 JEPA 계열은 이미지나 비디오에서 좋은 representation을 학습한다고 주장해왔지만, 중요한 질문이 남아 있었어.

“이 representation이 정말 세계의 latent structure를 배운 것인가, 아니면 downstream task에만 그럴듯하게 맞는 embedding인가?”

논문은 이걸 linear identifiability 문제로 바꿔. 관측 데이터는 실제 latent variable z가 unknown nonlinear function g를 거쳐 만들어진 x = g(z)라고 두고, encoder f가 x를 representation으로 보낸다고 해. 그러면 전체 mapping은 h = f ∘ g가 되고, 이상적으로는 h(z)가 원래 latent z를 선형적으로 복원해야 해. 논문은 이 목표를 h(z) = Qz, Q는 orthogonal matrix 형태로 정의해. 즉, 회전 정도는 허용하지만 nonlinear하게 꼬이면 안 된다는 뜻

3. LeJEPA objective를 어떻게 해석하나

논문에서 LeJEPA 학습은 크게 두 부분으로 모델링됨.

첫째, **alignment loss**는 positive pair의 embedding을 가깝게 만듦. 예를 들어 $z$와 $z'$가 같은 세계 상태의 두 view이거나 가까운 시간의 상태라면, $h(z)$와 $h(z')$가 가까워지도록 함.

둘째, **Gaussian regularization**, 구체적으로 SIGReg는 embedding distribution이 isotropic Gaussian이 되도록 만듦. 이는 collapse를 막기 위한 장치인데, 이 논문에서는 단순한 collapse prevention을 넘어서 identifiability 보장의 핵심 조건으로 작동한다고 봄.

논문에서 objective는 대략 다음처럼 해석됨.

$$
\min_h \mathbb{E}\left[\|h(z') - h(z)\|^2\right]
\quad
\text{s.t.}
\quad
h(z) \sim \mathcal{N}(0, I)
$$

즉, **embedding은 Gaussian처럼 퍼져 있어야 하고, 동시에 positive pair끼리는 최대한 가까워야 한다**는 조건임.

---

## 4. 가장 중요한 이론 결과

### Theorem 1: Gaussian world에서는 LeJEPA가 latent를 선형적으로 복원함

논문은 latent가 다음과 같다고 가정함.

$$
z \sim \mathcal{N}(0, I)
$$

또한 positive pair가 Ornstein–Uhlenbeck transition 형태로 생성된다고 가정함.

$$
z' = \rho z + \sqrt{1-\rho^2}\eta,
\quad
\eta \sim \mathcal{N}(0, I)
$$

이때 LeJEPA objective의 global optimum은 반드시 다음 형태가 된다고 증명함.

$$
h(z) = Qz
$$

여기서 $Q$는 orthogonal matrix임. 즉, LeJEPA가 nonlinear observation $x = g(z)$만 보고도 원래 latent를 회전된 형태로 복원한다는 뜻임.

발표에서 쉽게 말하면 다음과 같음.

> LeJEPA는 positive pair를 가깝게 만들고 embedding을 Gaussian으로 유지함. Gaussian world에서는 nonlinear하게 꼬인 representation보다 원래 latent를 선형적으로 복원하는 representation이 alignment loss를 가장 잘 줄임.

---

### Theorem 2: Gaussian이 유일함

더 강한 주장은 다음과 같음.

**이런 linear identifiability 보장은 Gaussian latent에서만 성립함.**

논문은 Gaussian이 아닌 latent distribution에서는 alignment + Gaussianity constraint만으로는 진짜 latent를 선형적으로 복원한다는 보장을 할 수 없다고 주장함.

실험에서도 generalized normal distribution의 shape parameter를 바꿔가며 확인했을 때, Gaussian에 해당하는 $\alpha = 2$에서 linear recovery가 가장 높게 나옴.

이 부분은 리뷰 발표에서 굉장히 중요한 포인트임. 왜냐하면 논문이 “LeJEPA는 항상 world model을 배운다”고 말하는 것이 아니라, **상당히 강한 조건 아래에서만 보장됨**을 말하기 때문임.

---

### Theorem 3: 근사적으로도 보장이 유지됨

현실에서는 Gaussian regularization이나 alignment가 완벽하게 만족되지 않음. 그래서 논문은 완벽한 optimum이 아니더라도, whitening error와 alignment gap이 작으면 recovery error도 작다는 approximate identifiability bound를 제시함.

실험에서는 여러 run에서 실제 recovery error가 theoretical bound 아래에 있는지를 확인함.

발표에서는 다음처럼 말하면 됨.

> 이론은 population-level global optimum에 대한 결과지만, 저자들은 approximate bound를 통해 실제 training loss가 identifiability의 proxy로 작동할 수 있다고 주장함.

---

### Theorem 4: Linear identifiability는 planning에 도움이 됨

World Model이라고 부르려면 단순히 representation이 예쁜 것만으로는 부족하고, 그 latent space에서 planning이 가능해야 함.

논문은 다음처럼 선형적으로 식별 가능한 representation에서는 rotation-invariant cost를 쓰는 planning 문제가 true latent space에서의 planning과 동일해진다고 보임.

$$
h(z) = Qz
$$

실험적으로도 Reacher 환경에서 Gaussian OU pair로 학습한 encoder는 oracle과 유사한 planning trajectory를 만들지만, 실제 RL trajectory로 학습한 non-Gaussian encoder는 더 휘어진 trajectory와 높은 control cost를 보임.

---

## 5. 실험 구성

실험은 이론을 검증하기 위한 controlled experiment 중심임.

첫 번째는 **2D Gaussian latent를 nonlinear mixing한 뒤 LeJEPA가 다시 원래 Gaussian latent structure를 복원하는지** 보는 실험임. Parabolic shear, sinusoidal shear, RealNVP coupling 같은 nonlinear mixing을 적용한 뒤, LeJEPA embedding이 다시 isotropic Gaussian 구조를 회복하는 모습을 보여줌.

두 번째는 **고차원 scaling 실험**임. Latent dimension을 2부터 1024까지 키우고, SIGReg, VICReg, InfoNCE를 비교함. SIGReg와 VICReg는 다음 수준으로 1024차원까지 linear identifiability를 유지하지만, InfoNCE는 fixed kernel width 설정에서 scale이 커질수록 성능이 떨어짐.

$$
R^2 > 0.999
$$

세 번째는 **non-Gaussian latent sweep**임. Generalized normal distribution에서 shape parameter $\alpha$를 바꾸면서 Gaussian일 때와 아닐 때를 비교함. Linear recovery는 Gaussian에 해당하는 $\alpha = 2$에서 sharp하게 peak를 보임.

네 번째는 **DMC Reacher 기반 pixel RL trajectory 실험**임. 같은 Reacher system이라도 Gaussian OU sampling을 쓰면 latent recovery가 잘 되지만, 실제 RL policy trajectory는 non-Gaussian, anisotropic transition, joint-limit wrapping 등의 이유로 linear identifiability가 깨짐.

---

## 6. 발표에서 강조하면 좋은 기여점

이 논문의 기여는 크게 세 가지로 잡으면 좋음.

첫째, **JEPA/LeJEPA에 대해 처음으로 명확한 identifiability guarantee를 제시했다**는 점임. 기존 JEPA는 representation이 좋다는 empirical claim이 강했는데, 이 논문은 어떤 world assumption 아래에서 정말 latent structure를 recover하는지 수학적으로 보여줌.

둘째, **Gaussian regularization의 의미를 새롭게 해석했다**는 점임. SIGReg는 단순히 collapse를 막는 regularizer가 아니라, latent를 linearly identifiable하게 만드는 핵심 constraint로 작동함.

셋째, **world model과 planning을 연결했다**는 점임. Representation이 다음 형태로 true latent와 선형적으로 대응되면, learned latent space에서의 planning이 true latent space planning과 같은 의미를 갖게 됨.

$$
h(z) = Qz
$$

---

## 7. 비판적으로 봐야 할 한계

리뷰 발표에서 가장 중요하게 짚어야 할 한계는 **가정이 강하다**는 점임.

첫째, **real-world latent가 정말 Gaussian인가?** 논문도 이를 limitation으로 인정함. Gaussian은 maximum entropy prior라서 가장 적은 가정을 하는 분포라고 설명하지만, 실제 세계의 semantic factor, object identity, action, contact dynamics, discrete state 등이 Gaussian이라고 보기 어려움.

둘째, **encoder output dimension이 true latent dimension과 같다고 가정함.** 논문은 $m = n$일 때를 기본으로 다룸. 하지만 실제 self-supervised representation에서는 latent dimension이 실제 world factor보다 훨씬 크거나 작을 수 있음. 논문도 $m < n$이면 어떤 subspace가 선택되는지, $m > n$이면 redundancy나 collapse가 어떻게 되는지 open problem이라고 말함.

셋째, **population-level global optimum 결과임.** 실제 deep network training에서 finite sample, optimization dynamics, architecture bias가 어떻게 작동하는지는 완전히 해결하지 않음. 논문은 approximate bound를 제시하지만, sample complexity나 training dynamics까지 보장하지는 않음.

넷째, **action-conditioned world model까지 완성한 것은 아님.** 논문은 state representation 쪽 identifiability를 다루고, action-conditioned transition model은 별도로 학습되어야 한다고 말함. 즉, “world model의 state side”를 다룬 논문이지, 완전한 model-based RL system 전체를 증명한 논문은 아님.

0. Abstract

세계의 true degrees of freedom을 뒤섞는 representation은 reliable planning이나 compositional generalization을 지원할 수 없음.

우리는 LeJEPA, 즉 alignment plus Gaussian regularization이 nonlinear observations로부터 세계의 latent variables를 선형적으로 복원함을 증명함.

이 성질은 linear identifiability로 알려져 있으며, latents가 stationary, additive-noise transitions 아래에서 진화하는 넓은 종류의 worlds에서 성립함.

우리의 main result는 이러한 모든 worlds 중에서, Gaussian이 이 보장이 성립하는 유일한 latent distribution이라는 것임.

Forward direction은 spectral decomposition에 기반하며, 여기서 각 degree of nonlinearity는 alignment에 의해 엄격하게 penalize되므로 linear map이 optimum이 됨.

반대로 converse는 모든 non-Gaussian alternative를 배제함.

또한 우리는 보장이 점진적으로 약화되는 approximate identifiability result를 증명하고, linear, orthogonal identifiability가 optimal latent-space planning을 가능하게 함을 보임.

우리는 2D examples부터 1024-dimensional latents까지 이어지는 experiments를 통해 이 이론을 검증하며, 여기에는 distributional ablations와 pixel-based robotic control이 포함됨.

우리의 이론은 경험적으로 성공한 recipe를 mathematical guarantee로 전환하며, world의 structure를 provably recover하는 World Models를 구축하기 위한 foundation을 제공함.

Abstract 이해를 위한 문장별 해설

1. “세계의 true degrees of freedom을 뒤섞는 representation은 reliable planning이나 compositional generalization을 지원할 수 없음.”

true degrees of freedom은 세계를 실제로 결정하는 핵심 요인들을 의미함.
- 예를 들어 로봇 팔이라면 joint angle, joint velocity, target position 등이 true degrees of freedom임.
- 이미지라면 object의 위치, 색, 조명, 크기, pose 등이 이에 해당할 수 있음.
- 즉, 겉으로 보이는 pixel 자체가 아니라, 그 pixel을 만들어내는 근본적인 원인 변수들임.
true degrees of freedom을 뒤섞는다는 것은 서로 다른 의미의 요인들이 엉켜 있는 상태를 말함.
- 예를 들어 object의 위치 정보와 색 정보가 representation 안에서 복잡하게 섞여 있으면, 위치만 바꾸거나 색만 바꾸는 것이 어려워짐.
- 로봇 예시에서는 위치와 속도, 관절 각도와 배경 정보가 뒤섞이면, 다음 상태를 예측하거나 행동을 계획하기 어려워짐.
reliable planning은 믿을 수 있는 계획 수립을 의미함.
- 로봇이 목표 지점까지 가기 위해 어떤 action을 해야 하는지 계산하는 것이 planning임.
- representation이 잘못되어 있으면, 모델은 latent space에서는 직선으로 움직인다고 생각했는데 실제 세계에서는 이상하게 휘어진 경로가 될 수 있음.
- 따라서 planning이 안정적으로 되려면 representation이 실제 세계의 구조를 잘 보존해야 함.
compositional generalization은 배운 요소들을 새롭게 조합해도 잘 일반화하는 능력을 의미함.
- 예를 들어 모델이 “빨간 공”, “파란 큐브”를 봤다면, “빨간 큐브”도 이해할 수 있어야 함.
- 이를 위해서는 색, 모양, 위치 같은 factor들이 representation 안에서 어느 정도 분리되어 있어야 함.
- factor들이 뒤섞여 있으면 새로운 조합에 약해짐.
이 문장 핵심:
- 모델의 representation이 세계의 핵심 latent factor들을 엉망으로 섞어버리면, 그 representation은 보기에는 좋아 보여도 실제 planning이나 일반화에는 쓸모가 부족함.
- World Model이라고 부르려면 단순히 feature가 예쁜 것이 아니라, 세계를 구성하는 실제 변수들을 잘 보존해야 함.

2. “우리는 LeJEPA, 즉 alignment plus Gaussian regularization이 nonlinear observations로부터 세계의 latent variables를 선형적으로 복원함을 증명함.”

LeJEPA는 JEPA 계열의 self-supervised learning 방법임.
- JEPA는 Joint-Embedding Predictive Architecture의 약자임.
- pixel을 직접 예측하는 대신, representation space에서 예측하거나 비슷하게 맞추는 방식임.
- 즉, 이미지의 모든 세부 pixel을 맞추기보다 중요한 feature를 맞추는 방향임.
alignment는 positive pair의 representation을 가깝게 만드는 것임.
- 예를 들어 같은 장면의 두 augmentation, 비디오에서 가까운 두 frame, 로봇 trajectory에서 가까운 두 state는 서로 관련된 pair임.
- 이런 pair의 embedding이 비슷해지도록 학습하는 것이 alignment임.
- 쉽게 말해 “비슷한 세계 상태는 비슷한 representation을 가져야 한다”는 조건임.
Gaussian regularization은 representation의 분포가 Gaussian, 즉 정규분포처럼 되도록 강제하는 것임.
- 단순히 모든 sample이 같은 representation으로 collapse되는 것을 막기 위한 장치임.
- 이 논문에서는 Gaussian regularization이 collapse 방지를 넘어, representation이 실제 latent variable을 복원하게 만드는 핵심 조건이라고 봄.
nonlinear observations는 실제 latent variable이 복잡한 과정을 거쳐 관측된 결과를 의미함.
- 예를 들어 실제 세계의 latent variable이 물체 위치, 색, 조명이라고 해도, 우리가 보는 것은 이들이 복잡하게 섞인 pixel image임.
- 3D scene이 camera rendering을 거쳐 2D image가 되는 과정은 매우 nonlinear함.
- 즉, observation은 latent variable을 그대로 보여주는 것이 아니라, 복잡하게 뒤섞은 결과임.
latent variables는 관측 데이터 뒤에 숨어 있는 원인 변수임.
- 예를 들어 이미지의 pixel 뒤에는 object position, pose, color, lighting 같은 원인들이 있음.
- 로봇 데이터 뒤에는 joint angle, velocity, target position 등이 있음.
- 이 논문은 모델이 이런 숨은 변수들을 representation 안에서 되살릴 수 있는지 묻고 있음.
선형적으로 복원함은 representation이 true latent variable과 linear transformation 관계에 있다는 뜻임.
- 논문에서는 대략 다음 형태를 목표로 함.

$$
h(z) = Qz
$$

여기서 (z)는 true latent variable이고, (h(z))는 모델이 학습한 representation임.
(Q)는 rotation이나 reflection 같은 orthogonal transformation임.
즉, 완전히 같은 좌표계는 아니더라도, 회전된 좌표계 정도로만 달라야 한다는 뜻임.
비선형적으로 꼬여 있으면 안 됨.
이 문장의 핵심은 다음과 같음.
- LeJEPA는 alignment와 Gaussian regularization을 함께 사용함.
- 이 조합이 nonlinear하게 관측된 데이터에서 원래 latent variable을 선형적으로 복원할 수 있음을 이론적으로 증명했다는 주장임.

3. “이 성질은 linear identifiability로 알려져 있으며, latents가 stationary, additive-noise transitions 아래에서 진화하는 넓은 종류의 worlds에서 성립함.”

linear identifiability는 hidden latent variable을 선형 변환까지 허용해서 복원할 수 있다는 성질임.
- “정확히 원래 좌표와 100% 같아야 한다”는 뜻은 아님.
- rotation이나 reflection처럼 단순한 변환은 허용함.
- 하지만 비선형적으로 뒤틀린 representation은 허용하지 않음.
예를 들어 true latent가 다음과 같다고 하자.

$$
z = [\text{position}, \text{velocity}]
$$

좋은 representation은 다음처럼 단순한 선형 변환 관계여야 함.

$$
h(z) = Qz
$$

나쁜 representation은 position과 velocity가 복잡한 nonlinear function으로 뒤섞인 형태임.

$$
h(z) = [\sin(\text{position}) + \text{velocity}^2, \dots]
$$

이런 경우 linear probe로 원래 latent를 깔끔하게 복원하기 어려움.
stationary transition은 시간이 지나도 분포가 유지되는 transition을 의미함.
- 예를 들어 (z)에서 (z')로 이동해도 (z)와 (z')의 전체 분포가 같다는 뜻임.
- 즉, process 자체가 시간에 따라 변하지 않음.
- 비디오나 trajectory에서 가까운 두 state를 positive pair로 볼 때 이런 가정이 사용됨.
additive-noise transition은 다음 상태가 현재 상태의 함수에 noise가 더해진 형태라는 뜻임.

$$
z' = m(z) + \eta
$$

여기서 (m(z))는 deterministic한 변화이고, (\eta)는 random noise임.
예를 들어 sensor noise, Brownian motion, augmentation jitter 등을 생각할 수 있음.
이 문장의 핵심은 다음과 같음.
- 이 논문은 아무 세계에서나 LeJEPA가 성공한다고 말하지 않음.
- latent variable이 일정한 분포를 유지하면서, noise가 더해지는 방식으로 변화하는 world class를 가정함.
- 그 안에서 linear identifiability가 성립한다고 주장함.

4. “우리의 main result는 이러한 모든 worlds 중에서, Gaussian이 이 보장이 성립하는 유일한 latent distribution이라는 것임.”

이 문장은 논문의 가장 강한 주장 중 하나임.
- LeJEPA가 latent variable을 선형적으로 복원한다는 보장은 아무 분포에서나 성립하지 않음.
- 이 논문은 그 보장이 Gaussian latent distribution에서만 성립한다고 주장함.
Gaussian distribution은 정규분포를 의미함.
- 평균과 분산만 정해졌을 때 가장 적은 추가 가정을 넣는 maximum-entropy distribution으로 볼 수 있음.
- 많은 작은 요인들이 합쳐진 변수는 central limit theorem에 의해 Gaussian에 가까워질 수 있음.
이 논문에서 중요한 점은 Gaussian이 단순한 예시가 아니라는 것임.
- “Gaussian이면 된다”가 아니라,
- “이 class of worlds 안에서는 Gaussian일 때만 된다”라고 주장함.
AI 관점에서 보면 의미가 큼.
- 보통 representation learning에서는 “좋은 objective를 쓰면 좋은 feature를 배운다”고 말함.
- 하지만 이 논문은 objective만으로 충분하지 않고, data-generating process의 latent distribution도 중요하다고 말함.
이 문장의 핵심은 다음과 같음.
- LeJEPA의 linear identifiability guarantee는 Gaussian latent world에서만 성립함.
- 따라서 이 논문은 LeJEPA가 항상 world model을 배운다는 논문이 아니라, 어떤 조건에서 가능한지를 정확히 밝히는 논문임.

5. “Forward direction은 spectral decomposition에 기반하며, 여기서 각 degree of nonlinearity는 alignment에 의해 엄격하게 penalize되므로 linear map이 optimum이 됨.”

Forward direction은 “Gaussian이면 LeJEPA가 linear identifiability를 달성함”을 보이는 방향임.
- 즉, 가정이 Gaussian world라면 결과로 linear recovery가 나온다는 증명임.
spectral decomposition은 함수를 여러 성분으로 나누어 분석하는 방법임.
- Gaussian world에서는 Hermite polynomial이라는 basis로 함수를 나눌 수 있음.
- Fourier transform에서 신호를 여러 frequency로 나누는 것과 비슷하게 이해할 수 있음.
이 논문에서는 representation function을 다음처럼 나눠서 봄.
- linear part
- quadratic part
- cubic part
- higher-order nonlinear part
degree of nonlinearity는 함수가 얼마나 비선형적인지를 나타냄.
- degree 1은 linear임.
- degree 2는 quadratic임.
- degree 3 이상은 더 복잡한 nonlinear component임.
논문의 핵심 수식적 직관은 다음과 같음.
- positive pair 사이의 correlation은 linear component에서 가장 크게 유지됨.
- nonlinear component로 갈수록 correlation이 더 빠르게 줄어듦.

w_1 \rho
+
w_2 \rho^2
+
w_3 \rho^3
+
\cdots
\leq
\rho
$$

여기서 (w_1)은 linear component의 비중이고, (w_2, w_3)는 nonlinear component의 비중임.
(0 < \rho < 1)이므로 다음이 성립함.

$$
\rho > \rho^2 > \rho^3 > \cdots
$$

따라서 nonlinear component가 섞일수록 alignment score가 나빠짐.
LeJEPA는 positive pair를 가깝게 만들려고 하므로, nonlinear component는 objective에서 손해를 봄.
그래서 optimum은 nonlinear component가 없는 linear map이 됨.
이 문장의 핵심은 다음과 같음.
- Gaussian world에서는 representation이 비선형적으로 꼬일수록 alignment loss가 나빠짐.
- 따라서 가장 좋은 해는 true latent를 선형적으로 복원하는 representation임.

6. “반대로 converse는 모든 non-Gaussian alternative를 배제함.”

converse는 반대 방향의 증명임.
- Forward direction은 “Gaussian이면 linear identifiability가 됨”을 보임.
- Converse는 “linear identifiability가 되려면 Gaussian이어야 함”을 보임.
즉, 논문은 다음 두 방향을 모두 보이려는 것임.

$$
\text{Gaussian world}
\Rightarrow
\text{linear identifiability}
$$

$$
\text{linear identifiability}
\Rightarrow
\text{Gaussian world}
$$

두 방향을 합치면 다음처럼 말할 수 있음.

$$
\text{linear identifiability}
\Longleftrightarrow
\text{Gaussian world}
$$

non-Gaussian alternative를 배제함은 Laplace, uniform, heavy-tailed distribution 등 다른 분포에서는 같은 보장을 할 수 없다는 뜻임.
- 즉, Gaussian이 많은 가능한 분포 중 하나가 아니라, 이 이론에서 유일하게 가능한 분포라는 주장임.
이 문장의 핵심은 다음과 같음.
- 이 논문은 Gaussian case 하나를 보여주는 데서 끝나지 않음.
- 다른 non-Gaussian latent distribution에서는 같은 linear recovery guarantee가 성립하지 않는다고 증명함.

7. “또한 우리는 보장이 점진적으로 약화되는 approximate identifiability result를 증명하고, linear, orthogonal identifiability가 optimal latent-space planning을 가능하게 함을 보임.”

이 문장은 두 가지 내용을 담고 있음.

7-1. approximate identifiability

앞선 이론은 이상적인 상황을 가정함.
- infinite data
- perfect optimization
- perfect Gaussian regularization
- exact alignment optimum
하지만 실제 deep learning에서는 이런 조건이 완벽하게 성립하지 않음.
- training loss가 완전히 minimum이 아닐 수 있음.
- embedding distribution이 완벽한 Gaussian이 아닐 수 있음.
- finite sample noise가 존재함.
그래서 논문은 approximate identifiability를 보임.
- objective를 조금 덜 만족하면 recovery도 조금 나빠진다는 뜻임.
- 즉, 갑자기 보장이 완전히 깨지는 것이 아니라, error가 점진적으로 증가함.
논문에서는 recovery error를 alignment gap과 whitening error로 bound함.

$$
\mathbb{E}\left[|h(z) - Qz|^2\right]
\leq
D + (\varepsilon + D)^2
$$

여기서 (\varepsilon)은 whitening error임.
- representation covariance가 identity matrix에서 얼마나 벗어났는지를 의미함.
(D)는 alignment gap에서 나온 term임.
- positive pair를 얼마나 잘 맞추지 못했는지를 나타냄.
이 의미는 실용적으로 중요함.
- training loss가 낮고 Gaussian regularization이 잘 되면, 실제 latent recovery도 좋을 가능성이 높다는 뜻임.

7-2. linear, orthogonal identifiability와 planning

orthogonal identifiability는 representation이 true latent와 rotation/reflection 정도만 다르다는 뜻임.

h(z) = Qz

여기서 (Q)가 orthogonal matrix이면 거리와 각도 같은 구조가 보존됨.
즉, latent space에서 두 점 사이의 거리가 true latent space에서도 같은 의미를 가짐.
이것이 planning에 중요한 이유는 다음과 같음.
- planning은 “현재 상태에서 목표 상태까지 어떻게 움직일지”를 계산하는 문제임.
- learned latent space가 true latent space를 회전한 것에 불과하다면, learned latent에서 짠 계획은 true world에서도 의미가 유지됨.
- 반대로 representation이 비선형적으로 뒤틀려 있으면, latent에서 직선 경로가 실제 세계에서는 휘어진 비효율적 경로가 될 수 있음.
이 문장의 핵심은 다음과 같음.
- 이론적 guarantee가 완벽한 경우뿐 아니라 approximate case에서도 어느 정도 유지됨.
- 또한 linear and orthogonal identifiability가 있으면 learned latent space에서의 planning이 실제 세계에서도 올바른 planning으로 이어질 수 있음.

8. “우리는 2D examples부터 1024-dimensional latents까지 이어지는 experiments를 통해 이 이론을 검증하며, 여기에는 distributional ablations와 pixel-based robotic control이 포함됨.”

이 문장은 실험 검증 범위를 설명함.
2D examples는 가장 단순한 toy experiment임.
- 2차원 Gaussian latent를 만든 뒤 nonlinear transformation으로 일부러 꼬아놓음.
- 이후 LeJEPA가 이를 다시 원래 latent 구조로 복원하는지 확인함.
- 시각화가 가능하므로 이론을 직관적으로 보여주기 좋음.
1024-dimensional latents는 고차원에서도 이론이 작동하는지 확인하는 실험임.
- 실제 representation learning에서는 latent dimension이 매우 큼.
- 따라서 2D toy example만으로는 부족함.
- 논문은 1024차원까지 확장해도 SIGReg나 VICReg가 linear recovery를 잘 유지함을 보임.
distributional ablations는 latent distribution을 바꿔가며 성능을 비교하는 실험임.
- Gaussian
- Laplace
- heavy-tailed distribution
- uniform distribution
- 등을 비교함.
이 실험은 Theorem 2와 연결됨.
- Gaussian일 때만 linear identifiability가 강하게 나타나는지 확인하는 목적임.
pixel-based robotic control은 실제 pixel image를 입력으로 사용하는 로봇 제어 실험임.
- 단순히 latent vector를 직접 주는 것이 아니라, image로부터 representation을 학습함.
- DMC Reacher 같은 환경에서 robot joint state를 latent로 보고, pixel observation으로부터 이를 복원할 수 있는지 확인함.
- 이는 이론이 실제 control setting과 어느 정도 연결될 수 있음을 보여주기 위한 실험임.
이 문장의 핵심은 다음과 같음.
- 논문은 이론만 제시하지 않고, toy example, high-dimensional latent, distribution shift, robotic control까지 실험적으로 확인함.
- 다만 여전히 controlled experiment 중심이며, real-world 대규모 video world model까지 완전히 증명한 것은 아님.

9. “우리의 이론은 경험적으로 성공한 recipe를 mathematical guarantee로 전환하며, world의 structure를 provably recover하는 World Models를 구축하기 위한 foundation을 제공함.”

경험적으로 성공한 recipe는 LeJEPA의 학습 방식이 실험적으로 잘 작동했다는 뜻임.
- Alignment를 사용함.
- Gaussian regularization을 사용함.
- Collapse를 막으면서 좋은 representation을 학습함.
- 이러한 방식은 이미 image, video, control 등에서 좋은 결과를 보였음.
하지만 경험적으로 잘 된다는 것과 이론적으로 왜 되는지 아는 것은 다름.
- 기존에는 “잘 되더라”에 가까웠다면,
- 이 논문은 “어떤 조건에서는 왜 되는지 증명할 수 있다”고 말함.
mathematical guarantee는 수학적 보장을 의미함.
- 특정 assumptions가 만족되면, LeJEPA가 true latent를 linear하게 recover한다는 것을 증명함.
- 즉, 단순한 empirical observation을 theorem으로 끌어올림.
world의 structure를 provably recover한다는 것은 세계를 구성하는 latent structure를 증명 가능하게 복원한다는 뜻임.
- 예를 들어 image나 sensor data 뒤에 있는 position, velocity, angle 같은 변수들을 representation에서 선형적으로 되살릴 수 있다는 의미임.
- 물론 이 보장은 Gaussian latent, stationary additive-noise transition, dimension matching 같은 강한 조건 아래에서 성립함.
foundation을 제공함은 이 논문이 완성된 World Model system을 제시했다기보다, World Model을 이론적으로 이해하기 위한 기초를 제공한다는 뜻임.
- 특히 “언제 representation이 world model이라고 부를 수 있는가?”라는 질문에 대해 하나의 수학적 기준을 제시함.
- 그 기준이 linear identifiability임.
이 문장의 핵심은 다음과 같음.
- LeJEPA는 경험적으로 잘 되는 SSL recipe였음.
- 이 논문은 그 recipe가 특정 조건에서 실제 world latent structure를 복원한다는 이론적 근거를 제공함.
- 따라서 World Model 연구에서 representation learning의 이론적 foundation으로 볼 수 있음.

전체 요약

이 논문은 LeJEPA가 언제 World Model을 학습한다고 말할 수 있는지 분석함.
핵심 기준은 linear identifiability임.
Linear identifiability는 learned representation이 true latent variable을 rotation/reflection 정도만 허용하고 선형적으로 복원한다는 뜻임.
LeJEPA의 alignment는 positive pair를 가깝게 만들고, Gaussian regularization은 representation distribution을 isotropic Gaussian으로 유지함.
Gaussian latent world에서는 nonlinear representation이 alignment에서 손해를 보기 때문에, optimum은 linear representation이 됨.
반대로 non-Gaussian latent distribution에서는 같은 보장이 성립하지 않음.
Approximate case에서도 training loss와 whitening error가 작으면 recovery error도 작게 bound됨.
Linear and orthogonal identifiability가 있으면 learned latent space에서의 planning이 true world planning과 같은 의미를 가질 수 있음.
따라서 이 논문은 LeJEPA가 항상 world model을 배운다고 주장하는 것이 아니라, 어떤 조건에서 world structure를 provably recover할 수 있는지 밝히는 이론 논문임.

1. Introduction

Self-supervised learning (SSL)의 약속은 labeled data 없이도, 단지 관찰하고 예측하는 것만으로 세계의 useful representations를 학습할 수 있다는 데 있음.

Joint-Embedding Predictive Architectures [JEPAs, 1]는 같은 input의 related views에 대해 비슷한 embeddings를 생성하도록 representation을 학습시키고, regularizer가 representation이 trivial constant로 collapse되는 것을 막도록 함으로써 이 비전을 추구함 [2, 3].

이렇게 얻어진 representations는 image [4], video [5, 6], latent-space planning [7, 8, 9] 전반에서. 매우 효과적임이 입증됨.

그러나 더 깊은 질문이 남아 있음.

Learned representation은 언제 World Model, 즉 세계의 latent structure에 대한 faithful map이라고 할 수 있는가?

우리의 답은 다음과 같음. 세계의 latent variables를 선형적으로 복원할 때임 (Fig. 1).

왼쪽: 세계는 서로 독립적인 Gaussian latent variables를 가짐. 가운데: 알 수 없는 nonlinear process가 이 latent variables를 우리가 관측하는 data로 뒤섞음. 오른쪽: LeJEPA는 latent variables를 rotation까지만 허용한 형태로 복원함. 우리는 이것이 유일한 optimum임을 증명함.

관련 없는 latent variables를 entangle하여 object의 position을 color와 뒤섞거나, velocity를 texture와. 섞어버리는 representation은 좁은 task에서는 좋은 점수를 낼 수 있지만, 세계가 변하면 실패하게 됨 [10].

우리에게 필요한 것은 linear identifiability, 즉 learned representation이 underlying latent variables를 simple symmetries까지만 허용한 채 복원한다는 mathematical guarantee임 [11].

Practitioners는 이미 이를 routinely test하고 있음.

Representation이 linear probing [12]으로 평가될 때마다, 암묵적으로 묻는 질문은 model이 latent variables의 linear representation을 학습했는지 여부임 [13, 14, 15, 16, 17, 18].

이것이 없으면 linear probes는 latent variables를 정확히 복원할 수 없음.

따라서 linear identifiability는 faithful linear probing을 위한 necessary condition임. 다만 sufficient condition은 아님 [19].

1. “우리에게 필요한 것은 linear identifiability…”

여기서 linear identifiability는 다음 뜻임.

모델이 배운 representation이 원래 latent variables를 선형 변환 정도의 차이만 두고 복원할 수 있음이 보장되는 성질임.

예를 들어 실제 세계의 latent variable이 다음과 같다고 하자.

z=[object position,object color,object velocity]z = [ \text{object position}, \text{object color}, \text{object velocity} ]

모델이 이미지를 보고 representation h(z)h(z)를 만들었을 때, 좋은 경우는 다음처럼 되는 것임.

h(z)=Qzh(z) = Qz

즉, 원래 latent zz가 representation 안에 선형적으로 정리되어 있음.

완전히 같은 좌표일 필요는 없음. 좌표축이 회전되거나 부호가 바뀐 정도는 괜찮음. 이걸 논문에서는 simple symmetries라고 부름.

2. “simple symmetries까지만 허용한 채 복원한다”

이 말이 좀 어려운데, 쉽게 말하면:

원래 latent와 learned representation이 완전히 똑같을 필요는 없지만, 회전이나 반사처럼 구조를 망가뜨리지 않는 변환 정도만 허용한다는 뜻임.

예를 들어 2D latent가 있다고 하자.

z=[positionvelocity]z = \begin{bmatrix} \text{position} \\ \text{velocity} \end{bmatrix}

모델이 다음처럼 복원하면 괜찮음.

h(z)=Qzh(z) = Qz

여기서 QQ가 rotation matrix라면, 좌표계가 회전된 것뿐임. 점들 사이의 거리, 각도, 구조는 유지됨.

하지만 다음처럼 복원하면 안 좋음.

h(z)=[sin⁡(position)+velocity2position⋅velocity]h(z) = \begin{bmatrix} \sin(\text{position}) + \text{velocity}^2 \\ \text{position} \cdot \text{velocity} \end{bmatrix}

이건 position과 velocity가 비선형적으로 뒤섞인 것임. 이런 representation은 latent를 다시 꺼내기 어려움.

3. “Practitioners는 이미 이를 routinely test하고 있음”

이 말은 연구자들이 이미 실험에서 이걸 자주 확인하고 있다는 뜻임.

대표적인 방법이 linear probing임.

Linear probing은 보통 이렇게 함.

pretrained encoder는 freeze함.
그 encoder가 만든 feature를 가져옴.
그 위에 아주 간단한 linear classifier 또는 linear regressor만 학습함.
성능이 잘 나오면 “representation 안에 task-relevant information이 잘 들어 있다”고 봄.

예를 들어 이미지 encoder를 학습했다고 하자. 그 encoder의 feature 위에 linear classifier 하나만 붙였는데 object class를 잘 맞히면, 사람들은 “이 encoder가 object semantics를 잘 배웠다”고 해석함.

4. “Representation이 linear probing으로 평가될 때마다…”

이 문장은 linear probing의 숨은 가정을 말하는 것임.

Linear probing은 단순한 선형 모델만 사용함.

z^=Wh(z)+b\hat{z} = Wh(z) + b

여기서 h(z)h(z)는 learned representation이고, W,bW, b는 linear probe임.

만약 h(z)h(z) 안에 latent variable이 선형적으로 들어 있다면, linear probe가 쉽게 꺼낼 수 있음.

예를 들어 좋은 경우:

h(z)=Qzh(z) = Qz

그러면 linear probe는 Q−1Q^{-1} 비슷한 것을 배워서 원래 zz를 복원할 수 있음.

z≈Wh(z)z \approx Wh(z)

하지만 나쁜 경우:

h(z)=[sin⁡(z1)z22]h(z) = \begin{bmatrix} \sin(z_1) \\ z_2^2 \end{bmatrix}

이렇게 nonlinear하게 꼬여 있으면, 단순한 linear probe로는 원래 z1,z2z_1, z_2를 정확히 복원하기 어려움.

5. “이것이 없으면 linear probes는 latent variables를 정확히 복원할 수 없음”

여기서 이것은 linear identifiability를 말함.

즉, learned representation이 true latent를 선형적으로 담고 있지 않으면, linear probe는 아무리 학습해도 원래 latent를 정확히 꺼낼 수 없다는 뜻임.

예를 들어 true latent가 position이라고 하자.

좋은 representation:

h(z)=2z+1h(z) = 2z + 1

이 경우 linear probe로 zz를 쉽게 복원할 수 있음.

나쁜 representation:

h(z)=z2h(z) = z^2

이 경우 z=1z=1과 z=−1z=-1이 둘 다 h(z)=1h(z)=1이 됨. 그러면 linear probe는 원래 zz가 1인지 -1인지 구분할 수 없음.

더 복잡한 예로, position과 velocity가 다음처럼 섞여 있으면:

h(z)=[position+velocity2sin⁡(position)]h(z) = \begin{bmatrix} \text{position} + \text{velocity}^2 \\ \sin(\text{position}) \end{bmatrix}

linear probe는 position만 깔끔하게 꺼내기 어려움.

linear identifiability만 있으면 faithful linear probing이 무조건 보장되는가?

에 대한 답은 아님임.

왜냐하면 linear probing이 잘 작동하려면 linear identifiability 말고도 다른 조건들이 필요하기 때문임.

예를 들어:

probe를 학습할 labeled data가 충분해야 함
probe optimization이 잘 되어야 함
downstream label이 실제 latent와 선형적으로 관련되어 있어야 함
noise가 너무 크지 않아야 함
evaluation task가 representation이 담은 latent와 맞아야 함

하지만 JEPA에 대한 identifiability result는 아직 존재하지 않음.

이는 prior methods가 unspecified embedding distributions를 가진 implicit collapse prevention [20, 21, 22]에 의존했기 때문에 지금까지 해결하기 어려웠음.

최근의 변화가 이 상황을 바꿈.

LeJEPA [2]는 Sketched Isotropic Gaussian Regularization (SIGReg)를 통해 embedding distribution을 isotropic Gaussian 쪽으로 명시적으로 regularize함으로써 collapse를 방지함.

Alignment loss와 함께 사용되면, 이는 raw pixels로부터 stable end-to-end training을 가능하게 함 [9].

그러나 이렇게 얻어진 representation이 실제로 세계의 latent variables를 복원하는지는 여전히 open question으로 남아 있음.

기존 JEPA의 한계와 LeJEPA가 필요한 이유

1. JEPA에 대한 identifiability result는 아직 존재하지 않음

JEPA는 Joint-Embedding Predictive Architecture의 약자임.
핵심 아이디어는 pixel 자체를 예측하는 것이 아니라, representation space에서 예측하거나 맞추는 것임.
예를 들어 비디오의 현재 frame과 미래 frame이 있을 때, 미래 frame의 pixel을 하나하나 예측하는 대신, 두 frame의 embedding이 잘 맞도록 학습함.
이렇게 하면 texture, background noise, pixel-level detail 같은 불필요한 정보에 capacity를 덜 쓰고, 더 추상적인 representation을 배울 수 있음.
하지만 중요한 문제가 있음.
JEPA가 좋은 representation을 학습한다는 empirical result는 있었지만, 그 representation이 실제 world의 latent variables를 복원한다는 이론적 보장은 없었음.
여기서 identifiability result란 다음과 같은 보장을 의미함.
“모델이 학습한 representation이 데이터 뒤에 숨어 있는 실제 latent variables를 복원할 수 있음.”
예를 들어 로봇 팔 이미지가 있다고 할 때, pixel 뒤에는 실제 joint angle, velocity, target position 같은 latent variables가 있음.
좋은 World Model이라면 단순히 이미지를 구분하는 feature를 만드는 것에서 끝나면 안 됨.
이미지 뒤에 숨어 있는 실제 상태 변수들을 representation 안에 잘 담아야 함.
그런데 기존 JEPA에 대해서는 이런 의미의 identifiability theorem이 없었음.

2. 기존 방법들은 implicit collapse prevention에 의존했기 때문에 분석이 어려웠음

Self-supervised learning에서 가장 큰 문제 중 하나는 collapse임.
Collapse는 모든 input이 거의 같은 representation으로 mapping되는 현상임.
예를 들어 어떤 이미지를 넣어도 encoder가 항상 같은 vector를 출력하면, representation은 아무 정보도 담지 못함.
그런데 alignment loss만 사용하면 이런 collapse가 쉽게 생길 수 있음.
- 왜냐하면 positive pair를 가깝게 만들라는 목표만 있으면, 모든 sample을 같은 vector로 보내는 것이 loss를 줄이는 쉬운 방법이 될 수 있기 때문임.
그래서 기존 SSL 방법들은 collapse를 막기 위한 여러 장치를 사용했음.
예를 들어 stop-gradient teacher, momentum encoder, predictor, covariance regularization 같은 방법들이 있음.
하지만 이런 방법들은 collapse를 “간접적으로” 막는 경우가 많았음.
이것을 implicit collapse prevention이라고 부름.
implicit이라는 말은 명시적으로 “embedding distribution은 이런 형태여야 한다”고 정하지 않는다는 뜻임.
즉, collapse는 막지만, 최종 embedding distribution이 정확히 어떤 분포를 가져야 하는지는 불명확함.
이 점이 이론 분석을 어렵게 만듦.
Identifiability를 증명하려면 learned representation의 분포가 어떤 구조를 갖는지 알아야 함.
그런데 prior methods는 embedding distribution이 unspecified되어 있었음.
즉, embedding이 Gaussian인지, uniform인지, sphere 위에 놓이는지, cluster를 이루는지 명확하지 않았음.
그래서 JEPA류 방법이 latent variables를 실제로 복원한다고 수학적으로 증명하기 어려웠음.

3. LeJEPA는 이 상황을 바꿈

최근 LeJEPA가 등장하면서 이론 분석이 가능해질 수 있는 구조가 생김.
LeJEPA의 핵심은 alignment loss와 Gaussian regularization을 함께 사용하는 것임.
Alignment loss는 positive pair의 representation을 가깝게 만듦.
Gaussian regularization은 전체 embedding distribution이 isotropic Gaussian에 가까워지도록 만듦.
즉, LeJEPA는 단순히 collapse를 막는 데서 끝나지 않고, embedding distribution의 모양을 명시적으로 정함.
이 점이 기존 JEPA/SSL 방법들과 중요한 차이임.

4. SIGReg는 embedding distribution을 isotropic Gaussian으로 regularize함

LeJEPA는 Sketched Isotropic Gaussian Regularization, 즉 SIGReg를 사용함.
SIGReg의 목표는 embedding distribution이 isotropic Gaussian에 가까워지도록 만드는 것임.
Isotropic Gaussian은 평균이 0이고 covariance가 identity matrix인 정규분포임.

N(0,I)\mathcal{N}(0, I)

여기서 isotropic이라는 말은 모든 방향으로 균일하게 퍼져 있다는 뜻임.
예를 들어 2D에서는 원형 Gaussian처럼 생각할 수 있음.
특정 방향으로 길게 늘어나 있거나, 어떤 축끼리 강하게 correlation되어 있는 상태가 아님.
SIGReg는 embedding들이 한 점으로 collapse되지 않게 만들면서, 전체적으로 Gaussian처럼 잘 퍼지도록 유도함.
이처럼 embedding distribution을 명시적으로 Gaussian에 맞추면, 이론적으로 다루기 쉬워짐.
왜냐하면 Gaussian distribution은 수학적으로 매우 잘 분석되는 분포이고, rotation invariance 같은 좋은 성질을 갖기 때문임.

5. Alignment loss와 SIGReg를 함께 쓰면 raw pixels에서도 stable end-to-end training이 가능함

Alignment loss는 related views의 embedding을 가깝게 만드는 역할을 함.
예를 들어 같은 이미지의 두 augmentation, 비디오의 가까운 두 frame, trajectory의 가까운 두 state는 positive pair가 될 수 있음.
Alignment loss는 이런 pair들이 representation space에서 가까워지도록 만듦.

Lalign=E[∥h(z′)−h(z)∥2]\mathcal{L}_{align} = \mathbb{E} \left[ \|h(z') - h(z)\|^2 \right]

하지만 alignment loss만 있으면 collapse가 생길 수 있음.
모든 input을 같은 embedding으로 보내면 positive pair가 항상 가까워지기 때문임.
이때 SIGReg가 전체 embedding distribution을 Gaussian처럼 퍼지게 만들면서 collapse를 막음.
따라서 alignment loss와 SIGReg는 서로 보완적인 역할을 함.
Alignment loss는 관련 있는 sample들을 가깝게 만들고, SIGReg는 전체 representation이 collapse하지 않게 만듦.
이 조합 덕분에 raw pixels에서부터 stable end-to-end training이 가능해짐.
Raw pixels는 사람이 미리 만든 feature가 아니라 이미지의 원래 pixel 값을 의미함.
End-to-end training은 input image부터 encoder output까지 전체 neural network를 한 번에 학습한다는 뜻임.

6. 그래도 representation이 실제 latent variables를 복원하는지는 여전히 open question임

LeJEPA는 좋은 representation을 안정적으로 학습할 수 있음.
하지만 이것만으로 “world의 latent variables를 복원했다”고 말할 수는 없음.
좋은 representation을 학습하는 것과, 실제 세계의 hidden state를 복원하는 것은 다른 문제임.
예를 들어 image classification에 좋은 feature를 배웠다고 해서, 그 feature가 object position, velocity, lighting, pose 같은 실제 latent factor를 정확히 담고 있다고 보장되지는 않음.
이 논문이 던지는 핵심 질문은 바로 이것임.
“LeJEPA로 학습한 representation은 언제 실제 world의 latent variables를 recover한다고 말할 수 있는가?”
논문은 이 질문에 linear identifiability라는 기준으로 답하려고 함.
즉, learned representation이 true latent variables를 rotation/reflection 정도의 simple symmetry만 허용한 채 선형적으로 복원할 수 있는지를 분석함.

Contributions.

우리는 JEPAs에 대한 첫 번째 identifiability result를 제시함으로써 이 gap을 메움 (Fig. 2).

우리는 Gaussian latent variables를 가지며, independent, additive-noise transitions를 갖는 stationary process로부터 positive pairs가 생성되는 넓은 종류의 worlds를 고려함.

이는 maximum-entropy choices임 [23].

우리는 latent variables가 Gaussian일 때 그리고 오직 그때에만, LeJEPA가 linearly identifiable representation을 학습함을 증명함.

Forward direction은 모든 degree of nonlinearity를 penalize하는 spectral decomposition에 기반함.

Converse는 모든 non-Gaussian alternative를 배제함.

또한 objectives가 근사적으로만 만족될 때 identifiability가 점진적으로 약화됨을 증명하고, linear identifiability가 optimal latent-space planning에 충분함을 보임.

우리는 2D mixings, 1024-dimensional latents, distributional ablations, pixel-based robotic control, 그리고 approximate bound 전반에 걸쳐 이를 empirically validate함 (Sec. 6).

왼쪽: 세계는 correlated positive pairs를 가진 깨끗한 latent structure를 가짐. 이 latent structure는 Gaussian이고 disentangled되어 있음. 가운데: 알 수 없는 nonlinear process가 우리가 실제로 관측하는 observations를 생성하며, 이 과정에서 latent structure가 뒤섞임. 오른쪽: LeJEPA는 두 가지 objective로 representation을 학습함. 하나는 positive pairs를 서로 가깝게 당기는 것, 즉 attract이고, 다른 하나는 collapse를 막기 위해 embedding distribution을 Gaussian으로 유지하는 것, 즉 SIGReg임. 우리는 학습된 representation이 true latents의 rotation이 될 수밖에 없음을 증명함. 즉, representation이 올바른 World Model을 학습하도록 강제됨을 보임. 이는 Theorem 1에 해당함.

2 Related Work

Representation Learning.

JEPAs [1, 4, 5, 6]는 pixel space가 아니라 representation space에서 예측함으로써, irrelevant detail에 capacity가 낭비되는 것을 피함.

더 일반적으로, SSL은 같은 content의 positive views를 서로 가깝게 당기면서 collapse를 방지함.

Contrastive methods [24, 25]는 negatives를 명시적으로 사용함 [InfoNCE, 26].

Non-contrastive methods는 stop-gradient teachers [21, 20], covariance regularization [VICReg, 3, 27], 또는 feature clustering을 이용한 self-distillation [22, 28]으로 이를 대체함.

LeJEPA [2]는 explicit Gaussianity regularizer인 SIGReg를 추가함.

LeWorldModel [9]은 이 recipe를 action-conditioned control로 scale up함.

InfoNCE, VICReg, LeJEPA는 implicit [29, 30, 31, 32] constraint에서 second-moment constraint를 거쳐 full Gaussianity constraint에 이르는 Gaussianity constraints의 hierarchy를 이룸.

우리는 이 세 가지를 모두 test함 (Tab. 1, App. H.7).

이 문장은 InfoNCE, VICReg, LeJEPA를 embedding distribution을 얼마나 Gaussian-like하게 만드는지에 따라 비교한 것임. InfoNCE는 negative samples를 밀어내면서 embedding을 간접적으로 퍼뜨리기 때문에 implicit한 Gaussianity constraint로 볼 수 있음. VICReg는 variance와 covariance를 직접 제어하므로 second-moment constraint에 해당함. 반면 LeJEPA는 SIGReg를 통해 embedding distribution 자체를 isotropic Gaussian에 가깝게 만들려고 하므로 full Gaussianity constraint라고 볼 수 있음. 따라서 세 방법은 collapse를 막는 방식이 점점 더 명시적이고 강한 Gaussianity constraint로 이동하는 hierarchy를 이룸.

InfoNCE, VICReg, LeJEPA는 모두 embedding이 collapse되지 않고 잘 퍼지게 만들려는 방법인데, embedding distribution을 얼마나 강하게 Gaussian처럼 만들려고 하느냐가 다르다는 뜻임.

즉, 세 방법을 Gaussianity constraint의 강도 기준으로 줄 세운 것임.

1. 왜 Gaussianity constraint가 나오냐

Self-supervised learning에서는 representation이 한 점으로 collapse되면 안 됨.

예를 들어 모든 이미지가 같은 embedding으로 가면:

cat  → [0, 0, 0]
dog  → [0, 0, 0]
car  → [0, 0, 0]

positive pair는 잘 맞는 것처럼 보이지만, representation은 아무 정보도 없음.

그래서 SSL 방법들은 embedding들이 공간에 잘 퍼지도록 만드는 장치를 둠.
이걸 논문에서는 넓게 Gaussianity constraint 관점으로 해석하는 것임.

2. InfoNCE: implicit Gaussianity constraint

InfoNCE는 contrastive learning에서 쓰는 loss임.

핵심은:

positive pair는 가깝게
negative pairs는 멀게

만드는 것임.

예를 들어:

same image augmentation → 가까이
different image samples → 멀리

이렇게 하면 embedding들이 한 점으로 collapse될 수 없음.
왜냐하면 negative samples끼리는 서로 멀어져야 하기 때문임.

하지만 InfoNCE는 직접적으로

h(z) \sim \mathcal{N}(0, I)

라고 강제하지는 않음.

그냥 negative samples를 밀어내다 보면 embedding들이 어느 정도 공간에 퍼지게 됨.
그래서 이걸 implicit constraint라고 부름.

InfoNCE는 embedding distribution을 Gaussian으로 직접 만들지는 않지만, contrastive repulsion 때문에 간접적으로 퍼지게 만듦.

3. VICReg: second-moment constraint

VICReg는 non-contrastive SSL 방법임.
Negative samples 없이 collapse를 막으려고 함.

VICReg의 핵심은 representation의 variance와 covariance를 조절하는 것임.

즉, embedding distribution의 평균이나 전체 모양을 완전히 Gaussian으로 맞추는 게 아니라, 주로 2차 통계량, 즉 second moment를 맞춤.

여기서 second moment는 대략 이런 것들임.

각 feature dimension의 variance
feature dimension들 사이의 covariance

즉 VICReg는:

각 dimension이 충분히 변해야 함
서로 다른 dimension들이 너무 중복되면 안 됨
covariance가 identity에 가까워지도록 유도함

이런 식으로 collapse를 막음.

수식적으로는 대략:

[
\text{Cov}(h(z)) \approx I
]

를 유도한다고 보면 됨.

하지만 이것도 full Gaussian은 아님.

왜냐하면 어떤 분포는 covariance가 identity여도 Gaussian이 아닐 수 있기 때문임.

예를 들어 uniform distribution도 covariance를 맞출 수 있고, heavy-tailed distribution도 covariance만 맞출 수 있음.

즉:

VICReg는 embedding distribution의 평균/분산/공분산 같은 2차 통계량은 제어하지만, 전체 분포가 Gaussian이 되도록 강제하지는 않음.

4. LeJEPA: full Gaussianity constraint

LeJEPA는 여기서 더 강하게 감.

LeJEPA의 SIGReg는 embedding distribution이 isotropic Gaussian에 가까워지도록 regularize함.

즉 단순히 covariance만 맞추는 게 아니라, 전체 distribution이 다음처럼 되도록 유도함.

[
h(z) \sim \mathcal{N}(0, I)
]

이게 full Gaussianity constraint임.

즉:

mean ≈ 0
covariance ≈ I
higher-order distribution shape도 Gaussian-like

가 되도록 하는 것임.

물론 실제 구현에서 완벽한 Gaussian을 강제한다기보다는, sketching 기반으로 Gaussianity를 효율적으로 맞추려는 regularization을 넣는 것임.

5. 그래서 hierarchy라는 말의 의미

논문에서 말하는 hierarchy는 이런 순서임.

InfoNCE
↓
implicit Gaussianity constraint

VICReg
↓
second-moment constraint

LeJEPA
↓
full Gaussianity constraint

즉, Gaussianity를 강제하는 정도가 점점 강해짐.

MethodGaussianity를 어떻게 다루는가강도

InfoNCE	negative samples로 embedding을 간접적으로 퍼뜨림	약함 / implicit
VICReg	variance, covariance 같은 second moment를 맞춤	중간
LeJEPA	embedding distribution 전체를 (N(0,I))에 가깝게 만듦	강함 / explicit

6. 발표에서 쉽게 말하면

이 문장은 InfoNCE, VICReg, LeJEPA를 embedding distribution을 얼마나 Gaussian-like하게 만드는지에 따라 비교한 것임. InfoNCE는 negative samples를 밀어내면서 embedding을 간접적으로 퍼뜨리기 때문에 implicit한 Gaussianity constraint로 볼 수 있음. VICReg는 variance와 covariance를 직접 제어하므로 second-moment constraint에 해당함. 반면 LeJEPA는 SIGReg를 통해 embedding distribution 자체를 isotropic Gaussian에 가깝게 만들려고 하므로 full Gaussianity constraint라고 볼 수 있음. 따라서 세 방법은 collapse를 막는 방식이 점점 더 명시적이고 강한 Gaussianity constraint로 이동하는 hierarchy를 이룸.

World Models.

Internal-model concept는 cognitive science, control theory, modern ML 전반에 걸쳐 있음.

Cognitive scientists는 mental simulation을 reasoning, perception, motor control의 substrate로 보았으며 [33, 34, 35, 36, 37], free-energy [38]와 probabilistic-program [39] formalizations는 brain을 generative model로 해석함.

Cybernetics는 이를 engineered systems로 확장했으며 [40, 41], classical control은 주어진 state space 위에서 dynamics를 발전시킴 [42, 43].

Neural world models는 recurrent controller-model pairs [44, 45, 46]에서 시작하여, pixels로부터의 latent dynamics [47, 48, 49, 50, 51], value-equivalent prediction [52], generative [53] 및 joint-embedding predictive [1, 9] architectures, 그리고 large generative video simulators [54, 55]로 발전해옴.

JEPA의 주장은 pixel-perfect prediction이 capacity를 낭비한다는 것임 [1].

우리는 이 그림에서 encoder side를 다룸.

Classical control은 learned coordinates에서 적용됨 (Thm. 4).

World Models 문단 설명

1. “Internal-model concept는 cognitive science, control theory, modern ML 전반에 걸쳐 있음.”

이 문장은 World Model이라는 아이디어가 최근 딥러닝에서 갑자기 나온 개념이 아니라는 뜻임.

여기서 internal model은 agent나 시스템이 세계에 대해 내부적으로 가지고 있는 모델을 의미함. 쉽게 말하면, “내가 어떤 행동을 하면 세계가 어떻게 변할지”를 머릿속이나 시스템 내부에서 시뮬레이션할 수 있는 구조임.

예를 들어 사람은 공을 던지기 전에 “이 정도 힘으로 던지면 저쪽으로 날아가겠지”라고 예측할 수 있음. 로봇도 마찬가지로 “이 action을 하면 팔이 어느 위치로 갈지”를 내부적으로 예측해야 함.

이런 내부 모델 개념은 여러 분야에 걸쳐 등장해왔음.

cognitive science에서는 사람이 어떻게 생각하고 예측하는지 설명할 때 사용됨
control theory에서는 시스템을 제어하기 위해 dynamics model을 사용하는 방식으로 등장함
modern ML에서는 neural network가 세계의 latent state나 dynamics를 학습하는 World Model로 이어짐

즉, 이 문장은 “World Model은 딥러닝만의 개념이 아니라, 인지과학, 제어이론, 머신러닝을 관통하는 오래된 아이디어다”라는 배경 설명임.

2. “Cognitive scientists는 mental simulation을 reasoning, perception, motor control의 substrate로 보았으며 [33, 34, 35, 36, 37], free-energy [38]와 probabilistic-program [39] formalizations는 brain을 generative model로 해석함.”

이 문장은 World Model 개념이 인지과학에서 어떻게 이해되어 왔는지 설명함.

먼저 mental simulation은 머릿속 시뮬레이션을 의미함. 사람은 실제로 행동하기 전에 머릿속으로 결과를 예측함. 예를 들어 컵을 잡기 전에 “손을 이렇게 움직이면 컵을 잡을 수 있겠다”라고 내부적으로 시뮬레이션함.

이 문장에서 말하는 세 가지는 다음과 같음.

reasoning: 추론
어떤 일이 일어날지 생각하고 판단하는 과정임.
perception: 지각
눈이나 귀로 들어온 정보를 단순히 받아들이는 것이 아니라, 뇌가 “지금 세계가 어떤 상태인지” 해석하는 과정임.
motor control: 운동 제어
몸이나 팔을 어떻게 움직일지 결정하고 조절하는 과정임.

여기서 substrate는 기반 또는 토대라는 뜻임. 즉, cognitive scientists는 mental simulation이 reasoning, perception, motor control의 기반이라고 본다는 말임.

다음으로 free-energy formalization은 뇌가 prediction error를 줄이도록 작동한다는 관점과 관련됨. 쉽게 말하면, 뇌는 외부 세계를 예측하는 generative model을 가지고 있고, 실제 감각 입력과 예측 사이의 차이를 줄이려 한다는 해석임.

probabilistic-program formalization은 뇌를 확률적 생성 프로그램처럼 보는 관점임. 즉, 뇌가 세계가 어떻게 생성되는지에 대한 probabilistic model을 가지고 있고, 관측된 감각 입력을 바탕으로 숨은 원인을 추론한다고 보는 것임.

이 문장의 핵심은 다음과 같음.

인지과학에서는 오래전부터 뇌가 세계를 내부적으로 시뮬레이션하고 예측하는 generative model처럼 작동한다고 보았음. 따라서 World Model이라는 아이디어는 인간의 지각, 추론, 행동 제어를 설명하는 핵심 개념과 연결되어 있음.

3. “Cybernetics는 이를 engineered systems로 확장했으며 [40, 41], classical control은 주어진 state space 위에서 dynamics를 발전시킴 [42, 43].”

이 문장은 World Model 개념이 인지과학에서 공학 시스템과 제어이론으로 이어졌다는 뜻임.

Cybernetics는 생물, 기계, 사회 시스템에서 feedback과 control을 연구하는 분야임. 쉽게 말하면, 어떤 시스템이 주변 환경을 관찰하고, 목표에 맞게 행동을 수정하는 원리를 다룸.

예를 들어 온도 조절기를 생각할 수 있음.

현재 온도를 관찰함
목표 온도와 비교함
너무 낮으면 난방을 켬
너무 높으면 난방을 끔

이런 feedback control 구조가 cybernetics의 기본적인 예시임.

여기서 “engineered systems로 확장했다”는 말은, 사람이나 뇌의 internal model 개념이 기계나 로봇 같은 인공 시스템을 설계하는 데도 사용되었다는 뜻임.

다음으로 classical control은 제어이론의 전통적인 분야임. 로봇, 항공기, 자동차, 모터 같은 시스템을 원하는 상태로 움직이게 만드는 방법을 다룸.

여기서 중요한 표현이 given state space임. Classical control에서는 보통 시스템의 상태가 무엇인지 이미 주어져 있다고 가정함.

예를 들어 로봇 팔이면 state가 다음처럼 주어져 있다고 봄.

joint angle
joint velocity
end-effector position

그리고 이 state 위에서 dynamics를 모델링함.

즉, 이런 식임.

현재 state가 있음
action을 가함
다음 state가 어떻게 될지 dynamics로 예측함
목표 state에 도달하도록 action을 선택함

이 문장의 핵심은 다음과 같음.

Cybernetics와 classical control에서는 internal model 아이디어가 실제 기계 시스템을 제어하는 문제로 확장되었음. 다만 classical control에서는 보통 state space가 이미 주어져 있다고 가정하고, 그 위에서 dynamics와 planning을 다룸.

이 부분이 이 논문과 연결되는 이유는, 이 논문은 바로 “그 state space를 neural representation으로 학습할 수 있는가?”를 묻기 때문임.

4. “Neural world models는 recurrent controller-model pairs [44, 45, 46]에서 시작하여, pixels로부터의 latent dynamics [47, 48, 49, 50, 51], value-equivalent prediction [52], generative [53] 및 joint-embedding predictive [1, 9] architectures, 그리고 large generative video simulators [54, 55]로 발전해옴.”

이 문장은 딥러닝에서 World Model 연구가 어떻게 발전해왔는지를 요약함.

먼저 neural world models는 neural network로 world model을 학습하는 방법을 의미함. 즉, 환경의 상태나 미래 변화를 neural network가 예측하도록 학습하는 것임.

recurrent controller-model pairs

초기 neural world model은 recurrent neural network를 사용해 환경의 dynamics를 학습하고, controller가 그 모델을 사용해 행동하는 구조였음.

쉽게 말하면 두 개의 모듈이 있음.

model: 세계가 어떻게 변할지 예측함
controller: 그 예측을 바탕으로 action을 선택함

여기서 recurrent라는 말은 시간에 따라 상태가 이어지는 sequence를 다룬다는 뜻임. 비디오나 로봇 trajectory처럼 시간이 중요한 문제에서 사용됨.

latent dynamics from pixels

그다음 발전은 pixels로부터 latent dynamics를 학습하는 것임.

현실에서는 로봇의 true state, 예를 들어 joint angle이나 velocity를 항상 직접 받을 수 있는 것이 아님. 카메라 이미지 같은 pixel observation만 받을 수도 있음.

그래서 모델은 pixel image를 바로 다루는 대신, 먼저 image를 latent representation으로 압축함.

image pixel
encoder
latent state
dynamics model
next latent state prediction

이렇게 pixel로부터 latent state를 만들고, 그 latent space에서 dynamics를 학습하는 방식임.

value-equivalent prediction

value-equivalent prediction은 미래 observation을 정확히 예측하는 것보다, decision making에 필요한 value를 잘 보존하는 예측을 하자는 아이디어임.

즉, world model이 pixel을 완벽하게 예측하지 않아도 됨. 중요한 것은 agent가 좋은 action을 선택하는 데 필요한 정보를 보존하는 것임.

예를 들어 게임 화면의 모든 pixel을 정확히 예측하지 않아도, 어떤 action이 점수를 높이는지 판단할 수 있으면 충분할 수 있음.

generative architectures

generative architectures는 미래 observation이나 trajectory를 생성하는 방식의 world model을 의미함.

예를 들어 현재 frame을 보고 다음 frame을 생성하거나, 현재 상태와 action을 보고 미래 상태를 생성하는 모델임.

joint-embedding predictive architectures

joint-embedding predictive architectures, 즉 JEPA 계열은 pixel 자체를 예측하지 않고 representation space에서 target을 예측하는 방식임.

이 논문과 직접 연결되는 부분임. JEPA는 pixel-perfect prediction이 아니라, target view의 representation을 맞추는 방식으로 World Model을 학습하려고 함.

large generative video simulators

최근에는 대규모 video generation model이 World Model처럼 해석되기도 함. 예를 들어 현재 장면과 action 또는 prompt가 주어졌을 때, 그 이후의 plausible한 video를 생성할 수 있다면, 어느 정도 세계의 dynamics를 학습한 것으로 볼 수 있음.

이 문장의 핵심은 다음과 같음.

Neural world model 연구는 초기의 recurrent dynamics model에서 시작해, pixel observation으로부터 latent dynamics를 학습하는 방향, decision making에 필요한 value를 보존하는 방향, generative model, JEPA-style representation prediction, 그리고 대규모 video simulator로 발전해왔음.

5. “JEPA의 주장은 pixel-perfect prediction이 capacity를 낭비한다는 것임 [1].”

이 문장은 JEPA의 핵심 철학을 설명함.

기존 generative world model은 미래 image나 video를 pixel 단위로 정확히 예측하려고 할 수 있음. 하지만 pixel-perfect prediction은 너무 어려운 문제임.

예를 들어 다음 frame을 예측한다고 할 때, 모델이 다음과 같은 것까지 다 맞추려고 하면 capacity가 많이 낭비됨.

배경 texture의 작은 변화
그림자
조명 noise
정확한 pixel-level detail
task와 무관한 작은 움직임

하지만 planning이나 understanding에 꼭 필요한 것은 이런 pixel-level detail이 아닐 수 있음. 로봇 제어에서는 팔이 어디에 있는지, 목표가 어디에 있는지, 다음에 어떻게 움직일지가 중요하지, 배경 texture의 미세한 pixel을 완벽히 맞추는 것이 중요하지 않을 수 있음.

JEPA의 주장은 다음과 같음.

World Model이 반드시 pixel을 완벽하게 예측할 필요는 없음. 오히려 pixel-perfect prediction을 목표로 하면 불필요한 detail에 model capacity를 낭비할 수 있음. 대신 중요한 semantic/structural information을 담은 representation space에서 예측하는 것이 더 효율적일 수 있음.

이것이 JEPA가 pixel space가 아니라 representation space에서 prediction을 수행하는 이유임.

6. “우리는 이 그림에서 encoder side를 다룸.”

이 문장은 이 논문이 World Model 전체를 다루는 것이 아니라, 그중 encoder side를 다룬다는 뜻임.

World Model을 크게 나누면 다음 구성요소들이 있음.

observation을 state representation으로 바꾸는 encoder
state가 action에 따라 어떻게 변하는지 예측하는 transition model
그 transition model을 사용해 action sequence를 찾는 planner
실제 action을 수행하는 controller

이 논문은 이 중에서 주로 첫 번째, 즉 encoder가 만든 representation이 좋은 state인지 분석함.

수식으로는 다음과 같은 부분임.

true latent state: (z)
observation: (x = g(z))
learned representation: (h(z) = f(g(z)))

이 논문이 묻는 질문은 다음임.

encoder가 만든 (h(z))가 true latent (z)를 선형적으로 복원하는가?

즉, 이 논문은 action-conditioned transition model을 직접 학습하거나, 복잡한 planning algorithm을 새로 제안하는 논문이 아님. World Model에 필요한 state representation이 어떤 조건에서 올바르게 학습되는지를 분석하는 논문임.

7. “Classical control은 learned coordinates에서 적용됨 (Thm. 4).”

이 문장은 앞의 classical control과 이 논문의 결과를 연결하는 말임.

Classical control은 원래 state space가 주어져 있다고 가정함. 예를 들어 로봇의 joint angle과 velocity를 알고 있다고 하면, 그 state space 위에서 dynamics를 세우고 planning이나 control을 수행함.

그런데 이 논문에서는 state를 직접 아는 것이 아니라, encoder가 만든 learned representation을 사용함.

즉, 원래는 classical control이 true state (z)에서 작동한다고 보면,

true state space: (z)
learned representation space: (h(z))

이 논문은 (h(z))가 (z)와 다음 관계를 만족하면,

[
h(z) = Qz
]

learned representation space에서도 classical control의 논리를 적용할 수 있다고 말함.

왜냐하면 (Q)가 orthogonal matrix이면 distance와 geometry가 보존되기 때문임. 즉, learned coordinates는 true coordinates를 회전하거나 반사한 것과 같음.

따라서 learned latent space에서 planning을 해도 true latent space에서 planning하는 것과 같은 의미를 가질 수 있음.

이것이 Theorem 4의 의미임.

이 문장의 핵심은 다음과 같음.

Classical control은 원래 true state space에서 작동하는 이론임. 그런데 LeJEPA representation이 true state를 (h(z)=Qz)처럼 선형적으로 보존한다면, 그 learned representation space를 새로운 좌표계로 보고 classical control을 적용할 수 있음. 즉, 이 논문은 learned representation이 control에 사용할 수 있는 state가 되기 위한 조건을 이론적으로 보여줌.

전체 요약

이 World Models 문단은 이 논문이 어떤 연구 흐름 위에 있는지 설명하는 부분임.

핵심 흐름은 다음과 같음.

Cognitive science에서는 뇌가 세계를 내부적으로 시뮬레이션하는 generative model처럼 작동한다고 봄.
Cybernetics와 classical control에서는 이 internal model 개념이 기계 시스템과 제어 문제로 확장됨.
Modern ML에서는 neural network로 world model을 학습하려는 연구가 발전해옴.
기존 neural world model은 pixels에서 latent dynamics를 학습하거나, 미래를 생성하거나, representation space에서 예측하는 방향으로 발전함.
JEPA는 pixel-perfect prediction이 불필요한 detail에 capacity를 낭비한다고 보고, representation space prediction을 사용함.
이 논문은 World Model 전체 중에서 encoder side, 즉 observation을 planning에 쓸 수 있는 state representation으로 바꾸는 부분을 분석함.
만약 learned representation이 true latent state를 (h(z)=Qz)처럼 보존한다면, classical control이나 planning을 learned coordinates 위에서 적용할 수 있음.

Identifiability.

Nonlinear ICA는 추가적인 structure 없이는 unidentifiable함 [56, 57].

Identifiability는 world가 그것을 가능하게 하는 structure를 제공할 때 가능해짐.

예를 들어 non-stationarity [58], temporal dependence [59, 60], auxiliary variables [61, 62], contrastive learning [63], augmentation [64], mechanism sparsity [65], interventions [66, 67], 또는 supervision [68, 69]이 이에 해당함.

이러한 결과들은 보통 representation을 smooth diffeomorphism으로 제한함.

반면 우리의 Hermite approach는 arbitrary measurable maps에 대해 작동함.

가장 밀접하게 관련된 것은 slow feature analysis (SFA) [70]이며, JEPA objectives는 이를 empirically recover함 [71]. App. F에서는 SFA theory [72]와의 차이를 깊이 있게 비교함.

Representation space에서의 유사한 spectral analysis는 이전에도 다른 SSL objectives를 characterize하는 데 사용된 바 있으며 [73], 우리의 Thm. 3은 최근의 quantitative stability work [74, 75]와 연결됨.

공통된 교훈은 identifiability가 항상 World, 즉 data generating process와 Learner, 즉 learning objective인 LeJEPA에 대한 joint statement라는 것임.

Basis는 복잡한 대상을 표현하기 위한 기본 재료임. 벡터를 e1,e2e_1, e_2 같은 좌표축의 조합으로 표현하듯이, 함수도 여러 기본 함수들의 조합으로 표현할 수 있음. Gaussian variable 위에서 함수를 표현할 때 자연스럽게 쓰이는 기본 함수들이 Hermite polynomial basis임. Fourier analysis에서 복잡한 신호를 sine/cosine 성분으로 나누듯이, Hermite decomposition은 h(z)h(z) 같은 함수를 linear, quadratic, cubic 같은 차수별 성분으로 나눔. 이 논문은 이 분해를 이용해 LeJEPA representation 안에 nonlinear 성분이 얼마나 있는지 분석함. Gaussian OU transition에서는 linear 성분이 positive pair 사이에서 가장 잘 유지되고, nonlinear 성분은 더 빨리 약해짐. 따라서 alignment loss는 nonlinear 성분을 불리하게 만들고, 최적해는 결국 h(z)=Qzh(z)=Qz 같은 linear map이 됨.

Identifiability 문단 설명

1. “Nonlinear ICA는 추가적인 structure 없이는 unidentifiable함.”

이 문장은 identifiability 문단의 출발점임.

먼저 ICA는 Independent Component Analysis의 약자임. 쉽게 말하면, 관측된 데이터가 여러 숨은 원인 변수들이 섞여서 만들어졌다고 보고, 그 숨은 원인 변수들을 다시 분리하려는 문제임.

예를 들어 여러 사람이 동시에 말하는 소리가 섞여서 하나의 마이크에 들어왔다고 하자. ICA는 이 섞인 신호에서 각 사람의 목소리를 다시 분리하려는 문제와 비슷함.

여기서 latent variables는 숨은 원인 변수임. 이미지에서는 object position, color, lighting, pose 등이 latent variable일 수 있고, 로봇에서는 joint angle, velocity 등이 latent variable일 수 있음.

Linear ICA에서는 관측 데이터가 latent variables의 선형 혼합이라고 가정함.

예를 들어,

x = Az

처럼 관측값 (x)가 latent (z)에 matrix (A)를 곱해서 만들어졌다고 보는 것임.

그런데 Nonlinear ICA는 관측 데이터가 latent variables의 비선형 변환으로 만들어졌다고 봄.

x = g(z)

여기서 (g)가 복잡한 nonlinear function이면, (x)만 보고 원래 (z)를 복원하는 것이 일반적으로 불가능함.

왜냐하면 여러 다른 latent coordinate들이 같은 observation distribution을 설명할 수 있기 때문임.

즉, 겉으로 보이는 데이터만으로는 “진짜 latent가 이것이다”라고 하나로 정하기 어려움.

그래서 “Nonlinear ICA는 추가적인 structure 없이는 unidentifiable하다”는 말은 다음 뜻임.

관측값이 latent의 비선형 변환일 때, 데이터만 보고 원래 숨은 변수들을 복원하는 것은 일반적으로 불가능함. 어떤 추가적인 단서나 가정이 있어야 식별 가능해짐.

2. “Identifiability는 world가 그것을 가능하게 하는 structure를 제공할 때 가능해짐.”

이 문장은 앞 문장의 해결 방향을 말함.

Nonlinear ICA가 일반적으로 불가능하다면, 어떻게 가능하게 만들 수 있는가? 답은 world가 추가적인 structure를 제공해야 한다는 것임.

여기서 world는 data generating process를 의미함. 즉, 데이터가 실제로 어떻게 만들어졌는지에 대한 구조임.

예를 들어 단순히 이미지 한 장만 보면 object position과 camera angle이 어떻게 섞여 있는지 알기 어려울 수 있음. 하지만 시간에 따라 움직이는 비디오를 보면, position은 연속적으로 변하고 object identity는 비교적 유지되는 식의 단서가 생김.

즉, world가 다음과 같은 구조를 제공하면 latent를 더 잘 복원할 수 있음.

시간이 흐르는 방식
환경이 변하는 방식
어떤 변수는 천천히 변하고 어떤 변수는 빠르게 변하는 성질
intervention에 따라 특정 변수만 바뀌는 구조
label이나 auxiliary information

이 문장의 핵심은 다음과 같음.

Latent variables를 복원하려면 learner가 똑똑하기만 해서는 부족함. 데이터가 만들어지는 world 자체가 식별에 도움이 되는 구조를 가지고 있어야 함.

3. “예를 들어 non-stationarity, temporal dependence, auxiliary variables, contrastive learning, augmentation, mechanism sparsity, interventions, 또는 supervision이 이에 해당함.”

이 문장은 identifiability를 가능하게 해주는 여러 종류의 추가 structure를 나열한 것임. 하나씩 풀면 다음과 같음.

non-stationarity

Non-stationarity는 데이터 분포가 시간이나 조건에 따라 변한다는 뜻임.

예를 들어 낮에는 조명이 밝고 밤에는 어두운 데이터가 나온다고 하자. 또는 어떤 시간대에는 특정 object가 많이 나오고, 다른 시간대에는 다른 object가 많이 나온다고 하자.

분포가 변하면 오히려 latent를 구분하는 단서가 될 수 있음. 어떤 latent factor가 어떤 조건에서 어떻게 변하는지를 관찰할 수 있기 때문임.

즉, non-stationarity는 단순히 불편한 noise가 아니라, latent factor를 구분하게 해주는 추가 정보가 될 수 있음.

temporal dependence

Temporal dependence는 시간적으로 가까운 sample들이 서로 관련되어 있다는 뜻임.

비디오에서 연속된 frame들은 완전히 독립적이지 않음. 물체의 위치는 갑자기 랜덤하게 바뀌지 않고, 보통 연속적으로 움직임.

이런 시간적 연속성은 latent를 식별하는 단서가 됨.

예를 들어 object identity는 여러 frame 동안 유지되고, position은 조금씩 변함. 이런 패턴을 이용하면 어떤 factor가 identity이고 어떤 factor가 motion인지 더 잘 구분할 수 있음.

auxiliary variables

Auxiliary variables는 관측 데이터 외에 추가로 주어지는 보조 정보임.

예를 들어 다음과 같은 것들이 auxiliary variable이 될 수 있음.

시간 index
domain label
environment label
class label 비슷한 약한 정보
camera 정보
action 정보

이런 보조 정보가 있으면 latent variable을 구분하는 데 도움이 됨.

예를 들어 같은 object를 여러 조명 조건에서 본다는 auxiliary information이 있으면, object identity와 lighting factor를 분리하기 쉬워짐.

contrastive learning

Contrastive learning은 positive pair와 negative pair를 이용하는 학습 방식임.

예를 들어 같은 이미지의 augmentation 두 개는 positive pair이고, 다른 이미지에서 온 sample은 negative pair임.

Positive는 가깝게, negative는 멀게 만드는 구조는 representation이 collapse되지 않고 유용한 factor를 담게 하는 단서가 됨.

즉, contrastive learning도 world나 data construction을 통해 “무엇이 같고 무엇이 다른지”에 대한 구조를 제공함.

augmentation

Augmentation은 데이터를 인위적으로 변형하는 것임.

예를 들어 이미지에 crop, color jitter, blur 등을 적용할 수 있음.

Augmentation은 모델에게 “이런 변화는 본질적인 의미를 바꾸지 않는다”는 정보를 줌.

예를 들어 같은 이미지에 color jitter를 넣어도 object identity는 같다고 본다면, 모델은 color 변화에 덜 민감하고 object identity를 더 안정적으로 담는 representation을 배울 수 있음.

즉, augmentation은 어떤 factor를 invariant하게 보고 어떤 factor를 무시해도 되는지 알려주는 structure임.

mechanism sparsity

Mechanism sparsity는 world를 구성하는 causal mechanism들이 sparse하다는 가정임.

쉽게 말하면, 하나의 원인 변화가 모든 것을 동시에 바꾸는 것이 아니라, 일부 factor만 선택적으로 바꾼다는 뜻임.

예를 들어 조명을 바꾸면 lighting factor는 바뀌지만 object shape은 그대로임. 물체를 이동시키면 position은 바뀌지만 object identity는 그대로임.

이처럼 변화가 sparse하게 일어나면 latent factor를 분리하기 쉬워짐.

interventions

Intervention은 어떤 변수에 직접 개입해서 바꾸는 것임.

예를 들어 로봇에서 특정 action을 주면 joint angle이 변함. 또는 실험자가 조명만 바꾸고 object는 그대로 둘 수 있음.

Intervention이 있으면 “이 action이 어떤 latent factor를 바꾸는지”를 관찰할 수 있기 때문에 identifiability에 큰 도움이 됨.

예를 들어 action을 주었을 때 position만 바뀌고 color는 그대로라면, position factor와 color factor를 분리하기 쉬워짐.

supervision

Supervision은 label이나 정답 정보를 사용하는 것임.

가장 직접적인 structure임. 예를 들어 object class label, position label, segmentation label 등이 있으면 latent factor를 훨씬 쉽게 식별할 수 있음.

하지만 이 논문은 self-supervised setting에 관심이 있으므로, full supervision 없이도 어떤 조건에서 identifiability가 가능한지를 보려는 것임.

이 문장의 핵심은 다음과 같음.

Nonlinear latent를 복원하려면 그냥 데이터만으로는 어렵고, 시간 구조, augmentation, intervention, label, contrastive pair 같은 추가 단서가 필요함. 기존 identifiability 연구들은 이런 여러 structure를 이용해 식별 가능성을 확보해왔음.

4. “이러한 결과들은 보통 representation을 smooth diffeomorphism으로 제한함.”

이 문장은 기존 identifiability 이론들이 어떤 가정을 두었는지를 설명함.

Smooth diffeomorphism은 쉽게 말하면 “부드럽고, 뒤집을 수 있는 변환”임.

더 풀면 다음 조건을 가진 함수임.

smooth함
갑자기 꺾이거나 불연속적으로 튀지 않음.
invertible함
입력과 출력이 일대일 대응되어서, 출력에서 입력을 다시 복원할 수 있음.
inverse도 smooth함
역함수도 부드러움.

즉, 기존 연구들은 representation function이 아무렇게나 생겼다고 보지 않고, 꽤 좋은 성질을 가진 함수라고 가정하는 경우가 많았음.

예를 들어,

h(z)

가 (z)를 복잡하게 바꾸더라도, 적어도 부드럽고 invertible한 좌표 변환이라고 보는 것임.

이 가정은 수학적으로 분석하기 좋음. 하지만 neural network가 실제로 학습하는 representation이 항상 smooth diffeomorphism이라고 보장되지는 않음.

이 문장의 핵심은 다음과 같음.

기존 identifiability 결과들은 보통 representation이 부드럽고 invertible한 좋은 함수라고 가정하는 경우가 많음. 이런 가정은 분석에는 편하지만, 실제 neural representation에는 다소 강한 제한일 수 있음.

5. “반면 우리의 Hermite approach는 arbitrary measurable maps에 대해 작동함.”

이 문장은 이 논문의 접근이 기존보다 더 일반적인 함수 class를 다룬다는 주장임.

Hermite approach는 Gaussian distribution에서 함수를 Hermite polynomial basis로 분해해서 분석하는 방법임.

Gaussian world에서는 임의의 함수를 여러 차수의 성분으로 나누어 볼 수 있음.

linear component
quadratic component
cubic component
higher-order nonlinear component

이 논문은 이 분해를 사용해서, nonlinear component가 alignment objective에서 손해를 본다는 것을 보임.

arbitrary measurable maps는 매우 넓은 함수 class임.

Measurable이라는 것은 확률과 expectation이 정의될 정도의 최소한의 조건이라고 보면 됨. Smooth하거나 invertible할 필요는 없음.

즉, 이 논문은 (h)가 반드시 smooth diffeomorphism이어야 한다고 가정하지 않음. Neural network가 어떤 복잡한 measurable function을 학습하더라도, LeJEPA objective 아래에서 optimum이 어떻게 되는지를 분석하려고 함.

이 문장의 핵심은 다음과 같음.

기존 연구들이 representation map에 부드럽고 invertible한 구조를 가정한 경우가 많았다면, 이 논문은 Gaussian world에서 Hermite decomposition을 사용해 훨씬 넓은 함수 class, 즉 arbitrary measurable maps까지 다룰 수 있다고 주장함.

6. “가장 밀접하게 관련된 것은 slow feature analysis (SFA)이며, JEPA objectives는 이를 empirically recover함.”

이 문장은 이 논문의 아이디어가 SFA와 관련 있다는 뜻임.

Slow Feature Analysis, 즉 SFA는 시간에 따라 천천히 변하는 feature를 찾는 방법임.

예를 들어 비디오를 보면 pixel은 빠르게 변할 수 있음. 조명, texture, background noise 등은 frame마다 바뀔 수 있음. 하지만 object identity나 scene structure 같은 중요한 정보는 비교적 천천히 변함.

SFA의 아이디어는 다음과 같음.

시간이 지나도 천천히 변하는 feature는 중요한 underlying factor일 가능성이 높다.

JEPA objective도 positive pair 사이에서 representation이 비슷하게 유지되도록 만듦. 예를 들어 가까운 두 frame의 representation을 align하면, 빠르게 변하는 noise보다는 안정적으로 유지되는 feature를 선호하게 됨.

그래서 JEPA objective가 경험적으로 SFA와 비슷한 slow features를 recover할 수 있다고 말하는 것임.

이 문장의 핵심은 다음과 같음.

JEPA의 alignment objective는 positive pair 사이에서 변하지 않거나 예측 가능한 feature를 선호한다는 점에서 SFA와 관련이 있음. 이 논문은 그 연결을 Gaussian world와 spectral analysis 관점에서 더 이론적으로 설명하려고 함.

7. “App. F에서는 SFA theory와의 차이를 깊이 있게 비교함.”

이 문장은 appendix에서 SFA와 이 논문의 차이를 자세히 다룬다는 뜻임.

SFA와 이 논문은 모두 “positive pair 사이에서 안정적인 feature”를 본다는 점에서 비슷함.

하지만 차이도 있음.

SFA는 전통적으로 시간적으로 천천히 변하는 feature를 찾는 방법이고, 특정 smoothness나 variance constraint를 두는 경우가 많음.

이 논문은 LeJEPA objective, 즉 alignment plus Gaussian regularization을 분석하고, Gaussian latent world에서 linear identifiability가 성립하는지를 다룸.

즉, 관련은 있지만 완전히 같은 이론은 아님.

이 문장의 핵심은 다음과 같음.

이 논문의 spectral 분석은 SFA와 관련 있지만, LeJEPA의 Gaussian regularization과 linear identifiability를 다룬다는 점에서 기존 SFA theory와 구분됨.

8. “Representation space에서의 유사한 spectral analysis는 이전에도 다른 SSL objectives를 characterize하는 데 사용된 바 있으며, 우리의 Thm. 3은 최근의 quantitative stability work와 연결됨.”

이 문장은 두 가지를 말함.

첫째, spectral analysis가 이 논문에서만 갑자기 등장한 도구는 아니라는 것임.

Spectral analysis는 어떤 operator나 objective를 eigenfunction, eigenvalue 관점에서 분석하는 방법임. Representation learning에서도 이런 방식으로 SSL objective가 어떤 feature를 선호하는지 분석한 연구들이 있었음.

이 논문도 비슷하게 transition operator와 Hermite decomposition을 사용해서 LeJEPA objective가 어떤 representation을 optimum으로 만드는지 분석함.

둘째, Theorem 3이 quantitative stability work와 연결된다는 점임.

Theorem 3은 approximate identifiability에 대한 결과임. 즉, objective가 완벽히 만족되지 않더라도 alignment gap과 whitening error가 작으면 recovery error도 작게 bound된다는 내용임.

Quantitative stability는 이상적인 조건에서 조금 벗어났을 때 결과가 얼마나 안정적으로 유지되는지를 수량적으로 분석하는 분야라고 보면 됨.

즉, 완벽한 theorem만 보여주는 것이 아니라,

loss가 조금 나빠지면 recovery도 얼마나 나빠지는가?

를 bound하는 것임.

이 문장의 핵심은 다음과 같음.

이 논문은 기존 SSL spectral analysis 흐름과 연결되며, approximate identifiability bound는 최근의 stability analysis와도 관련됨. 즉, exact optimum뿐 아니라 실제 학습처럼 약간의 오차가 있는 상황에서도 보장이 얼마나 유지되는지를 분석함.

9. “공통된 교훈은 identifiability가 항상 World, 즉 data generating process와 Learner, 즉 learning objective인 LeJEPA에 대한 joint statement라는 것임.”

이 문장이 이 문단의 결론임.

핵심은 identifiability가 한쪽만으로 결정되지 않는다는 것임.

즉, latent variables를 복원할 수 있는지는 다음 두 가지가 함께 맞아야 함.

World

World는 data generating process임. 즉, 데이터가 어떻게 만들어지는지에 대한 가정임.

이 논문에서는 world 쪽에 다음과 같은 가정을 둠.

latent variables가 independent함
transition이 stationary함
transition이 additive-noise 형태임
latent distribution이 Gaussian임

Learner

Learner는 학습 objective임. 즉, 모델이 무엇을 최적화하는지임.

이 논문에서는 learner 쪽에 LeJEPA objective를 둠.

positive pair를 align함
embedding distribution을 Gaussian으로 regularize함

이 둘이 함께 맞을 때 identifiability가 가능함.

예를 들어 world가 Gaussian이어도 learner가 적절한 objective를 쓰지 않으면 latent를 복원하지 못할 수 있음.

반대로 learner가 LeJEPA objective를 쓰더라도 world가 non-Gaussian이거나 transition assumption이 맞지 않으면 같은 보장이 깨질 수 있음.

즉, 이 논문이 말하는 핵심은 다음과 같음.

Identifiability는 데이터 자체의 성질만으로 결정되는 것도 아니고, 모델 objective만으로 결정되는 것도 아님. 어떤 world에서 데이터가 생성되었는지와 어떤 learner objective를 사용하는지가 함께 맞아야 함.

전체 문단 요약

이 Identifiability 문단의 전체 흐름은 다음과 같음.

Nonlinear ICA에서는 일반적으로 latent variable을 복원하는 것이 불가능함.
따라서 identifiability를 얻으려면 world가 추가적인 structure를 제공해야 함.
기존 연구들은 non-stationarity, temporal dependence, augmentation, intervention, supervision 같은 구조를 이용해 identifiability를 확보해왔음.
하지만 많은 기존 이론은 representation map을 smooth diffeomorphism처럼 좋은 함수 class로 제한하는 경우가 많았음.
이 논문은 Gaussian world에서 Hermite decomposition을 사용해 arbitrary measurable maps까지 다루려 함.
이 접근은 slow feature analysis와 관련이 있음. JEPA도 positive pair 사이에서 안정적인 feature를 선호하기 때문임.
또한 spectral analysis와 approximate stability 분석의 기존 흐름과도 연결됨.
최종적으로 이 문단이 말하는 핵심은, identifiability는 World와 Learner의 joint statement라는 것임.
이 논문에서는 World 쪽 조건으로 Gaussian latent와 stationary additive-noise transition을 두고, Learner 쪽 조건으로 LeJEPA objective를 두며, 이 둘이 결합될 때 linear identifiability가 성립한다고 주장함.

3. The World and the Learner

What does it mean to Learn the World Model?

세계가 latent variables $z \in \mathbb{R}^n$을 가진다고 가정함.

예를 들어 position, velocity, color, lighting 등이 이에 해당함.

이러한 degrees of freedom은 ICA literature에서는 latent variables / sources라고 불리고, representation learning에서는 factors of variation [76]이라고 불림.

우리는 $z$를 직접 관측하지 않음.

대신 unknown process $g$가 우리가 보는 data를 생성함.

즉, \[ x = g(z) \] 라고 둠.

$g$는 3D scene을 image로 rendering하거나, physical states를 sensor readings로 mapping하는 것으로 생각할 수 있음.

Process $g$는 매우 nonlinear할 수 있으며, clean latent structure를 복잡하고 entangled된 observations로 뒤섞음.

우리는 observations를 다시 representations로 mapping하는 representation $f$를 학습함.

즉, \[ y = f(x) \] 임.

이상적인 결과는 $f$가 $g$를 undo하는 것임.

즉, composed map $h = f \circ g$가 original latent variables $z$를 복원해야 함.

물론 perfect recovery를 기대하기는 어려움.

Gaussian의 rotation invariance처럼 해결될 수 없는 symmetries가 존재하기 때문임.

우리는 $h$가 true latents의 linear function이어야 함을 보일 것임.

즉, \[ h(z) = Qz \] 임.

이는 linear probes가 작동하기 위한 necessary condition임.

이 부분은 논문의 문제 설정을 설명하는 부분임.
핵심은 다음 한 문장임.

우리가 보는 데이터 xx는 진짜 세계 상태 zz가 복잡하게 변환된 결과이고, 모델 ff는 그 데이터 xx를 다시 representation으로 바꿔서 원래 zz를 되찾으려고 함.

그 흐름을 그림처럼 쓰면 이거임.

진짜 세계 상태 z
    ↓  g: 관측/렌더링 과정
우리가 보는 데이터 x
    ↓  f: 학습된 encoder
모델의 representation y

수식으로는:

x=g(z)x = g(z) y=f(x)y = f(x)

그리고 둘을 합치면:

y=f(g(z))y = f(g(z))

논문에서는 이 전체 mapping을

h=f∘gh = f \circ g

라고 부름. 즉,

h(z)=f(g(z))h(z) = f(g(z))

임.

1. “세계가 latent variables zz를 가진다”는 말

세계에는 우리가 직접 보고 싶은 진짜 상태 변수들이 있다고 가정함.

예를 들어 로봇이라면:

z = [joint angle, velocity, target position]

이미지라면:

z = [object position, color, pose, lighting]

이런 것들이 latent variables임.

여기서 latent라는 말은 숨겨져 있다는 뜻임.
즉, 실제로는 존재하지만 우리가 직접 관측하지 못하는 변수임.

예를 들어 이미지 한 장을 보면 pixel은 보이지만, 그 뒤의 정확한 3D position, 조명 값, object pose는 직접 숫자로 보이지 않음.

2. “우리는 zz를 직접 관측하지 않는다”는 말

모델은 보통 진짜 latent zz를 바로 받지 않음.

예를 들어 로봇의 실제 상태가 joint angle이라고 해도, pixel-based setting에서는 모델이 joint angle 숫자를 직접 받는 게 아니라 카메라 이미지만 볼 수 있음.

즉 모델이 보는 것은 zz가 아니라 xx임.

모델이 직접 보는 것: image x
모델이 진짜 알고 싶어하는 것: latent state z

3. “unknown process gg가 data를 생성한다”는 말

gg는 진짜 latent zz를 우리가 볼 수 있는 데이터 xx로 바꾸는 과정임.

x=g(z)x = g(z)

예를 들어:

3D object position, color, lighting
→ rendering process
→ 2D image

또는 로봇에서는:

joint angle, velocity
→ camera / sensor process
→ image observation

여기서 gg는 우리가 정확히 모르는 복잡한 과정임.
카메라, 조명, 렌더링, 센서 노이즈 등이 전부 섞여 있을 수 있음.

그래서 gg는 보통 매우 nonlinear함.

4. “gg가 latent structure를 entangled observation으로 뒤섞는다”는 말

진짜 latent zz는 비교적 깔끔할 수 있음.

예를 들어:

position
color
lighting
velocity

처럼 각각 의미가 나뉘어 있음.

그런데 이미지 xx에서는 이 정보들이 복잡하게 섞임.

예를 들어 pixel 하나의 값은 object color만으로 결정되지 않음.

pixel value = object color + lighting + shadow + camera angle + texture + background ...

즉, 이미지에서는 position, color, lighting, pose가 모두 섞여 있음.

그래서 x=g(z)x = g(z)는 “깔끔한 latent zz”가 “복잡하게 뒤섞인 observation xx”로 변한 것이라고 보는 것임.

5. “우리는 representation ff를 학습한다”는 말

모델은 observation xx를 입력받아서 representation yy를 출력함.

y=f(x)y = f(x)

여기서 ff는 encoder라고 보면 됨.

예를 들어:

image x
→ ViT / CNN encoder f
→ feature vector y

이 yy가 모델이 학습한 representation임.

6. “이상적인 결과는 ff가 gg를 undo하는 것”이라는 말

gg는 latent zz를 image xx로 복잡하게 뒤섞는 과정임.

z→gxz \xrightarrow{g} x

그렇다면 좋은 encoder ff는 이걸 다시 되돌려야 함.

x→fyx \xrightarrow{f} y

즉, 전체적으로 보면:

z→gx→fyz \xrightarrow{g} x \xrightarrow{f} y

이때 yy가 원래 zz와 같거나 거의 같으면 좋음.

그래서 논문은 ff가 gg를 undo해야 한다고 말함.

쉽게 말하면:

세상이 latent를 image로 꼬아놓았고, encoder는 그 image에서 다시 latent structure를 풀어내야 함.

7. “composed map h=f∘gh = f \circ g”가 무슨 말인가

h=f∘gh = f \circ g는 gg를 먼저 하고 그다음 ff를 한다는 뜻임.

즉,

h(z)=f(g(z))h(z) = f(g(z))

임.

흐름은 이거임.

z
→ g(z) = x
→ f(x) = y

그래서

h(z) = y

라고 볼 수 있음.

즉, hh는 “진짜 latent zz에서 시작해서 최종 representation yy까지 가는 전체 함수”임.

8. “perfect recovery는 어렵다”는 말

이론적으로 가장 좋으면:

h(z)=zh(z) = z

가 되면 됨.

즉, 모델 representation이 원래 latent와 완전히 같으면 좋음.

하지만 현실적으로, 또는 수학적으로 구분할 수 없는 ambiguity가 있음.

대표적인 예가 Gaussian의 rotation invariance임.

만약

z∼N(0,I)z \sim \mathcal{N}(0,I)

이면, zz를 회전시켜도 분포가 똑같음.

Qz∼N(0,I)Qz \sim \mathcal{N}(0,I)

즉, 데이터만 보고는 원래 좌표계가 zz인지, 회전된 좌표계가 QzQz인지 구분할 수 없음.

그래서 논문은 완벽히 h(z)=zh(z)=z까지 요구하지 않고, 회전이나 반사 정도는 허용함.

9. 그래서 목표가 h(z)=Qzh(z)=Qz가 됨

논문이 보이려는 목표는 다음임.

h(z)=Qzh(z)=Qz

여기서 QQ는 orthogonal matrix임.

즉, h(z)h(z)는 원래 latent zz와 완전히 같지는 않아도 됨.
하지만 회전이나 반사 정도로만 달라야 함.

이건 괜찮은 이유가, 거리와 각도가 보존되기 때문임.

즉, latent space의 geometry가 망가지지 않음.

10. 이게 linear probe와 왜 연결되는가

linear probe는 representation yy 위에 간단한 linear model을 붙여서 latent나 label을 꺼내는 방식임.

예를 들어:

z^=Wy+b\hat{z} = Wy + b

만약

y=h(z)=Qzy = h(z) = Qz

이면, yy는 zz의 선형 변환임.
그러면 linear probe가 다시 zz를 꺼낼 수 있음.

하지만 만약 representation이 이렇게 생기면:

h(z)=sin⁡(z)+z2h(z) = \sin(z) + z^2

같은 nonlinear 꼬임이 있으면 linear probe로 원래 zz를 정확히 꺼내기 어려움.

그래서 논문은 h(z)=Qzh(z)=Qz가 linear probe가 잘 작동하기 위한 necessary condition이라고 말함.

전체를 아주 쉽게 다시 쓰면

이 문단은 이렇게 이해하면 됨.

세계에는 object position, velocity, color, lighting 같은 진짜 latent variables zz가 있음. 하지만 우리는 이 zz를 직접 보는 것이 아니라, zz가 복잡한 관측 과정 gg를 거쳐 만들어낸 이미지나 센서값 xx만 봄. 모델의 encoder ff는 이 관측값 xx를 representation yy로 바꿈. 이상적으로는 encoder가 관측 과정 gg를 되돌려서, representation yy 안에 원래 latent zz를 복원해야 함. 이를 전체 mapping으로 쓰면 h(z)=f(g(z))h(z)=f(g(z))임. 다만 Gaussian latent에서는 회전이나 반사 같은 ambiguity는 구분할 수 없으므로, 완전히 h(z)=zh(z)=z를 요구하지 않고 h(z)=Qzh(z)=Qz까지 허용함. 즉, learned representation이 true latent를 선형 변환 정도로 복원하면 충분하다고 보는 것임. 이 성질이 linear identifiability이고, linear probe나 planning에 representation을 사용하기 위한 핵심 조건임.

3.1 The World

World는 positive pairs $(g(z), g(z'))$에 대한 joint distribution $p(z, z')$를 지정함.

SSL에서 이들은 같은 underlying content의 두 views임.

예를 들어 video의 두 frames, image의 두 augmentations, trajectory의 인접한 두 time steps가 이에 해당함.

넓은 종류의 worlds는 세 가지 assumptions로 정의됨.

Assumptions (World)

(i) Independence. 모든 $i \neq j$에 대해 $p(z_i) \perp p(z_j)$이고 transitions $p(z'_i \mid z_i) \perp p(z'_j \mid z_j)$임.

(ii) Stationarity. 두 views는 같은 marginal을 공유함.

즉, \[ p(z) = p(z') \] 임.

(iii) Additive noise. \[ z'_i = m_i(z_i) + \eta_i \] 이며, $\eta_i$는 $z_i$와 independent함.

Independent latent variables는 ICA와 disentangled representation learning [11, 76]이 공유하는 standard assumption임.

Stationarity는 generative process가 views 사이에서 변하지 않는다는 뜻임.

Additive noise는 가장 단순한 perturbation model임.

즉, sensor noise, Brownian motion, jitter를 통한 data augmentation처럼 deterministic signal 위에 random fluctuations가 더해지는 방식임.

이 assumptions를 함께 사용하면 일반적인 World Models의 class가 정의됨.

우리의 forward result (Sec. 5.1)는 specific latent variable distribution, 즉 Gaussian을 선택함으로써 이 class를 specialize할 것임.

우리의 converse result (Sec. 5.2)는 이 class of worlds에서 linear identifiability를 산출하는 유일한 선택이 Gaussian임을 보임.

이 부분은 논문이 “어떤 종류의 world를 가정하는지” 정의하는 파트임.
앞에서는 (z \rightarrow x=g(z) \rightarrow h(z)) 흐름을 설명했다면, 여기서는 positive pair (z, z')가 어떤 관계를 가지는지를 정함.

핵심은 이거임.

LeJEPA/SSL은 related pair, 즉 (x)와 (x')를 사용해 학습함.
그런데 이론 분석을 하려면 (x)와 (x') 뒤에 있는 latent (z), (z')가 어떤 관계인지 정해야 함.
이 논문은 그 관계를 independence, stationarity, additive noise라는 세 가지 assumption으로 정의함.

1. “World는 positive pairs ((g(z), g(z')))에 대한 joint distribution (p(z,z'))를 지정함.”

여기서 말이 어려운데, 쉽게 풀면 이거임.

SSL에서는 두 개의 related view를 사용함.

예를 들어:

같은 이미지의 augmentation 두 개
비디오에서 가까운 두 frame
로봇 trajectory에서 인접한 두 state

이 두 view를 보통 positive pair라고 부름.

논문 notation으로는 실제 latent가 (z), (z')이고, 우리가 보는 observation은 다음과 같음.

[
x = g(z)
]

[
x' = g(z')
]

즉, positive pair는 observation level에서는

[
(g(z), g(z'))
]

임.

그런데 이 positive pair가 아무렇게나 뽑히는 것이 아니라, (z)와 (z')가 어떤 확률적 관계를 가지고 뽑힌다고 봄. 그 관계를 나타내는 것이

[
p(z,z')
]

임.

즉, (p(z,z'))는 현재 latent (z)와 related latent (z')가 함께 어떻게 분포하는지를 나타내는 joint distribution임.

쉽게 말하면:

이 world에서는 어떤 (z)와 어떤 (z')가 positive pair로 묶이는지를 (p(z,z'))가 정함.

2. “SSL에서 이들은 같은 underlying content의 두 views임.”

이 말은 (z)와 (z')가 완전히 무관한 두 sample이 아니라는 뜻임.

예를 들어 같은 고양이 이미지에 augmentation을 두 번 적용하면, 두 이미지는 pixel은 다를 수 있지만 underlying content는 같음.

view 1: crop된 고양이 이미지
view 2: color jitter된 고양이 이미지

둘은 다르게 보일 수 있지만, 둘 다 같은 고양이라는 content를 공유함.

비디오에서는:

frame t
frame t+1

두 frame은 약간 다르지만 같은 scene이나 같은 object motion을 공유함.

trajectory에서는:

state t
state t+1

두 state는 약간 움직였지만 같은 system의 연속된 상태임.

이런 것들이 positive pair임.

3. “넓은 종류의 worlds는 세 가지 assumptions로 정의됨.”

이제 논문은 (z)와 (z')가 어떤 관계를 가져야 하는지 세 가지 조건을 둠.

세 가지는 다음임.

Independence
Stationarity
Additive noise

이 셋은 논문에서 “우리가 다룰 world class”를 정의하는 조건임.

Assumption 1. Independence

원문:

[
p(z_i) \perp p(z_j)
]

[
p(z'_i \mid z_i) \perp p(z'_j \mid z_j)
]

모든 (i \neq j)에 대해 성립함.

4. Independence가 무슨 뜻인가?

(z)는 여러 latent factor로 이루어진 vector임.

예를 들어:

[
z = [z_1, z_2, z_3]
]

라고 하자.

각각이 다음 의미를 가질 수 있음.

z_1 = object position
z_2 = color
z_3 = lighting

Independence는 이 latent factors들이 서로 독립이라고 가정하는 것임.

즉,

position이 어떤 값이라고 해서 color가 자동으로 정해지는 것이 아니고,
color가 어떤 값이라고 해서 lighting이 자동으로 정해지는 것이 아니다

라는 뜻임.

예를 들어 object position과 color가 독립이면, 물체가 왼쪽에 있다고 해서 반드시 빨간색일 필요는 없음.

5. transition independence는 또 뭐냐?

두 번째 조건은 transition도 factor별로 독립이라는 뜻임.

즉, (z_i)가 (z'_i)로 변하는 과정과, (z_j)가 (z'_j)로 변하는 과정이 서로 독립이라는 말임.

예를 들어:

position 변화는 position 자체에 의해 결정됨
color 변화는 color 자체에 의해 결정됨

그리고 position 변화가 color 변화에 직접 영향을 주지 않는다고 보는 것임.

수식으로는:

[
p(z'_i \mid z_i) \perp p(z'_j \mid z_j)
]

임.

쉽게 말하면:

각 latent dimension은 자기 나름대로 independently evolve한다고 가정함.

이건 ICA나 disentangled representation learning에서 자주 쓰는 표준 가정임. 왜냐하면 disentanglement 자체가 보통 “서로 독립적인 factor들을 분리한다”는 생각에서 출발하기 때문임.

Assumption 2. Stationarity

원문:

[
p(z) = p(z')
]

6. Stationarity가 무슨 뜻인가?

Stationarity는 두 view가 같은 marginal distribution을 가진다는 뜻임.

즉, (z)와 (z')가 개별적으로는 다를 수 있지만, 전체 분포는 같아야 함.

중요한 점은:

[
z = z'
]

라는 뜻이 아님.

아니고,

[
p(z) = p(z')
]

라는 뜻임.

즉:

현재 상태와 다음 상태가 같은 값이어야 한다는 뜻이 아니라,
현재 상태들을 많이 모아 본 분포와 다음 상태들을 많이 모아 본 분포가 같아야 한다는 뜻임.

7. 예시로 이해하기

예를 들어 (z)가 다음 분포를 따른다고 하자.

[
z \sim \mathcal{N}(0, I)
]

transition 이후 (z')도 다음을 따른다면 stationary임.

[
z' \sim \mathcal{N}(0, I)
]

개별 sample은 움직일 수 있음.

예를 들어:

[
z = 0.2,\quad z' = 0.4
]

또 다른 sample은:

[
z = -0.7,\quad z' = -0.5
]

처럼 바뀔 수 있음.

하지만 전체적으로 모아보면 (z)도 (\mathcal{N}(0,I)), (z')도 (\mathcal{N}(0,I))이면 stationarity임.

8. Stationarity가 깨지는 경우

처음에는 평균 0 근처에 있었는데, transition 이후 모두 오른쪽으로 이동한다고 하자.

[
z \sim \mathcal{N}(0, I)
]

[
z' \sim \mathcal{N}(3, I)
]

이 경우는 stationary하지 않음.
왜냐하면 (z)와 (z')의 분포가 달라졌기 때문임.

즉, stationarity는 “세계가 view 사이에서 같은 방식으로 유지된다”는 가정임.

Assumption 3. Additive noise

원문:

[
z'_i = m_i(z_i) + \eta_i
]

(\eta_i)는 (z_i)와 independent함.

9. Additive noise가 무슨 뜻인가?

Additive noise는 다음 상태가 두 부분으로 만들어진다는 뜻임.

현재 상태에서 deterministic하게 정해지는 부분
random noise

수식으로:

[
z'_i = m_i(z_i) + \eta_i
]

여기서

(z_i): 현재 latent factor
(z'_i): 다음 또는 related view의 latent factor
(m_i(z_i)): 현재 상태에서 예측 가능한 deterministic 변화
(\eta_i): 랜덤하게 더해지는 noise

임.

즉, (z'_i)는 현재 (z_i)에서 완전히 랜덤하게 나오는 것이 아니라, (z_i)를 기반으로 한 변화에 noise가 더해진 것임.

10. 예시

예를 들어 로봇 joint angle이 있다고 하자.

현재 각도:

[
z_i = 30^\circ
]

다음 각도는 대략 현재 각도 근처일 것임.

[
m_i(z_i) = 32^\circ
]

그런데 센서 노이즈나 작은 흔들림이 있어서 noise가 더해짐.

[
\eta_i = 0.5^\circ
]

그러면:

[
z'_i = 32^\circ + 0.5^\circ = 32.5^\circ
]

이런 식임.

이미지 augmentation에서도 비슷하게 생각할 수 있음. 원래 position이나 color에 jitter noise가 더해져 약간 변형된 view가 만들어짐.

11. 왜 additive noise가 “가장 단순한 perturbation model”인가?

Perturbation은 원래 상태를 약간 흔드는 변화를 말함.

Additive noise는 그중 가장 단순한 형태임.

새 상태 = 기존 상태의 변화 + noise

예를 들어:

sensor noise
Brownian motion
jitter augmentation
measurement error

모두 이런 additive noise로 생각할 수 있음.

12. “이 assumptions를 함께 사용하면 일반적인 World Models의 class가 정의됨.”

이 말은 논문이 앞으로 다룰 world의 범위를 정했다는 뜻임.

아무 world나 다루는 것이 아니라, 다음 조건을 만족하는 world만 다룸.

latent factor들이 서로 independent함
transition 이후에도 전체 분포가 유지됨
다음 상태는 현재 상태의 함수 + 독립 noise로 생성됨

이 조건들을 만족하는 world class 안에서 LeJEPA가 latent를 복원할 수 있는지 분석하는 것임.

13. “Forward result는 Gaussian을 선택함으로써 이 class를 specialize함.”

여기서 forward result는 Theorem 1을 말함.

앞에서 정의한 world class는 아직 latent distribution이 정확히 무엇인지 정하지 않았음.

즉, (z)가 Gaussian일 수도 있고, Laplace일 수도 있고, uniform일 수도 있음.

Forward result에서는 이 중에서 특정하게 Gaussian을 선택함.

[
z \sim \mathcal{N}(0,I)
]

그랬을 때 LeJEPA가 linear identifiability를 달성함을 보임.

즉:

[
\text{Gaussian world}
\Rightarrow
h(z)=Qz
]

를 증명하는 것임.

14. “Converse result는 Gaussian이 유일한 선택임을 보임.”

Converse result는 Theorem 2를 말함.

이는 반대 방향임.

즉, 이 world class 안에서 LeJEPA가 linear identifiability를 보장하려면, latent distribution이 Gaussian이어야 한다고 보임.

다시 말해:

[
h(z)=Qz \text{ 보장이 성립}
\Rightarrow
z \text{는 Gaussian}
]

이라는 방향임.

그래서 이 논문은 단순히 “Gaussian이면 된다”가 아니라,

이 world class 안에서는 Gaussian만이 linear identifiability guarantee를 가능하게 한다

고 주장함.

전체를 쉽게 다시 쓰면

이 문단은 이렇게 이해하면 됨.

LeJEPA는 positive pair를 사용해서 학습함. 그래서 이론 분석을 하려면 positive pair 뒤에 있는 latent (z)와 (z')가 어떤 관계인지 정해야 함. 논문은 이를 위해 세 가지 world assumption을 둠. 첫째, latent factor들은 서로 independent함. 둘째, 두 view (z)와 (z')는 같은 marginal distribution을 가져야 함. 즉, 개별 상태는 변해도 전체 분포는 유지되어야 함. 셋째, (z')는 (z)의 deterministic 변화에 independent noise가 더해진 additive-noise 형태로 만들어짐. 이 세 조건은 논문이 다루는 world class를 정의함. 이후 Theorem 1에서는 이 world class 중 Gaussian latent를 선택하면 LeJEPA가 (h(z)=Qz)를 학습함을 보이고, Theorem 2에서는 이 class 안에서 그런 linear identifiability 보장을 주는 유일한 latent distribution이 Gaussian임을 보임.

3.1.1 The Gaussian World

이제 위 framework 안에서 specific distributional choice를 함.

우리는 Gaussian latents, 즉 \[ z \sim \mathcal{N}(0, I_n) \] 를 가정함.

이는 주어진 mean과 covariance에 대해 maximum-entropy distribution임 [23].

즉, 가능한 한 적은 structure를 가정함.

또한 task-relevant latents는 보통 많은 micro-variables의 aggregate이며, central limit theorem에 의해 Gaussianity를 향하는 경향이 있음.

Gaussian latents와 Assumptions (3.1)은 Gaussian transitions를 함의함.

Stationarity는 \[ z' \sim \mathcal{N}(0, I_n) \] 을 요구하며, distribution을 보존하는 Gaussian의 유일한 additive-noise perturbation은 Ornstein–Uhlenbeck (OU) transition [77, 78]임.

\[ z' = \rho z + \sqrt{1 - \rho^2}\eta, \quad \eta \sim \mathcal{N}(0, I_n), \quad \eta \perp z, \tag{1} \] 여기서 $\rho \in (0, 1)$는 views 사이의 correlation을 제어함.

다음을 확인할 수 있음.

\[ \mathbb{E}[z'] = 0 \] \[ \mathrm{Var}(z') = \rho^2 I_n + (1 - \rho^2)I_n = I_n \] \[ \mathrm{Cov}(z, z') = \rho I_n \] Gaussian은 이러한 형태의 channel이 marginal을 보존하는 유일한 distribution임.

이는 Gaussian이 rescaling까지 허용했을 때 convolution의 fixed point라는 사실에서 따라옴.

Independence assumption도 만족됨.

$z$의 components가 independent이고 noise $\eta$가 diagonal covariance를 가지므로, 각 $(z_i, z'_i)$가 independent하게 evolve하기 때문임.

이 부분은 앞에서 정의한 world assumptions 안에서, 이제 latent distribution을 Gaussian으로 구체화하는 파트임.

핵심은 다음임.

앞에서는 (z)와 (z')가 independent, stationary, additive-noise transition을 따른다고만 했음.
이제 저자들은 그중에서도 (z)가 Gaussian이라고 선택함.
그러면 (z')도 같은 Gaussian 분포를 유지하려면 transition이 자연스럽게 OU transition 형태가 되어야 함.

1. “specific distributional choice를 한다”는 말

앞의 3.1에서는 world에 대한 일반 조건만 말했음.

즉, 아직 (z)가 어떤 분포인지는 정하지 않았음.

가능한 후보는 많음.

[
z \sim \text{Gaussian}
]

일 수도 있고,

[
z \sim \text{Uniform}
]

일 수도 있고,

[
z \sim \text{Laplace}
]

일 수도 있음.

그런데 3.1.1에서는 그중에서 Gaussian latent를 선택함.

[
z \sim \mathcal{N}(0, I_n)
]

즉, latent variable (z)가 평균 0, covariance identity인 (n)-dimensional Gaussian을 따른다고 가정함.

2. (z \sim \mathcal{N}(0,I_n))가 무슨 뜻인가

[
z \sim \mathcal{N}(0,I_n)
]

이 말은 (z)가 (n)-차원 표준 정규분포를 따른다는 뜻임.

여기서:

평균은 0임
각 latent dimension의 variance는 1임
서로 다른 latent dimension들은 correlation이 없음
모든 방향으로 균일하게 퍼져 있음

즉, isotropic Gaussian임.

예를 들어 (z)가 2차원이라면, 점들이 원형으로 퍼진 Gaussian cloud처럼 생각할 수 있음.

3. “maximum-entropy distribution”이라는 말

논문은 Gaussian을 선택한 이유 중 하나로 maximum entropy를 말함.

이 말은 다음 뜻임.

평균과 covariance만 정해졌을 때, Gaussian은 그 외에 가장 적은 추가 구조를 가정하는 분포임.

쉽게 말하면, 평균과 분산만 알고 있을 때 가장 “덜 편향된” 선택이 Gaussian이라는 뜻임.

예를 들어 평균 0, variance 1이라는 조건만 주어졌다고 하자.
그 조건을 만족하는 분포는 많음.

Gaussian
Uniform
Laplace
heavy-tailed distribution

그중 Gaussian은 entropy가 가장 큼.
즉, 특정한 모양이나 추가적인 structure를 가장 적게 넣은 분포로 볼 수 있음.

그래서 저자들은 Gaussian을 “least-assumption prior”처럼 정당화함.

4. “task-relevant latents는 많은 micro-variables의 aggregate”라는 말

이 문장은 Gaussian을 선택하는 또 다른 직관적 이유임.

실제 task-relevant latent는 하나의 단순한 원인에서 오는 게 아니라, 많은 작은 요인들이 합쳐진 결과일 수 있음.

예를 들어 “object pose”나 “motion state”도 실제로는 여러 물리적 요인, 센서 요인, 환경 요인들이 합쳐진 결과일 수 있음.

이때 central limit theorem에 따르면, 많은 독립적인 작은 변수들이 합쳐지면 분포가 Gaussian에 가까워지는 경향이 있음.

그래서 저자들은 이렇게 말하는 것임.

실제 latent factor들이 완전히 Gaussian이라고 보장할 수는 없지만, task에 중요한 latent들이 많은 micro-variable의 aggregate라면 Gaussian에 가까워질 수 있다.

다만 이건 직관적 정당화이지, 실제 world latent가 반드시 Gaussian이라는 증거는 아님.

5. “Gaussian latents와 Assumptions는 Gaussian transitions를 함의함”이 무슨 말인가

앞에서 world assumption에는 stationarity가 있었음.

Stationarity는:

[
p(z)=p(z')
]

라는 뜻이었음.

이제 (z)를 Gaussian으로 선택했음.

[
z \sim \mathcal{N}(0,I_n)
]

그러면 stationarity 때문에 (z')도 같은 분포를 가져야 함.

[
z' \sim \mathcal{N}(0,I_n)
]

즉, (z)에서 (z')로 transition이 일어나도 전체 분포는 그대로 Gaussian이어야 함.

그래서 필요한 transition은 다음 조건을 만족해야 함.

(z')는 (z)와 관련되어 있어야 함
noise가 더해져야 함
하지만 전체 분포는 여전히 (\mathcal{N}(0,I_n))이어야 함

이 조건을 만족하는 대표적인 transition이 OU transition임.

6. OU transition이 뭔가

논문에서는 (z')를 다음처럼 정의함.

[
z' = \rho z + \sqrt{1-\rho^2}\eta
]

여기서

[
\eta \sim \mathcal{N}(0,I_n)
]

이고, (\eta)는 (z)와 independent함.

이 식을 쉽게 말하면:

(z')는 현재 상태 (z)를 일부 유지하고, 나머지는 새로운 Gaussian noise로 채운 상태임.

즉,

(\rho z): 기존 (z)에서 유지되는 부분
(\sqrt{1-\rho^2}\eta): 새로 들어오는 noise 부분

임.

7. (\rho)는 무엇인가

[
\rho \in (0,1)
]

는 (z)와 (z')가 얼마나 비슷한지를 조절함.

(\rho)가 1에 가까우면

[
z' \approx z
]

임.

즉, 두 view가 매우 비슷함.
비디오로 치면 거의 연속된 frame처럼 볼 수 있음.

(\rho)가 0에 가까우면

[
z' \approx \eta
]

임.

즉, (z')는 (z)와 거의 관련이 없는 새로운 noise에 가까워짐.

그래서 (\rho)는 positive pair 사이의 correlation을 조절하는 값임.

8. 왜 (\sqrt{1-\rho^2})가 붙는가

이 부분이 중요함.

(z')가 여전히 variance 1을 가지려면, 기존 (z)에서 온 부분과 noise에서 온 부분의 variance 합이 1이어야 함.

[
z' = \rho z + \sqrt{1-\rho^2}\eta
]

에서 (z)와 (\eta)는 둘 다 variance 1임.

그러면 (z')의 variance는:

[
\mathrm{Var}(z')

\rho^2 I_n + (1-\rho^2)I_n

I_n
]

즉, (\rho z)만 쓰면 variance가 (\rho^2)로 줄어들어버림.
그래서 남은 variance를 noise가 채우도록 (\sqrt{1-\rho^2})를 곱하는 것임.

결과적으로 (z')도 여전히 (\mathcal{N}(0,I_n))를 따름.

9. 수식 확인: (\mathbb{E}[z']=0)

[
z' = \rho z + \sqrt{1-\rho^2}\eta
]

이고,

[
\mathbb{E}[z]=0,\quad \mathbb{E}[\eta]=0
]

이므로,

[
\mathbb{E}[z']

\rho \mathbb{E}[z]
+
\sqrt{1-\rho^2}\mathbb{E}[\eta]

0
]

즉, (z')의 평균도 0임.

10. 수식 확인: (\mathrm{Var}(z')=I_n)

[
\mathrm{Var}(z')

\mathrm{Var}(\rho z + \sqrt{1-\rho^2}\eta)
]

(z)와 (\eta)가 independent이므로 variance가 더해짐.

[
\mathrm{Var}(z')

\rho^2\mathrm{Var}(z)
+
(1-\rho^2)\mathrm{Var}(\eta)
]

둘 다 covariance가 (I_n)이므로,

[
\mathrm{Var}(z')

\rho^2 I_n
+
(1-\rho^2)I_n

I_n
]

즉, (z')도 (z)와 같은 covariance를 가짐.

11. 수식 확인: (\mathrm{Cov}(z,z')=\rho I_n)

[
z' = \rho z + \sqrt{1-\rho^2}\eta
]

이므로,

[
\mathrm{Cov}(z,z')

\mathrm{Cov}(z,\rho z + \sqrt{1-\rho^2}\eta)
]

[

\rho\mathrm{Cov}(z,z)
+
\sqrt{1-\rho^2}\mathrm{Cov}(z,\eta)
]

그런데 (z)와 (\eta)는 independent이므로,

[
\mathrm{Cov}(z,\eta)=0
]

따라서,

[
\mathrm{Cov}(z,z')

\rho I_n
]

즉, (z)와 (z') 사이의 correlation이 (\rho)로 조절됨.

12. “marginal을 보존한다”는 말

여기서 marginal을 보존한다는 말은:

transition 이후에도 (z')의 분포가 (z)와 같게 유지된다는 뜻임.

즉,

[
z \sim \mathcal{N}(0,I_n)
]

이고 transition 후에도

[
z' \sim \mathcal{N}(0,I_n)
]

이면 marginal이 보존된 것임.

이게 stationarity와 연결됨.

13. “Gaussian이 convolution의 fixed point”라는 말

이 말은 조금 수학적인데, 직관적으로는 다음 의미임.

Gaussian에 independent Gaussian noise를 더하고 적절히 rescale하면, 결과도 다시 Gaussian이 됨.

예를 들어:

[
z \sim \mathcal{N}(0,I)
]

[
\eta \sim \mathcal{N}(0,I)
]

이면,

[
\rho z + \sqrt{1-\rho^2}\eta
]

도 다시 Gaussian임.

즉, Gaussian은 noise를 더하고 scale을 조절해도 Gaussian 형태를 유지함.

그래서 OU transition은 Gaussian distribution을 보존할 수 있음.

이걸 “Gaussian이 convolution under rescaling의 fixed point”라고 말하는 것임.

블로그에서는 그냥 이렇게 쓰면 충분함.

Gaussian은 independent Gaussian noise를 더하고 variance를 맞춰 rescale해도 다시 Gaussian이 되는 안정적인 분포임. 그래서 stationarity를 유지하는 additive-noise transition으로 OU transition이 자연스럽게 등장함.

14. Independence assumption도 왜 만족되는가

앞에서 world assumption 중 independence가 있었음.

Gaussian (z)는 (I_n) covariance를 가지므로 각 component가 independent임.

[
z = (z_1,\dots,z_n)
]

에서 (z_i)들이 서로 독립임.

또 noise (\eta)도

[
\eta \sim \mathcal{N}(0,I_n)
]

이므로 component들이 서로 독립임.

그리고 transition은 각 dimension별로:

[
z'_i = \rho z_i + \sqrt{1-\rho^2}\eta_i
]

처럼 작동함.

즉, (i)-번째 factor는 (z_i)와 (\eta_i)만 사용해서 (z'_i)가 됨.
다른 dimension (z_j)에는 의존하지 않음.

그래서 각 ((z_i,z'_i))가 independent하게 evolve함.

전체를 쉽게 다시 쓰면

이 파트는 다음처럼 이해하면 됨.

앞에서는 world가 independence, stationarity, additive noise를 만족한다고만 했음. 이제 논문은 그 world class 안에서 latent distribution을 Gaussian으로 선택함. 즉, (z\sim\mathcal{N}(0,I_n))라고 가정함. Gaussian은 평균과 covariance만 주어졌을 때 가장 적은 추가 구조를 넣는 maximum-entropy distribution이고, 많은 작은 요인의 합이 Gaussian에 가까워질 수 있다는 central limit theorem 관점에서도 자연스럽다고 설명함.

Stationarity 때문에 (z')도 같은 (\mathcal{N}(0,I_n))를 따라야 함. 이를 만족하면서 (z)와 (z')가 correlation (\rho)를 갖도록 만드는 transition이 (z'=\rho z+\sqrt{1-\rho^2}\eta)임. 여기서 (z)의 일부를 유지하고, 나머지는 independent Gaussian noise로 채움. 이렇게 하면 (z')의 평균은 0, covariance는 (I_n), (z)와 (z')의 covariance는 (\rho I_n)이 됨. 즉, positive pair는 서로 관련되어 있지만, 전체 분포는 그대로 Gaussian으로 유지됨. 이 Gaussian OU transition이 이후 Theorem 1에서 LeJEPA가 (h(z)=Qz)를 학습한다는 결과의 기본 setting이 됨.

3.2 The Learner: LeJEPA

Representation은 composed map \[ h = f \circ g : \mathbb{R}^n \rightarrow \mathbb{R}^n \] 으로 특징지어짐.

여기서 $g$는 world의 unknown generative function이고, $f$는 우리가 학습한 representation임.

LeJEPA [2] training의 두 components는 positive pairs를 서로 당기는 invariance loss와, collapse를 방지하기 위해 embedding distribution의 shape를 조정하는 regularizer임.

즉, $h$, 더 구체적으로 $h = f \circ g$에서의 $f$는 다음을 만족하도록 학습됨.

\[ \min_h \mathcal{L}(h) = \mathbb{E} \left[ \|h(z') - h(z)\|^2 \right] \quad \text{s.t.} \quad h(z) \sim \mathcal{N}(0, I_n) \tag{2} \] 여기서 첫 번째 항은 Alignment이고, constraint는 Gaussianity임.

우리는 SIGReg가 성공한 상황, 즉 $h(z)$가 target Gaussian과 match되는 상황을 model함.

실제로는 이것이 approximate하게 성립함 (Sec. 5.3).

Expectations가 정의되도록 하기 위해 measurability 외에는 $h$에 대해 아무것도 요구하지 않음.

따라서 이 결과는 어떤 neural network에도 적용되며, output dimension은 전체 논문에서 latent dimension $n$과 같다고 둠. Mismatched regimes $m \neq n$은 Sec. 7에서 논의됨.

또한 whitening, 즉 \[ \mathrm{Cov}(h(z)) = I_n \] 은 다음을 고정함.

\[ \mathbb{E}[\|h(z)\|^2] = \mathbb{E}[\|h(z')\|^2] = n \] 따라서 \[ \mathcal{L}(h) = 2n - 2 \sum_{i=1}^{n} \mathbb{E}[h_i(z')h_i(z)] \tag{3} \] 임.

따라서 distance를 minimize하는 것은 두 views 사이의 correlation을 maximize하는 것과 동치임.

질문은 다음과 같아짐.

$h(z) \sim \mathcal{N}(0, I_n)$를 만족하는 모든 measure-preserving maps $h$ 중에서, positive pairs $h(z), h(z')$ 사이의 correlation을 가장 높게 달성하는 것은 무엇인가?

3.2 The Learner: LeJEPA 설명

이 파트는 앞에서 정의한 World에 이어서, Learner, 즉 LeJEPA가 무엇을 학습하는지를 설명하는 부분임.

앞에서는 세계가 latent variable z를 가지고 있고, 우리가 관측하는 데이터는 x = g(z)처럼 생성된다고 했음.

여기서 g는 latent state를 observation으로 바꾸는 unknown generative process임. 예를 들어 3D scene을 image로 rendering하거나, physical state를 sensor reading으로 바꾸는 과정임.

이제 모델은 observation x를 입력받아 representation을 만듦. 이를 y = f(x)라고 쓸 수 있음.

여기서 f는 우리가 학습하는 encoder 또는 representation function임.

따라서 latent z에서 시작해서 최종 representation까지 가는 전체 mapping은 h = f composed with g라고 쓸 수 있음. 즉, h(z) = f(g(z))임.

흐름으로 쓰면 다음과 같음.

true latent z → observation process g → observation x → encoder f → representation h(z)

여기서 중요한 점은, 모델은 실제로 z를 직접 보지 않는다는 것임. 모델이 보는 것은 x = g(z)임. 하지만 이론 분석에서는 z에서 시작해서 h(z)까지 가는 전체 과정을 하나의 함수 h로 묶어서 봄.

따라서 이 논문에서 LeJEPA가 잘 학습된다는 것은, 결국 h(z)가 원래 latent z의 구조를 잘 보존한다는 뜻임.

1. LeJEPA training의 두 구성요소

LeJEPA training은 크게 두 가지 component로 이루어짐.

첫 번째는 positive pairs를 서로 가깝게 만드는 alignment 또는 invariance loss임.

Positive pair는 서로 관련된 두 view임. 예를 들어 같은 이미지의 두 augmentation, 비디오의 가까운 두 frame, trajectory에서 인접한 두 state가 positive pair가 될 수 있음.

latent level에서는 이 pair를 z와 z'라고 쓰고, representation level에서는 h(z)와 h(z')라고 쓸 수 있음.

LeJEPA는 이 둘이 가까워지도록 학습함. 즉, ||h(z') - h(z)||^2를 작게 만들고자 함.

이 값이 작아지면, related views의 representation이 서로 가까워진다는 뜻임.

두 번째는 collapse를 막기 위한 Gaussian regularization임.

Alignment loss만 사용하면 모든 input을 같은 vector로 보내는 collapse가 발생할 수 있음. 모든 sample의 representation이 같으면 positive pair는 항상 가까워지기 때문임.

그래서 LeJEPA는 embedding distribution이 isotropic Gaussian에 가까워지도록 regularization을 추가함. 즉, h(z) ~ N(0, I_n)이 되도록 유도함.

이는 representation들이 한 점으로 몰리지 않고, 평균 0과 covariance I_n을 갖는 Gaussian처럼 잘 퍼지도록 만드는 것임.

2. LeJEPA objective

논문에서는 LeJEPA objective를 다음처럼 볼 수 있음.

minimize over h: L(h) = E[||h(z') - h(z)||^2]

subject to: h(z) ~ N(0, I_n)

이 식은 두 가지를 동시에 말함.

첫 번째 항인 E[||h(z') - h(z)||^2]는 alignment loss임. 이는 positive pair의 representation이 평균적으로 얼마나 떨어져 있는지를 측정함.

여기서 E를 쓰는 이유는 positive pair가 하나만 있는 것이 아니라, data distribution에서 여러 positive pair가 계속 샘플링되기 때문임. 즉, 특정 pair 하나의 거리가 아니라 전체 positive pair에 대한 평균적인 거리를 최소화하는 것임.

이 값을 최소화한다는 것은 h(z)와 h(z')를 평균적으로 가깝게 만들겠다는 뜻임.

두 번째 조건인 h(z) ~ N(0, I_n)은 Gaussianity constraint임. 이는 전체 representation distribution이 isotropic Gaussian을 따르도록 만든다는 뜻임. 다시 말해 representation이 collapse되지 않고 전체 공간에 잘 퍼져 있어야 함.

따라서 LeJEPA objective는 쉽게 말하면, related views는 representation space에서 가까워야 하지만 전체 representation은 한 점에 몰리지 않고 Gaussian처럼 퍼져 있어야 한다는 의미임.

3. SIGReg가 성공한 상황을 model한다는 말

논문에서는 SIGReg가 성공했다고 가정함.

SIGReg는 Sketched Isotropic Gaussian Regularization의 약자이며, embedding distribution을 isotropic Gaussian에 가깝게 만드는 regularizer임.

즉, 논문은 h(z) ~ N(0, I_n) 조건이 잘 만족된다고 보는 것임.

실제 training에서는 이 조건이 완벽하게 성립하지는 않을 수 있음. 그래서 논문은 뒤의 Sec. 5.3에서 approximate case도 다룸. 즉, Gaussianity가 완벽하지 않고 alignment도 완벽하지 않을 때 recovery error가 얼마나 커지는지를 bound함.

하지만 이 파트에서는 먼저 이상적인 경우, 즉 SIGReg가 잘 작동해서 representation distribution이 target Gaussian과 match되는 상황을 분석함.

4. h에 대해 measurability 외에는 요구하지 않는다는 말

논문은 h에 대해 매우 강한 함수 조건을 두지 않음.

기존 identifiability 연구에서는 representation map이 smooth하거나 invertible해야 한다는 가정을 두는 경우가 많음. 하지만 이 논문은 h가 꼭 smooth하거나 invertible할 필요는 없다고 봄.

여기서 필요한 최소 조건은 measurability임.

Measurable하다는 것은 쉽게 말해, expectation 같은 확률적 계산이 가능할 정도의 함수라는 뜻임. 즉, E[||h(z') - h(z)||^2] 같은 값이 정의될 수 있으면 됨.

그래서 이 결과는 특정한 neural network architecture에만 적용되는 것이 아니라, expectation이 정의되는 넓은 class의 representation function에 대해 적용될 수 있다고 보는 것임.

다만 논문 전체에서는 output dimension이 latent dimension과 같다고 둠. 즉, h는 R^n에서 R^n으로 가는 함수임.

latent dimension과 representation dimension이 다른 경우, 즉 m != n인 경우는 Sec. 7의 limitation에서 논의됨.

5. Whitening이란 무엇인가?

논문은 Gaussianity 조건에서 특히 covariance가 identity가 되는 성질을 사용함.

Gaussianity 조건이 h(z) ~ N(0, I_n)이면, h(z)의 covariance는 Cov(h(z)) = I_n임.

이를 whitening이라고 부름.

Whitening은 representation의 각 dimension이 적절히 퍼져 있고, 서로 correlation이 없도록 만드는 것임.

쉽게 말하면 다음과 같음.

각 feature dimension의 variance가 1임.

서로 다른 feature dimension끼리 correlation이 없음.

representation이 특정 방향으로만 길게 늘어나 있지 않음.

모든 방향으로 균일하게 퍼져 있음.

이 whitening 조건 때문에 E[||h(z)||^2] = n이 성립함.

그리고 stationarity 때문에 z와 z'의 marginal distribution이 같으므로, h(z')도 같은 distribution을 가짐.

따라서 E[||h(z')||^2] = n도 성립함.

6. 왜 E[||h(z)||^2] = n인가?

h(z)가 n-dimensional vector라고 하자.

즉, h(z) = [h_1(z), h_2(z), ..., h_n(z)]임.

그리고 h(z) ~ N(0, I_n)이면 각 dimension은 평균 0, variance 1을 가짐.

즉, E[h_i(z)^2] = 1임.

벡터 norm의 제곱은 각 component 제곱의 합임.

||h(z)||^2 = sum from i=1 to n of h_i(z)^2

따라서 expectation을 취하면 다음과 같음.

E[||h(z)||^2] = sum from i=1 to n of E[h_i(z)^2]

각 E[h_i(z)^2]가 1이므로,

E[||h(z)||^2] = 1 + 1 + ... + 1 = n

즉, n-dimensional standard Gaussian vector의 squared norm의 평균은 n임.

7. Alignment loss를 전개하면 correlation maximization이 됨

LeJEPA alignment loss는 E[||h(z') - h(z)||^2]임.

이제 벡터 제곱 거리 공식을 사용함.

일반적으로 두 벡터 a, b에 대해 다음이 성립함.

||a - b||^2 = ||a||^2 + ||b||^2 - 2 a^T b

여기서 a = h(z'), b = h(z)라고 두면 다음과 같음.

||h(z') - h(z)||^2 = ||h(z')||^2 + ||h(z)||^2 - 2 h(z')^T h(z)

Expectation을 취하면 다음과 같음.

L(h) = E[||h(z')||^2] + E[||h(z)||^2] - 2E[h(z')^T h(z)]

앞에서 whitening 때문에 E[||h(z')||^2] = n이고 E[||h(z)||^2] = n이라고 했음.

따라서 다음과 같이 정리됨.

L(h) = 2n - 2E[h(z')^T h(z)]

또한 inner product는 component별 곱의 합으로 쓸 수 있음.

h(z')^T h(z) = sum from i=1 to n of h_i(z')h_i(z)

그래서 최종적으로 다음 식이 됨.

L(h) = 2n - 2 sum from i=1 to n of E[h_i(z')h_i(z)]

8. 왜 distance minimization이 correlation maximization이 되는가?

위 식을 보면 다음과 같음.

L(h) = 2n - 2 sum from i=1 to n of E[h_i(z')h_i(z)]

여기서 2n은 constant임. 즉, h를 바꿔도 whitening 조건 아래에서는 고정되어 있음.

따라서 L(h)를 줄이려면, 오른쪽의 두 번째 항인 sum from i=1 to n of E[h_i(z')h_i(z)]를 크게 만들어야 함.

이 항은 h(z)와 h(z')가 얼마나 같은 방향으로 움직이는지를 나타냄. 평균적으로 두 representation이 비슷하면 이 값이 커짐.

그래서 논문은 distance를 minimize하는 것이 positive pair 사이의 correlation을 maximize하는 것과 같다고 말함.

쉽게 말하면, whitening 때문에 각 representation의 크기는 이미 고정되어 있음. 그러면 두 representation 사이의 거리를 줄이는 유일한 방법은 서로 더 같은 방향을 바라보게 만드는 것임. 이것이 correlation maximization임.

9. 최종 질문: 어떤 h가 가장 높은 correlation을 만드는가?

이제 논문의 질문은 다음처럼 바뀜.

처음 질문은 다음이었음.

positive pair h(z), h(z')를 가깝게 만들면서, embedding distribution을 Gaussian으로 유지하는 h는 무엇인가?

수식으로 쓰면 다음과 같음.

minimize over h: E[||h(z') - h(z)||^2]

subject to: h(z) ~ N(0, I_n)

그런데 whitening 조건을 이용해 loss를 전개하면, 질문은 다음으로 바뀜.

h(z) ~ N(0, I_n)를 만족하는 모든 map h 중에서, positive pair h(z)와 h(z')의 correlation을 가장 크게 만드는 h는 무엇인가?

논문의 핵심 theorem은 이 질문에 대한 답임.

Gaussian OU world에서는 그 답이 결국 linear map임. 즉, h(z) = Qz임.

따라서 LeJEPA objective를 가장 잘 만족하는 representation은 true latent z를 rotation/reflection 정도만 허용한 채 선형적으로 복원하는 representation이라는 것임.

Symbol Definitions

z: latent state
observation 뒤에 숨은 true world state

z': positive-pair latent state
z와 관련된 두 번째 latent view

x: observation
모델이 실제로 보는 data

g: unknown generative / observation process
latent z를 observation x로 바꾸는 함수

f: learned encoder / representation function
observation x를 representation y로 바꾸는 모델

y: learned representation
encoder output

h: composed latent-to-representation map
h = f composed with g, 즉 h(z) = f(g(z))

h_i(z): i-th component of representation
representation vector h(z)의 i번째 값

L(h): LeJEPA alignment loss
positive pair representations 사이의 squared distance

E[...]: expectation
positive pair distribution에 대한 평균

N(0, I_n): standard n-dimensional Gaussian
평균 0, covariance I_n인 Gaussian distribution

I_n: n by n identity matrix
standard Gaussian의 covariance

n: latent / representation dimension
이 논문에서는 latent dimension과 representation dimension이 같다고 가정함

Cov(h(z)): covariance of representation
representation distribution의 covariance

whitening: covariance를 identity로 만드는 조건
여기서는 Cov(h(z)) = I_n

measure-preserving map: distribution을 보존하는 map
여기서는 z를 h(z)로 보냈을 때 h(z) ~ N(0, I_n)가 되게 하는 map

4 Spectral Analysis of the World

무언가를 증명하기 전에, main results 뒤에 있는 mathematical tools에 대한 intuition을 먼저 구축함.

Transition Operator.

World는 $z$에서 $z'$로의 transition을 정의함 (3.1).

이 transition은 functions 위의 operator를 유도함.

임의의 scalar function $h_i(z)$가 주어졌을 때, transition operator $T$를 다음과 같이 정의함.

\[ (Th_i)(z) = \mathbb{E}[h_i(z') \mid z] \]

이는 current state가 주어졌을 때 next view에서의 $h_i$의 expected value임.

이는 linear operator이며 spectral decomposition을 가짐.

즉, 다음을 만족하는 eigenfunctions $\phi_k$의 집합이 존재함.

\[ T\phi_k = \lambda_k \phi_k \] 여기서 eigenvalues는 다음을 만족함.

\[ 1 = \lambda_0 > \lambda_1 \geq \lambda_2 \geq \cdots \geq 0 \] 가장 큰 eigenvalues를 가진 eigenfunctions는 positive pairs 사이에서 가장 correlated되어 있음.

즉, latent variables의 가장 predictable features임.

이 spectral perspective는 우리의 forward result와 converse result 모두를 관통하는 common thread임.

Gaussian World: Hermite Polynomials.

Gaussian worlds (Sec. 3.1.1)의 경우, eigenfunctions는 closed form으로 알려져 있음. 이들은 Hermite polynomials $\{\mathrm{He}_k\}_{k \geq 0}$이며, Gaussian variables의 functions에 대한 natural orthogonal basis임. 이는 periodic functions에 대한 Fourier modes와 유사함.

Degree-$d$ Hermite polynomial의 eigenvalue는 정확히 $\rho^d$이며, 이는 Mehler’s formula [79]의 결과임.

이는 zero mean과 unit variance를 가진 임의의 function $h_i(z)$가 linear part (degree 1), quadratic part (degree 2), cubic part (degree 3), 그리고 그 이후의 higher-degree parts로 decomposed될 수 있음을 의미함. 각 part의 variance fractions $w_1, w_2, w_3, \ldots$는 합이 1임.

Positive pairs 사이의 correlation은 다음과 같이 decomposed됨. \[ \mathbb{E}[h_i(z')h_i(z)] = w_1 \cdot \rho + w_2 \cdot \rho^2 + w_3 \cdot \rho^3 + \cdots \leq \rho \tag{4} \] Equality는 $w_1 = 1$일 때 그리고 오직 그때에만 성립함.

즉, $h_i$가 linear일 때임. 말로 풀면, representation의 어떤 nonlinear distortion도 positive pairs 사이의 correlation을 엄격하게 감소시킴.

이것이 key intuition임.

General Case: Sturm-Liouville Theory.

Constant diffusion 아래에서 evolve하는 일반적인 latent variable distribution, 즉 반드시 Gaussian일 필요는 없는 distribution의 경우, transition operator의 eigenfunctions는 classical Sturm–Liouville (SL) equation [72]으로 characterize됨.

첫 번째 non-constant eigenfunction $\phi_1$은 항상 monotonic하며, 이는 monotonic transformation까지의 identifiability를 제공함.

그러나 linear identifiability는 $\phi_1$이 affine일 것을 요구하며, 이는 latent variable distribution에 강한 constraint를 부과함. 이것이 우리의 converse result (Sec. 5.2)의 핵심 engine임.

우리는 Gaussian만이 이 constraint를 만족함을 보임. 자세한 내용은 App. A.2 (Gauss/Hermite)와 App. F (SL connection)에 있음.

5 Theory

우리는 LeJEPA가 언제 World Model을 학습하는지를 함께 characterize하는 네 가지 결과를 제시함.

Thm. 1은 Gaussian world가 LeJEPA를 통해 linearly identifiable함을 보임.

Thm. 2는 Sec. 3.1에서 정의한 class of worlds 안에서, Gaussian이 이 성질을 갖는 유일한 distribution임을 확립함.

Thm. 3은 objectives가 완전히 만족되지 않을 때 recovery error를 bound함.

Thm. 4는 linear identifiability가 latent space에서 optimal planning을 가능하게 함을 보임.

모든 proofs는 literature에서 axiomatized된 standard background lemmas를 제외하고 Lean 4 theorem prover에서 verify됨 (App. G).

5.1 Forward Direction: LeJEPA Learns the World Model

Theorem 1 (LeJEPA Linear Identifiability) Gaussian world (Sec. 3.1.1)를 고려함.

$h : \mathbb{R}^n \rightarrow \mathbb{R}^n$가 $h(z) \sim \mathcal{N}(0, I_n)$를 만족하는 임의의 measurable map이라고 하자.

그러면 \[ \mathcal{L}(h) \geq 2(1 - \rho)n \] 이며, equality는 어떤 orthogonal $Q \in O(n)$에 대해 \[ h(z) = Qz \] 일 때 그리고 오직 그때에만 성립함.

이러한 optimum에서는 \[ h(z') \mid h(z) \sim \mathcal{N}(\rho h(z), (1 - \rho^2)I_n) \] 가 성립함.

Proof Sketch. $\mathcal{L}$을 minimize하는 것은 식 (3)의 \[ \sum_i \mathbb{E}[h_i(z')h_i(z)] \] 를 maximize하는 것과 동치임.

Spectral bound (4)는 각 component에 대해 \[ \mathbb{E}[h_i(z')h_i(z)] \leq \rho \] 를 제공하며, equality는 $h_i$가 linear일 때 그리고 오직 그때에만 성립함.

Equality에서 $h(z) = Qz$가 되며, 여기서 $Q$는 unit-norm rows를 가진 matrix임.

Gaussianity는 \[ QQ^\top = I_n \] 을 강제하므로, $Q$는 orthogonal임. 그러면 transition은 direct substitution으로부터 따라옴.

\[ h(z') = \rho h(z) + \sqrt{1-\rho^2} Q\eta \] 이고, $Q$가 orthogonal이므로 $Q\eta \sim \mathcal{N}(0, I_n)$이며 $h(z)$와 independent함.

Proof는 App. A에 있음. Interpretation. Representation은 full World Model을 학습할 수밖에 없음.

두 LeJEPA objectives를 만족하는 어떤 representation이든 true latent variables와 true transition dynamics의 rotation/reflection을 복원해야 함. 남는 유일한 ambiguity는 global rotation이며, 이는 isotropic Gaussian에 본질적으로 존재하는 ambiguity임.

우리는 App. E에서 Dirichlet energy와 Mazur–Ulam theorem을 이용한 alternative proof도 제공하며, 이는 더 theoretical한 $\rho \rightarrow 1$ regime에서 작동함.

5.2 Converse Direction: The Gaussian is Unique

Thm. 1은 $h(z)$가 whitened covariance, 즉 \[ \mathrm{Cov}(h(z)) = I_n \] 를 갖는다는 사실만 사용함.

이는 SIGReg에 의해 함의됨 (Fig. 10).

그렇다면 Gaussian이라는 specific choice가 중요한가, 아니면 어떤 white distribution이라도 작동하는가? Optimal representation은 latent process의 slowest features를 추출함 [71]. Additive noise 아래에서 Sturm–Liouville theory는 이를 점점 더 oscillatory한 eigenfunctions 순서로 정렬함 [72]. 첫 번째 eigenfunction은 항상 monotonic이며, 어떤 latent distribution에 대해서도 monotonic transformation까지의 identifiability를 제공함. Linear identifiability는 이 eigenfunction이 affine일 것을 요구하고, 오직 Gaussian만이 이를 만족함. Theorem 2 (Gaussian Uniqueness) Assumptions 3.1을 만족하는 임의의 world를 고려함. $\mathrm{Cov}(h(z)) = I_n$를 만족하는 식 (3)의 모든 minimizer가 linear, 즉 \[ h(z) = Qz \] 라고 가정함. 그러면 $z$는 Gaussian임. Proof Sketch. Affine eigenfunction을 요구하면 latent distribution의 score function $(\log p)'$가 linear일 수밖에 없음. 만약 \[ \phi(z_i) = az_i + b \] 가 eigenfunction이면, $\phi'$는 constant이고, eigenvalue equation은 $z_i$에 대한 $(\log p)'$의 linear ODE로 collapse됨. 이를 풀면 \[ \log p(z_i) \propto -(z_i - \mu)^2 \] 가 얻어짐. 여기서 sign은 normalizability에 의해 고정되며, 이는 Gaussian임. 이 argument는 component별로 적용됨. Latent variables의 independence는 이 결론을 joint distribution으로 확장함. Proof는 App. B에 있음.

5.3 Approximate Identifiability

앞선 results는 exact optimality를 가정함. 실제로는 두 objectives 모두 approximate하게만 만족됨. Alignment는 minimum과 같지는 않지만 가까운 값에 도달하고, regularizer는 approximate whitening을 강제함.

Identifiability는 이 두 quantity에 대해 점진적으로 약화됨. Theorem 3 (Approximate Identifiability) Gaussian world (3.1.1)를 고려함.

$h : \mathbb{R}^n \rightarrow \mathbb{R}^n$가 measurable이고 $\mathbb{E}[h(z)] = 0$이며, 다음을 만족한다고 하자.

Approximate alignment: \[ \mathcal{L}(h) \leq 2(1-\rho)\mathrm{tr}(\mathrm{Cov}(h(z))) + \delta \] Approximate whitening: \[ \|\mathrm{Cov}(h(z)) - I_n\|_F \leq \varepsilon \] 다음을 정의함. \[ D = \frac{\delta}{2\rho(1-\rho)} \] 그러면 어떤 $Q \in O(n)$가 존재하여 다음을 만족함. \[ \mathbb{E}\left[\|h(z) - Qz\|^2\right] \leq D + (\varepsilon + D)^2 \tag{5} \] Interpretation. 이 bound는 두 부분으로 구성됨.

하나는 $h$가 linear에서 얼마나 멀리 있는지를 측정하는 nonlinearity term $D$이고, 다른 하나는 linear part가 orthogonal에서 얼마나 멀리 있는지를 측정하는 distortion term $(\varepsilon + D)^2$임.

Exact case, 즉 $\delta = \varepsilon = 0$에서는 둘 다 사라지며 Thm. 1이 복원됨.

실제로는 첫 번째 항이 dominate하므로, recovery error는 \[ \frac{\delta}{2\rho(1-\rho)} \] 처럼 scale됨. 즉, alignment는 어렵고 whitening은 사실상 free임.

Proof는 App. C에 있음.

5.4 Optimal Latent Planning

Theorems 1–3는 encoder가 무엇을 복원하는지를 characterize함.

이제 orthogonal identifiability가 World Models의 핵심 motivation 중 하나인 latent space에서의 action planning에 무엇을 제공하는지 명시함.

Theorem 4 (Optimal Latent Planning) $h(z) = Qz$이고 $Q \in O(n)$라고 하자.

$\hat{z} := h(z)$라고 쓰고, stage cost와 terminal cost $\ell, \ell_T$가 state argument에 대해 $O(n)$-invariant인 임의의 finite-horizon optimal control problem을 고려함. \[ \ell(Rz, a) = \ell(z, a), \quad \ell_T(Rz) = \ell_T(z), \quad \forall R \in O(n) \tag{6} \] $\hat{p}(\cdot \mid \hat{z}_t, a_t)$가 $h$ 아래에서 $p$의 pushforward를 나타낸다고 하자.

또한 $\hat{V}^*$, $\hat{a}^*_{1:T}$가 $p$를 $\hat{p}$로 대체하고 costs는 그대로 둠으로써 정의된 $\hat{z}$-space problem의 value function과 optimal plan을 나타낸다고 하자.

그러면 다음이 성립함. \[ \hat{V}^*(h(z_0)) = V^*(z_0) \] 그리고 \[ \hat{a}^*_{1:T}(h(z_0)) = a^*_{1:T}(z_0) \] 임.

Interpretation. Linear identifiability는 learned representation을 planning에 유용한 World Model로 바꿈.

Learned latent에서 계획된 trajectories는 true world에서 계획된 trajectories와 수학적으로 동일하며, 같은 actions와 같은 value를 가짐.

Proof는 App. D에 있음.

6 Experiments

우리는 각 result를 validate함.

Gaussian latent variables에서의 linear identifiability (Sec. 6.1, Thm. 1), latent-distribution sweep과 RL-policy [9] latents에서의 converse (Sec. 6.2, Thm. 2), 모든 runs에 걸친 approximate bound (Sec. 6.3, Thm. 3), 그리고 near-optimal latent planning [80] (Sec. 6.4, Thm. 4)을 검증함.

6.1 Forward: Linear Identifiability from Gaussian Latent Variables

먼저 controlled 2D setting에서 Thm. 1을 검증함.

Latents \[ z \sim \mathcal{N}(0, I_2) \] 를 sample하고 네 가지 nonlinear mixing functions를 적용함 (Figures 1, 3).

첫 번째는 norm-dependent rotation으로, \[ g(z) = R(\pi\|z\|^2)z \] 이며 spiral을 생성함 [c.f. 74].

두 번째는 sinusoidal shear로, \[ g(z_1, z_2) = (z_1 + \sin(1.5z_2), z_2) \] 임.

세 번째는 parabolic shear로, \[ g(z_1, z_2) = (z_1, z_2 + z_1^2) \] 임.

네 번째는 RealNVP-style coupling layer [81]임.

네 함수 모두 diffeomorphisms임.

추가로 spiral은 measure-preserving이므로, observations의 Gaussianity만으로는 identifiability에 충분하지 않음을 보여줌. Alignment가 essential함.

우리는 LeJEPA loss를 사용해 4-layer MLP를 학습함.

\[ \mathcal{L} = \lambda \mathcal{L}_{SIG} + (1-\lambda)\mathcal{L}_{inv} \]

여기서 positive pairs는 식 (1)의 OU transition으로 생성됨.

점들은 ground-truth latent variables z∼N(0,I2)z \sim N(0, I_2)z∼N(0,I2) 의 polar angle과 radius에 따라 색칠되어 있으며, Fig. 1과 같은 방식임. 왼쪽의 observations는 nonlinear mixing 이후의 x=g(z)x = g(z)x=g(z) 를 보여줌. 여기에는 parabolic shear, sinusoidal shear, RealNVP coupling layer가 포함됨. 오른쪽의 learned embeddings는 LeJEPA가 isotropic Gaussian structure를 rotation까지만 허용한 채 복원함을 보여줌. 이는 Theorem 1과 일치함.

Fig. 3은 learned representation이 각 nonlinear mixing을 rotation까지 허용한 채 invert함을 보여주며, 이는 Thm. 1과 일치함.

$\lambda$와 $\rho$에 대한 grid search (App. H.6)는 다음을 보여줌.

Gaussianity가 너무 강하면, 즉 $\lambda = 0.5$이면 representation이 collapse되고, best recovery는 낮은 $\lambda$와 높은 $\rho$에서 발생함.

Scaling to High Dimensions.

다음으로 latent dimension \[ N \in \{2^1, \ldots, 2^{10}\} \] 을 sweep함.

이는 예를 들어 DINOv3의 768 embedding dimensions보다 더 큰 범위까지 포함함.

RealNVP mixing과 matched encoder [81]를 사용하여, 실패가 발생한다면 encoder expressivity가 아니라 optimization 때문이 되도록 함.

우리는 같은 setup에서 세 가지 Gaussianity-enforcing classes를 모두 test함.

즉, SIGReg [2], VICReg [3], InfoNCE [26]임.

Table 1은 batch-statistic estimators인 SIGReg와 VICReg가 $N=1024$까지 \[ R^2 > 0.999 \] 를 유지함을 보여줌.

InfoNCE는 fixed kernel width 아래에서 scale이 커질수록 성능이 저하됨.

따라서 Thm. 1은 Gaussian-latent assumptions가 성립할 때 mixing functions, dimensions, hyperparameters 전반에서 empirically supported됨.

Methods 사이의 practical gap은 theoretical ideal에서 벗어날 때에만 나타나는 것으로 보임.

Experiment 파트의 역할은 간단히 말하면 앞에서 증명한 theorem들이 실제 controlled setting에서도 맞는지 확인하는 것임.

논문은 실험을 네 덩어리로 구성함.

Gaussian latent에서는 LeJEPA가 linear identifiability를 얻는가?
Non-Gaussian latent에서는 정말 linear identifiability가 깨지는가?
완벽한 optimum이 아니어도 approximate bound가 맞는가?
Linear identifiability가 실제 planning 성능과 연결되는가?

논문도 Sec. 6 첫 문장에서 실험이 각각 Thm. 1, Thm. 2, Thm. 3, Thm. 4를 검증한다고 정리함. (arXiv)

6.1 Forward: Linear Identifiability from Gaussian Latent Variables

여기서는 Theorem 1, 즉 “Gaussian world에서는 LeJEPA가 true latent를 선형적으로 복원한다”는 결과를 실험으로 확인함.

실험은 먼저 2D Gaussian latent를 샘플링함.

z ~ N(0, I_2)

그 다음 이 깨끗한 latent를 일부러 nonlinear하게 꼬아버림.

논문에서 사용한 nonlinear mixing은 다음과 같음.

spiral / norm-dependent rotation
sinusoidal shear
parabolic shear
RealNVP-style coupling layer

즉, 원래 latent space에서는 Gaussian으로 깔끔하게 퍼져 있던 점들을, observation space에서는 복잡하게 뒤틀린 형태로 만들어놓음. 그 후 LeJEPA를 학습시켜서, learned representation이 다시 원래 latent 구조를 복원하는지 보는 것임. 논문은 이 2D setting에서 4-layer MLP를 LeJEPA loss로 학습하고, Fig. 3에서 learned representation이 nonlinear mixing을 rotation 정도만 남기고 되돌리는 모습을 보여줌. (arXiv)

핵심은 이것임.

observation은 nonlinear하게 꼬여 있어도, underlying latent가 Gaussian이고 positive pair가 OU transition을 따른다면, LeJEPA가 다시 linearly identifiable한 representation을 학습할 수 있음.

그리고 이 실험은 단순 2D에서 끝나지 않음. 논문은 latent dimension을 2부터 1024까지 키워서 high-dimensional setting에서도 확인함. 이때 SIGReg, VICReg, InfoNCE를 비교하는데, SIGReg와 VICReg는 N=1024까지 R² > 0.999를 유지하고, InfoNCE는 fixed kernel width 조건에서 scale이 커질수록 성능이 떨어진다고 보고함. 즉, Gaussian latent assumption이 맞을 때는 이론이 꽤 넓은 차원과 mixing에서 잘 맞는다는 것을 보여줌. (arXiv)

발표에서는 이렇게 말하면 됨.

첫 번째 실험은 Theorem 1을 검증하는 실험임. 저자들은 Gaussian latent를 만든 뒤, 일부러 nonlinear mixing을 적용해서 observation을 복잡하게 꼬아놓음. 이후 LeJEPA를 학습했을 때, learned embedding이 다시 원래 Gaussian latent structure를 rotation 정도만 남기고 복원하는지를 확인함. Fig. 3은 이 결과를 시각적으로 보여주고, scaling experiment에서는 1024차원까지도 SIGReg와 VICReg가 매우 높은 linear recovery를 유지함.

6.2 Converse: Non-Gaussian Latent Variables Break Linear Identifiability

여기서는 Theorem 2, 즉 “Gaussian이 유일하다”는 주장을 실험으로 확인함.

Theorem 1이 “Gaussian이면 된다”였다면, Theorem 2는 “linear identifiability가 보장되려면 Gaussian이어야 한다”는 방향임.

이를 확인하기 위해 논문은 latent distribution을 generalized normal family로 바꿔가며 실험함.

shape parameter alpha를 바꾸면 분포 모양이 달라짐.

alpha → 0: heavy-tailed
alpha = 1: Laplace
alpha = 2: Gaussian
alpha → infinity: uniform

그리고 각 분포에서 LeJEPA/관련 objective가 latent를 얼마나 linearly recover하는지 확인함. 결과는 Gaussian에 해당하는 alpha = 2에서 linear recovery가 sharp하게 peak를 보임. 즉, Gaussian일 때 가장 잘 되고, non-Gaussian으로 갈수록 linear identifiability가 약해지는 것을 보여줌. (arXiv)

여기서 끝나지 않고, 논문은 pixel-based robotic control setting도 봄.

DMC Reacher 환경을 사용함. 이 환경에서는 로봇 팔이 두 개의 joint를 가지고 있어서 latent state를 두 joint angle로 볼 수 있음.

z = (theta_0, theta_1)

논문은 두 조건을 비교함.

첫 번째는 OU condition임. 여기서는 Gaussian sample을 만들고 OU transition으로 positive pair를 생성함. 즉, 이론 가정에 잘 맞는 controlled setting임.

두 번째는 Trajectory condition임. 여기서는 실제 RL policy trajectory에서 나온 joint-angle pair를 사용함. 이 경우 실제 trajectory distribution은 Gaussian이 아닐 수 있고, 각 joint의 변화 정도도 anisotropic할 수 있으며, joint-limit wrapping 같은 문제가 생김.

결과적으로 OU condition에서는 latent recovery가 잘 되지만, real RL trajectory condition에서는 R²가 낮고 anisotropic하게 나타남. 논문은 trajectory condition이 non-Gaussian marginals, anisotropic correlation, joint-limit wrapping 등 여러 이론 가정을 동시에 위반한다고 설명함. (arXiv)

발표에서는 이렇게 말하면 됨.

두 번째 실험은 Theorem 2를 확인하는 실험임. 저자들은 latent distribution을 Gaussian에서 Laplace, heavy-tailed, uniform 쪽으로 바꿔가며 linear recovery를 측정함. 결과적으로 Gaussian에 해당하는 alpha=2에서 recovery가 가장 높게 나타남. 또한 DMC Reacher의 실제 RL trajectory를 사용하면 distribution이 non-Gaussian이고 anisotropic해지기 때문에 linear identifiability가 떨어짐. 즉, Gaussian assumption이 단순한 편의가 아니라 이 보장에 중요한 조건임을 보여줌.

6.3 Approximation Bound and Loss Predict Identifiability

여기서는 Theorem 3, 즉 approximate identifiability bound를 실험으로 확인함.

앞의 Theorem 1은 ideal case임. 즉, LeJEPA objective가 완벽하게 최적화되고, Gaussianity도 완벽하게 맞는 경우임.

하지만 실제 training에서는 그럴 수 없음.

그래서 Theorem 3은 다음을 말함.

alignment gap이 작고 whitening error가 작으면, recovery error도 작게 bound된다.

여기서 세 가지가 중요함.

Alignment gap은 alignment loss가 이론적 최적값보다 얼마나 큰지를 의미함.

Whitening error는 Cov(h(z))가 I에서 얼마나 벗어났는지를 의미함.

Recovery error는 h(z)가 Qz와 얼마나 다른지를 의미함.

논문은 각 run마다 epsilon, delta, theoretical bound, actual recovery error를 계산함. Fig. 4a에서 이 bound가 grid search, 2D mixing, scaling, latent-distribution sweep 전반에서 대체로 성립함을 보여줌. 이 결과를 통해 training loss가 identifiability의 proxy로 사용될 수 있다고 말함. (arXiv)

발표에서는 이렇게 말하면 됨.

세 번째 실험은 이론이 완벽한 optimum에서만 의미 있는지, 아니면 실제 학습 오차가 있어도 유지되는지를 보는 실험임. 저자들은 alignment gap과 whitening error를 측정하고, 이 값들로부터 recovery error의 upper bound를 계산함. Fig. 4a에서는 실제 recovery error가 이 bound 아래에 있는지를 확인함. 결과적으로 여러 실험 조건에서 bound가 대체로 유지되며, 이는 training loss가 낮고 whitening이 잘 되면 representation이 true latent를 잘 recover할 가능성이 높다는 것을 의미함.

6.4 Linear Identifiability Enables Latent-Space Planning

여기서는 Theorem 4, 즉 linear identifiability가 planning에 도움이 되는지를 실험으로 확인함.

Theorem 4의 직관은 다음임.

h(z) ≈ Qz라면 learned latent space는 true latent space를 rotation/reflection한 것에 가까움.
그러면 learned latent에서의 planning이 true latent에서의 planning과 같은 의미를 가질 수 있음.

논문은 goal-reaching task를 사용함. 목표 state까지 가는 가장 자연스러운 trajectory는 latent space에서 straight line임. 만약 encoder가 h(z) ≈ Qz를 만족한다면, learned latent space에서의 straight line이 true latent space에서도 near-straight path로 decode되어야 함.

결과적으로 Gaussian encoder의 straight-latent plan은 oracle-quality joint-space trajectory에 가까웠고, Trajectory encoder는 control cost를 더 크게 만들었음. 또한 Fig. 4c,d와 Fig. 5는 control cost가 linear identifiability R²와 관련되어 있음을 보여줌. 논문은 이를 통해 linear identifiability가 faithful World Model을 useful planner로 바꾸는 structural property라고 해석함. (arXiv)

발표에서는 이렇게 말하면 됨.

네 번째 실험은 linear identifiability가 planning과 실제로 연결되는지를 보는 실험임. 만약 h(z)=Qz라면 learned latent space는 true state space를 회전한 것에 불과하므로, learned latent에서 직선으로 계획한 경로가 실제 state space에서도 의미 있는 경로가 되어야 함. 실험 결과 Gaussian OU encoder는 oracle trajectory와 거의 비슷한 path를 만들지만, 실제 RL trajectory로 학습한 encoder는 더 큰 control cost를 보임. 따라서 linear identifiability는 단순히 representation이 예쁘다는 조건이 아니라, planning에 쓸 수 있는 state representation이 되기 위한 조건임.

Experiment 파트 전체 요약

실험 파트는 theorem을 하나씩 확인하는 구조임.

6.1 Forward experiment
Gaussian latent + nonlinear mixing
→ LeJEPA recovers latent structure
→ supports Thm. 1

6.2 Converse experiment
non-Gaussian latent sweep + RL trajectory
→ linear recovery drops outside Gaussian
→ supports Thm. 2

6.3 Approximate bound
alignment gap + whitening error
→ bound actual recovery error
→ supports Thm. 3

6.4 Planning
linear identifiability improves latent-space planning
→ supports Thm. 4

한 문장으로 정리하면:

이 논문의 실험은 단순히 성능을 높였다는 식의 실험이 아니라, 앞에서 제시한 네 가지 theorem이 실제 controlled setting에서 맞는지를 검증하는 실험임. Gaussian latent에서는 LeJEPA가 latent를 잘 recover하고, non-Gaussian에서는 recovery가 약해지며, approximate training에서도 bound가 유지되고, 마지막으로 linear identifiability가 planning cost와 연결됨을 보여줌.

6.2 Converse: Non-Gaussian Latent Variables Break Linear Identifiability

다음으로 non-Gaussian latents가 linear identifiability를 깨뜨린다는 Thm. 2의 predictions를 validate함. Latent-Distribution Sweep.

우리는 shape parameter $\alpha$를 갖는 generalized normal family를 따라 latent variable을 sweep함.

$\alpha \rightarrow 0$은 heavy-tailed, $\alpha = 1$은 Laplace, $\alpha = 2$는 Gaussian, $\alpha \rightarrow \infty$는 uniform에 해당함.

Linear recovery는 세 objective 모두에서 $\alpha = 2$일 때 sharp하게 peak를 보이며 (Fig. 4b, App. H.7), 이는 Thm. 2를 illustrate함.

(a) Bound Verification. Grid search, 2D simulation, scaling experiment, 그리고 generalized normal distribution에서 α=2\alpha = 2α=2인 SIGReg runs가 diagonal 아래에 위치함. 이는 Theorem 3을 확인해줌. 거의 0에 가까운 두 outlier는 finite-sample noise를 반영함. (b) Gaussian Optimality. Linear recovery R2(h→z)R^2(h \rightarrow z)R2(h→z)는 Gaussian에서 peak를 보이며, 이는 Theorem 2를 보여줌. SIGReg가 hhh를 Gaussian화하는 효과는 whitening보다 non-Gaussian latent variable distributions에 대해 더 robust함. (c) Control cost. K=30K = 30K=30개의 random start-goal pair에 대해 control cost를 측정함. Path length는 1 이상이고, ideal value는 1임. Gaussian encoder는 oracle과 통계적으로 구분되지 않을 정도로 유사하지만, Trajectory encoder는 cost가 더 높아지는 bias를 보임. (d) Control cost decreases with linear identifiability. Control cost는 linear identifiability R2R^2R2가 높아질수록 감소하며, 이는 Theorem 4를 뒷받침함.

SIGReg와 InfoNCE는 heavy-tailed latents에 대해 VICReg보다 더 넓은 plateau를 유지함. Pixel-Based RL Trajectories. DMC Reacher [80]는 두 개의 joints를 가지므로 2D latent state \[ z = (\theta_0, \theta_1) \] 를 가짐 (Fig. 11).

우리는 LeJEPA를 사용해 CNN encoder를 학습함 (App. H.11). 두 조건은 같은 rendering pipeline을 공유하지만 distributions는 다름. 첫 번째는 OU 조건으로, 이전과 같이 Gaussian samples \[ z \sim \mathcal{N}(0, I_2) \] 를 사용함 (1).

두 번째는 Trajectory 조건으로, 10k RL episodes [9]에서 $\delta$ frames separation을 둔 joint-angle pairs $(z_t, z_{t+\delta})$를 사용함.

Table 2 (left)는 OU pairs가 $\rho = 0.99$에서 \[ R^2 = 0.95 \] 를 달성하며, 두 joint dimensions가 linearly recovered됨을 보여줌. 반대로 real trajectories는 Gaussian assumption을 깨뜨림 (App. H.11, Figs. 12, 13). Table 2 (right)는 per-dimension $R^2$가 anisotropic이고, total $R^2$가 결코 0.5를 넘지 않음을 보여주며, 이는 Thm. 2와 일치함. Table 1: Scaling Comparison Across Regularizers (mean ± std, 5 seeds). Shared RealNVP mixing과 matched encoder에서 세 가지 Gaussianity-enforcing objectives를 비교함. SIGReg와 VICReg는 $N=1024$까지 $R^2 > 0.999$를 유지함. InfoNCE는 낮은 $N$에서는 match하지만, fixed kernel width $\sigma = 1$ 아래에서 scale이 커지면 성능이 저하됨. Method별 자세한 내용은 Tabs. 5–7에 있음. Trajectory condition은 한 번에 여러 theory assumptions를 위반함. 여기에는 non-Gaussian marginals, anisotropic $\rho_0 \neq \rho_1$, joint-limit wrapping이 포함됨.

6.3 Approximation Bound and Loss Predict Identifiability Thm. 3은 recovery error를 whitening error \[ \varepsilon = \|\mathrm{Cov}(h(z)) - I\|_F \] 와 alignment gap \[ \delta = \mathcal{L}(h) - 2(1-\rho)\mathrm{tr}(\mathrm{Cov}(h(z))) \] 로 bound함. 각 run에 대해 우리는 $\varepsilon$, $\delta$, bound \[ D + (\varepsilon + D)^2 \] 그리고 actual recovery error \[ \min_{Q \in O(n)} \mathbb{E}\left[\|h(z) - Qz\|^2\right] \] 를 계산함 (Fig. 4a). 이 bound는 grid search, 2D mixings, scaling, latent-distribution sweep 전반에서 성립하며, Thm. 3을 empirically support함. Practical corollary로, training loss는 identifiability에 대한 reliable proxy임 (App. H.9).

6.4 Linear Identifiability Enables Latent-Space Planning Thm. 4는 rotation-invariant cost를 사용하는 어떤 planner든 learned latent와 true latent에서 동일한 performance를 달성한다고 예측함.

우리는 자연스러운 instance인 goal-reaching을 test함. 여기서 straight line \[ \hat{z}_0 \rightarrow \hat{z}^{*} \] 가 cost-minimizing trajectory임.

$h(z) \approx Qz$를 만족하는 encoder의 경우, 이 latent straight line은 true latent에서 near-straight path로 decode됨. 그렇지 않은 경우, 같은 plan이 curved path를 유도함 (Fig. 17). Figs. 4c,d와 5는 이 둘을 모두 confirm함. Gaussian encoder의 straight-latent plans는 oracle-quality joint-space trajectories로 decode됨 (left, middle). 반면 Trajectory encoder는 control cost를 증가시킴.

모든 models에 걸쳐 control cost는 linear identifiability $R^2$를 monotonic하게 따라감 (right). 따라서 linear identifiability는 faithful World Model을 useful planner로 바꾸는 structural property임. 7 Limitations Are the latents Gaussian? Gaussian은 주어진 mean과 covariance에 대해 maximum-entropy distribution이며 [23], least-assumption prior가 됨. Real-world latents가 Gaussian인지 여부는 observations만으로는 알 수 없음.

하지만 classical ICA [82]의 non-Gaussianity assumptions도 마찬가지임. 어쩌면 structuralist perspective [83]가 더 agnostic할 수 있음. 이를 지지하는 scale-of-description argument도 있음. 개별 micro-variables는 non-Gaussian일 수 있지만, task-relevant latent variables는 많은 경우 aggregates이며, central limit theorem에 의해 Gaussianity를 향하는 경향이 있음. Table 2: Linear Identifiability on RL Trajectories. 평균 ± 표준편차이며, 3 seeds에 대한 결과임.

(left) Gaussian OU pairs는 $R^2$가 monotonic하게 증가하면서 true latents를 recover함. (right) RL-policy trajectories [9]는 anisotropy, 즉 $\rho_0 \neq \rho_1$, 그리고 non-Gaussian transitions 때문에 identifiability가 감소함 (Figs. 12, 13). What if the dimension is wrong? 우리의 theorem은 encoder output dimension이 true latent dimension과 match된다고 가정함.

즉, $m = n$임. $m < n$일 때 Gaussianity constraint는 어떤 subspace가 선택되는지, 또는 system이 superposition [15, 84, 18]에 의존하는지를 결정하지 않음.

$m > n$일 때 extra dimensions는 collapse되거나 redundancy를 encode해야 함.

이 interaction을 이해하는 것은 JEPA design [18]에 직접적인 consequence를 갖는 중요한 open problem임.

Finite samples and optimization. 우리의 result는 global optimum에 대한 population-level statement임.

Thm. 3은 guarantee가 alignment gap과 covariance deviation에 대해 continuous하게 degrade됨을 보이지만, 이것이 sample size [85]나 training dynamics [86]에 따라 어떻게 scale되는지는 다루지 않음. 우리가 empirically 관찰한 몇몇 bound violations (Fig. 4a)는 $\varepsilon$와 $\delta$의 finite-sample estimation noise와 일치함. 8 Discussion Outlook. Linear identifiability는 World Model의 state side를 다룸.

Action-conditioned transition \[ \hat{p}(\hat{z}' \mid \hat{z}, a) \] 는 여전히 학습되어야 함.

Identifiability를 이 setting으로 확장하는 것은 interventional causal representation learning과 자연스럽게 연결됨.

여기서 actions는 latent variables에 대한 interventions로 작동함 [66, 67]. 더 나아가 이는 causal graphs [87]와도 연결됨.

우리의 assumptions는 실제로 정확히 성립하지 않을 수 있지만, Thm. 3은 guarantee가 graceful하게 degrade됨을 보여줌.

Practical Implications. 두 가지 implication이 있음. 하나는 data에 관한 것이고, 다른 하나는 objective에 관한 것임. Data 측면에서, Reacher result는 같은 physical system이라도 isotropically하게 sampled될 때, 즉 OU일 때는 identifiability를 support하지만, goal-directed policy 아래에서는 그렇지 않음을 보여줌. Goal-directed policy의 marginals는 latent space의 low-entropy region으로 collapse되기 때문임 (Tab. 2). Self-supervised pretraining에서는 isotropic random walk에 가까운 exploration이 data를 우리의 theory가 cover하는 regime 안에 유지함. Objective 측면에서, SIGReg, VICReg, InfoNCE는 assumptions가 성립할 때 모두 linear identifiability를 산출함. 또한 우리의 experiments는 이들이 서로 다른 방식으로 실패함을 시사함. Pair-based estimators는 scale에서 kernel choice에 민감하고 (Tab. 1), moment-based estimators는 non-Gaussian latents에 민감함 (App. H.7).

실제로 어떤 estimator가 더 선호될지, 그리고 이 choice가 batch size, architecture, data regime과 어떻게 interaction하는지는 future work로 남겨둠. A Theoretical Foundation for World Models.

우리의 theory는 classical ICA narrative를 뒤집음. Linear ICA에서는 Gaussian이 source separation이 실패하는 유일한 distribution임 [82]. 반면 우리의 nonlinear setting에서는 Gaussian이 정확히 source separation을 가능하게 하는 distribution임.

그 payoff는 structural함. Linear identifiability는 learned representation을 control system에 사용할 수 있는 state로 바꾸므로, orthogonally-invariant한 어떤 cost도 true world에서 learned latent로 modification 없이 transfer됨. 또한 simple planning (Thm. 4)과 linear probing이 자연스럽게 따라옴. 이것이 World Model을 provably learn한다는 것의 의미임.

'AI > 논문 리뷰' 카테고리의 다른 글

[논문리뷰] Visual-RFT: Visual Reinforcement Fine-Tuning (ICCV 2025) (2)	2026.03.23
[떠먹여주는 논문리뷰] Causal-JEPA: Learning World Models through Object-Level Latent Interventions (0)	2026.03.06
[논문리뷰] Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck (0)	2026.03.03
[논문리뷰] Robust Representation Consistency Model via Contrastive Denoising (0)	2026.02.23
[논문리뷰] LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics (0)	2026.01.26

현재글[논문리뷰] When Does LeJEPA Learn a World Model?

프로그래밍선

귀여운 뽀뿌 🐶💗🤍

Python, nvs, 챗봇만들기, Computer Vision, tiled multiplane images for practical 3d photography, streamlit, 프로그래머스, depth pro, PIP, Depth estimation, tmpi, LLM, depth, novel view synthesis, SGD, ICCV, 경사하강법, cv2, 논문리뷰, error,

Today :
Yesterday :

« 2026/06 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

[논문리뷰] When Does LeJEPA Learn a World Model?

1. 한 줄 요약

2. 이 논문이 풀려는 문제

3. LeJEPA objective를 어떻게 해석하나

0. Abstract

Abstract 이해를 위한 문장별 해설

1. “세계의 true degrees of freedom을 뒤섞는 representation은 reliable planning이나 compositional generalization을 지원할 수 없음.”

2. “우리는 LeJEPA, 즉 alignment plus Gaussian regularization이 nonlinear observations로부터 세계의 latent variables를 선형적으로 복원함을 증명함.”

3. “이 성질은 linear identifiability로 알려져 있으며, latents가 stationary, additive-noise transitions 아래에서 진화하는 넓은 종류의 worlds에서 성립함.”

4. “우리의 main result는 이러한 모든 worlds 중에서, Gaussian이 이 보장이 성립하는 유일한 latent distribution이라는 것임.”

5. “Forward direction은 spectral decomposition에 기반하며, 여기서 각 degree of nonlinearity는 alignment에 의해 엄격하게 penalize되므로 linear map이 optimum이 됨.”

6. “반대로 converse는 모든 non-Gaussian alternative를 배제함.”

7. “또한 우리는 보장이 점진적으로 약화되는 approximate identifiability result를 증명하고, linear, orthogonal identifiability가 optimal latent-space planning을 가능하게 함을 보임.”

7-1. approximate identifiability

7-2. linear, orthogonal identifiability와 planning

8. “우리는 2D examples부터 1024-dimensional latents까지 이어지는 experiments를 통해 이 이론을 검증하며, 여기에는 distributional ablations와 pixel-based robotic control이 포함됨.”

9. “우리의 이론은 경험적으로 성공한 recipe를 mathematical guarantee로 전환하며, world의 structure를 provably recover하는 World Models를 구축하기 위한 foundation을 제공함.”

전체 요약

1. Introduction

1. “우리에게 필요한 것은 linear identifiability…”

2. “simple symmetries까지만 허용한 채 복원한다”

3. “Practitioners는 이미 이를 routinely test하고 있음”

4. “Representation이 linear probing으로 평가될 때마다…”

5. “이것이 없으면 linear probes는 latent variables를 정확히 복원할 수 없음”

기존 JEPA의 한계와 LeJEPA가 필요한 이유

1. JEPA에 대한 identifiability result는 아직 존재하지 않음

2. 기존 방법들은 implicit collapse prevention에 의존했기 때문에 분석이 어려웠음

3. LeJEPA는 이 상황을 바꿈

4. SIGReg는 embedding distribution을 isotropic Gaussian으로 regularize함

5. Alignment loss와 SIGReg를 함께 쓰면 raw pixels에서도 stable end-to-end training이 가능함

6. 그래도 representation이 실제 latent variables를 복원하는지는 여전히 open question임

Contributions.

2 Related Work

Representation Learning.

1. 왜 Gaussianity constraint가 나오냐

2. InfoNCE: implicit Gaussianity constraint

3. VICReg: second-moment constraint

4. LeJEPA: full Gaussianity constraint

5. 그래서 hierarchy라는 말의 의미

6. 발표에서 쉽게 말하면

World Models.

World Models 문단 설명

1. “Internal-model concept는 cognitive science, control theory, modern ML 전반에 걸쳐 있음.”

2. “Cognitive scientists는 mental simulation을 reasoning, perception, motor control의 substrate로 보았으며 [33, 34, 35, 36, 37], free-energy [38]와 probabilistic-program [39] formalizations는 brain을 generative model로 해석함.”

3. “Cybernetics는 이를 engineered systems로 확장했으며 [40, 41], classical control은 주어진 state space 위에서 dynamics를 발전시킴 [42, 43].”

recurrent controller-model pairs

latent dynamics from pixels

value-equivalent prediction

generative architectures

joint-embedding predictive architectures

large generative video simulators

5. “JEPA의 주장은 pixel-perfect prediction이 capacity를 낭비한다는 것임 [1].”

6. “우리는 이 그림에서 encoder side를 다룸.”

7. “Classical control은 learned coordinates에서 적용됨 (Thm. 4).”

전체 요약

Identifiability.

Identifiability 문단 설명

1. “Nonlinear ICA는 추가적인 structure 없이는 unidentifiable함.”

2. “Identifiability는 world가 그것을 가능하게 하는 structure를 제공할 때 가능해짐.”

3. “예를 들어 non-stationarity, temporal dependence, auxiliary variables, contrastive learning, augmentation, mechanism sparsity, interventions, 또는 supervision이 이에 해당함.”

non-stationarity

temporal dependence

auxiliary variables

contrastive learning

augmentation

mechanism sparsity

interventions

supervision

4. “이러한 결과들은 보통 representation을 smooth diffeomorphism으로 제한함.”

5. “반면 우리의 Hermite approach는 arbitrary measurable maps에 대해 작동함.”

6. “가장 밀접하게 관련된 것은 slow feature analysis (SFA)이며, JEPA objectives는 이를 empirically recover함.”

7. “App. F에서는 SFA theory와의 차이를 깊이 있게 비교함.”

8. “Representation space에서의 유사한 spectral analysis는 이전에도 다른 SSL objectives를 characterize하는 데 사용된 바 있으며, 우리의 Thm. 3은 최근의 quantitative stability work와 연결됨.”

9. “공통된 교훈은 identifiability가 항상 World, 즉 data generating process와 Learner, 즉 learning objective인 LeJEPA에 대한 joint statement라는 것임.”

World

Learner

전체 문단 요약

3. The World and the Learner

1. “세계가 latent variables zzz를 가진다”는 말

2. “우리는 zzz를 직접 관측하지 않는다”는 말

1. “세계가 latent variables zz를 가진다”는 말

2. “우리는 zz를 직접 관측하지 않는다”는 말

3. “unknown process gg가 data를 생성한다”는 말

4. “gg가 latent structure를 entangled observation으로 뒤섞는다”는 말

5. “우리는 representation ff를 학습한다”는 말

6. “이상적인 결과는 ff가 gg를 undo하는 것”이라는 말

7. “composed map h=f∘gh = f \circ g”가 무슨 말인가

9. 그래서 목표가 h(z)=Qzh(z)=Qz가 됨

[
\mathrm{Var}(z')

[
\mathbb{E}[z']

\rho \mathbb{E}[z]
+
\sqrt{1-\rho^2}\mathbb{E}[\eta]

[
\mathrm{Var}(z')

[
\mathrm{Var}(z')

[
\mathrm{Var}(z')

\rho^2 I_n
+
(1-\rho^2)I_n

[
\mathrm{Cov}(z,z')

[
\mathrm{Cov}(z,z')