CLIP-based fMRI Embedding Architecture

Figure 2: CLIP-based Contrastive Learning Architecture for fMRI-to-Image Embedding

The system employs contrastive learning to align fMRI neural representations with visual features in CLIP's embedding space. The fMRI encoder learns to map brain signals to the same 512-dimensional space as CLIP image embeddings, enabling cross-modal retrieval and reconstruction tasks through similarity-based matching.

fMRI Data

3092 voxels

Brain

Miyawaki Dataset

Visual Stimuli

28×28 digits

👁️

Digit Images

fMRI Processing Pipeline

Preprocessing

StandardScaler normalization
Z-score standardization

↓

Linear Layer 1

3092 → 4096
ReLU + Dropout(0.5) + BatchNorm

↓

Linear Layer 2

4096 → 2048
ReLU + Dropout(0.5) + BatchNorm

↓

Linear Layer 3

2048 → 1024
ReLU + Dropout(0.5) + BatchNorm

↓

Projection Layer

1024 → 512
L2 Normalization

Image Processing Pipeline

Image Preprocessing

28×28 → 224×224 resize
RGB conversion
CLIP normalization

↓

CLIP ViT-B/32

Vision Transformer
Pre-trained encoder

FROZEN

↓

Image Embeddings

512-dimensional features
L2 Normalized

Contrastive Learning Alignment

fMRI Embeddings

ℝ⁵¹² (learned)

↔

CLIP Embeddings

ℝ⁵¹² (fixed)

Contrastive Loss Function:
logits = (fMRI_emb @ CLIP_emb.T) / temperature
L = (CE(logits, labels) + CE(logits.T, labels)) / 2

Temperature: τ = 0.07
Objective: Maximize similarity for matched pairs

fMRI Encoder Architecture
Layer	Input → Output	Parameters
Hidden 1	3092 → 4096	12,665,856
Hidden 2	4096 → 2048	8,390,656
Hidden 3	2048 → 1024	2,098,176
Projection	1024 → 512	524,800
Total	-	23,679,488

Training Configuration

Optimizer: Adam

Learning Rate: 1e-3

Weight Decay: 1e-4

Batch Size: 32

Epochs: 100

Scheduler: CosineAnnealingLR

Dropout Rate: 0.5

Temperature: 0.07

Evaluation Metrics: Top-k Retrieval Accuracy

k values: [1, 5, 10]