Figure 2: CLIP-based Contrastive Learning Architecture for fMRI-to-Image Embedding
The system employs contrastive learning to align fMRI neural representations with visual features in CLIP's embedding space. The fMRI encoder learns to map brain signals to the same 512-dimensional space as CLIP image embeddings, enabling cross-modal retrieval and reconstruction tasks through similarity-based matching.
fMRI Data
3092 voxels
Brain
Miyawaki Dataset
+
Visual Stimuli
28ร—28 digits
๐Ÿ‘๏ธ
Digit Images
fMRI Processing Pipeline
Preprocessing
StandardScaler normalization
Z-score standardization
โ†“
Linear Layer 1
3092 โ†’ 4096
ReLU + Dropout(0.5) + BatchNorm
โ†“
Linear Layer 2
4096 โ†’ 2048
ReLU + Dropout(0.5) + BatchNorm
โ†“
Linear Layer 3
2048 โ†’ 1024
ReLU + Dropout(0.5) + BatchNorm
โ†“
Projection Layer
1024 โ†’ 512
L2 Normalization
Image Processing Pipeline
Image Preprocessing
28ร—28 โ†’ 224ร—224 resize
RGB conversion
CLIP normalization
โ†“
CLIP ViT-B/32
Vision Transformer
Pre-trained encoder
FROZEN
โ†“
Image Embeddings
512-dimensional features
L2 Normalized
Contrastive Learning Alignment
fMRI Embeddings
โ„โตยนยฒ (learned)
โ†”
CLIP Embeddings
โ„โตยนยฒ (fixed)
Contrastive Loss Function:
logits = (fMRI_emb @ CLIP_emb.T) / temperature
L = (CE(logits, labels) + CE(logits.T, labels)) / 2

Temperature: ฯ„ = 0.07
Objective: Maximize similarity for matched pairs
fMRI Encoder Architecture
Layer Input โ†’ Output Parameters
Hidden 1 3092 โ†’ 4096 12,665,856
Hidden 2 4096 โ†’ 2048 8,390,656
Hidden 3 2048 โ†’ 1024 2,098,176
Projection 1024 โ†’ 512 524,800
Total - 23,679,488
Training Configuration
Optimizer: Adam
Learning Rate: 1e-3
Weight Decay: 1e-4
Batch Size: 32
Epochs: 100
Scheduler: CosineAnnealingLR
Dropout Rate: 0.5
Temperature: 0.07
Evaluation Metrics: Top-k Retrieval Accuracy
k values: [1, 5, 10]