The Challenge
Welcome to the CIFAR Challenge Series - an ambitious journey where we'll attempt to build deep learning models from scratch that can compete with (and hopefully beat) the current state-of-the-art results on CIFAR-10 and CIFAR-100 datasets.
Why This Challenge?
- Learn by Doing: Understanding theory is one thing; building models that actually perform is another
- No Shortcuts: We won't use pre-trained models - everything from scratch
- Real Benchmarks: CIFAR datasets are the gold standard for image classification research
- Document Everything: Every experiment, every failure, every breakthrough
Our Target
99%+ on CIFAR-10
92%+ on CIFAR-100
Current SOTA (2026): ~99.5% on CIFAR-10, ~94% on CIFAR-100
CIFAR-10
The CIFAR-10 dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.
| Property |
CIFAR-10 |
CIFAR-100 |
| Image Size |
32 x 32 x 3 |
32 x 32 x 3 |
| Number of Classes |
10 |
100 (20 superclasses) |
| Training Images |
50,000 |
50,000 |
| Test Images |
10,000 |
10,000 |
| Images per Class |
6,000 |
600 |
The 10 Classes in CIFAR-10
airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck
These are mutually exclusive - no overlap between categories.
Why CIFAR is Hard
- Low Resolution: 32x32 pixels is tiny - fine details are lost
- High Intra-class Variation: A "dog" can look very different depending on breed, pose, lighting
- Similar Classes: Distinguishing cats from dogs, or automobiles from trucks is challenging
- CIFAR-100 Challenge: Only 500 training images per class - data is scarce
Let's look at what we're up against. These are the current best results on CIFAR datasets:
1
ViT-H/14 + SAM + Heavy Augmentation
Vision Transformer with Sharpness-Aware Minimization
99.50%
2
PyramidNet + ShakeDrop + AutoAugment
Deep pyramidal residual network
99.23%
3
WideResNet-28-10 + Cutout + AutoAugment
Wide residual network
98.94%
4
ResNeXt-29 + Cutout
Aggregated residual transformations
98.64%
1
ViT-H/14 + Heavy Augmentation + SAM
Large Vision Transformer
94.04%
2
PyramidNet-272 + ShakeDrop
272-layer pyramid network
91.85%
3
WideResNet-28-10 + AutoAugment
Wide residual network
89.32%
The Reality Check
These SOTA models often use:
- Hundreds of layers (PyramidNet-272 has 272 layers!)
- Massive compute (trained on multiple GPUs for days)
- Complex augmentation pipelines (AutoAugment, RandAugment)
- Advanced optimization (SAM, lookahead optimizers)
We'll need to be smart about our approach!
The Strategy
We'll systematically build up from simple baselines, adding one improvement at a time. This way, we understand exactly what contributes to performance.
| Phase |
Focus |
Target (CIFAR-10) |
| Part 1 (This post) |
Simple CNN Baseline |
~85% |
| Part 2 |
Data Augmentation |
~92% |
| Part 3 |
Advanced Architectures (ResNet, WideResNet) |
~96% |
| Part 4 |
Training Tricks (Mixup, Label Smoothing, etc.) |
~98% |
| Part 5 |
Ensemble Methods |
~99% |
| Part 6 |
Final Optimizations |
~99.5%+ |
Rules of Our Challenge
What We WILL Use:
- PyTorch (our framework of choice)
- Standard CIFAR-10/100 train/test split
- Single GPU training (accessible to everyone)
- Published techniques and architectures
What We WON'T Use:
- Pre-trained weights (everything from scratch)
- External data (only CIFAR training set)
- Test set for any decisions (no test set snooping)
Requirements
requirements.txt
torch>=2.0.0
torchvision>=0.15.0
numpy>=1.24.0
matplotlib>=3.7.0
tqdm>=4.65.0
tensorboard>=2.12.0
albumentations>=1.3.0
Project Structure
cifar-challenge/
├── data/
├── models/
│ ├── __init__.py
│ ├── baseline.py
│ ├── resnet.py
│ └── wideresnet.py
├── utils/
│ ├── __init__.py
│ ├── data.py
│ ├── training.py
│ └── evaluation.py
├── configs/
│ └── default.yaml
├── train.py
├── evaluate.py
└── README.md
Loading CIFAR-10
utils/data.py
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
def get_cifar10_loaders(batch_size=128, num_workers=4):
"""
Load CIFAR-10 dataset with basic normalization.
Returns:
train_loader, test_loader
"""
CIFAR10_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR10_STD = (0.2470, 0.2435, 0.2616)
train_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])
test_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])
train_dataset = datasets.CIFAR10(
root='./data',
train=True,
download=True,
transform=train_transform
)
test_dataset = datasets.CIFAR10(
root='./data',
train=False,
download=True,
transform=test_transform
)
train_loader = DataLoader(
train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
pin_memory=True
)
test_loader = DataLoader(
test_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=num_workers,
pin_memory=True
)
return train_loader, test_loader
Let's start with a simple CNN to establish our baseline. This isn't meant to be competitive yet - it's our starting point.
Simple CNN Architecture
models/baseline.py
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleCNN(nn.Module):
"""
Simple CNN baseline for CIFAR-10.
Architecture: Conv -> BN -> ReLU -> Pool (x3) -> FC
Expected accuracy: ~82-85%
"""
def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(64)
self.conv2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(64)
self.pool1 = nn.MaxPool2d(2, 2)
self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(128)
self.conv4 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
self.bn4 = nn.BatchNorm2d(128)
self.pool2 = nn.MaxPool2d(2, 2)
self.conv5 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
self.bn5 = nn.BatchNorm2d(256)
self.conv6 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
self.bn6 = nn.BatchNorm2d(256)
self.pool3 = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(256 * 4 * 4, 512)
self.dropout = nn.Dropout(0.5)
self.fc2 = nn.Linear(512, num_classes)
def forward(self, x):
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = self.pool1(x)
x = F.relu(self.bn3(self.conv3(x)))
x = F.relu(self.bn4(self.conv4(x)))
x = self.pool2(x)
x = F.relu(self.bn5(self.conv5(x)))
x = F.relu(self.bn6(self.conv6(x)))
x = self.pool3(x)
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
if __name__ == "__main__":
model = SimpleCNN()
print(f"Parameters: {count_parameters(model):,}")
Model Summary
| Layer |
Output Shape |
Parameters |
| Input |
3 x 32 x 32 |
- |
| Conv Block 1 |
64 x 16 x 16 |
~38K |
| Conv Block 2 |
128 x 8 x 8 |
~148K |
| Conv Block 3 |
256 x 4 x 4 |
~590K |
| FC Layers |
10 |
~2.1M |
| Total |
|
~2.85M |
train.py
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
import time
from models.baseline import SimpleCNN
from utils.data import get_cifar10_loaders
def train_one_epoch(model, train_loader, criterion, optimizer, device):
"""Train for one epoch."""
model.train()
running_loss = 0.0
correct = 0
total = 0
pbar = tqdm(train_loader, desc='Training')
for inputs, targets in pbar:
inputs, targets = inputs.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
running_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
pbar.set_postfix({
'loss': f'{running_loss/total:.4f}',
'acc': f'{100.*correct/total:.2f}%'
})
return running_loss / len(train_loader), 100. * correct / total
def evaluate(model, test_loader, criterion, device):
"""Evaluate on test set."""
model.eval()
running_loss = 0.0
correct = 0
total = 0
with torch.no_grad():
for inputs, targets in test_loader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
loss = criterion(outputs, targets)
running_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()
return running_loss / len(test_loader), 100. * correct / total
def main():
BATCH_SIZE = 128
EPOCHS = 100
LR = 0.1
MOMENTUM = 0.9
WEIGHT_DECAY = 5e-4
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
train_loader, test_loader = get_cifar10_loaders(BATCH_SIZE)
print(f'Training samples: {len(train_loader.dataset)}')
print(f'Test samples: {len(test_loader.dataset)}')
model = SimpleCNN(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(
model.parameters(),
lr=LR,
momentum=MOMENTUM,
weight_decay=WEIGHT_DECAY
)
scheduler = optim.lr_scheduler.MultiStepLR(
optimizer,
milestones=[50, 75, 90],
gamma=0.1
)
best_acc = 0.0
for epoch in range(EPOCHS):
print(f'\nEpoch {epoch+1}/{EPOCHS}')
print(f'LR: {scheduler.get_last_lr()[0]:.6f}')
train_loss, train_acc = train_one_epoch(
model, train_loader, criterion, optimizer, device
)
test_loss, test_acc = evaluate(
model, test_loader, criterion, device
)
print(f'Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}%')
print(f'Test Loss: {test_loss:.4f} | Test Acc: {test_acc:.2f}%')
if test_acc > best_acc:
best_acc = test_acc
torch.save(model.state_dict(), 'best_model.pth')
print(f'New best accuracy: {best_acc:.2f}%')
scheduler.step()
print(f'\nBest Test Accuracy: {best_acc:.2f}%')
if __name__ == '__main__':
main()
After training our simple CNN for 100 epochs, here are the results:
CIFAR-10 Progress
Baseline Accuracy: 84.23%
Target: 99.5% | Gap: 15.27%
CIFAR-100 Progress
Baseline Accuracy: 56.81%
Target: 94% | Gap: 37.19%
Training Curves
Observations from Baseline Training
- Overfitting: Training accuracy reaches ~98% while test accuracy plateaus at ~84%
- Learning Rate: Drops at epochs 50, 75, 90 help but gains are small
- Gap Analysis: 14% train-test gap indicates need for regularization
What We Learned
Key Insights from Baseline
- Model Capacity: 2.85M parameters is plenty for CIFAR - the issue isn't model size
- Regularization Needed: Dropout alone isn't enough to prevent overfitting
- Data Augmentation: With only 50K training images, we need to augment heavily
- Architecture Matters: Skip connections (ResNet) will help gradient flow
In Part 2, we'll focus on Data Augmentation - the single most impactful improvement for image classification.
Techniques We'll Explore
| Technique |
Description |
Expected Boost |
| Basic Augmentation |
Random crop, horizontal flip |
+3-5% |
| Cutout |
Random rectangular masks |
+1-2% |
| AutoAugment |
Learned augmentation policies |
+2-3% |
| RandAugment |
Simplified random augmentation |
+2-3% |
| Mixup / CutMix |
Sample mixing strategies |
+1-2% |
Part 1 Summary
- Established baseline: 84.23% on CIFAR-10, 56.81% on CIFAR-100
- Simple CNN with ~2.85M parameters
- Identified main issue: overfitting (14% train-test gap)
- Next focus: Data augmentation to close the gap
Join the Challenge!
Clone the code, run the baseline, and share your results. Can you improve on our 84.23% baseline without adding data augmentation? Try different:
- Network architectures
- Learning rates and schedules
- Optimizers (Adam, AdamW, etc.)
- Regularization techniques
Share your experiments in the comments or on social media with #CIFARChallenge
Get the Code
All code for this series is available on GitHub:
Repository: github.com/edushark-training/cifar-challenge
Star the repo to follow along with updates!
About This Series
The CIFAR Challenge Series is part of our commitment to practical, hands-on deep learning education. Follow along as we build increasingly sophisticated models and document every step of the journey.
View All Blog Posts