CIFAR Challenge Series

Part 1: Can We Beat State-of-the-Art on CIFAR?

Building deep learning models from scratch to challenge the best

25 min read | Level: Intermediate to Advanced | Category: Deep Learning Challenge

CIFAR Challenge Series

Part 1: Introduction & Baseline (You are here)
Part 2: Data Augmentation Strategies (Coming Soon)
Part 3: Advanced Architectures (Coming Soon)
Part 4: Training Tricks & Optimization (Coming Soon)
Part 5: Ensemble Methods (Coming Soon)
Part 6: Final Push - Beating SOTA (Coming Soon)

The Challenge

Welcome to the CIFAR Challenge Series - an ambitious journey where we'll attempt to build deep learning models from scratch that can compete with (and hopefully beat) the current state-of-the-art results on CIFAR-10 and CIFAR-100 datasets.

                    Why This Challenge?
                    Learn by Doing: Understanding theory is one thing; building models that actually perform is another
No Shortcuts: We won't use pre-trained models - everything from scratch
Real Benchmarks: CIFAR datasets are the gold standard for image classification research
Document Everything: Every experiment, every failure, every breakthrough

                

Our Target

99%+ on CIFAR-10 92%+ on CIFAR-100

Current SOTA (2026): ~99.5% on CIFAR-10, ~94% on CIFAR-100

Understanding CIFAR Datasets

CIFAR-10

The CIFAR-10 dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.

Property	CIFAR-10	CIFAR-100
Image Size	32 x 32 x 3	32 x 32 x 3
Number of Classes	10	100 (20 superclasses)
Training Images	50,000	50,000
Test Images	10,000	10,000
Images per Class	6,000	600

The 10 Classes in CIFAR-10

airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

These are mutually exclusive - no overlap between categories.

Why CIFAR is Hard

Low Resolution: 32x32 pixels is tiny - fine details are lost
High Intra-class Variation: A "dog" can look very different depending on breed, pose, lighting
Similar Classes: Distinguishing cats from dogs, or automobiles from trucks is challenging
CIFAR-100 Challenge: Only 500 training images per class - data is scarce

State-of-the-Art (2026)

Let's look at what we're up against. These are the current best results on CIFAR datasets:

CIFAR-10 Leaderboard (2026)

ViT-H/14 + SAM + Heavy Augmentation
Vision Transformer with Sharpness-Aware Minimization

99.50%

PyramidNet + ShakeDrop + AutoAugment
Deep pyramidal residual network

99.23%

WideResNet-28-10 + Cutout + AutoAugment
Wide residual network

98.94%

ResNeXt-29 + Cutout
Aggregated residual transformations

98.64%

CIFAR-100 Leaderboard (2026)

ViT-H/14 + Heavy Augmentation + SAM
Large Vision Transformer

94.04%

PyramidNet-272 + ShakeDrop
272-layer pyramid network

91.85%

WideResNet-28-10 + AutoAugment
Wide residual network

89.32%

The Reality Check

These SOTA models often use:

Hundreds of layers (PyramidNet-272 has 272 layers!)
Massive compute (trained on multiple GPUs for days)
Complex augmentation pipelines (AutoAugment, RandAugment)
Advanced optimization (SAM, lookahead optimizers)

We'll need to be smart about our approach!

Our Approach & Roadmap

The Strategy

We'll systematically build up from simple baselines, adding one improvement at a time. This way, we understand exactly what contributes to performance.

Phase	Focus	Target (CIFAR-10)
Part 1 (This post)	Simple CNN Baseline	~85%
Part 2	Data Augmentation	~92%
Part 3	Advanced Architectures (ResNet, WideResNet)	~96%
Part 4	Training Tricks (Mixup, Label Smoothing, etc.)	~98%
Part 5	Ensemble Methods	~99%
Part 6	Final Optimizations	~99.5%+

Rules of Our Challenge

                        What We WILL Use:
                        PyTorch (our framework of choice)
Standard CIFAR-10/100 train/test split
Single GPU training (accessible to everyone)
Published techniques and architectures

                        What We WON'T Use:
                        Pre-trained weights (everything from scratch)
External data (only CIFAR training set)
Test set for any decisions (no test set snooping)

                    

Environment Setup

Requirements

requirements.txt

torch>=2.0.0
torchvision>=0.15.0
numpy>=1.24.0
matplotlib>=3.7.0
tqdm>=4.65.0
tensorboard>=2.12.0
albumentations>=1.3.0

Project Structure

cifar-challenge/
├── data/                  # Dataset will be downloaded here
├── models/
│   ├── __init__.py
│   ├── baseline.py        # Simple CNN
│   ├── resnet.py          # ResNet variants
│   └── wideresnet.py      # Wide ResNet
├── utils/
│   ├── __init__.py
│   ├── data.py            # Data loading & augmentation
│   ├── training.py        # Training loops
│   └── evaluation.py      # Metrics & visualization
├── configs/
│   └── default.yaml       # Hyperparameters
├── train.py               # Main training script
├── evaluate.py            # Evaluation script
└── README.md

Loading CIFAR-10

utils/data.py

import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

def get_cifar10_loaders(batch_size=128, num_workers=4):
"""
Load CIFAR-10 dataset with basic normalization.

Returns:
train_loader, test_loader
"""

# CIFAR-10 mean and std (precomputed)
CIFAR10_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR10_STD = (0.2470, 0.2435, 0.2616)

# Basic transforms (no augmentation yet)
train_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

test_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

# Download and load datasets
train_dataset = datasets.CIFAR10(
root='./data',
train=True,
download=True,
transform=train_transform
)

test_dataset = datasets.CIFAR10(
root='./data',
train=False,
download=True,
transform=test_transform
)

# Create data loaders
train_loader = DataLoader(
train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=num_workers,
pin_memory=True
)

test_loader = DataLoader(
test_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=num_workers,
pin_memory=True
)

return train_loader, test_loader

Building Our Baseline Model

Let's start with a simple CNN to establish our baseline. This isn't meant to be competitive yet - it's our starting point.

Simple CNN Architecture

models/baseline.py

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
"""
Simple CNN baseline for CIFAR-10.
Architecture: Conv -> BN -> ReLU -> Pool (x3) -> FC

Expected accuracy: ~82-85%
"""

def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()

# Block 1: 32x32 -> 16x16
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, padding=1)
self.bn1 = nn.BatchNorm2d(64)
self.conv2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
self.bn2 = nn.BatchNorm2d(64)
self.pool1 = nn.MaxPool2d(2, 2)

# Block 2: 16x16 -> 8x8
self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
self.bn3 = nn.BatchNorm2d(128)
self.conv4 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
self.bn4 = nn.BatchNorm2d(128)
self.pool2 = nn.MaxPool2d(2, 2)

# Block 3: 8x8 -> 4x4
self.conv5 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
self.bn5 = nn.BatchNorm2d(256)
self.conv6 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
self.bn6 = nn.BatchNorm2d(256)
self.pool3 = nn.MaxPool2d(2, 2)

# Classifier
self.fc1 = nn.Linear(256 * 4 * 4, 512)
self.dropout = nn.Dropout(0.5)
self.fc2 = nn.Linear(512, num_classes)

def forward(self, x):
# Block 1
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
x = self.pool1(x)

# Block 2
x = F.relu(self.bn3(self.conv3(x)))
x = F.relu(self.bn4(self.conv4(x)))
x = self.pool2(x)

# Block 3
x = F.relu(self.bn5(self.conv5(x)))
x = F.relu(self.bn6(self.conv6(x)))
x = self.pool3(x)

# Classifier
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)

return x

# Count parameters
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Test
if __name__ == "__main__":
model = SimpleCNN()
print(f"Parameters: {count_parameters(model):,}")
# Output: Parameters: 2,847,818

Model Summary

Layer	Output Shape	Parameters
Input	3 x 32 x 32	-
Conv Block 1	64 x 16 x 16	~38K
Conv Block 2	128 x 8 x 8	~148K
Conv Block 3	256 x 4 x 4	~590K
FC Layers	10	~2.1M
Total		~2.85M

Training Loop

train.py

import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
import time

from models.baseline import SimpleCNN
from utils.data import get_cifar10_loaders

def train_one_epoch(model, train_loader, criterion, optimizer, device):
"""Train for one epoch."""
model.train()
running_loss = 0.0
correct = 0
total = 0

pbar = tqdm(train_loader, desc='Training')
for inputs, targets in pbar:
inputs, targets = inputs.to(device), targets.to(device)

# Forward pass
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)

# Backward pass
loss.backward()
optimizer.step()

# Statistics
running_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()

pbar.set_postfix({
'loss': f'{running_loss/total:.4f}',
'acc': f'{100.*correct/total:.2f}%'
})

return running_loss / len(train_loader), 100. * correct / total

def evaluate(model, test_loader, criterion, device):
"""Evaluate on test set."""
model.eval()
running_loss = 0.0
correct = 0
total = 0

with torch.no_grad():
for inputs, targets in test_loader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
loss = criterion(outputs, targets)

running_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()

return running_loss / len(test_loader), 100. * correct / total

def main():
# Config
BATCH_SIZE = 128
EPOCHS = 100
LR = 0.1
MOMENTUM = 0.9
WEIGHT_DECAY = 5e-4

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

# Data
train_loader, test_loader = get_cifar10_loaders(BATCH_SIZE)
print(f'Training samples: {len(train_loader.dataset)}')
print(f'Test samples: {len(test_loader.dataset)}')

# Model
model = SimpleCNN(num_classes=10).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(
model.parameters(),
lr=LR,
momentum=MOMENTUM,
weight_decay=WEIGHT_DECAY
)

# Learning rate scheduler
scheduler = optim.lr_scheduler.MultiStepLR(
optimizer,
milestones=[50, 75, 90],
gamma=0.1
)

# Training loop
best_acc = 0.0

for epoch in range(EPOCHS):
print(f'\nEpoch {epoch+1}/{EPOCHS}')
print(f'LR: {scheduler.get_last_lr()[0]:.6f}')

train_loss, train_acc = train_one_epoch(
model, train_loader, criterion, optimizer, device
)
test_loss, test_acc = evaluate(
model, test_loader, criterion, device
)

print(f'Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}%')
print(f'Test Loss: {test_loss:.4f} | Test Acc: {test_acc:.2f}%')

# Save best model
if test_acc > best_acc:
best_acc = test_acc
torch.save(model.state_dict(), 'best_model.pth')
print(f'New best accuracy: {best_acc:.2f}%')

scheduler.step()

print(f'\nBest Test Accuracy: {best_acc:.2f}%')

if __name__ == '__main__':
main()

Baseline Results

After training our simple CNN for 100 epochs, here are the results:

CIFAR-10 Progress

Baseline Accuracy: 84.23%

84.23%

Target: 99.5% | Gap: 15.27%

CIFAR-100 Progress

Baseline Accuracy: 56.81%

56.81%

Target: 94% | Gap: 37.19%

Training Curves

Observations from Baseline Training

Overfitting: Training accuracy reaches ~98% while test accuracy plateaus at ~84%
Learning Rate: Drops at epochs 50, 75, 90 help but gains are small
Gap Analysis: 14% train-test gap indicates need for regularization

What We Learned

                        Key Insights from Baseline
                        Model Capacity: 2.85M parameters is plenty for CIFAR - the issue isn't model size
Regularization Needed: Dropout alone isn't enough to prevent overfitting
Data Augmentation: With only 50K training images, we need to augment heavily
Architecture Matters: Skip connections (ResNet) will help gradient flow

                    

What's Next: Part 2 Preview

In Part 2, we'll focus on Data Augmentation - the single most impactful improvement for image classification.

Techniques We'll Explore

Technique	Description	Expected Boost
Basic Augmentation	Random crop, horizontal flip	+3-5%
Cutout	Random rectangular masks	+1-2%
AutoAugment	Learned augmentation policies	+2-3%
RandAugment	Simplified random augmentation	+2-3%
Mixup / CutMix	Sample mixing strategies	+1-2%

Part 1 Summary

Established baseline: 84.23% on CIFAR-10, 56.81% on CIFAR-100
Simple CNN with ~2.85M parameters
Identified main issue: overfitting (14% train-test gap)
Next focus: Data augmentation to close the gap

Join the Challenge!

Clone the code, run the baseline, and share your results. Can you improve on our 84.23% baseline without adding data augmentation? Try different:

Network architectures
Learning rates and schedules
Optimizers (Adam, AdamW, etc.)
Regularization techniques

Share your experiments in the comments or on social media with #CIFARChallenge

Get the Code

All code for this series is available on GitHub:

Repository: github.com/edushark-training/cifar-challenge

Star the repo to follow along with updates!

About This Series

The CIFAR Challenge Series is part of our commitment to practical, hands-on deep learning education. Follow along as we build increasingly sophisticated models and document every step of the journey.

View All Blog Posts

Back to Blog Part 2: Data Augmentation (Coming Soon)