CIFAR Challenge Series

Part 2: Data Augmentation Strategies

The single most impactful improvement for image classification

35 min read | Level: Intermediate | Target: 84% → 92%

CIFAR Challenge Series

Part 1: Introduction & Baseline (84.23%)
Part 2: Data Augmentation Strategies (You are here)
Part 3: Advanced Architectures
Part 4: Training Tricks & Optimization
Part 5: Ensemble Methods
Part 6: Final Push - Beating SOTA

Challenge Progress

Previous: 84.23% | Target This Part: 92%

84.23%

Why Data Augmentation?

In Part 1, we observed a 14% gap between training accuracy (98%) and test accuracy (84%). This is classic overfitting - our model memorized the training data instead of learning generalizable features.

Data augmentation is the most effective technique to combat overfitting. By artificially expanding our training set through transformations, we:

Increase effective dataset size - 50K images become millions of variations
Teach invariances - A flipped cat is still a cat
Reduce overfitting - Model sees different versions each epoch
Improve generalization - Better performance on unseen data

Basic Augmentation

Flip, Crop, Rotate

+5-7%

Cutout

Random occlusion

+1-2%

AutoAugment

Learned policies

+2-3%

Mixup/CutMix

Sample mixing

+1-2%

Basic Augmentation Techniques

Let's start with the fundamentals that every image classifier should use.

1.1 Random Horizontal Flip

The simplest augmentation - flip images horizontally with 50% probability. This works because most objects look the same when mirrored.

augmentations/basic.py

import torch
import torchvision.transforms as T
import numpy as np
from PIL import Image

class RandomHorizontalFlip:
"""
Flip image horizontally with probability p.

Why it works:
- A cat facing left is still a cat facing right
- Doubles effective dataset size
- No information loss

When NOT to use:
- Text recognition (letters become mirrored)
- Directional data (traffic signs with arrows)
"""

def __init__(self, p=0.5):
self.p = p

def __call__(self, img):
if np.random.random() < self.p:
return img.transpose(Image.FLIP_LEFT_RIGHT)
return img

# Using torchvision (recommended)
flip_transform = T.RandomHorizontalFlip(p=0.5)

1.2 Random Crop with Padding

Pad the image and then randomly crop back to original size. This simulates small translations and teaches position invariance.

augmentations/basic.py (continued)

class RandomCropWithPadding:
"""
Pad image and randomly crop to original size.

For CIFAR (32x32):
- Pad by 4 pixels on each side → 40x40
- Randomly crop back to 32x32
- This gives 81 possible positions (9x9 grid)

Effect: Teaches translation invariance
"""

def __init__(self, size=32, padding=4, fill=0):
self.size = size
self.padding = padding
self.fill = fill

def __call__(self, img):
# Pad image
img = np.array(img)
padded = np.pad(
img,
((self.padding, self.padding),
(self.padding, self.padding),
(0, 0)),
mode='constant',
constant_values=self.fill
)

# Random crop
h, w = padded.shape[:2]
top = np.random.randint(0, h - self.size + 1)
left = np.random.randint(0, w - self.size + 1)

cropped = padded[top:top+self.size, left:left+self.size]
return Image.fromarray(cropped)

# Using torchvision (recommended)
crop_transform = T.RandomCrop(32, padding=4, padding_mode='reflect')

1.3 Color Jittering

Randomly adjust brightness, contrast, saturation, and hue. This helps the model become robust to lighting variations.

augmentations/basic.py (continued)

class ColorJitter:
"""
Randomly change brightness, contrast, saturation, hue.

Parameters (typical ranges for CIFAR):
- brightness: 0.2 (±20% brightness change)
- contrast: 0.2 (±20% contrast change)
- saturation: 0.2 (±20% saturation change)
- hue: 0.1 (±10% hue shift)

Be careful: Too much jittering can destroy important color information
"""

def __init__(self, brightness=0.2, contrast=0.2,
saturation=0.2, hue=0.1):
self.brightness = brightness
self.contrast = contrast
self.saturation = saturation
self.hue = hue

def __call__(self, img):
# Random brightness
brightness_factor = 1 + np.random.uniform(-self.brightness, self.brightness)
img = T.functional.adjust_brightness(img, brightness_factor)

# Random contrast
contrast_factor = 1 + np.random.uniform(-self.contrast, self.contrast)
img = T.functional.adjust_contrast(img, contrast_factor)

# Random saturation
saturation_factor = 1 + np.random.uniform(-self.saturation, self.saturation)
img = T.functional.adjust_saturation(img, saturation_factor)

# Random hue
hue_factor = np.random.uniform(-self.hue, self.hue)
img = T.functional.adjust_hue(img, hue_factor)

return img

# Using torchvision
color_transform = T.ColorJitter(
brightness=0.2,
contrast=0.2,
saturation=0.2,
hue=0.1
)

1.4 Random Rotation

augmentations/basic.py (continued)

class RandomRotation:
"""
Rotate image by random angle within range.

For CIFAR, use small angles (±15°) to avoid:
- Black corners from rotation
- Unrealistic orientations

Note: Some classes are rotation-sensitive (e.g., 6 vs 9)
"""

def __init__(self, degrees=15):
self.degrees = degrees

def __call__(self, img):
angle = np.random.uniform(-self.degrees, self.degrees)
return img.rotate(angle, resample=Image.BILINEAR, fillcolor=0)

# Using torchvision
rotation_transform = T.RandomRotation(degrees=15, fill=0)

1.5 Complete Basic Pipeline

augmentations/basic.py (complete)

import torchvision.transforms as T

# CIFAR-10 normalization values
CIFAR10_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR10_STD = (0.2470, 0.2435, 0.2616)

def get_basic_augmentation():
"""
Basic augmentation pipeline for CIFAR-10.
Expected improvement: +5-7% accuracy
"""
train_transform = T.Compose([
T.RandomCrop(32, padding=4, padding_mode='reflect'),
T.RandomHorizontalFlip(p=0.5),
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

test_transform = T.Compose([
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

return train_transform, test_transform

def get_strong_basic_augmentation():
"""
Stronger basic augmentation with color jittering.
"""
train_transform = T.Compose([
T.RandomCrop(32, padding=4, padding_mode='reflect'),
T.RandomHorizontalFlip(p=0.5),
T.ColorJitter(
brightness=0.2,
contrast=0.2,
saturation=0.2,
hue=0.1
),
T.RandomRotation(degrees=15),
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

test_transform = T.Compose([
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

return train_transform, test_transform

Basic Augmentation Results

89.47%

Improvement: +5.24% from baseline

Cutout (Random Erasing)

Cutout (DeVries & Taylor, 2017) randomly masks out square regions of the input image during training. This forces the network to focus on multiple parts of the image rather than relying on a single discriminative region.

                        Why Cutout Works
                        Prevents over-reliance on specific features (e.g., just the eyes of a cat)
Simulates occlusion - objects in real world are often partially hidden
Acts as regularization - similar effect to dropout but in input space

                    

2.1 Cutout Implementation

augmentations/cutout.py

import torch
import numpy as np

class Cutout:
"""
Randomly mask out one or more patches from an image.

Args:
n_holes (int): Number of patches to cut out
length (int): Length (in pixels) of each square patch

For CIFAR-10 (32x32 images):
- Recommended: n_holes=1, length=16
- This masks 25% of the image on average

Paper: "Improved Regularization of Convolutional Neural Networks with Cutout"
https://arxiv.org/abs/1708.04552
"""

def __init__(self, n_holes=1, length=16):
self.n_holes = n_holes
self.length = length

def __call__(self, img):
"""
Args:
img (Tensor): Tensor image of size (C, H, W)
Returns:
Tensor: Image with n_holes of dimension length x length cut out
"""
h = img.size(1)
w = img.size(2)

mask = np.ones((h, w), np.float32)

for _ in range(self.n_holes):
# Random center point
y = np.random.randint(h)
x = np.random.randint(w)

# Calculate patch boundaries
y1 = np.clip(y - self.length // 2, 0, h)
y2 = np.clip(y + self.length // 2, 0, h)
x1 = np.clip(x - self.length // 2, 0, w)
x2 = np.clip(x + self.length // 2, 0, w)

# Zero out the patch
mask[y1:y2, x1:x2] = 0

mask = torch.from_numpy(mask)
mask = mask.expand_as(img)
img = img * mask

return img


class CutoutPIL:
"""
Cutout for PIL images (before ToTensor).
Fills with mean color instead of zero.
"""

def __init__(self, n_holes=1, length=16, fill_color=(125, 123, 114)):
self.n_holes = n_holes
self.length = length
self.fill_color = fill_color  # CIFAR mean in [0,255]

def __call__(self, img):
import PIL.ImageDraw as ImageDraw

img = img.copy()
w, h = img.size
draw = ImageDraw.Draw(img)

for _ in range(self.n_holes):
y = np.random.randint(h)
x = np.random.randint(w)

y1 = np.clip(y - self.length // 2, 0, h)
y2 = np.clip(y + self.length // 2, 0, h)
x1 = np.clip(x - self.length // 2, 0, w)
x2 = np.clip(x + self.length // 2, 0, w)

draw.rectangle([x1, y1, x2, y2], fill=self.fill_color)

return img

2.2 Using Cutout in Training Pipeline

augmentations/cutout.py (continued)

import torchvision.transforms as T

def get_cutout_augmentation(cutout_length=16):
"""
Training pipeline with Cutout augmentation.

Pipeline order matters:
1. Spatial transforms (crop, flip) - on PIL image
2. ToTensor - convert to tensor
3. Normalize - standardize values
4. Cutout - mask after normalization

Why Cutout after normalization?
- Masking with 0 after normalization creates a "neutral" patch
- Before normalization, 0 would be very dark
"""

CIFAR10_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR10_STD = (0.2470, 0.2435, 0.2616)

train_transform = T.Compose([
T.RandomCrop(32, padding=4, padding_mode='reflect'),
T.RandomHorizontalFlip(),
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
Cutout(n_holes=1, length=cutout_length),  # After normalize!
])

test_transform = T.Compose([
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

return train_transform, test_transform


# Experiment with different cutout sizes
def cutout_ablation_study():
"""
Cutout length ablation for CIFAR-10:

Length | % Masked | Accuracy
-------|----------|----------
8      | 6.25%    | 90.12%
12     | 14.1%    | 90.89%
16     | 25%      | 91.23%  ← Best
20     | 39%      | 90.78%
24     | 56%      | 89.45%

Optimal: mask ~25% of image (length=16 for 32x32)
"""
lengths = [8, 12, 16, 20, 24]
for length in lengths:
percent_masked = (length ** 2) / (32 ** 2) * 100
print(f"Length {length}: {percent_masked:.1f}% masked")

Basic + Cutout Results

91.23%

Improvement: +1.76% from basic augmentation

AutoAugment

AutoAugment (Cubuk et al., 2018) uses reinforcement learning to search for the best augmentation policy. Instead of manually designing augmentations, let the algorithm find what works best!

3.1 Understanding AutoAugment Policies

An AutoAugment policy consists of:

Sub-policies: 25 sub-policies to choose from
Operations: Each sub-policy has 2 operations
Parameters: Each operation has probability and magnitude

augmentations/autoaugment.py

import torch
import torchvision.transforms as T
from PIL import Image, ImageOps, ImageEnhance
import numpy as np
import random

# All available operations for AutoAugment
class AutoAugmentOperations:
"""
Collection of augmentation operations used in AutoAugment.
Each operation takes a PIL image and magnitude (0-10).
"""

@staticmethod
def shear_x(img, magnitude):
"""Shear image along x-axis"""
magnitude = magnitude * 0.3 / 10  # max 0.3
if random.random() > 0.5:
magnitude = -magnitude
return img.transform(
img.size, Image.AFFINE,
(1, magnitude, 0, 0, 1, 0),
resample=Image.BILINEAR
)

@staticmethod
def shear_y(img, magnitude):
"""Shear image along y-axis"""
magnitude = magnitude * 0.3 / 10
if random.random() > 0.5:
magnitude = -magnitude
return img.transform(
img.size, Image.AFFINE,
(1, 0, 0, magnitude, 1, 0),
resample=Image.BILINEAR
)

@staticmethod
def translate_x(img, magnitude):
"""Translate image horizontally"""
magnitude = magnitude * img.size[0] * 0.45 / 10
if random.random() > 0.5:
magnitude = -magnitude
return img.transform(
img.size, Image.AFFINE,
(1, 0, magnitude, 0, 1, 0),
resample=Image.BILINEAR
)

@staticmethod
def translate_y(img, magnitude):
"""Translate image vertically"""
magnitude = magnitude * img.size[1] * 0.45 / 10
if random.random() > 0.5:
magnitude = -magnitude
return img.transform(
img.size, Image.AFFINE,
(1, 0, 0, 0, 1, magnitude),
resample=Image.BILINEAR
)

@staticmethod
def rotate(img, magnitude):
"""Rotate image"""
magnitude = magnitude * 30 / 10  # max 30 degrees
if random.random() > 0.5:
magnitude = -magnitude
return img.rotate(magnitude, resample=Image.BILINEAR)

@staticmethod
def auto_contrast(img, magnitude):
"""Apply auto contrast"""
return ImageOps.autocontrast(img)

@staticmethod
def invert(img, magnitude):
"""Invert colors"""
return ImageOps.invert(img)

@staticmethod
def equalize(img, magnitude):
"""Histogram equalization"""
return ImageOps.equalize(img)

@staticmethod
def solarize(img, magnitude):
"""Solarize with threshold based on magnitude"""
threshold = int((magnitude / 10) * 256)
return ImageOps.solarize(img, threshold)

@staticmethod
def posterize(img, magnitude):
"""Reduce number of bits per channel"""
bits = int((magnitude / 10) * 4) + 4  # 4-8 bits
return ImageOps.posterize(img, bits)

@staticmethod
def contrast(img, magnitude):
"""Adjust contrast"""
factor = 1 + (magnitude / 10) * 0.9
if random.random() > 0.5:
factor = 1 / factor
return ImageEnhance.Contrast(img).enhance(factor)

@staticmethod
def brightness(img, magnitude):
"""Adjust brightness"""
factor = 1 + (magnitude / 10) * 0.9
if random.random() > 0.5:
factor = 1 / factor
return ImageEnhance.Brightness(img).enhance(factor)

@staticmethod
def sharpness(img, magnitude):
"""Adjust sharpness"""
factor = 1 + (magnitude / 10) * 0.9
if random.random() > 0.5:
factor = 1 / factor
return ImageEnhance.Sharpness(img).enhance(factor)

@staticmethod
def color(img, magnitude):
"""Adjust color saturation"""
factor = 1 + (magnitude / 10) * 0.9
if random.random() > 0.5:
factor = 1 / factor
return ImageEnhance.Color(img).enhance(factor)

@staticmethod
def identity(img, magnitude):
"""Return unchanged image"""
return img

3.2 CIFAR-10 Policy (From Paper)

augmentations/autoaugment.py (continued)

# CIFAR-10 policy discovered by AutoAugment search
CIFAR10_POLICY = [
[('Invert', 0.1, 7), ('Contrast', 0.2, 6)],
[('Rotate', 0.7, 2), ('TranslateX', 0.3, 9)],
[('Sharpness', 0.8, 1), ('Sharpness', 0.9, 3)],
[('ShearY', 0.5, 8), ('TranslateY', 0.7, 9)],
[('AutoContrast', 0.5, 8), ('Equalize', 0.9, 2)],
[('ShearY', 0.2, 7), ('Posterize', 0.3, 7)],
[('Color', 0.4, 3), ('Brightness', 0.6, 7)],
[('Sharpness', 0.3, 9), ('Brightness', 0.7, 9)],
[('Equalize', 0.6, 5), ('Equalize', 0.5, 1)],
[('Contrast', 0.6, 7), ('Sharpness', 0.6, 5)],
[('Color', 0.7, 7), ('TranslateX', 0.5, 8)],
[('Equalize', 0.3, 7), ('AutoContrast', 0.4, 8)],
[('TranslateY', 0.4, 3), ('Sharpness', 0.2, 6)],
[('Brightness', 0.9, 6), ('Color', 0.2, 8)],
[('Solarize', 0.5, 2), ('Invert', 0.0, 3)],
[('Equalize', 0.2, 0), ('AutoContrast', 0.6, 0)],
[('Equalize', 0.2, 8), ('Equalize', 0.6, 4)],
[('Color', 0.9, 9), ('Equalize', 0.6, 6)],
[('AutoContrast', 0.8, 4), ('Solarize', 0.2, 8)],
[('Brightness', 0.1, 3), ('Color', 0.7, 0)],
[('Solarize', 0.4, 5), ('AutoContrast', 0.9, 3)],
[('TranslateY', 0.9, 9), ('TranslateY', 0.7, 9)],
[('AutoContrast', 0.9, 2), ('Solarize', 0.8, 3)],
[('Equalize', 0.8, 8), ('Invert', 0.1, 3)],
[('TranslateY', 0.7, 9), ('AutoContrast', 0.9, 1)],
]


class CIFAR10AutoAugment:
"""
Apply AutoAugment policy for CIFAR-10.

How it works:
1. Randomly select one of 25 sub-policies
2. Apply first operation with its probability and magnitude
3. Apply second operation with its probability and magnitude
"""

def __init__(self):
self.policy = CIFAR10_POLICY
self.ops = AutoAugmentOperations()

# Map operation names to functions
self.op_dict = {
'ShearX': self.ops.shear_x,
'ShearY': self.ops.shear_y,
'TranslateX': self.ops.translate_x,
'TranslateY': self.ops.translate_y,
'Rotate': self.ops.rotate,
'AutoContrast': self.ops.auto_contrast,
'Invert': self.ops.invert,
'Equalize': self.ops.equalize,
'Solarize': self.ops.solarize,
'Posterize': self.ops.posterize,
'Contrast': self.ops.contrast,
'Brightness': self.ops.brightness,
'Sharpness': self.ops.sharpness,
'Color': self.ops.color,
}

def __call__(self, img):
# Randomly select a sub-policy
sub_policy = random.choice(self.policy)

# Apply each operation in the sub-policy
for op_name, prob, magnitude in sub_policy:
if random.random() < prob:
op_func = self.op_dict[op_name]
img = op_func(img, magnitude)

return img

3.3 Using AutoAugment (Easy Way with torchvision)

augmentations/autoaugment.py (continued)

import torchvision.transforms as T

def get_autoaugment_transforms():
"""
Use torchvision's built-in AutoAugment (PyTorch >= 1.10)
"""
CIFAR10_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR10_STD = (0.2470, 0.2435, 0.2616)

train_transform = T.Compose([
T.RandomCrop(32, padding=4, padding_mode='reflect'),
T.RandomHorizontalFlip(),
T.AutoAugment(policy=T.AutoAugmentPolicy.CIFAR10),
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
Cutout(n_holes=1, length=16),
])

test_transform = T.Compose([
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

return train_transform, test_transform

AutoAugment + Cutout Results

92.31%

Improvement: +1.08% from Cutout alone

RandAugment

RandAugment (Cubuk et al., 2019) simplifies AutoAugment by removing the need for a learned policy. Instead, it uses only 2 hyperparameters: N (number of operations) and M (magnitude).

                        Why RandAugment over AutoAugment?
                        Simpler: Only 2 hyperparameters vs. thousands in AutoAugment
No search required: Works well with N=2, M=9
Often better: Surprisingly matches or beats AutoAugment

                    

augmentations/randaugment.py

import random
import numpy as np
from PIL import Image, ImageOps, ImageEnhance

class RandAugment:
"""
RandAugment: Practical automated data augmentation with reduced search space.

Paper: https://arxiv.org/abs/1909.13719

Args:
n (int): Number of augmentation operations to apply (default: 2)
m (int): Magnitude of augmentations (0-30, default: 9)

Key insight: All augmentations share the same magnitude M,
eliminating per-operation tuning.
"""

def __init__(self, n=2, m=9):
self.n = n
self.m = m

# Available operations (14 total)
self.augment_list = [
('Identity', 0, 1),
('AutoContrast', 0, 1),
('Equalize', 0, 1),
('Rotate', -30, 30),
('Solarize', 0, 256),
('Color', 0.1, 1.9),
('Posterize', 4, 8),
('Contrast', 0.1, 1.9),
('Brightness', 0.1, 1.9),
('Sharpness', 0.1, 1.9),
('ShearX', -0.3, 0.3),
('ShearY', -0.3, 0.3),
('TranslateX', -0.45, 0.45),
('TranslateY', -0.45, 0.45),
]

def _apply_op(self, img, op_name, magnitude):
"""Apply a single augmentation operation."""

if op_name == 'Identity':
return img

elif op_name == 'AutoContrast':
return ImageOps.autocontrast(img)

elif op_name == 'Equalize':
return ImageOps.equalize(img)

elif op_name == 'Rotate':
return img.rotate(magnitude, resample=Image.BILINEAR)

elif op_name == 'Solarize':
return ImageOps.solarize(img, int(magnitude))

elif op_name == 'Color':
return ImageEnhance.Color(img).enhance(magnitude)

elif op_name == 'Posterize':
return ImageOps.posterize(img, int(magnitude))

elif op_name == 'Contrast':
return ImageEnhance.Contrast(img).enhance(magnitude)

elif op_name == 'Brightness':
return ImageEnhance.Brightness(img).enhance(magnitude)

elif op_name == 'Sharpness':
return ImageEnhance.Sharpness(img).enhance(magnitude)

elif op_name == 'ShearX':
return img.transform(
img.size, Image.AFFINE,
(1, magnitude, 0, 0, 1, 0),
resample=Image.BILINEAR
)

elif op_name == 'ShearY':
return img.transform(
img.size, Image.AFFINE,
(1, 0, 0, magnitude, 1, 0),
resample=Image.BILINEAR
)

elif op_name == 'TranslateX':
pixels = magnitude * img.size[0]
return img.transform(
img.size, Image.AFFINE,
(1, 0, pixels, 0, 1, 0),
resample=Image.BILINEAR
)

elif op_name == 'TranslateY':
pixels = magnitude * img.size[1]
return img.transform(
img.size, Image.AFFINE,
(1, 0, 0, 0, 1, pixels),
resample=Image.BILINEAR
)

return img

def __call__(self, img):
"""
Apply n random augmentations with magnitude m.
"""
# Randomly select n operations
ops = random.choices(self.augment_list, k=self.n)

for op_name, min_val, max_val in ops:
# Calculate magnitude within operation's range
magnitude = (self.m / 30) * (max_val - min_val) + min_val

# Randomly negate for symmetric operations
if random.random() > 0.5 and op_name in [
'Rotate', 'ShearX', 'ShearY', 'TranslateX', 'TranslateY'
]:
magnitude = -magnitude

img = self._apply_op(img, op_name, magnitude)

return img


def get_randaugment_transforms(n=2, m=9):
"""
Get transforms with RandAugment.

Recommended settings:
- CIFAR-10: n=2, m=9
- CIFAR-100: n=2, m=14
"""
CIFAR10_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR10_STD = (0.2470, 0.2435, 0.2616)

train_transform = T.Compose([
T.RandomCrop(32, padding=4, padding_mode='reflect'),
T.RandomHorizontalFlip(),
RandAugment(n=n, m=m),
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
Cutout(n_holes=1, length=16),
])

test_transform = T.Compose([
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

return train_transform, test_transform

Mixup and CutMix

These techniques mix samples together, creating new training examples from combinations of existing ones.

5.1 Mixup

Mixup (Zhang et al., 2017) creates virtual training examples by taking convex combinations of pairs of examples and their labels.

augmentations/mixup.py

import torch
import numpy as np

class Mixup:
"""
Mixup: Beyond Empirical Risk Minimization

Paper: https://arxiv.org/abs/1710.09412

For two samples (x_i, y_i) and (x_j, y_j):
x_new = lambda * x_i + (1 - lambda) * x_j
y_new = lambda * y_i + (1 - lambda) * y_j

where lambda ~ Beta(alpha, alpha)

Args:
alpha (float): Beta distribution parameter (default: 1.0)
Higher alpha = more mixing
alpha=1.0 gives uniform distribution
"""

def __init__(self, alpha=1.0):
self.alpha = alpha

def __call__(self, batch_x, batch_y):
"""
Args:
batch_x: Tensor of shape (batch_size, C, H, W)
batch_y: Tensor of shape (batch_size,) - class indices

Returns:
mixed_x: Mixed input
y_a, y_b: Original labels for loss computation
lam: Mixing coefficient
"""
if self.alpha > 0:
lam = np.random.beta(self.alpha, self.alpha)
else:
lam = 1

batch_size = batch_x.size(0)

# Random permutation of batch
index = torch.randperm(batch_size).to(batch_x.device)

# Mix inputs
mixed_x = lam * batch_x + (1 - lam) * batch_x[index, :]

# Keep both labels for loss computation
y_a, y_b = batch_y, batch_y[index]

return mixed_x, y_a, y_b, lam


def mixup_criterion(criterion, pred, y_a, y_b, lam):
"""
Compute mixup loss.

Loss = lam * CE(pred, y_a) + (1 - lam) * CE(pred, y_b)
"""
return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)


# Usage in training loop
def train_with_mixup(model, train_loader, optimizer, criterion, device, alpha=1.0):
"""Example training loop with Mixup."""
model.train()
mixup = Mixup(alpha=alpha)

for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)

# Apply mixup
mixed_inputs, targets_a, targets_b, lam = mixup(inputs, targets)

# Forward pass
outputs = model(mixed_inputs)

# Compute mixup loss
loss = mixup_criterion(criterion, outputs, targets_a, targets_b, lam)

# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()

5.2 CutMix

CutMix (Yun et al., 2019) cuts and pastes patches among training images, with ground truth labels mixed proportionally to the area of patches.

augmentations/cutmix.py

import torch
import numpy as np

class CutMix:
"""
CutMix: Regularization Strategy to Train Strong Classifiers

Paper: https://arxiv.org/abs/1905.04899

Unlike Mixup (which blends entire images), CutMix:
1. Cuts a rectangular region from one image
2. Pastes it onto another image
3. Mixes labels proportionally to the area

This encourages the model to identify objects from partial views.

Args:
alpha (float): Beta distribution parameter (default: 1.0)
prob (float): Probability of applying CutMix (default: 0.5)
"""

def __init__(self, alpha=1.0, prob=0.5):
self.alpha = alpha
self.prob = prob

def _rand_bbox(self, size, lam):
"""
Generate random bounding box.

Args:
size: (batch, channels, height, width)
lam: Mixing ratio

Returns:
Bounding box coordinates (x1, y1, x2, y2)
"""
W = size[2]
H = size[3]

# Cut ratio is sqrt because we're cutting 2D
cut_rat = np.sqrt(1. - lam)
cut_w = int(W * cut_rat)
cut_h = int(H * cut_rat)

# Random center point
cx = np.random.randint(W)
cy = np.random.randint(H)

# Bounding box
bbx1 = np.clip(cx - cut_w // 2, 0, W)
bby1 = np.clip(cy - cut_h // 2, 0, H)
bbx2 = np.clip(cx + cut_w // 2, 0, W)
bby2 = np.clip(cy + cut_h // 2, 0, H)

return bbx1, bby1, bbx2, bby2

def __call__(self, batch_x, batch_y):
"""
Apply CutMix to a batch.

Returns:
mixed_x: Mixed images
y_a, y_b: Original labels
lam: Actual mixing ratio (based on actual box size)
"""
if np.random.random() > self.prob:
# Don't apply CutMix
return batch_x, batch_y, batch_y, 1.0

# Sample lambda from Beta distribution
lam = np.random.beta(self.alpha, self.alpha)

batch_size = batch_x.size(0)
index = torch.randperm(batch_size).to(batch_x.device)

y_a, y_b = batch_y, batch_y[index]

# Get random bounding box
bbx1, bby1, bbx2, bby2 = self._rand_bbox(batch_x.size(), lam)

# Apply CutMix - paste from shuffled batch
batch_x[:, :, bbx1:bbx2, bby1:bby2] = batch_x[index, :, bbx1:bbx2, bby1:bby2]

# Adjust lambda based on actual box size
lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) /
(batch_x.size(-1) * batch_x.size(-2)))

return batch_x, y_a, y_b, lam


# Combined Mixup + CutMix (used in modern training)
class MixupCutMix:
"""
Randomly choose between Mixup and CutMix.

Used in many modern training recipes (e.g., DeiT, ConvNeXt).
"""

def __init__(self, mixup_alpha=1.0, cutmix_alpha=1.0,
mixup_prob=0.5, cutmix_prob=0.5):
self.mixup = Mixup(alpha=mixup_alpha)
self.cutmix = CutMix(alpha=cutmix_alpha, prob=1.0)
self.mixup_prob = mixup_prob
self.cutmix_prob = cutmix_prob

def __call__(self, batch_x, batch_y):
# Randomly choose which to apply
r = np.random.random()

if r < self.mixup_prob:
return self.mixup(batch_x, batch_y)
elif r < self.mixup_prob + self.cutmix_prob:
return self.cutmix(batch_x, batch_y)
else:
return batch_x, batch_y, batch_y, 1.0

Complete Augmentation Pipeline

Let's put everything together into a production-ready augmentation pipeline.

augmentations/pipeline.py

import torch
import torchvision.transforms as T
from torchvision import datasets
from torch.utils.data import DataLoader

# Import our custom augmentations
from augmentations.cutout import Cutout
from augmentations.randaugment import RandAugment
from augmentations.mixup import Mixup, mixup_criterion
from augmentations.cutmix import CutMix, MixupCutMix

# Constants
CIFAR10_MEAN = (0.4914, 0.4822, 0.4465)
CIFAR10_STD = (0.2470, 0.2435, 0.2616)
CIFAR100_MEAN = (0.5071, 0.4867, 0.4408)
CIFAR100_STD = (0.2675, 0.2565, 0.2761)


def get_cifar10_transforms(augmentation_level='strong'):
"""
Get CIFAR-10 transforms based on augmentation level.

Levels:
'none': No augmentation (baseline)
'basic': Flip + Crop only
'medium': Basic + Cutout
'strong': Basic + RandAugment + Cutout (recommended)
'autoaugment': Basic + AutoAugment + Cutout

Returns:
train_transform, test_transform
"""

test_transform = T.Compose([
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

if augmentation_level == 'none':
train_transform = test_transform

elif augmentation_level == 'basic':
train_transform = T.Compose([
T.RandomCrop(32, padding=4, padding_mode='reflect'),
T.RandomHorizontalFlip(),
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
])

elif augmentation_level == 'medium':
train_transform = T.Compose([
T.RandomCrop(32, padding=4, padding_mode='reflect'),
T.RandomHorizontalFlip(),
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
Cutout(n_holes=1, length=16),
])

elif augmentation_level == 'strong':
train_transform = T.Compose([
T.RandomCrop(32, padding=4, padding_mode='reflect'),
T.RandomHorizontalFlip(),
RandAugment(n=2, m=14),
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
Cutout(n_holes=1, length=16),
])

elif augmentation_level == 'autoaugment':
train_transform = T.Compose([
T.RandomCrop(32, padding=4, padding_mode='reflect'),
T.RandomHorizontalFlip(),
T.AutoAugment(policy=T.AutoAugmentPolicy.CIFAR10),
T.ToTensor(),
T.Normalize(CIFAR10_MEAN, CIFAR10_STD),
Cutout(n_holes=1, length=16),
])

else:
raise ValueError(f"Unknown augmentation level: {augmentation_level}")

return train_transform, test_transform


def get_dataloaders(dataset='cifar10', batch_size=128,
augmentation_level='strong', num_workers=4):
"""
Get complete data loaders with augmentation.
"""

if dataset == 'cifar10':
train_transform, test_transform = get_cifar10_transforms(augmentation_level)
train_dataset = datasets.CIFAR10(
root='./data', train=True,
download=True, transform=train_transform
)
test_dataset = datasets.CIFAR10(
root='./data', train=False,
download=True, transform=test_transform
)

elif dataset == 'cifar100':
# Similar for CIFAR-100
train_transform, test_transform = get_cifar100_transforms(augmentation_level)
train_dataset = datasets.CIFAR100(
root='./data', train=True,
download=True, transform=train_transform
)
test_dataset = datasets.CIFAR100(
root='./data', train=False,
download=True, transform=test_transform
)

train_loader = DataLoader(
train_dataset, batch_size=batch_size, shuffle=True,
num_workers=num_workers, pin_memory=True, drop_last=True
)

test_loader = DataLoader(
test_dataset, batch_size=batch_size, shuffle=False,
num_workers=num_workers, pin_memory=True
)

return train_loader, test_loader

Complete Training Script with All Augmentations

train_augmented.py

import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm

from models.baseline import SimpleCNN
from augmentations.pipeline import get_dataloaders
from augmentations.mixup import MixupCutMix, mixup_criterion

def train_one_epoch(model, loader, criterion, optimizer, device, mixup_fn=None):
model.train()
running_loss = 0.0
correct = 0
total = 0

pbar = tqdm(loader, desc='Training')
for inputs, targets in pbar:
inputs, targets = inputs.to(device), targets.to(device)

# Apply Mixup/CutMix if provided
if mixup_fn is not None:
inputs, targets_a, targets_b, lam = mixup_fn(inputs, targets)

optimizer.zero_grad()
outputs = model(inputs)

# Compute loss
if mixup_fn is not None:
loss = mixup_criterion(criterion, outputs, targets_a, targets_b, lam)
else:
loss = criterion(outputs, targets)

loss.backward()
optimizer.step()

running_loss += loss.item()

# Accuracy (for monitoring only, not exact with mixup)
_, predicted = outputs.max(1)
total += targets.size(0)
if mixup_fn is not None:
correct += (lam * predicted.eq(targets_a).sum().float() +
(1 - lam) * predicted.eq(targets_b).sum().float()).item()
else:
correct += predicted.eq(targets).sum().item()

pbar.set_postfix({'loss': f'{running_loss/total:.4f}'})

return running_loss / len(loader), 100. * correct / total


def evaluate(model, loader, criterion, device):
model.eval()
running_loss = 0.0
correct = 0
total = 0

with torch.no_grad():
for inputs, targets in loader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs)
loss = criterion(outputs, targets)

running_loss += loss.item()
_, predicted = outputs.max(1)
total += targets.size(0)
correct += predicted.eq(targets).sum().item()

return running_loss / len(loader), 100. * correct / total


def main():
# Configuration
config = {
'batch_size': 128,
'epochs': 200,
'lr': 0.1,
'momentum': 0.9,
'weight_decay': 5e-4,
'augmentation': 'strong',  # Options: none, basic, medium, strong
'use_mixup': True,
'mixup_alpha': 0.2,
'cutmix_alpha': 1.0,
}

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

# Data
train_loader, test_loader = get_dataloaders(
dataset='cifar10',
batch_size=config['batch_size'],
augmentation_level=config['augmentation']
)

# Model
model = SimpleCNN(num_classes=10).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(
model.parameters(),
lr=config['lr'],
momentum=config['momentum'],
weight_decay=config['weight_decay']
)

# Cosine annealing scheduler
scheduler = optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=config['epochs']
)

# Mixup/CutMix
mixup_fn = None
if config['use_mixup']:
mixup_fn = MixupCutMix(
mixup_alpha=config['mixup_alpha'],
cutmix_alpha=config['cutmix_alpha'],
mixup_prob=0.5,
cutmix_prob=0.5
)

# Training loop
best_acc = 0.0

for epoch in range(config['epochs']):
print(f'\nEpoch {epoch+1}/{config["epochs"]}')

train_loss, train_acc = train_one_epoch(
model, train_loader, criterion, optimizer, device, mixup_fn
)
test_loss, test_acc = evaluate(model, test_loader, criterion, device)

print(f'Train Loss: {train_loss:.4f} | Test Acc: {test_acc:.2f}%')

if test_acc > best_acc:
best_acc = test_acc
torch.save(model.state_dict(), 'best_augmented_model.pth')
print(f'New best: {best_acc:.2f}%')

scheduler.step()

print(f'\nFinal Best Accuracy: {best_acc:.2f}%')


if __name__ == '__main__':
main()

Results Summary

Configuration	CIFAR-10 Accuracy	Improvement
Baseline (no augmentation)	84.23%	-
+ Basic (flip, crop)	89.47%	+5.24%
+ Cutout	91.23%	+1.76%
+ AutoAugment	92.31%	+1.08%
+ RandAugment + Mixup	92.87%	+0.56%
Final (all augmentations)	93.12%	+8.89% total

Updated Progress

Current: 93.12% | Target: 99.5%

93.12%

Gap remaining: 6.38%

Part 2 Key Takeaways

Basic augmentation is essential - Random crop + flip alone gives +5%
Cutout is simple but effective - Forces model to use multiple features
RandAugment is practical - Works as well as AutoAugment with 2 hyperparameters
Mixup/CutMix helps generalization - Especially useful with longer training
Order matters - Spatial transforms → ToTensor → Normalize → Cutout

Next: Part 3 - Advanced Architectures

With our augmentation pipeline delivering 93.12%, we've hit the limits of our simple CNN. In Part 3, we'll implement:

ResNet - Skip connections for deeper networks
WideResNet - Wider layers instead of deeper
PyramidNet - Gradually increasing channels
DenseNet - Dense connections between layers

Previous: Part 1 - Baseline Next: Part 3 - Architectures