Blog of Silas Bempong

Pretraining vs Data Augmentation: What Matters More for Semantic Segmentation on a Laptop?

TL;DR: On a small semantic segmentation task (Oxford-IIIT Pet) trained on a MacBook, ImageNet pretraining clearly helped more than heavier data augmentation. Pretraining gave me ~+3.8 mIoU points on average with essentially the same training time. Heavy augmentation was almost a wash.

Why I Ran This Experiment

Semantic segmentation models are usually trained on big GPUs with big datasets. But most of us don’t live inside a datacenter — we live on laptops.

I wanted to answer a simple, practical question:

If you only have a small GPU (in my case: an M1 MacBook Pro) and a few hours, what should you spend your limited budget on: pretraining or heavier data augmentation?

Both are classic “performance knobs”:

Rather than chasing state-of-the-art numbers, I framed this as a trade-off question under tight compute. I picked a small dataset, a lightweight model, and four simple configurations, then measured both accuracy (mIoU) and training time.

Experimental Setup

Task & Dataset: Oxford-IIIT Pet Segmentation

I used the Oxford-IIIT Pet dataset, which has photos of 37 cat and dog breeds. Each image comes with a pixel-wise trimap mask (background/pet/boundary), making it a nice, compact segmentation benchmark.

For this experiment:

This keeps the problem interesting (non-trivial shapes and textures) but small enough to run comfortably on a MacBook.

Hardware & Framework

So the full 4-config grid fits into a few hours of wall-clock time.

Model: Tiny U-Net with ResNet18 Encoder

To avoid spending time reinventing architectures, I used segmentation_models_pytorch and built a lightweight U-Net:

The key part for this experiment is that the encoder can be:

Everything else about the model stays fixed.

Loss Function & Metric

Dice loss tends to help with class imbalance and improves boundary quality a bit, while cross-entropy is a solid default. I didn’t tune this combination; it’s the same for all runs.

Training Hyperparameters

Same for all 4 configs:

I tracked:

What I Varied: Pretraining vs Data Augmentation

I ran a 2×2 grid over two binary choices: pretraining and augmentation strength.

Axis 1: Pretraining

Axis 2: Data Augmentation

I defined two simple augmentation pipelines.

Light Augmentation Applied to the training images (geometric changes applied equally to image and mask):

Heavy Augmentation

Everything from Light, plus:

The goal wasn’t to design the perfect augmentation policy, just a clearly “lighter” vs “heavier” pair.

The Four Configurations

Putting these together, I get four configurations:

Everything else about the training is identical.

Results

Test mIoU by Configuration

miou_by_config.png

Here are the final test mIoU scores for each config:

A few immediate observations:

Ablation: Pretraining vs Heavy Augmentation

To isolate the effect of each knob, I took simple averages.

Effect of Pretraining

So pretraining gave me roughly +0.038 mIoU (3.8 absolute points) on average.

Effect of Heavy Augmentation

Heavy augmentation was basically a wash, slightly worse than light.

ablation_view.png

Training Time: How Much Compute Did This Cost?

Each configuration took about the same time to train: training_time_by_config.png

A couple of things worth noting:

Bang for Buck: mIoU vs Training Time

bang-for-buck.png I plotted test mIoU vs training time. Each point is one config (C1–C4).

A clean story emerges:

If you care about accuracy per minute of training, C3 is the clear winner.

Qualitative Results

Numbers are nice, but segmentation is more visual, so I looked at predictions on a small test set. qualitative-comparison.png The plot shows, for six different examples:

  1. The input image
  2. The ground truth mask
  3. C1 prediction (Scratch + Light)
  4. C2 prediction (Scratch + Heavy)
  5. C3 prediction (Pretrained + Light)
  6. C4 prediction (Pretrained + Heavy)

A few qualitative patterns I noticed:

Overall, the visuals match the metrics: pretraining is the big lever, while my version of “heavy augmentation” didn’t give a clear win.

Takeaways for Practitioners on Low Compute

If you’re training a semantic segmentation model on a modest machine (like a MacBook, gaming laptop, or small cloud instance), here’s what I’d do based on this experiment.

1. Always Start with a Pretrained Encoder

If you can get a pretrained backbone, use it:

In other words: pretraining is basically free performance.

2. Use Simple, Robust Augmentation Before Getting Fancy

Basic geometric augmentations (flip + crop) are cheap and safe. In this experiment, going from light to “heavy” augmentation didn’t help and sometimes slightly hurt.

That doesn’t mean heavy augmentation is bad; it just means:

If you’re on tight compute, I’d start with:

Only then would I layer in rotations, color jitter, or more exotic methods if I still see overfitting.

3. Don’t Overthink the Model (At First)

A simple U-Net with a good encoder gave solid performance here. Before trying to squeeze out tiny gains with custom architectures, you’ll often get more value from:

4. This Whole Experiment Fits on a Laptop

One neat side effect: this 4-config grid was completely feasible on an M1 MacBook Pro:

You don’t need a massive GPU cluster to do meaningful empirical work in computer vision.

Limitations & Next Steps

This is a small, focused experiment, and there are plenty of things I didn’t explore.

Some obvious caveats:

If I extend this project, I’d like to:

  1. Vary the amount of labeled data and compare pretraining vs augmentation when data is scarce.
  2. Try different resolutions to see how much performance you lose by shrinking images to go faster.
  3. Experiment with more advanced augmentations (Mixup/CutMix/Copy-Paste) and see if they matter more at higher resolutions or with bigger models.

Conclusion

For this small, laptop-friendly segmentation experiment, the answer to my original question is pretty clear:

If you only have time to turn one knob, turn on pretraining.

ImageNet pretraining gave a solid, consistent boost in test mIoU with almost no downsides. My heavier augmentation recipe, on the other hand, was mostly neutral.

The broader lesson: even under tight compute, you can still do meaningful, well-scoped experiments in computer vision. And if you frame them around practical trade-offs — like “accuracy per minute of training” — the results can be immediately useful for anyone training models on a single machine.

Try a similar setup on another dataset from my repository or with a different backbone. I’d be very curious to see whether pretraining still dominates, or if augmentation plays a bigger role there.