Cross-Architecture Knowledge Distillation & 4-Bit Quantization for Real-Time Edge Inference
In the era of "Bigger is Better," Project Nano-Vision takes the opposite approach. We tackle the challenge of deploying state-of-the-art Deep Learning models on resource-constrained hardware ("Potatoes").
By implementing Knowledge Distillation (KD), we transfer the "dark knowledge" of a heavy, scratch-built ResNet-50 (Teacher) into a lightweight MobileNetV3-Small (Student). To further bridge the gap between research and production, we apply 4-bit Quantization and ONNX Graph Optimization, achieving real-time inference on standard consumer CPUs.
Designed specifically for input (CIFAR-100), avoiding the spatial resolution loss found in standard ImageNet-centric architectures.
- Key Feature: Bottleneck blocks with Identity Shortcuts.
- Accuracy Target: ~78% on CIFAR-100.
A high-efficiency model utilizing Depthwise Separable Convolutions to minimize FLOPs.
- Key Feature: Low-latency architecture optimized for CPU-bound environments.
We don't just train on hard labels. We use Temperature-Scaled KL Divergence to capture the inter-class relationships learned by the teacher.
The Loss Function:
- Logit-based Distillation: Training the student to mimic the teacher's softened probability distribution ().
- Quantization-Aware Training (QAT): Simulating low-precision math during training to maintain accuracy after "crushing" weights to INT8/INT4.
- ONNX Runtime: Exporting to a hardware-agnostic format to leverage SIMD instructions on local CPUs.
| Metric | Teacher (ResNet-50) | Student (Quantized) | Improvement |
|---|---|---|---|
| Model Size | ~95 MB | ~2.8 MB | 33x Smaller |
| Inference (CPU) | 120ms / image | 8ms / image | 15x Faster |
| FPS | ~8 FPS | ~120 FPS | Fluid Motion |
| Accuracy | 78.4% | 76.1% | Only 2.3% Drop |
conda create -n <env_name> python=3.12 -y
conda activate <env_name>
pip install uv
uv pip install -e ".[train,deploy,dev]"