path = untar_data(URLs.PETS)
files = get_image_files(path/"images")
def label_func(f): return f[0].isupper()
dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(64))Fully-Connected layers decomposition
Overview
FC Layer Decomposition uses Singular Value Decomposition (SVD) to factorize large fully-connected layers into smaller, more efficient layers. This is particularly effective for models with large FC layers like VGG.
How It Works
A weight matrix \(W \in \mathbb{R}^{m \times n}\) is decomposed as: \[W \approx U \cdot S \cdot V^T\]
By keeping only the top-\(k\) singular values, we replace one large layer with two smaller layers: - Original: Linear(n → m) with \(m \times n\) parameters - Decomposed: Linear(n → k) + Linear(k → m) with \(k \times (m + n)\) parameters
When \(k << \min(m, n)\), this significantly reduces parameters.
When to Use FC Decomposition
| Model Type | FC Layer Size | Recommendation |
|---|---|---|
| VGG-style | Very large (4096×4096) | ✅ Highly effective |
| ResNet-style | Small (512×classes) | ❌ Not needed |
| Transformers | Medium (hidden×4×hidden) | ⚠️ May help |
Best for: Models where FC layers dominate the parameter count (e.g., VGG has ~90% of parameters in FC layers).
1. Setup and Data
2. Train the Model
We use VGG16 with batch normalization - a model with very large FC layers:
learn = Learner(dls, vgg16_bn(num_classes=2), metrics=accuracy)
learn.fit_one_cycle(5, 1e-5)| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.695412 | 0.605576 | 0.673207 | 00:03 |
| 1 | 0.663228 | 0.597482 | 0.673207 | 00:02 |
| 2 | 0.629274 | 0.574708 | 0.691475 | 00:03 |
| 3 | 0.601648 | 0.564869 | 0.700271 | 00:03 |
| 4 | 0.590151 | 0.561631 | 0.709743 | 00:03 |
3. Apply FC Decomposition
Use FC_Decomposer to factorize the fully-connected layers:
fc = FC_Decomposer()
new_model = fc.decompose(learn.model)Notice how each FC layer is now replaced by a Sequential of two smaller layers. For example: - Original: Linear(25088 → 4096) = 102M parameters - Decomposed: Linear(25088 → 2048) + Linear(2048 → 4096) = 59M parameters
new_modelVGG(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(9): ReLU(inplace=True)
(10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(12): ReLU(inplace=True)
(13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(14): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(16): ReLU(inplace=True)
(17): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(18): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(19): ReLU(inplace=True)
(20): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(21): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(22): ReLU(inplace=True)
(23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(24): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(25): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(26): ReLU(inplace=True)
(27): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(28): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(29): ReLU(inplace=True)
(30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(31): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(32): ReLU(inplace=True)
(33): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(35): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(36): ReLU(inplace=True)
(37): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(38): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(39): ReLU(inplace=True)
(40): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(41): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(42): ReLU(inplace=True)
(43): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
(classifier): Sequential(
(0): Sequential(
(0): Linear(in_features=25088, out_features=2048, bias=False)
(1): Linear(in_features=2048, out_features=4096, bias=True)
)
(1): ReLU(inplace=True)
(2): Dropout(p=0.5, inplace=False)
(3): Sequential(
(0): Linear(in_features=4096, out_features=2048, bias=False)
(1): Linear(in_features=2048, out_features=4096, bias=True)
)
(4): ReLU(inplace=True)
(5): Dropout(p=0.5, inplace=False)
(6): Sequential(
(0): Linear(in_features=4096, out_features=1, bias=False)
(1): Linear(in_features=1, out_features=2, bias=True)
)
)
)
4. Compare Results
Parameter Reduction
count_parameters(learn.model)134277186
count_parameters(new_model)91281476
A reduction of ~43 million parameters (~32% smaller)!
Accuracy Trade-off
SVD decomposition is an approximation, so some accuracy loss is expected. The accuracy depends on how many singular values are retained:
new_learn = Learner(dls, new_model, metrics=accuracy)
new_learn.validate()[0.5803729295730591, 0.6833558678627014]
The accuracy drop from ~90% to ~68% is significant. To recover accuracy, fine-tune the decomposed model:
new_learn = Learner(dls, new_model, metrics=accuracy)
new_learn.fit_one_cycle(5, 1e-4) # Fine-tune with small learning rate5. Parameter Reference
FC_Decomposer Parameters
| Parameter | Default | Description |
|---|---|---|
rank_ratio |
0.5 |
Fraction of singular values to keep (0-1). Lower = more compression, more accuracy loss |
Choosing rank_ratio
| rank_ratio | Compression | Accuracy Impact |
|---|---|---|
0.8 |
Low | Minimal |
0.5 |
Medium | Moderate |
0.25 |
High | Significant (requires fine-tuning) |
Summary
| Metric | Original VGG16 | Decomposed | Change |
|---|---|---|---|
| Parameters | 134M | 91M | -32% |
| FC Layer Params | ~120M | ~77M | -36% |
| Accuracy (before fine-tune) | 90% | 68% | Needs fine-tuning |
Recommended Workflow
from fasterai.misc.all import *
# 1. Train model
learn.fit_one_cycle(5)
# 2. Decompose FC layers
fc = FC_Decomposer(rank_ratio=0.5)
new_model = fc.decompose(learn.model)
# 3. Fine-tune to recover accuracy
new_learn = Learner(dls, new_model, metrics=accuracy)
new_learn.fit_one_cycle(3, 1e-4)
# 4. (Optional) Apply other compressions
# - Pruning, sparsification, quantizationSee Also
- BN Folding - Combine with BN folding for more optimization
- Pruner - Apply structured pruning after decomposition
- ONNX Export - Export optimized model for deployment
- Sparsifier - Add sparsity for further compression