Fully-Connected layers decomposition

Decompose and factorize your FC layers

Overview

FC Layer Decomposition uses Singular Value Decomposition (SVD) to factorize large fully-connected layers into smaller, more efficient layers. This is particularly effective for models with large FC layers like VGG.

How It Works

A weight matrix \(W \in \mathbb{R}^{m \times n}\) is decomposed as: \[W \approx U \cdot S \cdot V^T\]

By keeping only the top-\(k\) singular values, we replace one large layer with two smaller layers: - Original: Linear(n → m) with \(m \times n\) parameters - Decomposed: Linear(n → k) + Linear(k → m) with \(k \times (m + n)\) parameters

When \(k << \min(m, n)\), this significantly reduces parameters.

When to Use FC Decomposition

Model Type FC Layer Size Recommendation
VGG-style Very large (4096×4096) Highly effective
ResNet-style Small (512×classes) ❌ Not needed
Transformers Medium (hidden×4×hidden) ⚠️ May help

Best for: Models where FC layers dominate the parameter count (e.g., VGG has ~90% of parameters in FC layers).

1. Setup and Data

path = untar_data(URLs.PETS)
files = get_image_files(path/"images")

def label_func(f): return f[0].isupper()

dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(64))

2. Train the Model

We use VGG16 with batch normalization - a model with very large FC layers:

learn = Learner(dls, vgg16_bn(num_classes=2), metrics=accuracy)
learn.fit_one_cycle(5, 1e-5)
epoch train_loss valid_loss accuracy time
0 0.695412 0.605576 0.673207 00:03
1 0.663228 0.597482 0.673207 00:02
2 0.629274 0.574708 0.691475 00:03
3 0.601648 0.564869 0.700271 00:03
4 0.590151 0.561631 0.709743 00:03

3. Apply FC Decomposition

Use FC_Decomposer to factorize the fully-connected layers:

fc = FC_Decomposer()
new_model = fc.decompose(learn.model)

Notice how each FC layer is now replaced by a Sequential of two smaller layers. For example: - Original: Linear(25088 → 4096) = 102M parameters - Decomposed: Linear(25088 → 2048) + Linear(2048 → 4096) = 59M parameters

new_model
VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace=True)
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): ReLU(inplace=True)
    (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (12): ReLU(inplace=True)
    (13): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (14): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (16): ReLU(inplace=True)
    (17): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (19): ReLU(inplace=True)
    (20): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (21): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (22): ReLU(inplace=True)
    (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (24): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (26): ReLU(inplace=True)
    (27): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (28): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (29): ReLU(inplace=True)
    (30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (31): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (32): ReLU(inplace=True)
    (33): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (35): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (36): ReLU(inplace=True)
    (37): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (38): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (39): ReLU(inplace=True)
    (40): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (41): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (42): ReLU(inplace=True)
    (43): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
  (classifier): Sequential(
    (0): Sequential(
      (0): Linear(in_features=25088, out_features=2048, bias=False)
      (1): Linear(in_features=2048, out_features=4096, bias=True)
    )
    (1): ReLU(inplace=True)
    (2): Dropout(p=0.5, inplace=False)
    (3): Sequential(
      (0): Linear(in_features=4096, out_features=2048, bias=False)
      (1): Linear(in_features=2048, out_features=4096, bias=True)
    )
    (4): ReLU(inplace=True)
    (5): Dropout(p=0.5, inplace=False)
    (6): Sequential(
      (0): Linear(in_features=4096, out_features=1, bias=False)
      (1): Linear(in_features=1, out_features=2, bias=True)
    )
  )
)

4. Compare Results

Parameter Reduction

count_parameters(learn.model)
134277186
count_parameters(new_model)
91281476

A reduction of ~43 million parameters (~32% smaller)!

Accuracy Trade-off

SVD decomposition is an approximation, so some accuracy loss is expected. The accuracy depends on how many singular values are retained:

new_learn = Learner(dls, new_model, metrics=accuracy)
new_learn.validate()
[0.5803729295730591, 0.6833558678627014]

The accuracy drop from ~90% to ~68% is significant. To recover accuracy, fine-tune the decomposed model:

new_learn = Learner(dls, new_model, metrics=accuracy)
new_learn.fit_one_cycle(5, 1e-4)  # Fine-tune with small learning rate

5. Parameter Reference

FC_Decomposer Parameters

Parameter Default Description
rank_ratio 0.5 Fraction of singular values to keep (0-1). Lower = more compression, more accuracy loss

Choosing rank_ratio

rank_ratio Compression Accuracy Impact
0.8 Low Minimal
0.5 Medium Moderate
0.25 High Significant (requires fine-tuning)

Summary

Metric Original VGG16 Decomposed Change
Parameters 134M 91M -32%
FC Layer Params ~120M ~77M -36%
Accuracy (before fine-tune) 90% 68% Needs fine-tuning

See Also

  • BN Folding - Combine with BN folding for more optimization
  • Pruner - Apply structured pruning after decomposition
  • ONNX Export - Export optimized model for deployment
  • Sparsifier - Add sparsity for further compression