KnowledgeDistillation Callback

How to apply knowledge distillation with fasterai

We’ll illustrate how to use Knowledge Distillation to distill the knowledge of a Resnet34 (the teacher), to a Resnet18 (the student)

Let’s us grab some data

path = untar_data(URLs.PETS)
files = get_image_files(path/"images")

def label_func(f): return f[0].isupper()

dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(64))

The first step is then to train the teacher model. We’ll start from a pretrained model, ensuring to get good results on our dataset.

teacher = vision_learner(dls, resnet34, metrics=accuracy)
teacher.unfreeze()
teacher.fit_one_cycle(10, 1e-3)

epoch	train_loss	valid_loss	accuracy	time
0	0.712487	0.780329	0.826116	00:04
1	0.426159	0.454067	0.895129	00:04
2	0.306519	0.290588	0.897158	00:04
3	0.196592	0.349783	0.875507	00:04
4	0.172939	0.191000	0.925575	00:04
5	0.131154	0.193276	0.926252	00:04
6	0.107619	0.802561	0.884980	00:04
7	0.070899	0.199201	0.936401	00:04
8	0.039612	0.191167	0.937754	00:04
9	0.024194	0.194976	0.937077	00:04

Without KD

We’ll now train a Resnet18 from scratch, and without any help from the teacher model, to get that as a baseline

student = Learner(dls, resnet18(num_classes=2), metrics=accuracy)
#student = vision_learner(dls, resnet18, metrics=accuracy)
student.fit_one_cycle(10, 1e-3)

epoch	train_loss	valid_loss	accuracy	time
0	0.602736	0.784080	0.682003	00:04
1	0.582019	0.629800	0.644790	00:04
2	0.547411	0.521493	0.725981	00:04
3	0.490268	0.669058	0.740189	00:04
4	0.448316	0.446682	0.778078	00:03
5	0.403792	0.668784	0.759811	00:03
6	0.350714	0.409201	0.815291	00:04
7	0.279282	0.392315	0.815968	00:04
8	0.197490	0.415861	0.837618	00:03
9	0.157046	0.403317	0.834235	00:04

With KD

And now we train the same model, but with the help of the teacher. The chosen loss is a combination of the regular classification loss (Cross-Entropy) and a loss pushing the student to learn from the teacher’s predictions.

student = Learner(dls, resnet18(num_classes=2), metrics=accuracy)
#student = vision_learner(dls, resnet18, metrics=accuracy)
kd = KnowledgeDistillationCallback(teacher.model, SoftTarget)
student.fit_one_cycle(10, 1e-3, cbs=kd)

epoch	train_loss	valid_loss	accuracy	time
0	2.874970	2.434021	0.709066	00:04
1	2.619885	2.321189	0.737483	00:04
2	2.381633	2.690866	0.730041	00:04
3	2.101448	1.772370	0.771313	00:04
4	1.824600	1.707633	0.793640	00:04
5	1.588555	1.433752	0.814614	00:04
6	1.273060	1.264489	0.843708	00:04
7	0.979666	1.169676	0.849120	00:04
8	0.768508	1.047257	0.862652	00:04
9	0.630613	1.043255	0.861976	00:04

When helped, the student model performs better !

There exist more complicated KD losses, such as the one coming from Paying Attention to Attention, where the student tries to replicate the same attention maps of the teacher at intermediate layers.

Using such a loss requires to be able to specify from which layer we want to replicate those attention maps. To do so, we have to specify them from their string name, which can be obtained with the get_model_layers function.

For example, we set the loss to be applied after each Residual block of our models:

student = Learner(dls, resnet18(num_classes=2), metrics=accuracy)
kd = KnowledgeDistillationCallback(teacher.model, Attention, ['layer1', 'layer2', 'layer3', 'layer4'], ['0.4', '0.5', '0.6', '0.7'], weight=0.9)
student.fit_one_cycle(10, 1e-3, cbs=kd)

epoch	train_loss	valid_loss	accuracy	time
0	0.090439	0.085387	0.684709	00:04
1	0.080193	0.081185	0.702300	00:04
2	0.071975	0.068845	0.769959	00:04
3	0.063899	0.062546	0.773342	00:04
4	0.056500	0.057492	0.793640	00:04
5	0.049420	0.055552	0.815968	00:04
6	0.040951	0.051518	0.841678	00:04
7	0.034474	0.047924	0.843708	00:04
8	0.026169	0.049825	0.855210	00:04
9	0.021952	0.050935	0.855210	00:04