KnowledgeDistillation Callback

How to apply knowledge distillation with fasterai

We’ll illustrate how to use Knowledge Distillation to distill the knowledge of a Resnet34 (the teacher), to a Resnet18 (the student)

Let’s us grab some data

path = untar_data(URLs.PETS)
files = get_image_files(path/"images")

def label_func(f): return f[0].isupper()

dls = ImageDataLoaders.from_name_func(path, files, label_func, item_tfms=Resize(64))

The first step is then to train the teacher model. We’ll start from a pretrained model, ensuring to get good results on our dataset.

teacher = vision_learner(dls, resnet34, metrics=accuracy)
teacher.unfreeze()
teacher.fit_one_cycle(10, 1e-3)
epoch train_loss valid_loss accuracy time
0 0.712487 0.780329 0.826116 00:04
1 0.426159 0.454067 0.895129 00:04
2 0.306519 0.290588 0.897158 00:04
3 0.196592 0.349783 0.875507 00:04
4 0.172939 0.191000 0.925575 00:04
5 0.131154 0.193276 0.926252 00:04
6 0.107619 0.802561 0.884980 00:04
7 0.070899 0.199201 0.936401 00:04
8 0.039612 0.191167 0.937754 00:04
9 0.024194 0.194976 0.937077 00:04

Without KD

We’ll now train a Resnet18 from scratch, and without any help from the teacher model, to get that as a baseline

student = Learner(dls, resnet18(num_classes=2), metrics=accuracy)
#student = vision_learner(dls, resnet18, metrics=accuracy)
student.fit_one_cycle(10, 1e-3)
epoch train_loss valid_loss accuracy time
0 0.602736 0.784080 0.682003 00:04
1 0.582019 0.629800 0.644790 00:04
2 0.547411 0.521493 0.725981 00:04
3 0.490268 0.669058 0.740189 00:04
4 0.448316 0.446682 0.778078 00:03
5 0.403792 0.668784 0.759811 00:03
6 0.350714 0.409201 0.815291 00:04
7 0.279282 0.392315 0.815968 00:04
8 0.197490 0.415861 0.837618 00:03
9 0.157046 0.403317 0.834235 00:04

With KD

And now we train the same model, but with the help of the teacher. The chosen loss is a combination of the regular classification loss (Cross-Entropy) and a loss pushing the student to learn from the teacher’s predictions.

student = Learner(dls, resnet18(num_classes=2), metrics=accuracy)
#student = vision_learner(dls, resnet18, metrics=accuracy)
kd = KnowledgeDistillationCallback(teacher.model, SoftTarget)
student.fit_one_cycle(10, 1e-3, cbs=kd)
epoch train_loss valid_loss accuracy time
0 2.874970 2.434021 0.709066 00:04
1 2.619885 2.321189 0.737483 00:04
2 2.381633 2.690866 0.730041 00:04
3 2.101448 1.772370 0.771313 00:04
4 1.824600 1.707633 0.793640 00:04
5 1.588555 1.433752 0.814614 00:04
6 1.273060 1.264489 0.843708 00:04
7 0.979666 1.169676 0.849120 00:04
8 0.768508 1.047257 0.862652 00:04
9 0.630613 1.043255 0.861976 00:04

When helped, the student model performs better !

There exist more complicated KD losses, such as the one coming from Paying Attention to Attention, where the student tries to replicate the same attention maps of the teacher at intermediate layers.

Using such a loss requires to be able to specify from which layer we want to replicate those attention maps. To do so, we have to specify them from their string name, which can be obtained with the get_model_layers function.

For example, we set the loss to be applied after each Residual block of our models:

student = Learner(dls, resnet18(num_classes=2), metrics=accuracy)
kd = KnowledgeDistillationCallback(teacher.model, Attention, ['layer1', 'layer2', 'layer3', 'layer4'], ['0.4', '0.5', '0.6', '0.7'], weight=0.9)
student.fit_one_cycle(10, 1e-3, cbs=kd)
epoch train_loss valid_loss accuracy time
0 0.090439 0.085387 0.684709 00:04
1 0.080193 0.081185 0.702300 00:04
2 0.071975 0.068845 0.769959 00:04
3 0.063899 0.062546 0.773342 00:04
4 0.056500 0.057492 0.793640 00:04
5 0.049420 0.055552 0.815968 00:04
6 0.040951 0.051518 0.841678 00:04
7 0.034474 0.047924 0.843708 00:04
8 0.026169 0.049825 0.855210 00:04
9 0.021952 0.050935 0.855210 00:04