pretrained_weights = 'gpt2'
tokenizer = GPT2TokenizerFast.from_pretrained(pretrained_weights)
model = GPT2LMHeadModel.from_pretrained(pretrained_weights)Prune Transformers
This example code is taken from the fastai docs
path = untar_data(URLs.WIKITEXT_TINY)Let’s create our fastai Learner.
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=Perplexity())And let’s try to extend a given prompt with the pretrained model.
prompt = "\n = Unicorn = \n \n A unicorn is a magical creature with a rainbow tail and a horn"preds = learn.model.generate(inp, max_length=40, num_beams=5, temperature=1.5)tokenizer.decode(preds[0].cpu().numpy())learn.validate()learn.fit_one_cycle(1, 1e-4)prompt_ids = tokenizer.encode(prompt)
inp = tensor(prompt_ids)[None]
preds = learn.model.generate(inp.cuda(), max_length=40, num_beams=5, temperature=1.5)
tokenizer.decode(preds[0].cpu().numpy())Make it sparse !
Let’s see now if we retrain our model, this time introducing sparsity
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), cbs=[DropOutput], metrics=Perplexity())Unfortunately, the transformer model uses a custom layer: Conv1D, which is not a part of PyTorch. To overcome this problem, we have to add this layer to our Granularities class, so that it knows what to sparsify.
Here, the Conv1D behaves like a Linear layer, i.e. the weights are defined by a matrix of dimension (nf,nx)
doc(Conv1D)We can thus add the Conv1D granularity by using the add_granularity method, indicating the target module and the corresponding granularities that it can handle (the same as Linear so we can reuse it)
Granularities.add_granularity(Conv1D, Granularities._granularities_Linear)Let’s now define our SparsifyCallback. Let’s say we want to make our model 30% sparse, by removing the highest-norm weight in each attention head.
sp_cb = SparsifyCallback(sparsity=30, granularity='weight', context='local', criteria=large_final, schedule=one_cycle, layer_type=Conv1D)We now only have to pass our callback to fastai
And we can check the predicion to the same prompt as before
prompt_ids = tokenizer.encode(prompt)
inp = tensor(prompt_ids)[None]
preds = learn.model.generate(inp.cuda(), max_length=40, num_beams=5, temperature=1.5)
tokenizer.decode(preds[0].cpu().numpy())That’s it ! You now have a sparse Transformer as performant as the whole model. However, this model is currently not more efficient speed and storage wise. To have such a speed-up, I suggest you to look at the granularity section.