In https://alanbuxton.wordpress.com/2021/09/21/transformers-for-use-oriented-entity-extraction/ I wrote about an experience training a custom transformer-based model to do a type of entity extraction. I tried training from scratch because the source text happened to have been preprocessed / lemmatized in such a way to include over 20 custom tokens that the RoBERTa model wouldn’t know about. My assumption was that this text would be so different to normal English that you may as well treat it as its own language.

Once I saw the results I decided to test this hypothesis somewhat by comparing the results of the preprocessed/lemmatized text with the custom model vs a raw version of the text on a fine-tuned out of the box roberta-base model.

Turns out that, for me, the fine-tuned RoBERTa model always outperformed the model trained from scratch, though the difference in performance becomes pretty minimal once you’re in the millions of sentences.

Conclusion – when working in this space, don’t make assumptions. Stand on the shoulders of as many giants as possible.

Approx number of
sentences for fine-tuning
F1 Score –
RoBERTa from scratch
F1 Score –
Fine-tuned roberta-base
1,5000.391820.52233
15,0000.752940.97764
40,0000.926390.99494
65,0000.972600.99627
125,0000.991050.99776
300,0000.996700.99797
600,0000.997710.99866
960,0000.997830.99865
1,400,0000.998100.99888

One thought on “Is it worth training an NLP Transformer from scratch?

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.