Machine Learning, Software Development

Simpified history of NLP Transformers

(Some notes I made recently and posting here in case of interest to others – see the tables below)

The transformers story was kicked off by the “Attention is all you need” paper published in mid 2017. (See “Key Papers” section below). This eventually led to use cases like Google implementing transformers to improve its search in 2019/2020 and Microsoft implementing transformers to simplify writing code in 2021 (See “Real-world use of Transformers” section below).

For the rest of us, Huggingface has been producing some great code libraries for working with transformers. This was under heavy development in 2018-2019, including being renamed twice – an indicator of how in flux this area was at the time – but it’s fair to say that this has stabilised a lot over the past year. See “Major Huggingface releases” section below.

Another recent data point – Coursera’s Deep Learning Specialisation was based around using Google Brain’s Trax (https://github.com/google/trax). As of October 2021 Coursera has now announced that (in addition to doing some of the course with Trax) the transformers part now uses Huggingface.

Feels like transformers are at the level of maturity now that it makes sense to embed them into more real-world use cases. We will inevitably have to go through the Gartner Hype Cycle phases of inflated expectations leading to despair, so it’s important not to let expectations get too far ahead of reality. But even with that caveat in mind, now is a great time to be doing some experimentation with Huggingface’s transformers.

Key papers

Jun 2017“Attention is all you need” publishedhttps://arxiv.org/abs/1706.03762
Oct 2018 “BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding” publishedhttps://arxiv.org/abs/1810.04805
Jul 2019“RoBERTa: A Robustly Optimized BERT Pretraining Approach” published. https://arxiv.org/abs/1907.11692
May 2020“Language Models are Few-Shot Learners” published, describing use of GPT-3https://arxiv.org/abs/2005.14165

Real-world use of Transformers

Nov 2018Google open sources BERT codehttps://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
Oct 2019Google starts rolling out BERT implementation for searchhttps://searchengineland.com/faq-all-about-the-bert-algorithm-in-google-search-324193
May 2020OpenAI introduces GPT-3https://en.wikipedia.org/wiki/GPT-3
Oct 2020Google is using BERT used on “almost every English-language query”https://searchengineland.com/google-bert-used-on-almost-every-english-query-342193
May 2021Microsoft introduces GPT-3 into Power Appshttps://powerapps.microsoft.com/en-us/blog/introducing-power-apps-ideas-ai-powered-assistance-now-helps-anyone-create-apps-using-natural-language/

Major Huggingface Releases

Nov 2018Initial 0.1.2 release of pytorch-pretrained-bert https://github.com/huggingface/transformers/releases/tag/v0.1.2
Jul 2019v1.0 of their pytorch-transformers library (including change of name from pytorch-pretrained-bert to pytorch-transformers)https://github.com/huggingface/transformers/releases/tag/v1.0.0
Sep 2019v2.0, this time including name change from pytorch-transformers to, simply, transformershttps://github.com/huggingface/transformers/releases/tag/v2.0.0
June 2020v3.0 of transformershttps://github.com/huggingface/transformers/releases/tag/v3.0.0
Nov 2020v4.0 of transformershttps://github.com/huggingface/transformers/releases/tag/v4.0.0
Machine Learning, Supply Chain Management

Transformers vs Spend Classification

In recent posts I’ve written about the use of Transformers in Natural Language Processing.

A friend working in the procurement space asked about their application in combating decepticons unruly spend data. Specifically, could it help speed up classifying spend data.

So I fine-tuned a Distilbert model using publicly-available data from the TheyBuyForYou project to map text to CPV codes. It took a bit of poking around but the upshot is pretty promising. See the following classification results where the model can distinguish amongst the following types of spend items:

'mobile phone' => Radio, television, communication, telecommunication and related equipment (score = 0.9999891519546509)
'mobile app' => Software package and information systems (score = 0.9995172023773193)
'mobile billboard' => Advertising and marketing services (score = 0.5554304122924805)
'mobile office' => Construction work (score = 0.9570050835609436)

Usual disclaimers apply: this is a toy example that I played around with until it looked good for a specific use case. In reality you would need to apply domain expertise and understanding of the business. But the key point is that transformers are a lot more capable than older machine learning techniques that I’ve seen in spend classification.

The code is all on Github and made available under the Creative Commons BY-NC-SA 4.0 License. It doesn’t include the model itself as the model is too big for github and I haven’t had a chance to try out Git Large File Storage. If people are interested I’m more than happy to do so.

Machine Learning, Software Development

Is it worth training an NLP Transformer from scratch?

In https://alanbuxton.wordpress.com/2021/09/21/transformers-for-use-oriented-entity-extraction/ I wrote about an experience training a custom transformer-based model to do a type of entity extraction. I tried training from scratch because the source text happened to have been preprocessed / lemmatized in such a way to include over 20 custom tokens that the RoBERTa model wouldn’t know about. My assumption was that this text would be so different to normal English that you may as well treat it as its own language.

Once I saw the results I decided to test this hypothesis somewhat by comparing the results of the preprocessed/lemmatized text with the custom model vs a raw version of the text on a fine-tuned out of the box roberta-base model.

Turns out that, for me, the fine-tuned RoBERTa model always outperformed the model trained from scratch, though the difference in performance becomes pretty minimal once you’re in the millions of sentences.

Conclusion – when working in this space, don’t make assumptions. Stand on the shoulders of as many giants as possible.

Approx number of
sentences for fine-tuning
F1 Score –
RoBERTa from scratch
F1 Score –
Fine-tuned roberta-base
1,5000.391820.52233
15,0000.752940.97764
40,0000.926390.99494
65,0000.972600.99627
125,0000.991050.99776
300,0000.996700.99797
600,0000.997710.99866
960,0000.997830.99865
1,400,0000.998100.99888