Simpified history of NLP Transformers

Posted on October 30, 2021May 30, 2022 by alanbuxton

(Some notes I made recently and posting here in case of interest to others – see the tables below)

The transformers story was kicked off by the “Attention is all you need” paper published in mid 2017. (See “Key Papers” section below). This eventually led to use cases like Google implementing transformers to improve its search in 2019/2020 and Microsoft implementing transformers to simplify writing code in 2021 (See “Real-world use of Transformers” section below).

For the rest of us, Huggingface has been producing some great code libraries for working with transformers. This was under heavy development in 2018-2019, including being renamed twice – an indicator of how in flux this area was at the time – but it’s fair to say that this has stabilised a lot over the past year. See “Major Huggingface releases” section below.

Another recent data point – Coursera’s Deep Learning Specialisation was based around using Google Brain’s Trax (https://github.com/google/trax). As of October 2021 Coursera has now announced that (in addition to doing some of the course with Trax) the transformers part now uses Huggingface.

Feels like transformers are at the level of maturity now that it makes sense to embed them into more real-world use cases. We will inevitably have to go through the Gartner Hype Cycle phases of inflated expectations leading to despair, so it’s important not to let expectations get too far ahead of reality. But even with that caveat in mind, now is a great time to be doing some experimentation with Huggingface’s transformers.

Key papers

Jun 2017	“Attention is all you need” published	https://arxiv.org/abs/1706.03762
Oct 2018	“BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding” published	https://arxiv.org/abs/1810.04805
Jul 2019	“RoBERTa: A Robustly Optimized BERT Pretraining Approach” published.	https://arxiv.org/abs/1907.11692
May 2020	“Language Models are Few-Shot Learners” published, describing use of GPT-3	https://arxiv.org/abs/2005.14165

Real-world use of Transformers

Nov 2018	Google open sources BERT code	https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
Oct 2019	Google starts rolling out BERT implementation for search	https://searchengineland.com/faq-all-about-the-bert-algorithm-in-google-search-324193
May 2020	OpenAI introduces GPT-3	https://en.wikipedia.org/wiki/GPT-3
Oct 2020	Google is using BERT used on “almost every English-language query”	https://searchengineland.com/google-bert-used-on-almost-every-english-query-342193
May 2021	Microsoft introduces GPT-3 into Power Apps	https://powerapps.microsoft.com/en-us/blog/introducing-power-apps-ideas-ai-powered-assistance-now-helps-anyone-create-apps-using-natural-language/

Major Huggingface Releases

Nov 2018	Initial 0.1.2 release of pytorch-pretrained-bert	https://github.com/huggingface/transformers/releases/tag/v0.1.2
Jul 2019	v1.0 of their pytorch-transformers library (including change of name from pytorch-pretrained-bert to pytorch-transformers)	https://github.com/huggingface/transformers/releases/tag/v1.0.0
Sep 2019	v2.0, this time including name change from pytorch-transformers to, simply, transformers	https://github.com/huggingface/transformers/releases/tag/v2.0.0
June 2020	v3.0 of transformers	https://github.com/huggingface/transformers/releases/tag/v3.0.0
Nov 2020	v4.0 of transformers	https://github.com/huggingface/transformers/releases/tag/v4.0.0

Machine Learning, Supply Chain Management

Transformers vs Spend Classification

Posted on October 25, 2021May 30, 2022 by alanbuxton

In recent posts I’ve written about the use of Transformers in Natural Language Processing.

A friend working in the procurement space asked about their application in combating ~~decepticons~~ unruly spend data. Specifically, could it help speed up classifying spend data.

So I fine-tuned a Distilbert model using publicly-available data from the TheyBuyForYou project to map text to CPV codes. It took a bit of poking around but the upshot is pretty promising. See the following classification results where the model can distinguish amongst the following types of spend items:

'mobile phone' => Radio, television, communication, telecommunication and related equipment (score = 0.9999891519546509)
'mobile app' => Software package and information systems (score = 0.9995172023773193)
'mobile billboard' => Advertising and marketing services (score = 0.5554304122924805)
'mobile office' => Construction work (score = 0.9570050835609436)

Usual disclaimers apply: this is a toy example that I played around with until it looked good for a specific use case. In reality you would need to apply domain expertise and understanding of the business. But the key point is that transformers are a lot more capable than older machine learning techniques that I’ve seen in spend classification.

The code is all on Github and made available under the Creative Commons BY-NC-SA 4.0 License. It doesn’t include the model itself as the model is too big for github and I haven’t had a chance to try out Git Large File Storage. If people are interested I’m more than happy to do so.

Machine Learning, Software Development

Is it worth training an NLP Transformer from scratch?

Posted on October 6, 2021May 30, 2022 by alanbuxton

In https://alanbuxton.wordpress.com/2021/09/21/transformers-for-use-oriented-entity-extraction/ I wrote about an experience training a custom transformer-based model to do a type of entity extraction. I tried training from scratch because the source text happened to have been preprocessed / lemmatized in such a way to include over 20 custom tokens that the RoBERTa model wouldn’t know about. My assumption was that this text would be so different to normal English that you may as well treat it as its own language.

Once I saw the results I decided to test this hypothesis somewhat by comparing the results of the preprocessed/lemmatized text with the custom model vs a raw version of the text on a fine-tuned out of the box roberta-base model.

Turns out that, for me, the fine-tuned RoBERTa model always outperformed the model trained from scratch, though the difference in performance becomes pretty minimal once you’re in the millions of sentences.

Conclusion – when working in this space, don’t make assumptions. Stand on the shoulders of as many giants as possible.

Approx number of sentences for fine-tuning	F1 Score – RoBERTa from scratch	F1 Score – Fine-tuned roberta-base
1,500	0.39182	0.52233
15,000	0.75294	0.97764
40,000	0.92639	0.99494
65,000	0.97260	0.99627
125,000	0.99105	0.99776
300,000	0.99670	0.99797
600,000	0.99771	0.99866
960,000	0.99783	0.99865
1,400,000	0.99810	0.99888

Be good, work hard, get lucky