Machine Learning

Entity extraction powered by Flan

A Thanksgiving update on my side project. See here for an outline of the problem. In short, existing natural language processing techniques are good at generic entity extraction, but not at really getting to the core of the story.

I call it the ‘Bloomberg problem’. Imagine this text: “Bloomberg reported that Foo Inc has bought Bar Corp”. Bloomberg is not relevant in this story. But it is relevant in this one: “Bloomberg has just announced a new version of their Terminal”.

I wrote about my first attempt to address this problem, and then followed it up in July. I’ve been doing some more finessing since then and am pretty happy with the results. There is still some tidying up to do but I’m pretty confident that the building blocks are all there.

The big changes since July are:

  1. Replacing a lot of the post-processing logic with a model trained on more data. This was heartbreaking (throw away work, sad face emoji) but at the same time exhilarating (it works a lot better with less code in, big smile emoji).
  2. Implementing Flan T5 to help with some of the more generic areas.

At a high level this is how it works:

  1. The model
    • Approx 400 tagged docs (in total, across train, val and test sets)
    • Some judicious data synthesis
    • Trained a Named Entity Recognition model based on roberta-base
  2. Post-processing is a combination of
    • Benepar for constituency parsing to identify the relationships between entities for most cases
    • FlanT5 to help with the less obvious relationships.

Next steps are going to be to start representing this as a knowledge graph, which is a more natural way of exploring the data.

See below for a screenshot of the appointment topics extracted recently. These are available online at https://syracuse-1145.herokuapp.com/appointments

And below are the URLS for these appointment topics

The Native Antigen Company Strengthens Senior Team to Support Growing Product Portfolio

In this example, we have a number of companies listed – some of the company that is appointing these two new individuals and some are companies where the individuals used to work. Not all the company records are equally relevant.

Wolters Kluwer appoints Kevin Hay as Vice President of Sales for FRR

The Native Antigen Company Strengthens Senior Team to Support Growing Product Portfolio (Business Wire version)

Kering boosted by report Gucci’s creative director Michele to step down

Broadcat Announces Appointment of Director of Operations

Former MediaTek General Counsel Dr. Hsu Wei-Fu Joins ProLogium Technology to Reinforce Solid-State Battery IP Protection and Patent Portfolio Strategy

HpVac Appoints Joana Vitte, MD, PhD, as Chief Scientific Officer

Highview Power Appoints Sandra Redding to its Leadership Team

Highview Power Appoints Sandra Redding to its Leadership Team (Business Wire version)

ASML Supervisory Board changes announced

Recommendation from Equinor’s nomination committee

I’m pretty impressed with this one. There are a lot of organizations mentioned in this document with one person joining and one person leaving. The system has correctly identified the relevant individuals and organization. There is some redundancy: Board Member and Board Of Directors are identified as the same role, but that’s something that can easily be cleaned up in some more post-processing.

SG Analytics appoints Rob Mitchell as the new Advisory Board Member

Similarly, this article includes the organization that Rob has been appointed to and the names of organizations where he has worked before.

SG Analytics appoints Rob Mitchell as the new Advisory Board Member (Business Wire version)

Machine Learning, Software Development

Analyzing a WhatsApp group chat with Seaborn, NetworkX and Transformers

We had a company shutdown recently. Simfoni operates ‘Anytime Anywhere’, which means that anyone can work whatever hours they feel are appropriate from wherever they want to. Every quarter we mandate a full company shutdown over a long weekend to make sure that we all take time away from work at the same time and come back to a clear inbox.

For me this meant a bunch of playing with my kids and hanging out in the garden.

But it also meant playing with some fun tech courtesy of a brief challenge I was set: what insights could I generate quickly from a WhatsApp chat list.

I had a go using some of my favourite tools: Seaborn for easy data visualization; Huggingface Transformers for ML insights and Networkx for graph analysis.

You can find the repo here: https://github.com/alanbuxton/whatsapp-analysis

Machine Learning, Software Development

Simpified history of NLP Transformers

(Some notes I made recently and posting here in case of interest to others – see the tables below)

The transformers story was kicked off by the “Attention is all you need” paper published in mid 2017. (See “Key Papers” section below). This eventually led to use cases like Google implementing transformers to improve its search in 2019/2020 and Microsoft implementing transformers to simplify writing code in 2021 (See “Real-world use of Transformers” section below).

For the rest of us, Huggingface has been producing some great code libraries for working with transformers. This was under heavy development in 2018-2019, including being renamed twice – an indicator of how in flux this area was at the time – but it’s fair to say that this has stabilised a lot over the past year. See “Major Huggingface releases” section below.

Another recent data point – Coursera’s Deep Learning Specialisation was based around using Google Brain’s Trax (https://github.com/google/trax). As of October 2021 Coursera has now announced that (in addition to doing some of the course with Trax) the transformers part now uses Huggingface.

Feels like transformers are at the level of maturity now that it makes sense to embed them into more real-world use cases. We will inevitably have to go through the Gartner Hype Cycle phases of inflated expectations leading to despair, so it’s important not to let expectations get too far ahead of reality. But even with that caveat in mind, now is a great time to be doing some experimentation with Huggingface’s transformers.

Key papers

Jun 2017“Attention is all you need” publishedhttps://arxiv.org/abs/1706.03762
Oct 2018 “BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding” publishedhttps://arxiv.org/abs/1810.04805
Jul 2019“RoBERTa: A Robustly Optimized BERT Pretraining Approach” published. https://arxiv.org/abs/1907.11692
May 2020“Language Models are Few-Shot Learners” published, describing use of GPT-3https://arxiv.org/abs/2005.14165

Real-world use of Transformers

Nov 2018Google open sources BERT codehttps://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
Oct 2019Google starts rolling out BERT implementation for searchhttps://searchengineland.com/faq-all-about-the-bert-algorithm-in-google-search-324193
May 2020OpenAI introduces GPT-3https://en.wikipedia.org/wiki/GPT-3
Oct 2020Google is using BERT used on “almost every English-language query”https://searchengineland.com/google-bert-used-on-almost-every-english-query-342193
May 2021Microsoft introduces GPT-3 into Power Appshttps://powerapps.microsoft.com/en-us/blog/introducing-power-apps-ideas-ai-powered-assistance-now-helps-anyone-create-apps-using-natural-language/

Major Huggingface Releases

Nov 2018Initial 0.1.2 release of pytorch-pretrained-bert https://github.com/huggingface/transformers/releases/tag/v0.1.2
Jul 2019v1.0 of their pytorch-transformers library (including change of name from pytorch-pretrained-bert to pytorch-transformers)https://github.com/huggingface/transformers/releases/tag/v1.0.0
Sep 2019v2.0, this time including name change from pytorch-transformers to, simply, transformershttps://github.com/huggingface/transformers/releases/tag/v2.0.0
June 2020v3.0 of transformershttps://github.com/huggingface/transformers/releases/tag/v3.0.0
Nov 2020v4.0 of transformershttps://github.com/huggingface/transformers/releases/tag/v4.0.0
Machine Learning, Supply Chain Management

Transformers vs Spend Classification

In recent posts I’ve written about the use of Transformers in Natural Language Processing.

A friend working in the procurement space asked about their application in combating decepticons unruly spend data. Specifically, could it help speed up classifying spend data.

So I fine-tuned a Distilbert model using publicly-available data from the TheyBuyForYou project to map text to CPV codes. It took a bit of poking around but the upshot is pretty promising. See the following classification results where the model can distinguish amongst the following types of spend items:

'mobile phone' => Radio, television, communication, telecommunication and related equipment (score = 0.9999891519546509)
'mobile app' => Software package and information systems (score = 0.9995172023773193)
'mobile billboard' => Advertising and marketing services (score = 0.5554304122924805)
'mobile office' => Construction work (score = 0.9570050835609436)

Usual disclaimers apply: this is a toy example that I played around with until it looked good for a specific use case. In reality you would need to apply domain expertise and understanding of the business. But the key point is that transformers are a lot more capable than older machine learning techniques that I’ve seen in spend classification.

The code is all on Github and made available under the Creative Commons BY-NC-SA 4.0 License. It doesn’t include the model itself as the model is too big for github and I haven’t had a chance to try out Git Large File Storage. If people are interested I’m more than happy to do so.