Machine Learning, Software Development, Technology Adoption

Transformers for Use-Oriented Entity Extraction

The internet is full of text information. We’re drowning in it. The only way to make sense of it is to use computers to interpret the text for us.

Consider this text:

Foo Inc announced it has acquired Bar Corp. The transaction closed yesterday, reported the Boston Globe.

This is a story about a company called ‘Foo’ buying a company called ‘Bar’. (I’m just using Foo and Bar as generic tech words, these aren’t real companies).

I was curious to see how the state of the art has evolved for pulling out these key bits of information from the text since I first looked at Dandelion in 2018.

TL;DR – existing Natural Language services vary from terrible to tolerable. But recent advances in language models, specifically transformers, point towards huge leaps in this kind of language processing.

Dandelion

Demo site: https://dandelion.eu/semantic-text/entity-extraction-demo/

While it was pretty impressive in 2018, the quality for this type of sentence is pretty poor. It only identified that the Boston Globe is an entity, but Dandelion tagged this entity as a “Work” (i.e. a work of art or literature). As I allowed more flexibility in finding entities, it also found that the term “Inc” and “Corp” usually relate to a Corporation, and it found a Toni Braxton song. Nul points.

Link to video

Explosion.ai

Demo site: https://explosion.ai/demos/displacy-ent

This organisation uses pretty standard named entity recognition. It successfully identified that there were three entities in this text. Pretty solid performance at extracting named entities, but not much help for my use case because the Boston Globe entity is not relevant to the key points of the story.

Link to video

Microsoft

Demo site: https://aidemos.microsoft.com/text-analytics

Thought I’d give Microsoft’s text analytics demo a whirl. Completely incomprehensible results. Worse than Dandelion.

Link to video

Completely WTF

Expert.ai

Demo site: https://try.expert.ai/analysis-and-classification

With Microsoft’s effort out of the way, time to look at a serious contender.

This one did a pretty good job. It identified Foo Inc and Bar Corp as businesses. It identified The Boston Globe as a different kind of entity. There was also some good inference that Foo had made an announcement and that something had acquired Bar Corp. But didn’t go so far as joining the dots that Foo was the buyer.

In this example, labelling The Boston Globe as Mass Media is helpful. It means I can ignore it unless I specifically want to know who is reporting which story. But this helpfulness can go too far. When I changed the name “Bar Corp” to “Reuters Corp” then the entity extraction only found one business entity: Foo Inc. The other two entities were now tagged as Mass Media.

Long story short – Expert.ai is the best so far, but a user would still need to implement a fair bit of post-processing to be able to extract they key elements from this text.

Link to video.

Expert.ai is identifying entities based on the nature of that entity, not based on the role that they are playing in the text. The relations are handled separately. I was looking for something that combined the relevant information from both the entities and their relations. I’ll call it ‘use-oriented entity extraction’ following Wittgenstein‘s quote that, if you want to understand language: “Don’t look for the meaning, look for the use”. In other words, the meaning of a word in some text can differ depending on how the word is used. In one sentence, Reuters might be the media company reporting a story. In another sentence, Reuters might be the business at the centre of the story.

Enter Transformers

I wondered how Transformers would do with the challenge of identifying the different entities depending on how the words are used in the text. So I trained a custom RoBERTa using a relatively small base set of text and some judicious pre-processing. I was blown away with the results. When I first saw all the 9’s appearing in the F1 score my initial reaction was “this has to be a bug, no way is this really this accurate”. Turns out it wasn’t a bug.

I’ve called the prototype “Napoli” because I like coastal locations and Napoli includes the consonants N, L and P. This is a super-simple proof of concept and would have a long way to go to become production-ready, but even these early results were pretty amazing:

  1. It could tell me that Foo Inc is the spending party that bought Bar Corp
  2. If I changed ‘Bar’ to ‘Reuters’ it could tell me that Foo Inc bought Reuters Corp
  3. If I changed the word “acquired” to “sold” it would tell me that Foo Inc is the receiving party that sold Reuters Corp (or Bar Corp etc).
  4. It didn’t get confused by the irrelevant fact that Boston Globe was doing the reporting.

Link to video