A Thanksgiving update on my side project. See here for an outline of the problem. In short, existing natural language processing techniques are good at generic entity extraction, but not at really getting to the core of the story.
I call it the ‘Bloomberg problem’. Imagine this text: “Bloomberg reported that Foo Inc has bought Bar Corp”. Bloomberg is not relevant in this story. But it is relevant in this one: “Bloomberg has just announced a new version of their Terminal”.
I wrote about my first attempt to address this problem, and then followed it up in July. I’ve been doing some more finessing since then and am pretty happy with the results. There is still some tidying up to do but I’m pretty confident that the building blocks are all there.
The big changes since July are:
- Replacing a lot of the post-processing logic with a model trained on more data. This was heartbreaking (throw away work, sad face emoji) but at the same time exhilarating (it works a lot better with less code in, big smile emoji).
- Implementing Flan T5 to help with some of the more generic areas.
At a high level this is how it works:
- The model
- Approx 400 tagged docs (in total, across train, val and test sets)
- Some judicious data synthesis
- Trained a Named Entity Recognition model based on roberta-base
- Post-processing is a combination of
Next steps are going to be to start representing this as a knowledge graph, which is a more natural way of exploring the data.
See below for a screenshot of the appointment topics extracted recently. These are available online at https://syracuse-1145.herokuapp.com/appointments
And below are the URLS for these appointment topics
The Native Antigen Company Strengthens Senior Team to Support Growing Product Portfolio
In this example, we have a number of companies listed – some of the company that is appointing these two new individuals and some are companies where the individuals used to work. Not all the company records are equally relevant.
Wolters Kluwer appoints Kevin Hay as Vice President of Sales for FRR
The Native Antigen Company Strengthens Senior Team to Support Growing Product Portfolio (Business Wire version)
Kering boosted by report Gucci’s creative director Michele to step down
Broadcat Announces Appointment of Director of Operations
HpVac Appoints Joana Vitte, MD, PhD, as Chief Scientific Officer
Highview Power Appoints Sandra Redding to its Leadership Team
Highview Power Appoints Sandra Redding to its Leadership Team (Business Wire version)
ASML Supervisory Board changes announced
Recommendation from Equinor’s nomination committee
I’m pretty impressed with this one. There are a lot of organizations mentioned in this document with one person joining and one person leaving. The system has correctly identified the relevant individuals and organization. There is some redundancy: Board Member and Board Of Directors are identified as the same role, but that’s something that can easily be cleaned up in some more post-processing.
SG Analytics appoints Rob Mitchell as the new Advisory Board Member
Similarly, this article includes the organization that Rob has been appointed to and the names of organizations where he has worked before.
SG Analytics appoints Rob Mitchell as the new Advisory Board Member (Business Wire version)