Machine Learning

Evaluating Syracuse – part 2

I recently wrote about the results of trying out my M&A entity extraction project that is smart enough to create simple graphs of which company has done what with which other company.

For a side project very much in alpha it stood up pretty well against the best of the other offerings out there. At least in the first case I looked at. Here are two more complex examples chosen at random

Test 1 – M&A activity with multiple participants

Article: Searchlight Capital Partners Completes the Acquisition of the Operations and Assets of Frontier Communications in the Northwest of the U.S. to form Ziply Fiber

Syracuse

It shows which organizations have been involved in the purchase, which organization sold the assets (Frontier) and the fact that the target entity is an organization called Ziply Fiber.

To improve, it could make it clearer that Ziply is a new entity being created rather than the purchase of an entity already called Ziply from Frontier. Also to identify that this is related to US North West assets. But otherwise pretty good.

Expert.ai

As before, it’s really good at identifying all the organizations in the text, even the ones that aren’t relevant to the story, e.g. Royal Canadian Mounted Police.

The relations piece is patchy. From the headline it determines that Searchlight Capital Partners is completing an acquisition of some operations, and also there is a relationship between the verb ‘complete’ and the assets of Frontier Communications. Pretty good result from this sentence, but not completely clear that there is an acquisition of assets.

Next sentence has a really good catch that Searchlight is forming Ziply

It only identifies one of the other parties involved in the transaction. It doesn’t tie the ‘it’ to Searchlight – you’d have to infer that from another relationship. And it doesn’t flag any of the other participants.

Test 2 – Digest Article

Article: Deals of the day-Mergers and acquisitions

Syracuse

It’s identifying 7 distinct stories. There are 8 bullet points in the Reuters story – one of which is about something that isn’t happening. Syracuse picks all of the real stories. It messes up Takeaway.com’s takeover of Just Eat by separating out Takeway and com as two different organizations, but apart from that looks pretty good.

I’m particularly gratified how it flags Exor as the spender and Agnelli as another kind of participant in the story about Exor raising its stake in GEDI. Agnelli is the family behind Exor, so they are involved, but strictly speaking the company doing the buying is Exor.

Expert.ai

Most of the entities are extracted correctly. A couple of notable errors:

  1. It finds a company called ‘Buyout’ (really this is the description of a type of firm, not the name of the firm)
  2. It also gets Takeaway.com wrong – but where Syracuse split this into two entities, Expert.ai flags it as a URL rather than a company (in yellow in the second image below)

The relationship piece is also pretty impressive from an academic point of view, but hard to piece together what is really going on from a practical point of view. Take the first story about Mediaset as an example and look at the relationships that Expert.ai identifies in the 4 graphs below. First one identifies that Mediaset belongs to Italy and is saying something. The other 3 talk about an “it” doing various things, but don’t tie this ‘it’ back to Mediaset.

Conclusion

Looking pretty good for Syracuse, if I say so myself :D.

Machine Learning

Revisiting Entity Extraction

In September 2021 I wrote about the difficulties of getting anything beyond basic named entity recognition. You could easily get the names of companies mentioned in a news article, but not whether one company was acquiring another or whether two companies were forming a joint venture, etc. Not to mention the perennial “Bloomberg problem”: Bloomberg is named in loads of different stories. Usually they are referenced as a company reporting the story, sometimes as the owner of the Bloomberg Terminal. Only a tiny proportion of mentions of Bloomberg are about actions that the Bloomberg company is done.

These were very real problems that a team I was involved in were facing around 2017, and were still not fixed in 2021. I figured I’d see if more recent ML techologies, specifically Transformers, could help solve these problems. I’ve made a simple Heroku app, called Syracuse, to showcase the results. It’s very alpha, but the quality is not too bad right now.

Meanwhile, the state of the art has moved on leaps and bounds over the past year. So I’m going to compare Syracuse with the winner from my 2021 comparison: Expert.ai‘s Document Analysis Tool and with ChatGPT – the new kid on the NLP block.

A Simple Test

Article: Avalara Acquires Artificial Intelligence Technology and Expertise from Indix to Aggregate, Structure and Deliver Global Product and Tax Information

The headline says it all: Avalara has acquired some Tech and Expertise from Indix.

Expert.AI

It is very comprehensive. For my purposes, too comprehensive. It identifies 3 companies: Avalara, ICR and Indix. The story is about Avalara acquiring IP from Indix. ICR is the communications company that is making the press release. ICR appearing in this list is an example of the “Bloomberg Problem” in action. Also it’s incorrect to call Indix IP a company – the company is Indix. The relevant sentence in the article mentions Indix’s IP, not a company called Indix IP: “Avalara believes its ability to collect, organize, and structure this content is accelerated with the acquisition of the Indix IP.

It also identifies many geographic locations, but many of them are irrelevant to the story as they are just lists of where Avalara has offices. If you wanted to search a database of UK-based M&A activity you would not want this story to come up.

Expert.AI’s relationship extraction is really impressive, but again, overly comprehensive. This first graph shows that Avalara gets expertise, technology and structure from Indix IP to aggregate things.

But there are also many many other graphs which are less useful, e.g:

Conclusion: Very powerful. Arguably too powerful. It reminds me of the age-old Google problem – I don’t want 1,487,585 results in 0.2 seconds. I’m already drowning in information, I want something that surfaces the answer quickly.

ChatGPT

I tried a few different prompts. First I included the background text then added a simple prompt:

I’m blown away by the quality of the summary here (no mention of ICR, LLC, so it’s not suffering from the Bloomberg Problem). But it’s not structured. Let’s try another prompt.

Again, it’s an impressive summary, but it’s not structured data.

Expert.ai + ChatGPT

I wonder what the results would be by combining a ChatGPT summary with Expert.AI document analysis. Turns out, not much use.

Syracuse

Link to data: https://syracuse-1145.herokuapp.com/m_and_as/1

Anyone looking at the URLs will recognise that this is the first entry in the database. This is the first example that I tried as an unseen test case (no cherry-picking here).

It shows the key information in a more concise graph as below. Avalara is a spender, Indix is receiving some kind of payment and the relevant target is some Indix Technology (the downward triangle represents something that is not an organization)

I’m pretty happy with this result. It shows that despite how impressive something like Expert.AI and ChatGPT are, they have limitations when applying to more specific problems, like in this case. Fortunately there are other open source ML technologies out there that can help, though it’s a job of work to stitch them together appropriately to get a decent result.

In future posts I’ll share more comparisons of more complex articles and share some insights into what I’ve learned about large language models through this process (spoiler – there are no silver bullets).

Machine Learning

Entity extraction powered by Flan

A Thanksgiving update on my side project. See here for an outline of the problem. In short, existing natural language processing techniques are good at generic entity extraction, but not at really getting to the core of the story.

I call it the ‘Bloomberg problem’. Imagine this text: “Bloomberg reported that Foo Inc has bought Bar Corp”. Bloomberg is not relevant in this story. But it is relevant in this one: “Bloomberg has just announced a new version of their Terminal”.

I wrote about my first attempt to address this problem, and then followed it up in July. I’ve been doing some more finessing since then and am pretty happy with the results. There is still some tidying up to do but I’m pretty confident that the building blocks are all there.

The big changes since July are:

  1. Replacing a lot of the post-processing logic with a model trained on more data. This was heartbreaking (throw away work, sad face emoji) but at the same time exhilarating (it works a lot better with less code in, big smile emoji).
  2. Implementing Flan T5 to help with some of the more generic areas.

At a high level this is how it works:

  1. The model
    • Approx 400 tagged docs (in total, across train, val and test sets)
    • Some judicious data synthesis
    • Trained a Named Entity Recognition model based on roberta-base
  2. Post-processing is a combination of
    • Benepar for constituency parsing to identify the relationships between entities for most cases
    • FlanT5 to help with the less obvious relationships.

Next steps are going to be to start representing this as a knowledge graph, which is a more natural way of exploring the data.

See below for a screenshot of the appointment topics extracted recently. These are available online at https://syracuse-1145.herokuapp.com/appointments

And below are the URLS for these appointment topics

The Native Antigen Company Strengthens Senior Team to Support Growing Product Portfolio

In this example, we have a number of companies listed – some of the company that is appointing these two new individuals and some are companies where the individuals used to work. Not all the company records are equally relevant.

Wolters Kluwer appoints Kevin Hay as Vice President of Sales for FRR

The Native Antigen Company Strengthens Senior Team to Support Growing Product Portfolio (Business Wire version)

Kering boosted by report Gucci’s creative director Michele to step down

Broadcat Announces Appointment of Director of Operations

Former MediaTek General Counsel Dr. Hsu Wei-Fu Joins ProLogium Technology to Reinforce Solid-State Battery IP Protection and Patent Portfolio Strategy

HpVac Appoints Joana Vitte, MD, PhD, as Chief Scientific Officer

Highview Power Appoints Sandra Redding to its Leadership Team

Highview Power Appoints Sandra Redding to its Leadership Team (Business Wire version)

ASML Supervisory Board changes announced

Recommendation from Equinor’s nomination committee

I’m pretty impressed with this one. There are a lot of organizations mentioned in this document with one person joining and one person leaving. The system has correctly identified the relevant individuals and organization. There is some redundancy: Board Member and Board Of Directors are identified as the same role, but that’s something that can easily be cleaned up in some more post-processing.

SG Analytics appoints Rob Mitchell as the new Advisory Board Member

Similarly, this article includes the organization that Rob has been appointed to and the names of organizations where he has worked before.

SG Analytics appoints Rob Mitchell as the new Advisory Board Member (Business Wire version)