Machine Learning

Evaluating Syracuse – part 2

I recently wrote about the results of trying out my M&A entity extraction project that is smart enough to create simple graphs of which company has done what with which other company.

For a side project very much in alpha it stood up pretty well against the best of the other offerings out there. At least in the first case I looked at. Here are two more complex examples chosen at random

Test 1 – M&A activity with multiple participants

Article: Searchlight Capital Partners Completes the Acquisition of the Operations and Assets of Frontier Communications in the Northwest of the U.S. to form Ziply Fiber

Syracuse

It shows which organizations have been involved in the purchase, which organization sold the assets (Frontier) and the fact that the target entity is an organization called Ziply Fiber.

To improve, it could make it clearer that Ziply is a new entity being created rather than the purchase of an entity already called Ziply from Frontier. Also to identify that this is related to US North West assets. But otherwise pretty good.

Expert.ai

As before, it’s really good at identifying all the organizations in the text, even the ones that aren’t relevant to the story, e.g. Royal Canadian Mounted Police.

The relations piece is patchy. From the headline it determines that Searchlight Capital Partners is completing an acquisition of some operations, and also there is a relationship between the verb ‘complete’ and the assets of Frontier Communications. Pretty good result from this sentence, but not completely clear that there is an acquisition of assets.

Next sentence has a really good catch that Searchlight is forming Ziply

It only identifies one of the other parties involved in the transaction. It doesn’t tie the ‘it’ to Searchlight – you’d have to infer that from another relationship. And it doesn’t flag any of the other participants.

Test 2 – Digest Article

Article: Deals of the day-Mergers and acquisitions

Syracuse

It’s identifying 7 distinct stories. There are 8 bullet points in the Reuters story – one of which is about something that isn’t happening. Syracuse picks all of the real stories. It messes up Takeaway.com’s takeover of Just Eat by separating out Takeway and com as two different organizations, but apart from that looks pretty good.

I’m particularly gratified how it flags Exor as the spender and Agnelli as another kind of participant in the story about Exor raising its stake in GEDI. Agnelli is the family behind Exor, so they are involved, but strictly speaking the company doing the buying is Exor.

Expert.ai

Most of the entities are extracted correctly. A couple of notable errors:

  1. It finds a company called ‘Buyout’ (really this is the description of a type of firm, not the name of the firm)
  2. It also gets Takeaway.com wrong – but where Syracuse split this into two entities, Expert.ai flags it as a URL rather than a company (in yellow in the second image below)

The relationship piece is also pretty impressive from an academic point of view, but hard to piece together what is really going on from a practical point of view. Take the first story about Mediaset as an example and look at the relationships that Expert.ai identifies in the 4 graphs below. First one identifies that Mediaset belongs to Italy and is saying something. The other 3 talk about an “it” doing various things, but don’t tie this ‘it’ back to Mediaset.

Conclusion

Looking pretty good for Syracuse, if I say so myself :D.

Machine Learning

Revisiting Entity Extraction

In September 2021 I wrote about the difficulties of getting anything beyond basic named entity recognition. You could easily get the names of companies mentioned in a news article, but not whether one company was acquiring another or whether two companies were forming a joint venture, etc. Not to mention the perennial “Bloomberg problem”: Bloomberg is named in loads of different stories. Usually they are referenced as a company reporting the story, sometimes as the owner of the Bloomberg Terminal. Only a tiny proportion of mentions of Bloomberg are about actions that the Bloomberg company is done.

These were very real problems that a team I was involved in were facing around 2017, and were still not fixed in 2021. I figured I’d see if more recent ML techologies, specifically Transformers, could help solve these problems. I’ve made a simple Heroku app, called Syracuse, to showcase the results. It’s very alpha, but the quality is not too bad right now.

Meanwhile, the state of the art has moved on leaps and bounds over the past year. So I’m going to compare Syracuse with the winner from my 2021 comparison: Expert.ai‘s Document Analysis Tool and with ChatGPT – the new kid on the NLP block.

A Simple Test

Article: Avalara Acquires Artificial Intelligence Technology and Expertise from Indix to Aggregate, Structure and Deliver Global Product and Tax Information

The headline says it all: Avalara has acquired some Tech and Expertise from Indix.

Expert.AI

It is very comprehensive. For my purposes, too comprehensive. It identifies 3 companies: Avalara, ICR and Indix. The story is about Avalara acquiring IP from Indix. ICR is the communications company that is making the press release. ICR appearing in this list is an example of the “Bloomberg Problem” in action. Also it’s incorrect to call Indix IP a company – the company is Indix. The relevant sentence in the article mentions Indix’s IP, not a company called Indix IP: “Avalara believes its ability to collect, organize, and structure this content is accelerated with the acquisition of the Indix IP.

It also identifies many geographic locations, but many of them are irrelevant to the story as they are just lists of where Avalara has offices. If you wanted to search a database of UK-based M&A activity you would not want this story to come up.

Expert.AI’s relationship extraction is really impressive, but again, overly comprehensive. This first graph shows that Avalara gets expertise, technology and structure from Indix IP to aggregate things.

But there are also many many other graphs which are less useful, e.g:

Conclusion: Very powerful. Arguably too powerful. It reminds me of the age-old Google problem – I don’t want 1,487,585 results in 0.2 seconds. I’m already drowning in information, I want something that surfaces the answer quickly.

ChatGPT

I tried a few different prompts. First I included the background text then added a simple prompt:

I’m blown away by the quality of the summary here (no mention of ICR, LLC, so it’s not suffering from the Bloomberg Problem). But it’s not structured. Let’s try another prompt.

Again, it’s an impressive summary, but it’s not structured data.

Expert.ai + ChatGPT

I wonder what the results would be by combining a ChatGPT summary with Expert.AI document analysis. Turns out, not much use.

Syracuse

Link to data: https://syracuse-1145.herokuapp.com/m_and_as/1

Anyone looking at the URLs will recognise that this is the first entry in the database. This is the first example that I tried as an unseen test case (no cherry-picking here).

It shows the key information in a more concise graph as below. Avalara is a spender, Indix is receiving some kind of payment and the relevant target is some Indix Technology (the downward triangle represents something that is not an organization)

I’m pretty happy with this result. It shows that despite how impressive something like Expert.AI and ChatGPT are, they have limitations when applying to more specific problems, like in this case. Fortunately there are other open source ML technologies out there that can help, though it’s a job of work to stitch them together appropriately to get a decent result.

In future posts I’ll share more comparisons of more complex articles and share some insights into what I’ve learned about large language models through this process (spoiler – there are no silver bullets).

Machine Learning

Entity extraction powered by Flan

A Thanksgiving update on my side project. See here for an outline of the problem. In short, existing natural language processing techniques are good at generic entity extraction, but not at really getting to the core of the story.

I call it the ‘Bloomberg problem’. Imagine this text: “Bloomberg reported that Foo Inc has bought Bar Corp”. Bloomberg is not relevant in this story. But it is relevant in this one: “Bloomberg has just announced a new version of their Terminal”.

I wrote about my first attempt to address this problem, and then followed it up in July. I’ve been doing some more finessing since then and am pretty happy with the results. There is still some tidying up to do but I’m pretty confident that the building blocks are all there.

The big changes since July are:

  1. Replacing a lot of the post-processing logic with a model trained on more data. This was heartbreaking (throw away work, sad face emoji) but at the same time exhilarating (it works a lot better with less code in, big smile emoji).
  2. Implementing Flan T5 to help with some of the more generic areas.

At a high level this is how it works:

  1. The model
    • Approx 400 tagged docs (in total, across train, val and test sets)
    • Some judicious data synthesis
    • Trained a Named Entity Recognition model based on roberta-base
  2. Post-processing is a combination of
    • Benepar for constituency parsing to identify the relationships between entities for most cases
    • FlanT5 to help with the less obvious relationships.

Next steps are going to be to start representing this as a knowledge graph, which is a more natural way of exploring the data.

See below for a screenshot of the appointment topics extracted recently. These are available online at https://syracuse-1145.herokuapp.com/appointments

And below are the URLS for these appointment topics

The Native Antigen Company Strengthens Senior Team to Support Growing Product Portfolio

In this example, we have a number of companies listed – some of the company that is appointing these two new individuals and some are companies where the individuals used to work. Not all the company records are equally relevant.

Wolters Kluwer appoints Kevin Hay as Vice President of Sales for FRR

The Native Antigen Company Strengthens Senior Team to Support Growing Product Portfolio (Business Wire version)

Kering boosted by report Gucci’s creative director Michele to step down

Broadcat Announces Appointment of Director of Operations

Former MediaTek General Counsel Dr. Hsu Wei-Fu Joins ProLogium Technology to Reinforce Solid-State Battery IP Protection and Patent Portfolio Strategy

HpVac Appoints Joana Vitte, MD, PhD, as Chief Scientific Officer

Highview Power Appoints Sandra Redding to its Leadership Team

Highview Power Appoints Sandra Redding to its Leadership Team (Business Wire version)

ASML Supervisory Board changes announced

Recommendation from Equinor’s nomination committee

I’m pretty impressed with this one. There are a lot of organizations mentioned in this document with one person joining and one person leaving. The system has correctly identified the relevant individuals and organization. There is some redundancy: Board Member and Board Of Directors are identified as the same role, but that’s something that can easily be cleaned up in some more post-processing.

SG Analytics appoints Rob Mitchell as the new Advisory Board Member

Similarly, this article includes the organization that Rob has been appointed to and the names of organizations where he has worked before.

SG Analytics appoints Rob Mitchell as the new Advisory Board Member (Business Wire version)

Machine Learning

ML Topic Extraction Update

This is an update to https://alanbuxton.wordpress.com/2022/01/19/first-steps-in-natural-language-topic-understanding. It’s scratching an itch I have about using machine learning to pick out useful information from text articles on topics like: who is being appointed to a new senior role in a company; what companies are launching new products in new regions etc. My first try, and a review of the various existing approaches out there, was first summarised here: https://alanbuxton.wordpress.com/2021/09/21/transformers-for-use-oriented-entity-extraction/.

After this recent nonsense about whether language models are sentient or not, I’ve decided to use language that doesn’t imply any level of consciousness or intelligence. So I’m not going to be using the word “understanding” any more. The algorithm clearly doesn’t understand the text it is being given in the same way that a human understands text.

Since the previous version of the topic extraction system I implemented logic that use constituency parsing and graphs in networkx to better model the relationships amongst the different entities. It went a long way to improving the quality of the results but the Appointment topic extraction, for example, still struggles in two particular use cases:

  • When lots of people are being appointed to one role (e.g. a lot of people being announced as partners)
  • When one person is taking on a new role that someone else is leaving (e.g. “Jane Smith is taking on the CEO role that Peter Franklin has stepped down from”)

At this point the post-processing is pretty complex. Instead of going further on with this approach I’m going back to square one. I once saw a maxim along the lines of “once your rules get complex, it’s best to replace them with machine learning”. This will mean throwing away a lot of code so emotionally it’s hard to do. And there is an open question will be how much more labelled data the algorithm would need to learn these relationships accurately. But it will be fun to find out.

A simplified version of the app, covering Appointments (senior hires and fires) and Locations (setting up a new HQ, launching in a new location) is available on Heroku at https://syracuse-1145.herokuapp.com/. Feedback more than welcome.

Enterprise Software, Machine Learning, Product Management

First steps in Natural Language Topic Understanding

In https://alanbuxton.wordpress.com/2021/09/21/transformers-for-use-oriented-entity-extraction/ I showed how transformers allowed me to build something more advanced than the generic entity extraction systems that are publicly available out there.

Next step was to see if I can do something useful with this. In past lives customers have told me about the importance of tracking certain signals or events in a company’s lifecycle, e.g. making an acquisition, expanding to a new territory, making a new senior hire etc.

So I gave it a go, initially looking purely at whether I could train an algorithm to pick out key staffing changes. Results below are 20 random topics pulled from my from my first attempt showing the good, bad and ugly. The numbers are the confidence scores that the algorithm chose for each entity in the topic.

I’ll give myself a B for a decent first prototype.

I do wonder who else out there is working on this sort of thing. From what I can see in the market ML is used to classify articles (e.g. “this article is about a new hire”) but I couldn’t see any commercial offering that goes to the level of “which org hired who into what role”.

If I were to take this further I would be training specialist models on each different type of topic. I wonder if there is something like a T5-style model to rule them all that can handle all this kind of intelligent detailed topic understanding?

TitleOSE Immunotherapeutics Announces the Appointment of Dominique Costantini as Interim CEO Following the Departure of Alexis Peyroles
Urlhttps://www.businesswire.com/news/home/20220116005013/en/OSE-Immunotherapeutics-Announces-the-Appointment-of-Dominique-Costantini-as-Interim-CEO-Following-the-Departure-of-Alexis-Peyroles
WhoWhatRoleOrgEffective When
Alexis Peyroles (0.9846990705)departure (0.943598628)Chief Executive Officer (0.9995111823)OSE Immunotherapeutics SA (0.9983804822)immediately (0.9876502156)
Dominique Costantini (0.9990960956)appointed (0.9998416901)interim Chief Executive Officer (0.9983062148)OSE Immunotherapeutics SA (0.9983804822)immediately (0.9876502156)
Alexis Peyroles (0.993326962)departure (0.9623697996)Chief Executive Officer (0.9994782805)OSE Immunotherapeutics SA (0.9968072176)
Dominique Costantini (0.9989916682)appointed (0.9993845224)interim Chief Executive Officer (0.9982660413)OSE Immunotherapeutics SA (0.9968072176)
AssessmentTopic is duplicated without the ‘effective immediately’ piece – should only keep the most granular topics
TitleBarclays appoints managing directors for Australia investment banking unit
Urlhttps://www.reuters.com/markets/funds/barclays-appoints-managing-directors-australia-investment-banking-unit-2022-01-17/
WhoWhatRoleOrgEffective When
Duncan Connellan (0.988427639)appointed (0.9996656179)managing directors (0.9994463921)Britain ‘s Barclays Plc (0.9851405621)
Duncan Beattie (0.9959402084)appointed (0.9996656179)managing directors (0.9994463921)Britain ‘s Barclays Plc (0.9851405621)
AssessmentPulled out the two key items but: didn’t do a great job of the Entity (Britain’s Barclays Plc was treated as one entity) and doesn’t understand the pluralised role name. Model was not trained to look for where the role is based, so haven’t identified that these roles are specifically in Australia
TitleTrulioo Appoints Michael Ramsbacker as Chief Product Officer
Urlhttps://www.prweb.com/releases/trulioo_appoints_michael_ramsbacker_as_chief_product_officer/prweb18439306.htm
WhoWhatRoleOrgEffective When
Michael Ramsbacker (0.999671936)appointment (0.9997799993)Chief Product Officer (0.9999740124)Trulioo (0.9999925494)
AssessmentGot it right
TitleElastrin Therapeutics Announces Newly Formed Scientific Advisory Board
Urlhttps://www.businesswire.com/news/home/20220117005220/en/Elastrin-Therapeutics-Announces-Newly-Formed-Scientific-Advisory-Board
WhoWhatRoleOrgEffective When
Dr. Pedro M. Quintana Diez (0.9665058851)chairman (0.9933767915)Elastrin Therapeutics Inc. (0.9841426611)
Dr. Pedro M. Quintana Diez (0.9665058851)Scientific Advisory Board (0.9952206612)Elastrin Therapeutics Inc. (0.9841426611)
AssessmentCorrectly extracts key info that Dr Quntana Diez is chairman of the new Scientific Advisory Board but treats these as two roles rather than as one
TitleToshiba Appoints Andrew McDaniel to Lead Its European Retail Business
Urlhttps://www.businesswire.com/news/home/20220117005027/en/Toshiba-Appoints-Andrew-McDaniel-to-Lead-Its-European-Retail-Business
WhoWhatRoleOrgEffective When
Andrew McDaniel (0.9996804595)senior vice president of Europe (0.9983366132)Toshiba Global Commerce Solutions (0.9999386668)January 15 , 2022 (0.9999966621)
Andrew McDaniel (0.9996804595)managing director (0.9998098612)Toshiba Global Commerce Solutions (0.9999386668)January 15 , 2022 (0.9999966621)
AssessmentGot it right
TitleCairn Real Estate Holdings Appoints Mark Johnson President of JPAR® – Real Estate
Urlhttps://www.prweb.com/releases/cairn_real_estate_holdings_appoints_mark_johnson_president_of_jpar_real_estate/prweb18437732.htm
WhoWhatRoleOrgEffective When
Mark Johnson (0.9998755455)appointment (0.955047369)JPAR® – Real Estate (0.9999427795)
AssessmentCorrectly pulls out the appointment but doesn’t identify the role
TitleFiona Macfarlane and Andrea Nicholls appointed to HSBC Bank Canada Board of Directors
Urlhttps://www.businesswire.com/news/home/20220117005321/en/Fiona-Macfarlane-and-Andrea-Nicholls-appointed-to-HSBC-Bank-Canada-Board-of-Directors
WhoWhatRoleOrgEffective When
Fiona Macfarlane (0.9959855676)appointed (0.9996260405)non-executive directors (0.9942650795)HSBC Bank Canada Board of Directors (0.9947710037)
Andrea Nicholls (0.9999670982)appointed (0.9996260405)non-executive directors (0.9942650795)HSBC Bank Canada Board of Directors (0.9947710037)
AssessmentGot it right
TitleDigital Mountain Announces Industry Veteran Calvin Weeks Joining Team as Director of Digital Forensics & Cybersecurity
Urlhttps://www.prweb.com/releases/2022/1/prweb18416336.htm
WhoWhatRoleOrgEffective When
Calvin Weeks (0.999994576)Director , Digital Forensics & Cybersecurity (0.999989152)Digital Mountain , Inc. (0.9999924898)
AssessmentGot the role right but didn’t get the ‘what’
TitleMiniCo Insurance Announces Two Strategic Leadership Promotions
Urlhttps://www.prweb.com/releases/minico_insurance_announces_two_strategic_leadership_promotions/prweb18437565.htm
WhoWhatRoleOrgEffective When
Rick Krouner (0.9899243116)named (0.9960696697)President (0.9988073111)MiniCo Insurance Agency ( MiniCo ) (0.9878121018)
Jim Henry (0.9995553493)named (0.9960696697)Specialty Programs division (0.9527196288)MiniCo Insurance Agency ( MiniCo ) (0.9878121018)
Jim Henry (0.9995553493)named (0.9960696697)National Programs division (0.9757707119)MiniCo Insurance Agency ( MiniCo ) (0.9878121018)
Jim Henry (0.9995553493)named (0.9960696697)President (0.9988151789)MiniCo Insurance Agency ( MiniCo ) (0.9878121018)
AssessmentSimilar to the Elastrin story it pulls out the title and the division but treats them as different roles; also only assigns one of the found roles to Mr Krouner. Also is a bit ‘greedy’ at identifying the Org – the part in parentheses is redundant
TitleStertil-Koni Names Supply Chain Sales Pro Scott Steinhardt as Vice President of Sales
Urlhttps://www.prweb.com/releases/stertil_koni_names_supply_chain_sales_pro_scott_steinhardt_as_vice_president_of_sales/prweb18430929.htm
WhoWhatRoleOrgEffective When
Scott Steinhardt (0.9999918938)joined (0.9999970198)Vice President of Sales (0.9999983311)Stertil-Koni (0.9999969602)
AssessmentGot it right
Machine Learning, Software Development

Is it worth training an NLP Transformer from scratch?

In https://alanbuxton.wordpress.com/2021/09/21/transformers-for-use-oriented-entity-extraction/ I wrote about an experience training a custom transformer-based model to do a type of entity extraction. I tried training from scratch because the source text happened to have been preprocessed / lemmatized in such a way to include over 20 custom tokens that the RoBERTa model wouldn’t know about. My assumption was that this text would be so different to normal English that you may as well treat it as its own language.

Once I saw the results I decided to test this hypothesis somewhat by comparing the results of the preprocessed/lemmatized text with the custom model vs a raw version of the text on a fine-tuned out of the box roberta-base model.

Turns out that, for me, the fine-tuned RoBERTa model always outperformed the model trained from scratch, though the difference in performance becomes pretty minimal once you’re in the millions of sentences.

Conclusion – when working in this space, don’t make assumptions. Stand on the shoulders of as many giants as possible.

Approx number of
sentences for fine-tuning
F1 Score –
RoBERTa from scratch
F1 Score –
Fine-tuned roberta-base
1,5000.391820.52233
15,0000.752940.97764
40,0000.926390.99494
65,0000.972600.99627
125,0000.991050.99776
300,0000.996700.99797
600,0000.997710.99866
960,0000.997830.99865
1,400,0000.998100.99888

Machine Learning, Software Development, Technology Adoption

Transformers for Use-Oriented Entity Extraction

The internet is full of text information. We’re drowning in it. The only way to make sense of it is to use computers to interpret the text for us.

Consider this text:

Foo Inc announced it has acquired Bar Corp. The transaction closed yesterday, reported the Boston Globe.

This is a story about a company called ‘Foo’ buying a company called ‘Bar’. (I’m just using Foo and Bar as generic tech words, these aren’t real companies).

I was curious to see how the state of the art has evolved for pulling out these key bits of information from the text since I first looked at Dandelion in 2018.

TL;DR – existing Natural Language services vary from terrible to tolerable. But recent advances in language models, specifically transformers, point towards huge leaps in this kind of language processing.

Dandelion

Demo site: https://dandelion.eu/semantic-text/entity-extraction-demo/

While it was pretty impressive in 2018, the quality for this type of sentence is pretty poor. It only identified that the Boston Globe is an entity, but Dandelion tagged this entity as a “Work” (i.e. a work of art or literature). As I allowed more flexibility in finding entities, it also found that the term “Inc” and “Corp” usually relate to a Corporation, and it found a Toni Braxton song. Nul points.

Link to video

Explosion.ai

Demo site: https://explosion.ai/demos/displacy-ent

This organisation uses pretty standard named entity recognition. It successfully identified that there were three entities in this text. Pretty solid performance at extracting named entities, but not much help for my use case because the Boston Globe entity is not relevant to the key points of the story.

Link to video

Microsoft

Demo site: https://aidemos.microsoft.com/text-analytics

Thought I’d give Microsoft’s text analytics demo a whirl. Completely incomprehensible results. Worse than Dandelion.

Link to video

Completely WTF

Expert.ai

Demo site: https://try.expert.ai/analysis-and-classification

With Microsoft’s effort out of the way, time to look at a serious contender.

This one did a pretty good job. It identified Foo Inc and Bar Corp as businesses. It identified The Boston Globe as a different kind of entity. There was also some good inference that Foo had made an announcement and that something had acquired Bar Corp. But didn’t go so far as joining the dots that Foo was the buyer.

In this example, labelling The Boston Globe as Mass Media is helpful. It means I can ignore it unless I specifically want to know who is reporting which story. But this helpfulness can go too far. When I changed the name “Bar Corp” to “Reuters Corp” then the entity extraction only found one business entity: Foo Inc. The other two entities were now tagged as Mass Media.

Long story short – Expert.ai is the best so far, but a user would still need to implement a fair bit of post-processing to be able to extract they key elements from this text.

Link to video.

Expert.ai is identifying entities based on the nature of that entity, not based on the role that they are playing in the text. The relations are handled separately. I was looking for something that combined the relevant information from both the entities and their relations. I’ll call it ‘use-oriented entity extraction’ following Wittgenstein‘s quote that, if you want to understand language: “Don’t look for the meaning, look for the use”. In other words, the meaning of a word in some text can differ depending on how the word is used. In one sentence, Reuters might be the media company reporting a story. In another sentence, Reuters might be the business at the centre of the story.

Enter Transformers

I wondered how Transformers would do with the challenge of identifying the different entities depending on how the words are used in the text. So I trained a custom RoBERTa using a relatively small base set of text and some judicious pre-processing. I was blown away with the results. When I first saw all the 9’s appearing in the F1 score my initial reaction was “this has to be a bug, no way is this really this accurate”. Turns out it wasn’t a bug.

I’ve called the prototype “Napoli” because I like coastal locations and Napoli includes the consonants N, L and P. This is a super-simple proof of concept and would have a long way to go to become production-ready, but even these early results were pretty amazing:

  1. It could tell me that Foo Inc is the spending party that bought Bar Corp
  2. If I changed ‘Bar’ to ‘Reuters’ it could tell me that Foo Inc bought Reuters Corp
  3. If I changed the word “acquired” to “sold” it would tell me that Foo Inc is the receiving party that sold Reuters Corp (or Bar Corp etc).
  4. It didn’t get confused by the irrelevant fact that Boston Globe was doing the reporting.

Link to video

Software Development

Re-learning regexes to help with NLP

TL;DR Don’t forget about boring old-school solutions when you’re working with shiny new tech approaches. Dusty old tech might still have a part to play.

I’ve been experimenting with entity recognition. The use case is to identify company names in text. I’ve been using the fab service from dandelion.eu as my gold standard of what should be achievable.

Overall it is pretty impressive. Consider the following phrases:

Takeda is a gerbil

Dandelion recognises that this phrase is about a gerbil.

Takeda eats barley from Shire

Dandelion recognises that this is about barley

Takeda buys goods from Shire

A subtle change of words means this sentence is probably about a company called Takeda and a company called Shire.

Very cool.

But what about this sentence:

Polish media and telecoms group Cyfrowy Polsat (CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million).

Still pretty impressive but it has made one big fumble. It thinks that CPS stands for Canon, whereas in reality CPS.WA is the stock ticker for Cyfrowy Polsat.

Is there an alternative approach?

In this sort of document, company names are often abbreviated, or referenced by their stock ticker. When they are abbreviated they use the same kind of convention. Sounds like a job for a regex.

In case you think that something as old-school as regexes can’t possibly have a role to play with cutting edge code, bear in mind they are heavily used in NLP, so Dandelion must be using them somewhere under the covers. Or have a look at Scikit-learn’s CountVectorizer which uses a very simply regex for the token-pattern that it uses for splitting up text into different terms.

I can’t remember the last time I used a regex in anger. (See also this great stackoverflow question and answer on the topic). I don’t like just copy pasting from stack overflow so I broke the task down into a few steps to make sure I fully understood what was going on. The iterations I went through are below (paste this into a python console to see the results).

sent = "Polish media and telecoms group Cyfrowy Polsat (CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million)."
import regex as re
re.findall(r"\(.+\)",sent)
re.findall(r"\(.+?\)",sent)
re.findall(r"\([A-Z.]+?\)",sent)
re.findall(r"[A-Z]\w+\s\([A-Z.]+?\)",sent)
re.findall(r"(?:[A-Z]\w+\s)*\([A-Z.]+?\)",sent)
re.findall(r"((?:[A-Z]\w+\s*)*)\s\(([A-Z.]+?)\)",sent)
re.findall(r"((?:[A-Z]\w+\s(?:[A-Z][A-Za-z.\s]+\s?)*))\s\(([A-Z.]+)\)",sent)

To go through the iterations one by one with what I re-learnt at each stage.

>>> re.findall(r"\(.+\)",sent)
['(CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million)']

Find one or more characters inside a bracket. Brackets have special significance so need to be escaped. This simple regex finds the longest match (it’s “greedy”) so we need to change it to a “lazy” search.

>>> re.findall(r"\(.+?\)",sent)
['(CPS.WA)', '(ESN)', '($44.48 million)']

Better, but the only real initials are combinations of a capital and a dot, so specify that:

>>> re.findall(r"\([A-Z.]+?\)",sent)
['(CPS.WA)', '(ESN)']

Next I need to find the text before the initials. This would be a Capital letter followed by one or more other characters followed by a space:

>>> re.findall(r"[A-Z]\w+\s\([A-Z.]+?\)",sent)
['Polsat (CPS.WA)']

Getting somewhere, but really we need multiple words before the initials. So put brackets around the part of the regex that includes a capitalised word and match this one or more times

>>> re.findall(r"([A-Z]\w+\s)+\([A-Z.]+?\)",sent)
['Polsat ']

WTF? Makes no sense. Except it does: The brackets create a capture group. This changes findall’s behaviour. findall now returns the results of the capture group rather than the overall match. See below how using the captures() method returns the whole match.

>>> re.search(r"([A-Z]\w+\s)+\([A-Z.]+?\)",sent).captures()
['Cyfrowy Polsat (CPS.WA)']

Solution is to turn this new group that we created into a non-capturing group using ?:

>>> re.findall(r"(?:[A-Z]\w+\s)+\([A-Z.]+?\)",sent)
['Cyfrowy Polsat (CPS.WA)', '(ESN)']

A bit better, but it would be nice now if we can separate out the name from the abbreviation, so we return two capture groups in the regex:

>>> re.findall(r"([A-Z]\w+\s[A-Z](?:[\w.\s]+))\s\(([A-Z.]+?)\)",sent)
[('Cyfrowy Polsat', 'CPS.WA'), ('', 'ESN')]

At this point the regex is only looking for capitalised words before the abbreviation. Eleven Sports Network has some lower case terms in the name, so the regex needs a bit more tuning. The following example does the job in this particular case. It looks for a capitalised word then a space, a capital letter and then some other text until it gets to what looks like an abbreviation in brackets:

>>> re.findall(r"([A-Z]\w+\s[A-Z](?:[\w.\s]+))\s\(([A-Z.]+)\)",sent)
[('Cyfrowy Polsat', 'CPS.WA'), ('Eleven Sports Network Sp.z o.o.', 'ESN')]

You can see this regex in action on the fab regex101.com site. Let’s break this down:

(                           )\s\((       )\)
 [A-Z]\w+\s[A-Z]                  [A-Z.]+
                (?:[\w.\s]+)

  • Capturing Group 1
    1. [A-Z] a capital letter
    2. \w+ one or more letters (to complete a word)
    3. \s a space (either to start another word, or just before the abbreviation)
    4. [A-Z] then a capital letter, to start a 2nd capitalised word
      1. (?:[\w.\s]+) a non-capturing group of one or more letters or full stops (periods) or spaces.
  • \s\( a space and then an open bracket
  • Capturing Group 2:
    1. [A-Z.]+? one or more capital letters or full stops.
  • \) Close bracket

This regex does the job but it didn’t take long for it to become incomprehensible. If ever there were a use case for copious unit tests it’s using regexes.

Also, it doesn’t generalise well. It won’t identify “Acme & Bob Corp (ABC)” or “ABC and Partners Ltd (ABC)” or “X.Y.Z. Property Co (XYZ)” or “ABC Enterprises, Inc. (‘ABC’)” properly. Writing one regex to handle all of these types of string would quickly become very brittle and hard to understand. In reality I would end up using a serious of regexes rather than trying to code one regex to rule the all.

Nevertheless I hope it’s clear how a boring old piece of tech can help plug gaps in the cool new stuff. It’s never too late to (re-)learn the old ways. And it’s ok to put your pride aside and go back to basics.

Software Development

Adventures in upgrading a Python text summarisation library

A reflection on what it took to upgrade a simple Python lib to support Python 3. The lib in question is PyTeaser and the final result is at PyTeaserPython3.

TL;DR

The moral of the story is:

  1. Don’t try to upgrade something unless you really need to. It rarely goes well. Even a simple library like this one can throw up all kinds of challenges.
  2. Automated tests really are crucial to allow work like future upgrades. In this case I had some tests that seemed to work but that didn’t give me the full picture.
  3. Ultimately your program is most probably about turning one set of data into another set of other data. Make sure your tests cover those use cases. In this case I was lucky: there was a demo script in the project directory that I could use to manually compare results between the Python 2 and Python 3 version.
  4. Even if all goes perfectly well you can end up with surprising results when the behaviour of an underlying library changes in subtle ways. So it can be worth having tests that check expected behaviour happens even when you are using standard, out of the box features.

Step By Step

Run the tests:

alan@dg04:~/PyTeaserPython3$ python -m tests
Traceback (most recent call last):
 File "/home/alan/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
 "__main__", mod_spec)
 File "/home/alan/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
 exec(code, run_globals)
 File "/home/alan/PyTeaserPython3/tests.py", line 2, in 
 from pyteaser import Summarize, SummarizeUrl
 File "/home/alan/PyTeaserPython3/pyteaser.py", line 72
 print 'IOError'
 ^
SyntaxError: Missing parentheses in call to 'print'

That didn’t go too well. Print is a function in Python3.

There’s a utility called 2to3 that will automatically update the code.

alan@dg04:~/PyTeaserPython3$ 2to3 -wn .
RefactoringTool: Skipping optional fixer: buffer
RefactoringTool: Skipping optional fixer: idioms
RefactoringTool: Skipping optional fixer: set_literal
...
RefactoringTool: ./goose/utils/encoding.py
RefactoringTool: ./goose/videos/extractors.py
RefactoringTool: ./goose/videos/videos.py

alan@dg04:~/PyTeaserPython3$ git diff --stat
 demo.py | 6 +++---
 goose/__init__.py | 2 +-
 goose/article.py | 18 +++++++++---------
 goose/extractors.py | 8 ++++----
 goose/images/extractors.py | 6 +++---
 goose/images/image.py | 4 ++--
 goose/images/utils.py | 6 +++---
 goose/network.py | 16 ++++++++--------
 goose/outputformatters.py | 2 +-
 goose/parsers.py | 6 +++---
 goose/text.py | 4 ++--
 goose/utils/__init__.py | 10 +++++-----
 goose/utils/encoding.py | 28 ++++++++++++++--------------
 pyteaser.py | 16 ++++++++--------
 tests.py | 12 ++++++------
 15 files changed, 72 insertions(+), 72 deletions(-)

It’s obviously done some work – how does this affect the tests?

alan@dg04:~/PyTeaserPython3$ python -m tests
.E
======================================================================
ERROR: testURLs (__main__.TestSummarize)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs
 summaries = SummarizeUrl(url)
 File "/home/alan/PyTeaserPython3/pyteaser.py", line 70, in SummarizeUrl
 article = grab_link(url)
...
 File "/home/alan/PyTeaserPython3/goose/text.py", line 88, in 
 class StopWords(object):
 File "/home/alan/PyTeaserPython3/goose/text.py", line 90, in StopWords
 PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]")
 File "/home/alan/anaconda3/lib/python3.6/re.py", line 233, in compile
 return _compile(pattern, flags)
 File "/home/alan/anaconda3/lib/python3.6/re.py", line 301, in _compile
 p = sre_compile.compile(pattern, flags)
 ...
 File "/home/alan/anaconda3/lib/python3.6/sre_parse.py", line 526, in _parse
 code1 = _class_escape(source, this)
 File "/home/alan/anaconda3/lib/python3.6/sre_parse.py", line 336, in _class_escape
 raise source.error('bad escape %s' % escape, len(escape))
sre_constants.error: bad escape \p at position 2

----------------------------------------------------------------------
Ran 2 tests in 0.054s

FAILED (errors=1)

Some progress. One of the two tests passed.

Root of the error in the failing test is this line: PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]"). Looks like it isn’t used anywhere:

alan@dg04:~/PyTeaserPython3$ grep -nrI PUNCTUATION
goose/text.py:90: PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]")
alan@dg04:~/PyTeaserPython3$

Comment out that line and try again

alan@dg04:~/PyTeaserPython3$ python -m tests
.E
======================================================================
ERROR: testURLs (__main__.TestSummarize)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs
summaries = SummarizeUrl(url)
...
File "/home/alan/PyTeaserPython3/goose/text.py", line 91, in StopWords
TRANS_TABLE = string.maketrans('', '')
AttributeError: module 'string' has no attribute 'maketrans'

----------------------------------------------------------------------
Ran 2 tests in 0.061s

FAILED (errors=1)
alan@dg04:~/PyTeaserPython3$

I admit it was a bit optimistic to think that commenting out one line would do the trick. Now the problem arises when TRANS_TABLE is defined, and this is used elsewhere in the code.

alan@dg04:~/PyTeaserPython3$ grep -nrI TRANS_TABLE
goose/text.py:91: TRANS_TABLE = string.maketrans('', '')
goose/text.py:107: return content.translate(self.TRANS_TABLE, string.punctuation).decode('utf-8')
alan@dg04:~/PyTeaserPython3$

Fortunately someone put a useful comment into this method so I can google StackOverflow and find out how to do the same thing in Python3.

def remove_punctuation(self, content):
    # code taken form
    # http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
    if isinstance(content, str):
        content = content.encode('utf-8')
    return content.translate(self.TRANS_TABLE, string.punctuation).decode('utf-8')

Sure enough there is an equivalent question and answer on StackOverflow so I can edit the method accordingly:

def remove_punctuation(self, content):
    # code taken form
    # https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate
    translator = str.maketrans('','',string.punctuation)
    return content.translate(translator)

And now I can remove the reference to TRANS_TABLE from line 91 and run the tests again.

alan@dg04:~/PyTeaserPython3$ python -m tests
.E
======================================================================
ERROR: testURLs (__main__.TestSummarize)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs
summaries = SummarizeUrl(url)
...
return URLHelper.get_parsing_candidate(crawl_candidate.url)
File "/home/alan/PyTeaserPython3/goose/utils/__init__.py", line 104, in get_parsing_candidate
link_hash = '%s.%s' % (hashlib.md5(final_url).hexdigest(), time.time())
TypeError: Unicode-objects must be encoded before hashing

----------------------------------------------------------------------
Ran 2 tests in 0.273s

FAILED (errors=1)
alan@dg04:~/PyTeaserPython3$

A bit of digging to fix this and the tests now pass.

alan@dg04:~/PyTeaserPython3$ python -m tests
..
----------------------------------------------------------------------
Ran 2 tests in 0.273s

OK
alan@dg04:~/PyTeaserPython3$

Let’s see how the demo works.

alan@dg04:~/PyTeaserPython3$ python demo.py
None
None
None
alan@dg04:~

Hmm, that doesn’t seem right. Ah, I’m not connected to the internet, duh.

Connect and try again.

alan@dg04:~/PyTeaserPython3$ python demo.py
None
None
None
alan@dg04:~

Compare with the results from the python2 version.

[u"Twitter's move is the latest response from U.S. Internet firms following disclosures by former spy agency contractor Edward Snowden about widespread, classified U.S. government surveillance programs.",
u'"Since then, it has become clearer and clearer how important that step was to protecting our users\' privacy."',
u'"A year and a half ago, Twitter was first served completely over HTTPS," the company said in a blog posting.',
...

Something isn’t working. I track it down to this code in the definition of get_html in goose/network.py.

    try:
        result = urllib.request.urlopen(request).read()
    except:
        return None

In Python3 the encoding of the URL causes this error: urllib.error.URLError: . The try fails and so None is returned. I fix the encoding but there is now a ValueError being thrown. From grab_link in pyteaser.py.

    try:
        article = Goose().extract(url=inurl)
        return article
    except ValueError:
        print('Goose failed to extract article from url')
        return None

A bit more digging – this is due to the fact that the string ‘10.0’ can’t be converted to an int. I edit the code to use a float instead of an int in this case.

Seems to be working now.

alan@dg04:~/PyTeaserPython3$ python demo.py
["Twitter's move is the latest response from U.S. Internet firms following "
'disclosures by former spy agency contractor Edward Snowden about widespread, '
'classified U.S. government surveillance programs.',
'"Since then, it has become clearer and clearer how important that step was '
'to protecting our users\' privacy."',

Let’s just double-check the tests.

alan@dg04:~/PyTeaserPython3$ python -m tests
./home/alan/PyTeaserPython3/goose/outputformatters.py:65: DeprecationWarning: The unescape method is deprecated and will be removed in 3.5, use html.unescape() instead.
txt = HTMLParser().unescape(txt)
.
----------------------------------------------------------------------
Ran 2 tests in 3.282s

OK
alan@dg04:~/PyTeaserPython3$

The previous passing tests didn’t give me the full story. Let’s fix the deprecation warning in goose/outputformatters.py and re-run the tests.

alan@dg04:~/PyTeaserPython3$ python -m tests
..
----------------------------------------------------------------------
Ran 2 tests in 3.653s

OK

Better.

Finally, I want to double check the outputs of demo.py just to make sure I am getting the same output. It turns out that the summary for the second URL in the demo was producing a different result between Python 2 and Python 3. See the keywords function in pyteaser.py (beginning line 177). The culprit is line 184 where Counter has items in a different order between the different versions. Seems the ordering logic has changed subtly between the two versions.

In Python 3 version:

Counter({'montevrain': 6, 'animal': 5, 'tiger': 3, 'town': 3, 'officials': 2, 'dog': 2, 'big': 2, 'cat': 2, 'outside': 2, 'woman': 2, 'supermarket': 2, 'car': 2, 'park': 2, 'search': 2, 'called': 2, 'local': 2, 'schools': 2, 'kept': 2, 'parisien': 2, 'mayors': 2, 'office': 2

In Python2 version

Counter({'montevrain': 6, 'animal': 5, 'tiger': 3, 'town': 3, 'office': 2, 'local': 2, 'mayors': 2, 'woman': 2, 'big': 2, 'schools': 2, 'officials': 2, 'outside': 2, 'supermarket': 2, 'search': 2, 'parisien': 2, 'park': 2, 'car': 2, 'cat': 2, 'called': 2, 'dog': 2, 'kept': 2

So when line 187 picks the 10 most common keywords the Python 2 and Python 3 version end up with a different list. A subtle change in the logic of Counter made quite a difference to the end result of running PyTeaser.