Software Development

Re-learning regexes to help with NLP

TL;DR Don’t forget about boring old-school solutions when you’re working with shiny new tech approaches. Dusty old tech might still have a part to play.

I’ve been experimenting with entity recognition. The use case is to identify company names in text. I’ve been using the fab service from dandelion.eu as my gold standard of what should be achievable.

Overall it is pretty impressive. Consider the following phrases:

Takeda is a gerbil

Dandelion recognises that this phrase is about a gerbil.

Takeda eats barley from Shire

Dandelion recognises that this is about barley

Takeda buys goods from Shire

A subtle change of words means this sentence is probably about a company called Takeda and a company called Shire.

Very cool.

But what about this sentence:

Polish media and telecoms group Cyfrowy Polsat (CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million).

Still pretty impressive but it has made one big fumble. It thinks that CPS stands for Canon, whereas in reality CPS.WA is the stock ticker for Cyfrowy Polsat.

Is there an alternative approach?

In this sort of document, company names are often abbreviated, or referenced by their stock ticker. When they are abbreviated they use the same kind of convention. Sounds like a job for a regex.

In case you think that something as old-school as regexes can’t possibly have a role to play with cutting edge code, bear in mind they are heavily used in NLP, so Dandelion must be using them somewhere under the covers. Or have a look at Scikit-learn’s CountVectorizer which uses a very simply regex for the token-pattern that it uses for splitting up text into different terms.

I can’t remember the last time I used a regex in anger. (See also this great stackoverflow question and answer on the topic). I don’t like just copy pasting from stack overflow so I broke the task down into a few steps to make sure I fully understood what was going on. The iterations I went through are below (paste this into a python console to see the results).

sent = "Polish media and telecoms group Cyfrowy Polsat (CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million)."
import regex as re
re.findall(r"\(.+\)",sent)
re.findall(r"\(.+?\)",sent)
re.findall(r"\([A-Z.]+?\)",sent)
re.findall(r"[A-Z]\w+\s\([A-Z.]+?\)",sent)
re.findall(r"(?:[A-Z]\w+\s)*\([A-Z.]+?\)",sent)
re.findall(r"((?:[A-Z]\w+\s*)*)\s\(([A-Z.]+?)\)",sent)
re.findall(r"((?:[A-Z]\w+\s(?:[A-Z][A-Za-z.\s]+\s?)*))\s\(([A-Z.]+)\)",sent)

To go through the iterations one by one with what I re-learnt at each stage.

>>> re.findall(r"\(.+\)",sent)
['(CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million)']

Find one or more characters inside a bracket. Brackets have special significance so need to be escaped. This simple regex finds the longest match (it’s “greedy”) so we need to change it to a “lazy” search.

>>> re.findall(r"\(.+?\)",sent)
['(CPS.WA)', '(ESN)', '($44.48 million)']

Better, but the only real initials are combinations of a capital and a dot, so specify that:

>>> re.findall(r"\([A-Z.]+?\)",sent)
['(CPS.WA)', '(ESN)']

Next I need to find the text before the initials. This would be a Capital letter followed by one or more other characters followed by a space:

>>> re.findall(r"[A-Z]\w+\s\([A-Z.]+?\)",sent)
['Polsat (CPS.WA)']

Getting somewhere, but really we need multiple words before the initials. So put brackets around the part of the regex that includes a capitalised word and match this one or more times

>>> re.findall(r"([A-Z]\w+\s)+\([A-Z.]+?\)",sent)
['Polsat ']

WTF? Makes no sense. Except it does: The brackets create a capture group. This changes findall’s behaviour. findall now returns the results of the capture group rather than the overall match. See below how using the captures() method returns the whole match.

>>> re.search(r"([A-Z]\w+\s)+\([A-Z.]+?\)",sent).captures()
['Cyfrowy Polsat (CPS.WA)']

Solution is to turn this new group that we created into a non-capturing group using ?:

>>> re.findall(r"(?:[A-Z]\w+\s)+\([A-Z.]+?\)",sent)
['Cyfrowy Polsat (CPS.WA)', '(ESN)']

A bit better, but it would be nice now if we can separate out the name from the abbreviation, so we return two capture groups in the regex:

>>> re.findall(r"([A-Z]\w+\s[A-Z](?:[\w.\s]+))\s\(([A-Z.]+?)\)",sent)
[('Cyfrowy Polsat', 'CPS.WA'), ('', 'ESN')]

At this point the regex is only looking for capitalised words before the abbreviation. Eleven Sports Network has some lower case terms in the name, so the regex needs a bit more tuning. The following example does the job in this particular case. It looks for a capitalised word then a space, a capital letter and then some other text until it gets to what looks like an abbreviation in brackets:

>>> re.findall(r"([A-Z]\w+\s[A-Z](?:[\w.\s]+))\s\(([A-Z.]+)\)",sent)
[('Cyfrowy Polsat', 'CPS.WA'), ('Eleven Sports Network Sp.z o.o.', 'ESN')]

You can see this regex in action on the fab regex101.com site. Let’s break this down:

(                           )\s\((       )\)
 [A-Z]\w+\s[A-Z]                  [A-Z.]+
                (?:[\w.\s]+)

  • Capturing Group 1
    1. [A-Z] a capital letter
    2. \w+ one or more letters (to complete a word)
    3. \s a space (either to start another word, or just before the abbreviation)
    4. [A-Z] then a capital letter, to start a 2nd capitalised word
      1. (?:[\w.\s]+) a non-capturing group of one or more letters or full stops (periods) or spaces.
  • \s\( a space and then an open bracket
  • Capturing Group 2:
    1. [A-Z.]+? one or more capital letters or full stops.
  • \) Close bracket

This regex does the job but it didn’t take long for it to become incomprehensible. If ever there were a use case for copious unit tests it’s using regexes.

Also, it doesn’t generalise well. It won’t identify “Acme & Bob Corp (ABC)” or “ABC and Partners Ltd (ABC)” or “X.Y.Z. Property Co (XYZ)” or “ABC Enterprises, Inc. (‘ABC’)” properly. Writing one regex to handle all of these types of string would quickly become very brittle and hard to understand. In reality I would end up using a serious of regexes rather than trying to code one regex to rule the all.

Nevertheless I hope it’s clear how a boring old piece of tech can help plug gaps in the cool new stuff. It’s never too late to (re-)learn the old ways. And it’s ok to put your pride aside and go back to basics.