python ‘in’ set vs list

Came across a comment on this Stackoverflow question which concerned converting a list to a set before looking for items in that set.

The point of set is to convert in from an O(n) operator to an O(1) operator. In this case it doesn’t really matter, but for larger data sets it would be inefficient to have multiple linear scans over a list.

Which led me to look a bit more at time complexity in python functions in general.

In the past I would only convert a list to a set if I wanted to remove duplicates. But evidently it’s a good habit to get into if speed of looking in that set is important. Though, obviously, converting list to set is O(n) in the first place.

Python: bind method outside loop to reduce overhead

Came across this interesting pattern in some of the sklearn codebase:

        # bind method outside of loop to reduce overhead
        ngrams_append = ngrams.append

        for n in range(min_n, min(max_n + 1, text_len + 1)):
            for i in range(text_len - n + 1):
                ngrams_append(text_document[i: i + n])
        return ngrams


I wonder, at what scale does this really start to make a meaningful performance difference?

Even if it doesn’t make a huge difference, I do find it pretty elegant and easy to follow.

Comparing Python vs Ruby’s handling of empty Lists/Arrays

TIL Ruby is more forgiving when handling empty arrays that Python is with empty lists. If you try to access a non-existent index in Ruby you get a nil back. In Python you get an IndexError. I prefer the Python approach. It’s more straightforward.

Though Python also has a curious way that allows you to work around IndexError by trying to access a range of items in the List. And Ruby does also have an approach that will give you an appropriate error if the index is out of bounds. So who knows what the logic behind all of this is supposed to be ….

Python Version

Compare the following Python commands

empty = []
arr = ["here","are","some","words"]
" ".join(empty[0:1])
" ".join(arr[0:1])

Results of the Python commands:

>>> empty = []
>>> arr = ["here","are","some","words"]
>>> empty[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> arr[0]
>>> empty[0:0]
>>> arr[0:0]
>>> " ".join(empty[0:0])
>>> " ".join(arr[0:0])

Ruby version

See the following Ruby commands

empty = []
arr = ["here","are","some","words"]
empty[0..0].join(" ")
arr[0..0].join(" ")

Results of the Ruby commands:

irb(main):007:0> empty = []
=> []
irb(main):008:0> arr = ["here","are","some","words"]
=> ["here", "are", "some", "words"]
irb(main):009:0> empty[0]
=> nil
irb(main):010:0> arr[0]
=> "here"
irb(main):011:0> empty[0..0]
=> []
irb(main):012:0> arr[0..0]
=> ["here"]
irb(main):013:0> empty[0..0].join(" ")
=> ""
irb(main):014:0> arr[0..0].join(" ")
=> "here"


Python is pretty unforgiving if you try to access a non-existent element. Ruby is forgiving enough to give you a nil if the element is non-existent. Downside of the Ruby approach is that you don’t know if the nil means “there was a nil there” or “there was nothing there”.

If you want the less forgiving approach in Ruby you can use fetch like so:

irb(main):031:0> empty
=> []
irb(main):032:0> empty.fetch(0)
Traceback (most recent call last):
5: from /snap/ruby/132/bin/irb:23:in `<main>'
4: from /snap/ruby/132/bin/irb:23:in `load'
3: from /snap/ruby/132/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
2: from (irb):32
1: from (irb):32:in `fetch'
IndexError (index 0 outside of array bounds: 0...0)


Fun with True and False in Python

TIL: In Python, True = 1 and False = 0

>>> var0 = 0
>>> var1 = 1
>>> var0 == True
>>> var0 == False
>>> var1 == True
>>> var1 == False

The number 2, however, is neither True nor False. But it is truthy.

>>> var2 = 2
>>> var2 == True
>>> var2 == False
>>> bool(var2)

Which has the following interesting side effects when testing conditions

>>> if (var1 == True): print("***  var1 is true ***")
... else: print("*** var1 is not true ***")
***  var1 is true ***

>>> if (var2 == True): print("*** var2 is true ***")
... else: print("*** var2 is not true ***")
*** var2 is not true ***

>>> if (bool(var2)): print("*** var2 is truthy ***")
... else: print("*** var2 is not truthy ***")
*** var2 is truthy ***

Admittedly, this is a bit artificial because in reality if you wanted to test for the existence of var2 you’d just do:

>>> if var2: print("*** var2 is truthy ***")
... else: print("*** var2 is not truthy ***")
*** var2 is truthy ***

But I did enjoy discovering this little nugget as it means you can easily count up the number of Trues in a list like so:

>>> sum([True,False,True,True,False,False,True])

Using local settings in a Scrapy project

TL;DR Seeing a pattern used in one framework can help you address similar problems in a different framework.

I was working on a scrapy project that would save pages into a local database. We wanted to be able to test it using a test database, just as you would with Rails.

Wec could see how to configure a database connection string in the config file, but this didn’t help switch between development and test database

Stackoverflow wasn’t much help, so we ended up rolling our own:

  1. Edit the file so it would read from additional settings files depending on a SCRAPY_ENV environment variable
  2. Move all the settings files to a separate config directory (and change scrapy.cfg so it knew where to look
  3. Edit .gitignore so that local files wouldn’t get committed to the repo (and then added some .sample files)

Git repo is here.

The magic happens at the end of

from importlib import import_module
from scrapy.utils.log import configure_logging
import logging
import os

if SCRAPY_ENV == None:
    raise ValueError("Must set SCRAPY_ENV environment var")
logger = logging.getLogger(__name__)
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

# Load if file exists; incorporate any names started with an
# uppercase letter into globals()
def load_extra_settings(fname):
    if not os.path.isfile("config/" % fname):
        logger.warning("Couldn't find %s, skipping" % fname)
    mdl=import_module("config.%s" % fname)
    names = [x for x in mdl.__dict__ if x[0].isupper()]
    globals().update({k: getattr(mdl,k) for k in names})

load_extra_settings("secrets_%s" % SCRAPY_ENV)
load_extra_settings("settings_%s" % SCRAPY_ENV)

It feels a bit hacky, but it does the job, so I would love to learn a more pythonic way to address this issue.

Re-learning regexes to help with NLP

TL;DR Don’t forget about boring old-school solutions when you’re working with shiny new tech approaches. Dusty old tech might still have a part to play.

I’ve been experimenting with entity recognition. The use case is to identify company names in text. I’ve been using the fab service from as my gold standard of what should be achievable.

Overall it is pretty impressive. Consider the following phrases:

Takeda is a gerbil

Dandelion recognises that this phrase is about a gerbil.

Takeda eats barley from Shire

Dandelion recognises that this is about barley

Takeda buys goods from Shire

A subtle change of words means this sentence is probably about a company called Takeda and a company called Shire.

Very cool.

But what about this sentence:

Polish media and telecoms group Cyfrowy Polsat (CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million).

Still pretty impressive but it has made one big fumble. It thinks that CPS stands for Canon, whereas in reality CPS.WA is the stock ticker for Cyfrowy Polsat.

Is there an alternative approach?

In this sort of document, company names are often abbreviated, or referenced by their stock ticker. When they are abbreviated they use the same kind of convention. Sounds like a job for a regex.

In case you think that something as old-school as regexes can’t possibly have a role to play with cutting edge code, bear in mind they are heavily used in NLP, so Dandelion must be using them somewhere under the covers. Or have a look at Scikit-learn’s CountVectorizer which uses a very simply regex for the token-pattern that it uses for splitting up text into different terms.

I can’t remember the last time I used a regex in anger. (See also this great stackoverflow question and answer on the topic). I don’t like just copy pasting from stack overflow so I broke the task down into a few steps to make sure I fully understood what was going on. The iterations I went through are below (paste this into a python console to see the results).

sent = "Polish media and telecoms group Cyfrowy Polsat (CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million)."
import regex as re

To go through the iterations one by one with what I re-learnt at each stage.

>>> re.findall(r"\(.+\)",sent)
['(CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million)']

Find one or more characters inside a bracket. Brackets have special significance so need to be escaped. This simple regex finds the longest match (it’s “greedy”) so we need to change it to a “lazy” search.

>>> re.findall(r"\(.+?\)",sent)
['(CPS.WA)', '(ESN)', '($44.48 million)']

Better, but the only real initials are combinations of a capital and a dot, so specify that:

>>> re.findall(r"\([A-Z.]+?\)",sent)
['(CPS.WA)', '(ESN)']

Next I need to find the text before the initials. This would be a Capital letter followed by one or more other characters followed by a space:

>>> re.findall(r"[A-Z]\w+\s\([A-Z.]+?\)",sent)
['Polsat (CPS.WA)']

Getting somewhere, but really we need multiple words before the initials. So put brackets around the part of the regex that includes a capitalised word and match this one or more times

>>> re.findall(r"([A-Z]\w+\s)+\([A-Z.]+?\)",sent)
['Polsat ']

WTF? Makes no sense. Except it does: The brackets create a capture group. This changes findall’s behaviour. findall now returns the results of the capture group rather than the overall match. See below how using the captures() method returns the whole match.

['Cyfrowy Polsat (CPS.WA)']

Solution is to turn this new group that we created into a non-capturing group using ?:

>>> re.findall(r"(?:[A-Z]\w+\s)+\([A-Z.]+?\)",sent)
['Cyfrowy Polsat (CPS.WA)', '(ESN)']

A bit better, but it would be nice now if we can separate out the name from the abbreviation, so we return two capture groups in the regex:

>>> re.findall(r"([A-Z]\w+\s[A-Z](?:[\w.\s]+))\s\(([A-Z.]+?)\)",sent)
[('Cyfrowy Polsat', 'CPS.WA'), ('', 'ESN')]

At this point the regex is only looking for capitalised words before the abbreviation. Eleven Sports Network has some lower case terms in the name, so the regex needs a bit more tuning. The following example does the job in this particular case. It looks for a capitalised word then a space, a capital letter and then some other text until it gets to what looks like an abbreviation in brackets:

>>> re.findall(r"([A-Z]\w+\s[A-Z](?:[\w.\s]+))\s\(([A-Z.]+)\)",sent)
[('Cyfrowy Polsat', 'CPS.WA'), ('Eleven Sports Network Sp.z o.o.', 'ESN')]

You can see this regex in action on the fab site. Let’s break this down:

(                           )\s\((       )\)
 [A-Z]\w+\s[A-Z]                  [A-Z.]+

  • Capturing Group 1
    1. [A-Z] a capital letter
    2. \w+ one or more letters (to complete a word)
    3. \s a space (either to start another word, or just before the abbreviation)
    4. [A-Z] then a capital letter, to start a 2nd capitalised word
      1. (?:[\w.\s]+) a non-capturing group of one or more letters or full stops (periods) or spaces.
  • \s\( a space and then an open bracket
  • Capturing Group 2:
    1. [A-Z.]+? one or more capital letters or full stops.
  • \) Close bracket

This regex does the job but it didn’t take long for it to become incomprehensible. If ever there were a use case for copious unit tests it’s using regexes.

Also, it doesn’t generalise well. It won’t identify “Acme & Bob Corp (ABC)” or “ABC and Partners Ltd (ABC)” or “X.Y.Z. Property Co (XYZ)” or “ABC Enterprises, Inc. (‘ABC’)” properly. Writing one regex to handle all of these types of string would quickly become very brittle and hard to understand. In reality I would end up using a serious of regexes rather than trying to code one regex to rule the all.

Nevertheless I hope it’s clear how a boring old piece of tech can help plug gaps in the cool new stuff. It’s never too late to (re-)learn the old ways. And it’s ok to put your pride aside and go back to basics.

Adventures in upgrading a Python text summarisation library

A reflection on what it took to upgrade a simple Python lib to support Python 3. The lib in question is PyTeaser and the final result is at PyTeaserPython3.


The moral of the story is:

  1. Don’t try to upgrade something unless you really need to. It rarely goes well. Even a simple library like this one can throw up all kinds of challenges.
  2. Automated tests really are crucial to allow work like future upgrades. In this case I had some tests that seemed to work but that didn’t give me the full picture.
  3. Ultimately your program is most probably about turning one set of data into another set of other data. Make sure your tests cover those use cases. In this case I was lucky: there was a demo script in the project directory that I could use to manually compare results between the Python 2 and Python 3 version.
  4. Even if all goes perfectly well you can end up with surprising results when the behaviour of an underlying library changes in subtle ways. So it can be worth having tests that check expected behaviour happens even when you are using standard, out of the box features.

Step By Step

Run the tests:

alan@dg04:~/PyTeaserPython3$ python -m tests
Traceback (most recent call last):
 File "/home/alan/anaconda3/lib/python3.6/", line 193, in _run_module_as_main
 "__main__", mod_spec)
 File "/home/alan/anaconda3/lib/python3.6/", line 85, in _run_code
 exec(code, run_globals)
 File "/home/alan/PyTeaserPython3/", line 2, in 
 from pyteaser import Summarize, SummarizeUrl
 File "/home/alan/PyTeaserPython3/", line 72
 print 'IOError'
SyntaxError: Missing parentheses in call to 'print'

That didn’t go too well. Print is a function in Python3.

There’s a utility called 2to3 that will automatically update the code.

alan@dg04:~/PyTeaserPython3$ 2to3 -wn .
RefactoringTool: Skipping optional fixer: buffer
RefactoringTool: Skipping optional fixer: idioms
RefactoringTool: Skipping optional fixer: set_literal
RefactoringTool: ./goose/utils/
RefactoringTool: ./goose/videos/
RefactoringTool: ./goose/videos/

alan@dg04:~/PyTeaserPython3$ git diff --stat | 6 +++---
 goose/ | 2 +-
 goose/ | 18 +++++++++---------
 goose/ | 8 ++++----
 goose/images/ | 6 +++---
 goose/images/ | 4 ++--
 goose/images/ | 6 +++---
 goose/ | 16 ++++++++--------
 goose/ | 2 +-
 goose/ | 6 +++---
 goose/ | 4 ++--
 goose/utils/ | 10 +++++-----
 goose/utils/ | 28 ++++++++++++++-------------- | 16 ++++++++-------- | 12 ++++++------
 15 files changed, 72 insertions(+), 72 deletions(-)

It’s obviously done some work – how does this affect the tests?

alan@dg04:~/PyTeaserPython3$ python -m tests
ERROR: testURLs (__main__.TestSummarize)
Traceback (most recent call last):
 File "/home/alan/PyTeaserPython3/", line 20, in testURLs
 summaries = SummarizeUrl(url)
 File "/home/alan/PyTeaserPython3/", line 70, in SummarizeUrl
 article = grab_link(url)
 File "/home/alan/PyTeaserPython3/goose/", line 88, in 
 class StopWords(object):
 File "/home/alan/PyTeaserPython3/goose/", line 90, in StopWords
 PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]")
 File "/home/alan/anaconda3/lib/python3.6/", line 233, in compile
 return _compile(pattern, flags)
 File "/home/alan/anaconda3/lib/python3.6/", line 301, in _compile
 p = sre_compile.compile(pattern, flags)
 File "/home/alan/anaconda3/lib/python3.6/", line 526, in _parse
 code1 = _class_escape(source, this)
 File "/home/alan/anaconda3/lib/python3.6/", line 336, in _class_escape
 raise source.error('bad escape %s' % escape, len(escape))
sre_constants.error: bad escape \p at position 2

Ran 2 tests in 0.054s

FAILED (errors=1)

Some progress. One of the two tests passed.

Root of the error in the failing test is this line: PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]"). Looks like it isn’t used anywhere:

alan@dg04:~/PyTeaserPython3$ grep -nrI PUNCTUATION
goose/ PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]")

Comment out that line and try again

alan@dg04:~/PyTeaserPython3$ python -m tests
ERROR: testURLs (__main__.TestSummarize)
Traceback (most recent call last):
File "/home/alan/PyTeaserPython3/", line 20, in testURLs
summaries = SummarizeUrl(url)
File "/home/alan/PyTeaserPython3/goose/", line 91, in StopWords
TRANS_TABLE = string.maketrans('', '')
AttributeError: module 'string' has no attribute 'maketrans'

Ran 2 tests in 0.061s

FAILED (errors=1)

I admit it was a bit optimistic to think that commenting out one line would do the trick. Now the problem arises when TRANS_TABLE is defined, and this is used elsewhere in the code.

alan@dg04:~/PyTeaserPython3$ grep -nrI TRANS_TABLE
goose/ TRANS_TABLE = string.maketrans('', '')
goose/ return content.translate(self.TRANS_TABLE, string.punctuation).decode('utf-8')

Fortunately someone put a useful comment into this method so I can google StackOverflow and find out how to do the same thing in Python3.

def remove_punctuation(self, content):
    # code taken form
    if isinstance(content, str):
        content = content.encode('utf-8')
    return content.translate(self.TRANS_TABLE, string.punctuation).decode('utf-8')

Sure enough there is an equivalent question and answer on StackOverflow so I can edit the method accordingly:

def remove_punctuation(self, content):
    # code taken form
    translator = str.maketrans('','',string.punctuation)
    return content.translate(translator)

And now I can remove the reference to TRANS_TABLE from line 91 and run the tests again.

alan@dg04:~/PyTeaserPython3$ python -m tests
ERROR: testURLs (__main__.TestSummarize)
Traceback (most recent call last):
File "/home/alan/PyTeaserPython3/", line 20, in testURLs
summaries = SummarizeUrl(url)
return URLHelper.get_parsing_candidate(crawl_candidate.url)
File "/home/alan/PyTeaserPython3/goose/utils/", line 104, in get_parsing_candidate
link_hash = '%s.%s' % (hashlib.md5(final_url).hexdigest(), time.time())
TypeError: Unicode-objects must be encoded before hashing

Ran 2 tests in 0.273s

FAILED (errors=1)

A bit of digging to fix this and the tests now pass.

alan@dg04:~/PyTeaserPython3$ python -m tests
Ran 2 tests in 0.273s


Let’s see how the demo works.

alan@dg04:~/PyTeaserPython3$ python

Hmm, that doesn’t seem right. Ah, I’m not connected to the internet, duh.

Connect and try again.

alan@dg04:~/PyTeaserPython3$ python

Compare with the results from the python2 version.

[u"Twitter's move is the latest response from U.S. Internet firms following disclosures by former spy agency contractor Edward Snowden about widespread, classified U.S. government surveillance programs.",
u'"Since then, it has become clearer and clearer how important that step was to protecting our users\' privacy."',
u'"A year and a half ago, Twitter was first served completely over HTTPS," the company said in a blog posting.',

Something isn’t working. I track it down to this code in the definition of get_html in goose/

        result = urllib.request.urlopen(request).read()
        return None

In Python3 the encoding of the URL causes this error: urllib.error.URLError: . The try fails and so None is returned. I fix the encoding but there is now a ValueError being thrown. From grab_link in

        article = Goose().extract(url=inurl)
        return article
    except ValueError:
        print('Goose failed to extract article from url')
        return None

A bit more digging – this is due to the fact that the string ‘10.0’ can’t be converted to an int. I edit the code to use a float instead of an int in this case.

Seems to be working now.

alan@dg04:~/PyTeaserPython3$ python
["Twitter's move is the latest response from U.S. Internet firms following "
'disclosures by former spy agency contractor Edward Snowden about widespread, '
'classified U.S. government surveillance programs.',
'"Since then, it has become clearer and clearer how important that step was '
'to protecting our users\' privacy."',

Let’s just double-check the tests.

alan@dg04:~/PyTeaserPython3$ python -m tests
./home/alan/PyTeaserPython3/goose/ DeprecationWarning: The unescape method is deprecated and will be removed in 3.5, use html.unescape() instead.
txt = HTMLParser().unescape(txt)
Ran 2 tests in 3.282s


The previous passing tests didn’t give me the full story. Let’s fix the deprecation warning in goose/ and re-run the tests.

alan@dg04:~/PyTeaserPython3$ python -m tests
Ran 2 tests in 3.653s



Finally, I want to double check the outputs of just to make sure I am getting the same output. It turns out that the summary for the second URL in the demo was producing a different result between Python 2 and Python 3. See the keywords function in (beginning line 177). The culprit is line 184 where Counter has items in a different order between the different versions. Seems the ordering logic has changed subtly between the two versions.

In Python 3 version:

Counter({'montevrain': 6, 'animal': 5, 'tiger': 3, 'town': 3, 'officials': 2, 'dog': 2, 'big': 2, 'cat': 2, 'outside': 2, 'woman': 2, 'supermarket': 2, 'car': 2, 'park': 2, 'search': 2, 'called': 2, 'local': 2, 'schools': 2, 'kept': 2, 'parisien': 2, 'mayors': 2, 'office': 2

In Python2 version

Counter({'montevrain': 6, 'animal': 5, 'tiger': 3, 'town': 3, 'office': 2, 'local': 2, 'mayors': 2, 'woman': 2, 'big': 2, 'schools': 2, 'officials': 2, 'outside': 2, 'supermarket': 2, 'search': 2, 'parisien': 2, 'park': 2, 'car': 2, 'cat': 2, 'called': 2, 'dog': 2, 'kept': 2

So when line 187 picks the 10 most common keywords the Python 2 and Python 3 version end up with a different list. A subtle change in the logic of Counter made quite a difference to the end result of running PyTeaser.