Software Development

Adventures in upgrading a Python text summarisation library

A reflection on what it took to upgrade a simple Python lib to support Python 3. The lib in question is PyTeaser and the final result is at PyTeaserPython3.

TL;DR

The moral of the story is:

  1. Don’t try to upgrade something unless you really need to. It rarely goes well. Even a simple library like this one can throw up all kinds of challenges.
  2. Automated tests really are crucial to allow work like future upgrades. In this case I had some tests that seemed to work but that didn’t give me the full picture.
  3. Ultimately your program is most probably about turning one set of data into another set of other data. Make sure your tests cover those use cases. In this case I was lucky: there was a demo script in the project directory that I could use to manually compare results between the Python 2 and Python 3 version.
  4. Even if all goes perfectly well you can end up with surprising results when the behaviour of an underlying library changes in subtle ways. So it can be worth having tests that check expected behaviour happens even when you are using standard, out of the box features.

Step By Step

Run the tests:

alan@dg04:~/PyTeaserPython3$ python -m tests
Traceback (most recent call last):
 File "/home/alan/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
 "__main__", mod_spec)
 File "/home/alan/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
 exec(code, run_globals)
 File "/home/alan/PyTeaserPython3/tests.py", line 2, in 
 from pyteaser import Summarize, SummarizeUrl
 File "/home/alan/PyTeaserPython3/pyteaser.py", line 72
 print 'IOError'
 ^
SyntaxError: Missing parentheses in call to 'print'

That didn’t go too well. Print is a function in Python3.

There’s a utility called 2to3 that will automatically update the code.

alan@dg04:~/PyTeaserPython3$ 2to3 -wn .
RefactoringTool: Skipping optional fixer: buffer
RefactoringTool: Skipping optional fixer: idioms
RefactoringTool: Skipping optional fixer: set_literal
...
RefactoringTool: ./goose/utils/encoding.py
RefactoringTool: ./goose/videos/extractors.py
RefactoringTool: ./goose/videos/videos.py

alan@dg04:~/PyTeaserPython3$ git diff --stat
 demo.py | 6 +++---
 goose/__init__.py | 2 +-
 goose/article.py | 18 +++++++++---------
 goose/extractors.py | 8 ++++----
 goose/images/extractors.py | 6 +++---
 goose/images/image.py | 4 ++--
 goose/images/utils.py | 6 +++---
 goose/network.py | 16 ++++++++--------
 goose/outputformatters.py | 2 +-
 goose/parsers.py | 6 +++---
 goose/text.py | 4 ++--
 goose/utils/__init__.py | 10 +++++-----
 goose/utils/encoding.py | 28 ++++++++++++++--------------
 pyteaser.py | 16 ++++++++--------
 tests.py | 12 ++++++------
 15 files changed, 72 insertions(+), 72 deletions(-)

It’s obviously done some work – how does this affect the tests?

alan@dg04:~/PyTeaserPython3$ python -m tests
.E
======================================================================
ERROR: testURLs (__main__.TestSummarize)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs
 summaries = SummarizeUrl(url)
 File "/home/alan/PyTeaserPython3/pyteaser.py", line 70, in SummarizeUrl
 article = grab_link(url)
...
 File "/home/alan/PyTeaserPython3/goose/text.py", line 88, in 
 class StopWords(object):
 File "/home/alan/PyTeaserPython3/goose/text.py", line 90, in StopWords
 PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]")
 File "/home/alan/anaconda3/lib/python3.6/re.py", line 233, in compile
 return _compile(pattern, flags)
 File "/home/alan/anaconda3/lib/python3.6/re.py", line 301, in _compile
 p = sre_compile.compile(pattern, flags)
 ...
 File "/home/alan/anaconda3/lib/python3.6/sre_parse.py", line 526, in _parse
 code1 = _class_escape(source, this)
 File "/home/alan/anaconda3/lib/python3.6/sre_parse.py", line 336, in _class_escape
 raise source.error('bad escape %s' % escape, len(escape))
sre_constants.error: bad escape \p at position 2

----------------------------------------------------------------------
Ran 2 tests in 0.054s

FAILED (errors=1)

Some progress. One of the two tests passed.

Root of the error in the failing test is this line: PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]"). Looks like it isn’t used anywhere:

alan@dg04:~/PyTeaserPython3$ grep -nrI PUNCTUATION
goose/text.py:90: PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]")
alan@dg04:~/PyTeaserPython3$

Comment out that line and try again

alan@dg04:~/PyTeaserPython3$ python -m tests
.E
======================================================================
ERROR: testURLs (__main__.TestSummarize)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs
summaries = SummarizeUrl(url)
...
File "/home/alan/PyTeaserPython3/goose/text.py", line 91, in StopWords
TRANS_TABLE = string.maketrans('', '')
AttributeError: module 'string' has no attribute 'maketrans'

----------------------------------------------------------------------
Ran 2 tests in 0.061s

FAILED (errors=1)
alan@dg04:~/PyTeaserPython3$

I admit it was a bit optimistic to think that commenting out one line would do the trick. Now the problem arises when TRANS_TABLE is defined, and this is used elsewhere in the code.

alan@dg04:~/PyTeaserPython3$ grep -nrI TRANS_TABLE
goose/text.py:91: TRANS_TABLE = string.maketrans('', '')
goose/text.py:107: return content.translate(self.TRANS_TABLE, string.punctuation).decode('utf-8')
alan@dg04:~/PyTeaserPython3$

Fortunately someone put a useful comment into this method so I can google StackOverflow and find out how to do the same thing in Python3.

def remove_punctuation(self, content):
    # code taken form
    # http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
    if isinstance(content, str):
        content = content.encode('utf-8')
    return content.translate(self.TRANS_TABLE, string.punctuation).decode('utf-8')

Sure enough there is an equivalent question and answer on StackOverflow so I can edit the method accordingly:

def remove_punctuation(self, content):
    # code taken form
    # https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate
    translator = str.maketrans('','',string.punctuation)
    return content.translate(translator)

And now I can remove the reference to TRANS_TABLE from line 91 and run the tests again.

alan@dg04:~/PyTeaserPython3$ python -m tests
.E
======================================================================
ERROR: testURLs (__main__.TestSummarize)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs
summaries = SummarizeUrl(url)
...
return URLHelper.get_parsing_candidate(crawl_candidate.url)
File "/home/alan/PyTeaserPython3/goose/utils/__init__.py", line 104, in get_parsing_candidate
link_hash = '%s.%s' % (hashlib.md5(final_url).hexdigest(), time.time())
TypeError: Unicode-objects must be encoded before hashing

----------------------------------------------------------------------
Ran 2 tests in 0.273s

FAILED (errors=1)
alan@dg04:~/PyTeaserPython3$

A bit of digging to fix this and the tests now pass.

alan@dg04:~/PyTeaserPython3$ python -m tests
..
----------------------------------------------------------------------
Ran 2 tests in 0.273s

OK
alan@dg04:~/PyTeaserPython3$

Let’s see how the demo works.

alan@dg04:~/PyTeaserPython3$ python demo.py
None
None
None
alan@dg04:~

Hmm, that doesn’t seem right. Ah, I’m not connected to the internet, duh.

Connect and try again.

alan@dg04:~/PyTeaserPython3$ python demo.py
None
None
None
alan@dg04:~

Compare with the results from the python2 version.

[u"Twitter's move is the latest response from U.S. Internet firms following disclosures by former spy agency contractor Edward Snowden about widespread, classified U.S. government surveillance programs.",
u'"Since then, it has become clearer and clearer how important that step was to protecting our users\' privacy."',
u'"A year and a half ago, Twitter was first served completely over HTTPS," the company said in a blog posting.',
...

Something isn’t working. I track it down to this code in the definition of get_html in goose/network.py.

    try:
        result = urllib.request.urlopen(request).read()
    except:
        return None

In Python3 the encoding of the URL causes this error: urllib.error.URLError: . The try fails and so None is returned. I fix the encoding but there is now a ValueError being thrown. From grab_link in pyteaser.py.

    try:
        article = Goose().extract(url=inurl)
        return article
    except ValueError:
        print('Goose failed to extract article from url')
        return None

A bit more digging – this is due to the fact that the string ‘10.0’ can’t be converted to an int. I edit the code to use a float instead of an int in this case.

Seems to be working now.

alan@dg04:~/PyTeaserPython3$ python demo.py
["Twitter's move is the latest response from U.S. Internet firms following "
'disclosures by former spy agency contractor Edward Snowden about widespread, '
'classified U.S. government surveillance programs.',
'"Since then, it has become clearer and clearer how important that step was '
'to protecting our users\' privacy."',

Let’s just double-check the tests.

alan@dg04:~/PyTeaserPython3$ python -m tests
./home/alan/PyTeaserPython3/goose/outputformatters.py:65: DeprecationWarning: The unescape method is deprecated and will be removed in 3.5, use html.unescape() instead.
txt = HTMLParser().unescape(txt)
.
----------------------------------------------------------------------
Ran 2 tests in 3.282s

OK
alan@dg04:~/PyTeaserPython3$

The previous passing tests didn’t give me the full story. Let’s fix the deprecation warning in goose/outputformatters.py and re-run the tests.

alan@dg04:~/PyTeaserPython3$ python -m tests
..
----------------------------------------------------------------------
Ran 2 tests in 3.653s

OK

Better.

Finally, I want to double check the outputs of demo.py just to make sure I am getting the same output. It turns out that the summary for the second URL in the demo was producing a different result between Python 2 and Python 3. See the keywords function in pyteaser.py (beginning line 177). The culprit is line 184 where Counter has items in a different order between the different versions. Seems the ordering logic has changed subtly between the two versions.

In Python 3 version:

Counter({'montevrain': 6, 'animal': 5, 'tiger': 3, 'town': 3, 'officials': 2, 'dog': 2, 'big': 2, 'cat': 2, 'outside': 2, 'woman': 2, 'supermarket': 2, 'car': 2, 'park': 2, 'search': 2, 'called': 2, 'local': 2, 'schools': 2, 'kept': 2, 'parisien': 2, 'mayors': 2, 'office': 2

In Python2 version

Counter({'montevrain': 6, 'animal': 5, 'tiger': 3, 'town': 3, 'office': 2, 'local': 2, 'mayors': 2, 'woman': 2, 'big': 2, 'schools': 2, 'officials': 2, 'outside': 2, 'supermarket': 2, 'search': 2, 'parisien': 2, 'park': 2, 'car': 2, 'cat': 2, 'called': 2, 'dog': 2, 'kept': 2

So when line 187 picks the 10 most common keywords the Python 2 and Python 3 version end up with a different list. A subtle change in the logic of Counter made quite a difference to the end result of running PyTeaser.