A reflection on what it took to upgrade a simple Python lib to support Python 3. The lib in question is PyTeaser and the final result is at PyTeaserPython3.
TL;DR
The moral of the story is:
- Don’t try to upgrade something unless you really need to. It rarely goes well. Even a simple library like this one can throw up all kinds of challenges.
- Automated tests really are crucial to allow work like future upgrades. In this case I had some tests that seemed to work but that didn’t give me the full picture.
- Ultimately your program is most probably about turning one set of data into another set of other data. Make sure your tests cover those use cases. In this case I was lucky: there was a demo script in the project directory that I could use to manually compare results between the Python 2 and Python 3 version.
- Even if all goes perfectly well you can end up with surprising results when the behaviour of an underlying library changes in subtle ways. So it can be worth having tests that check expected behaviour happens even when you are using standard, out of the box features.
Step By Step
Run the tests:
alan@dg04:~/PyTeaserPython3$ python -m tests Traceback (most recent call last): File "/home/alan/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/alan/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/alan/PyTeaserPython3/tests.py", line 2, in from pyteaser import Summarize, SummarizeUrl File "/home/alan/PyTeaserPython3/pyteaser.py", line 72 print 'IOError' ^ SyntaxError: Missing parentheses in call to 'print'
That didn’t go too well. Print is a function in Python3.
There’s a utility called 2to3 that will automatically update the code.
alan@dg04:~/PyTeaserPython3$ 2to3 -wn . RefactoringTool: Skipping optional fixer: buffer RefactoringTool: Skipping optional fixer: idioms RefactoringTool: Skipping optional fixer: set_literal ... RefactoringTool: ./goose/utils/encoding.py RefactoringTool: ./goose/videos/extractors.py RefactoringTool: ./goose/videos/videos.py alan@dg04:~/PyTeaserPython3$ git diff --stat demo.py | 6 +++--- goose/__init__.py | 2 +- goose/article.py | 18 +++++++++--------- goose/extractors.py | 8 ++++---- goose/images/extractors.py | 6 +++--- goose/images/image.py | 4 ++-- goose/images/utils.py | 6 +++--- goose/network.py | 16 ++++++++-------- goose/outputformatters.py | 2 +- goose/parsers.py | 6 +++--- goose/text.py | 4 ++-- goose/utils/__init__.py | 10 +++++----- goose/utils/encoding.py | 28 ++++++++++++++-------------- pyteaser.py | 16 ++++++++-------- tests.py | 12 ++++++------ 15 files changed, 72 insertions(+), 72 deletions(-)
It’s obviously done some work – how does this affect the tests?
alan@dg04:~/PyTeaserPython3$ python -m tests .E ====================================================================== ERROR: testURLs (__main__.TestSummarize) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs summaries = SummarizeUrl(url) File "/home/alan/PyTeaserPython3/pyteaser.py", line 70, in SummarizeUrl article = grab_link(url) ... File "/home/alan/PyTeaserPython3/goose/text.py", line 88, in class StopWords(object): File "/home/alan/PyTeaserPython3/goose/text.py", line 90, in StopWords PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]") File "/home/alan/anaconda3/lib/python3.6/re.py", line 233, in compile return _compile(pattern, flags) File "/home/alan/anaconda3/lib/python3.6/re.py", line 301, in _compile p = sre_compile.compile(pattern, flags) ... File "/home/alan/anaconda3/lib/python3.6/sre_parse.py", line 526, in _parse code1 = _class_escape(source, this) File "/home/alan/anaconda3/lib/python3.6/sre_parse.py", line 336, in _class_escape raise source.error('bad escape %s' % escape, len(escape)) sre_constants.error: bad escape \p at position 2 ---------------------------------------------------------------------- Ran 2 tests in 0.054s FAILED (errors=1)
Some progress. One of the two tests passed.
Root of the error in the failing test is this line: PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]")
. Looks like it isn’t used anywhere:
alan@dg04:~/PyTeaserPython3$ grep -nrI PUNCTUATION goose/text.py:90: PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]") alan@dg04:~/PyTeaserPython3$
Comment out that line and try again
alan@dg04:~/PyTeaserPython3$ python -m tests .E ====================================================================== ERROR: testURLs (__main__.TestSummarize) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs summaries = SummarizeUrl(url) ... File "/home/alan/PyTeaserPython3/goose/text.py", line 91, in StopWords TRANS_TABLE = string.maketrans('', '') AttributeError: module 'string' has no attribute 'maketrans' ---------------------------------------------------------------------- Ran 2 tests in 0.061s FAILED (errors=1) alan@dg04:~/PyTeaserPython3$
I admit it was a bit optimistic to think that commenting out one line would do the trick. Now the problem arises when TRANS_TABLE
is defined, and this is used elsewhere in the code.
alan@dg04:~/PyTeaserPython3$ grep -nrI TRANS_TABLE goose/text.py:91: TRANS_TABLE = string.maketrans('', '') goose/text.py:107: return content.translate(self.TRANS_TABLE, string.punctuation).decode('utf-8') alan@dg04:~/PyTeaserPython3$
Fortunately someone put a useful comment into this method so I can google StackOverflow and find out how to do the same thing in Python3.
def remove_punctuation(self, content): # code taken form # http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python if isinstance(content, str): content = content.encode('utf-8') return content.translate(self.TRANS_TABLE, string.punctuation).decode('utf-8')
Sure enough there is an equivalent question and answer on StackOverflow so I can edit the method accordingly:
def remove_punctuation(self, content): # code taken form # https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate translator = str.maketrans('','',string.punctuation) return content.translate(translator)
And now I can remove the reference to TRANS_TABLE
from line 91 and run the tests again.
alan@dg04:~/PyTeaserPython3$ python -m tests .E ====================================================================== ERROR: testURLs (__main__.TestSummarize) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs summaries = SummarizeUrl(url) ... return URLHelper.get_parsing_candidate(crawl_candidate.url) File "/home/alan/PyTeaserPython3/goose/utils/__init__.py", line 104, in get_parsing_candidate link_hash = '%s.%s' % (hashlib.md5(final_url).hexdigest(), time.time()) TypeError: Unicode-objects must be encoded before hashing ---------------------------------------------------------------------- Ran 2 tests in 0.273s FAILED (errors=1) alan@dg04:~/PyTeaserPython3$
A bit of digging to fix this and the tests now pass.
alan@dg04:~/PyTeaserPython3$ python -m tests .. ---------------------------------------------------------------------- Ran 2 tests in 0.273s OK alan@dg04:~/PyTeaserPython3$
Let’s see how the demo works.
alan@dg04:~/PyTeaserPython3$ python demo.py None None None alan@dg04:~
Hmm, that doesn’t seem right. Ah, I’m not connected to the internet, duh.
Connect and try again.
alan@dg04:~/PyTeaserPython3$ python demo.py None None None alan@dg04:~
Compare with the results from the python2 version.
[u"Twitter's move is the latest response from U.S. Internet firms following disclosures by former spy agency contractor Edward Snowden about widespread, classified U.S. government surveillance programs.", u'"Since then, it has become clearer and clearer how important that step was to protecting our users\' privacy."', u'"A year and a half ago, Twitter was first served completely over HTTPS," the company said in a blog posting.', ...
Something isn’t working. I track it down to this code in the definition of get_html
in goose/network.py
.
try: result = urllib.request.urlopen(request).read() except: return None
In Python3 the encoding of the URL causes this error: urllib.error.URLError:
. The try fails and so None is returned. I fix the encoding but there is now a ValueError
being thrown. From grab_link
in pyteaser.py
.
try: article = Goose().extract(url=inurl) return article except ValueError: print('Goose failed to extract article from url') return None
A bit more digging – this is due to the fact that the string ‘10.0’ can’t be converted to an int. I edit the code to use a float instead of an int in this case.
Seems to be working now.
alan@dg04:~/PyTeaserPython3$ python demo.py ["Twitter's move is the latest response from U.S. Internet firms following " 'disclosures by former spy agency contractor Edward Snowden about widespread, ' 'classified U.S. government surveillance programs.', '"Since then, it has become clearer and clearer how important that step was ' 'to protecting our users\' privacy."',
Let’s just double-check the tests.
alan@dg04:~/PyTeaserPython3$ python -m tests ./home/alan/PyTeaserPython3/goose/outputformatters.py:65: DeprecationWarning: The unescape method is deprecated and will be removed in 3.5, use html.unescape() instead. txt = HTMLParser().unescape(txt) . ---------------------------------------------------------------------- Ran 2 tests in 3.282s OK alan@dg04:~/PyTeaserPython3$
The previous passing tests didn’t give me the full story. Let’s fix the deprecation warning in goose/outputformatters.py
and re-run the tests.
alan@dg04:~/PyTeaserPython3$ python -m tests .. ---------------------------------------------------------------------- Ran 2 tests in 3.653s OK
Better.
Finally, I want to double check the outputs of demo.py just to make sure I am getting the same output. It turns out that the summary for the second URL in the demo was producing a different result between Python 2 and Python 3. See the keywords
function in pyteaser.py
(beginning line 177). The culprit is line 184 where Counter
has items in a different order between the different versions. Seems the ordering logic has changed subtly between the two versions.
In Python 3 version:
Counter({'montevrain': 6, 'animal': 5, 'tiger': 3, 'town': 3, 'officials': 2, 'dog': 2, 'big': 2, 'cat': 2, 'outside': 2, 'woman': 2, 'supermarket': 2, 'car': 2, 'park': 2, 'search': 2, 'called': 2, 'local': 2, 'schools': 2, 'kept': 2, 'parisien': 2, 'mayors': 2, 'office': 2
In Python2 version
Counter({'montevrain': 6, 'animal': 5, 'tiger': 3, 'town': 3, 'office': 2, 'local': 2, 'mayors': 2, 'woman': 2, 'big': 2, 'schools': 2, 'officials': 2, 'outside': 2, 'supermarket': 2, 'search': 2, 'parisien': 2, 'park': 2, 'car': 2, 'cat': 2, 'called': 2, 'dog': 2, 'kept': 2
So when line 187 picks the 10 most common keywords the Python 2 and Python 3 version end up with a different list. A subtle change in the logic of Counter
made quite a difference to the end result of running PyTeaser.