Using local settings in a Scrapy project

TL;DR Seeing a pattern used in one framework can help you address similar problems in a different framework.

I was working on a scrapy project that would save pages into a local database. We wanted to be able to test it using a test database, just as you would with Rails.

Wec could see how to configure a database connection string in the settings.py config file, but this didn’t help switch between development and test database

Stackoverflow wasn’t much help, so we ended up rolling our own:

  1. Edit the settings.py file so it would read from additional settings files depending on a SCRAPY_ENV environment variable
  2. Move all the settings files to a separate config directory (and change scrapy.cfg so it knew where to look
  3. Edit .gitignore so that local files wouldn’t get committed to the repo (and then added some .sample files)

Git repo is here.

The magic happens at the end of settings.py

from importlib import import_module
from scrapy.utils.log import configure_logging
import logging
import os

SCRAPY_ENV=os.environ.get('SCRAPY_ENV',None)
if SCRAPY_ENV == None:
    raise ValueError("Must set SCRAPY_ENV environment var")
logger = logging.getLogger(__name__)
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

# Load if file exists; incorporate any names started with an
# uppercase letter into globals()
def load_extra_settings(fname):
    if not os.path.isfile("config/%s.py" % fname):
        logger.warning("Couldn't find %s, skipping" % fname)
        return
    mdl=import_module("config.%s" % fname)
    names = [x for x in mdl.__dict__ if x[0].isupper()]
    globals().update({k: getattr(mdl,k) for k in names})

load_extra_settings("secrets")
load_extra_settings("secrets_%s" % SCRAPY_ENV)
load_extra_settings("settings_%s" % SCRAPY_ENV)

It feels a bit hacky, but it does the job, so I would love to learn a more pythonic way to address this issue.

Advertisements

Re-learning regexes to help with NLP

TL;DR Don’t forget about boring old-school solutions when you’re working with shiny new tech approaches. Dusty old tech might still have a part to play.

I’ve been experimenting with entity recognition. The use case is to identify company names in text. I’ve been using the fab service from dandelion.eu as my gold standard of what should be achievable.

Overall it is pretty impressive. Consider the following phrases:

Takeda is a gerbil

Dandelion recognises that this phrase is about a gerbil.

Takeda eats barley from Shire

Dandelion recognises that this is about barley

Takeda buys goods from Shire

A subtle change of words means this sentence is probably about a company called Takeda and a company called Shire.

Very cool.

But what about this sentence:

Polish media and telecoms group Cyfrowy Polsat (CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million).

Still pretty impressive but it has made one big fumble. It thinks that CPS stands for Canon, whereas in reality CPS.WA is the stock ticker for Cyfrowy Polsat.

Is there an alternative approach?

In this sort of document, company names are often abbreviated, or referenced by their stock ticker. When they are abbreviated they use the same kind of convention. Sounds like a job for a regex.

In case you think that something as old-school as regexes can’t possibly have a role to play with cutting edge code, bear in mind they are heavily used in NLP, so Dandelion must be using them somewhere under the covers. Or have a look at Scikit-learn’s CountVectorizer which uses a very simply regex for the token-pattern that it uses for splitting up text into different terms.

I can’t remember the last time I used a regex in anger. (See also this great stackoverflow question and answer on the topic). I don’t like just copy pasting from stack overflow so I broke the task down into a few steps to make sure I fully understood what was going on. The iterations I went through are below (paste this into a python console to see the results).

sent = "Polish media and telecoms group Cyfrowy Polsat (CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million)."
import regex as re
re.findall(r"\(.+\)",sent)
re.findall(r"\(.+?\)",sent)
re.findall(r"\([A-Z.]+?\)",sent)
re.findall(r"[A-Z]\w+\s\([A-Z.]+?\)",sent)
re.findall(r"(?:[A-Z]\w+\s)*\([A-Z.]+?\)",sent)
re.findall(r"((?:[A-Z]\w+\s*)*)\s\(([A-Z.]+?)\)",sent)
re.findall(r"((?:[A-Z]\w+\s(?:[A-Z][A-Za-z.\s]+\s?)*))\s\(([A-Z.]+)\)",sent)

To go through the iterations one by one with what I re-learnt at each stage.

>>> re.findall(r"\(.+\)",sent)
['(CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million)']

Find one or more characters inside a bracket. Brackets have special significance so need to be escaped. This simple regex finds the longest match (it’s “greedy”) so we need to change it to a “lazy” search.

>>> re.findall(r"\(.+?\)",sent)
['(CPS.WA)', '(ESN)', '($44.48 million)']

Better, but the only real initials are combinations of a capital and a dot, so specify that:

>>> re.findall(r"\([A-Z.]+?\)",sent)
['(CPS.WA)', '(ESN)']

Next I need to find the text before the initials. This would be a Capital letter followed by one or more other characters followed by a space:

>>> re.findall(r"[A-Z]\w+\s\([A-Z.]+?\)",sent)
['Polsat (CPS.WA)']

Getting somewhere, but really we need multiple words before the initials. So put brackets around the part of the regex that includes a capitalised word and match this one or more times

>>> re.findall(r"([A-Z]\w+\s)+\([A-Z.]+?\)",sent)
['Polsat ']

WTF? Makes no sense. Except it does: The brackets create a capture group. This changes findall’s behaviour. findall now returns the results of the capture group rather than the overall match. See below how using the captures() method returns the whole match.

>>> re.search(r"([A-Z]\w+\s)+\([A-Z.]+?\)",sent).captures()
['Cyfrowy Polsat (CPS.WA)']

Solution is to turn this new group that we created into a non-capturing group using ?:

>>> re.findall(r"(?:[A-Z]\w+\s)+\([A-Z.]+?\)",sent)
['Cyfrowy Polsat (CPS.WA)', '(ESN)']

A bit better, but it would be nice now if we can separate out the name from the abbreviation, so we return two capture groups in the regex:

>>> re.findall(r"([A-Z]\w+\s[A-Z](?:[\w.\s]+))\s\(([A-Z.]+?)\)",sent)
[('Cyfrowy Polsat', 'CPS.WA'), ('', 'ESN')]

At this point the regex is only looking for capitalised words before the abbreviation. Eleven Sports Network has some lower case terms in the name, so the regex needs a bit more tuning. The following example does the job in this particular case. It looks for a capitalised word then a space, a capital letter and then some other text until it gets to what looks like an abbreviation in brackets:

>>> re.findall(r"([A-Z]\w+\s[A-Z](?:[\w.\s]+))\s\(([A-Z.]+)\)",sent)
[('Cyfrowy Polsat', 'CPS.WA'), ('Eleven Sports Network Sp.z o.o.', 'ESN')]

You can see this regex in action on the fab regex101.com site. Let’s break this down:

(                           )\s\((       )\)
 [A-Z]\w+\s[A-Z]                  [A-Z.]+
                (?:[\w.\s]+)

  • Capturing Group 1
    1. [A-Z] a capital letter
    2. \w+ one or more letters (to complete a word)
    3. \s a space (either to start another word, or just before the abbreviation)
    4. [A-Z] then a capital letter, to start a 2nd capitalised word
      1. (?:[\w.\s]+) a non-capturing group of one or more letters or full stops (periods) or spaces.
  • \s\( a space and then an open bracket
  • Capturing Group 2:
    1. [A-Z.]+? one or more capital letters or full stops.
  • \) Close bracket

This regex does the job but it didn’t take long for it to become incomprehensible. If ever there were a use case for copious unit tests it’s using regexes.

Also, it doesn’t generalise well. It won’t identify “Acme & Bob Corp (ABC)” or “ABC and Partners Ltd (ABC)” or “X.Y.Z. Property Co (XYZ)” or “ABC Enterprises, Inc. (‘ABC’)” properly. Writing one regex to handle all of these types of string would quickly become very brittle and hard to understand. In reality I would end up using a serious of regexes rather than trying to code one regex to rule the all.

Nevertheless I hope it’s clear how a boring old piece of tech can help plug gaps in the cool new stuff. It’s never too late to (re-)learn the old ways. And it’s ok to put your pride aside and go back to basics.

How long will this feature take?

Be clear on someone’s definition of done when trying to communicate what it will take to build a feature.

Planning software development timescales is hard. As an industry we have moved away from the detailed Gantt charts and their illusion of total clarity and control. Basecamp have recently been talking about the joys of using Hill Charts to better communicate project statuses. The folk at ProdPad have been championing, for a long time, the idea of a flexible, transparent roadmap instead of committing to timelines.

That’s all well and good if you are preaching to the choir. But if you are working in a startup and the CEO needs to make a build or buy decision for a piece of software, you need to make sure that you have some way of weighing up the likely costs and efforts of any new feature you commit to build. It’s not good enough to just prioritise requests and drop them in the backlog.

The excellent Programmer Time Translation Table is a surprisingly accurate way of interpreting developer time estimates. My own rule of thumb is similar to Anders’ project manager. I usually triple anything a developer tells me because you build everything at least 3 times: once based on the developer’s interpretation of the requirements; once to convert that into what the product owner wanted; and once into what the end users will use. But even these approaches only look at things from the developer’s point of view, based on a developer’s “definition of done”. The overall puzzle can be much bigger than that.

For example, the startup CEO who is trying to figure out if we should invest in Feature X probably has a much longer range “definition of done” than “when can we release a beta version”. For example: “When will this make an impact on my revenues” or “when will this improve my user churn rates”. Part of the CTO job is to help make that decision from the business point of view in addition to what seems to be interesting from a tech angle.

For example, consider these two answers to the same question “When will the feature be done?”.

  1. The dev team is working on it in the current iteration, assuming testing goes well it will be released in 2 weeks.
  2. The current set of requirements is currently on target to release in 2 weeks. We will then need to do some monitoring over the next month or two so that we can iron out any issues that we spot in production and build any high priority enhancements that the users need. After that we will need to keep working on enhancements, customer feedback and scalability/security improvements so probably should expect to dedicate X effort on an ongoing basis over the next year.

Two examples from my experience:

A B2B system used primarily by internal staff. It took us about 6 weeks to release the first version from initial brainstorming on it. Then it took about another two months to get the first person to use it live. Within 2 years it was contributing 20% of our revenues, and people couldn’t live without it.

An end user feature that we felt would differentiate us from the competition. This was pretty technically involved so the backend work kept someone busy for a couple months. After some user testing we realised that the UI was going to need some imaginative work to get right. Eventually it got released. Two months after release the take-up was pretty unimpressive. But 5 years later that feature was fully embedded and it seems that everyone is using it.

What is the right “definition of done” for both of these projects? Depends on who is asking. It’s as well to be clear on what definition they are using before you answer. The right answer might be in the range of months or years, not hours or weeks.

Memory and Mining in Ethereum – a look in Geth internals

There are a few comments on my guide to getting started with Ethereum private networks about mining taking ages to run, or apparently not running at all. The consensus seems to be that you need to be on a 64 bit OS and using plenty of memory. In this post I’ll compare mining in a private network with 1Gb, 2Gb and 4Gb virtual machine configurations. This was an academic exercise I went through for my own interest, but hopefully will be of some interest to people who enjoy understanding a bit more about how the code works.

Here are the steps, collected from various docs, that I went through on Ubuntu. The developer’s guide was my starting point.

Install Go:

sudo add-apt-repository ppa:gophers/archive
sudo apt-get update
sudo apt-get install golang-1.9

Set GOPATH:

mkdir -p ~/go; echo "export GOPATH=$HOME/go" >> ~/.bashrc
echo "export PATH=$PATH:$HOME/go/bin:/usr/local/go/bin" >> ~/.bashrc
source ~/.bashrc

A few more commands so Go can be found in the GOPATH

mkdir $GOPATH/bin
sudo ln -s /usr/lib/go-1.9/bin/go $GOPATH/bin/go

Clone geth repo and build geth:

git clone git@github.com:ethereum/go-ethereum.git $GOPATH/src/github.com/ethereum/go-ethereum
cd $GOPATH/src/github.com/ethereum/go-ethereum
go install -v ./cmd/geth

You’ve probably got two geths installed now, a system-wide one and the one that you just installed, as below.

ethuser@eth-host:~$ geth version | grep '^Version:'
Version: 1.7.3-stable
ethuser@eth-host:~$ $GOPATH/bin/geth version | grep '^Version:'
Version: 1.8.0-unstable

Before we do anything else, let’s just double check that you can run the newly installed geth. Assuming you used the guide in the previous post:

cd ~
$GOPATH/bin/geth --datadir node1 --networkid 98765 console

Add some logging to geth. See this commit which provides additional logging during the mining process:

https://github.com/alanbuxton/go-ethereum/commit/3990f15fab13d32f9b9f0224367295f285f7f9db#diff-d48da17e6237684b832b97db3107179f

Apply these changes to your local version, or if you prefer you can clone the branch in my fork that has these changes already in.

git clone -b extra_logging --depth 1 https://github.com/alanbuxton/go-ethereum.git $GOPATH/src/github.com/ethereum/go-ethereum

Rebuild your local version using go install -v ./cmd/geth and now you’ll get some more logging when you mine in your private network. See below for some examples with extra logging with 4Gb, 2Gb and 1Gb virtual machine configurations:

4Gb

> miner.start(1);admin.sleepBlocks(1);miner.stop()
INFO [01-26|09:37:47] Updated mining threads threads=1
INFO [01-26|09:37:47] Transaction pool price threshold updated price=18000000000
INFO [01-26|09:37:47] Starting mining operation 
INFO [01-26|09:37:47] Commit new mining work number=34 txs=0 uncles=0 elapsed=125.978µs
INFO [01-26|09:37:47] Started ethash search for new nonces miner=0 seed=7377726478179259701
INFO [01-26|09:38:05] Still mining miner=0 attempts=1000 duration=18s
INFO [01-26|09:38:06] Still mining miner=0 attempts=2000 duration=19s
INFO [01-26|09:38:06] Still mining miner=0 attempts=4000 duration=19s
INFO [01-26|09:38:06] Still mining miner=0 attempts=8000 duration=19s
INFO [01-26|09:38:07] Still mining miner=0 attempts=16000 duration=20s
INFO [01-26|09:38:08] Still mining miner=0 attempts=32000 duration=21s
INFO [01-26|09:38:09] Finished mining miner=0 attempts=42332 duration=22s
INFO [01-26|09:38:09] Successfully sealed new block number=34 hash=c53c2f…45286c
INFO [01-26|09:38:09] 🔨 mined potential block number=34 hash=c53c2f…45286c

2 Gb

> miner.start(1);admin.sleepBlocks(1);miner.stop()
INFO [01-26|10:32:19] Updated mining threads threads=1
INFO [01-26|10:32:19] Transaction pool price threshold updated price=18000000000
INFO [01-26|10:32:19] Starting mining operation 
INFO [01-26|10:32:19] Commit new mining work number=37 txs=0 uncles=0 elapsed=129.734µs
INFO [01-26|10:32:19] Started ethash search for new nonces miner=0 seed=4187772785547710899
INFO [01-26|10:32:40] Still mining miner=0 attempts=1000 duration=21s
INFO [01-26|10:32:46] Still mining miner=0 attempts=2000 duration=27s
INFO [01-26|10:32:49] Still mining miner=0 attempts=4000 duration=30s
INFO [01-26|10:32:49] Still mining miner=0 attempts=8000 duration=30s
INFO [01-26|10:32:50] Still mining miner=0 attempts=16000 duration=31s
INFO [01-26|10:32:51] Still mining miner=0 attempts=32000 duration=32s
INFO [01-26|10:32:53] Finished mining miner=0 attempts=57306 duration=34s
INFO [01-26|10:32:53] Successfully sealed new block number=37 hash=068096…c31777
INFO [01-26|10:32:53] 🔨 mined potential block number=37 hash=068096…c31777

1Gb

I cancelled after over half an hour…

> miner.start(1);admin.sleepBlocks(1);miner.stop()
INFO [01-26|09:51:58] Updated mining threads threads=1
INFO [01-26|09:51:58] Transaction pool price threshold updated price=18000000000
INFO [01-26|09:51:58] Starting mining operation 
INFO [01-26|09:51:58] Commit new mining work number=37 txs=0 uncles=0 elapsed=243.956µs
INFO [01-26|09:51:58] Started ethash search for new nonces miner=0 seed=1420748052411692923
INFO [01-26|09:53:11] Still mining miner=0 attempts=1000 duration=1m13s
INFO [01-26|09:54:23] Still mining miner=0 attempts=2000 duration=2m25s
INFO [01-26|09:56:52] Still mining miner=0 attempts=4000 duration=4m54s
INFO [01-26|10:01:53] Still mining miner=0 attempts=8000 duration=9m55s
INFO [01-26|10:11:49] Still mining miner=0 attempts=16000 duration=19m51s
INFO [01-26|10:30:27] Still mining miner=0 attempts=32000 duration=38m29s

Adventures in upgrading a Python text summarisation library

A reflection on what it took to upgrade a simple Python lib to support Python 3. The lib in question is PyTeaser and the final result is at PyTeaserPython3.

TL;DR

The moral of the story is:

  1. Don’t try to upgrade something unless you really need to. It rarely goes well. Even a simple library like this one can throw up all kinds of challenges.
  2. Automated tests really are crucial to allow work like future upgrades. In this case I had some tests that seemed to work but that didn’t give me the full picture.
  3. Ultimately your program is most probably about turning one set of data into another set of other data. Make sure your tests cover those use cases. In this case I was lucky: there was a demo script in the project directory that I could use to manually compare results between the Python 2 and Python 3 version.
  4. Even if all goes perfectly well you can end up with surprising results when the behaviour of an underlying library changes in subtle ways. So it can be worth having tests that check expected behaviour happens even when you are using standard, out of the box features.

Step By Step

Run the tests:

alan@dg04:~/PyTeaserPython3$ python -m tests
Traceback (most recent call last):
 File "/home/alan/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
 "__main__", mod_spec)
 File "/home/alan/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
 exec(code, run_globals)
 File "/home/alan/PyTeaserPython3/tests.py", line 2, in 
 from pyteaser import Summarize, SummarizeUrl
 File "/home/alan/PyTeaserPython3/pyteaser.py", line 72
 print 'IOError'
 ^
SyntaxError: Missing parentheses in call to 'print'

That didn’t go too well. Print is a function in Python3.

There’s a utility called 2to3 that will automatically update the code.

alan@dg04:~/PyTeaserPython3$ 2to3 -wn .
RefactoringTool: Skipping optional fixer: buffer
RefactoringTool: Skipping optional fixer: idioms
RefactoringTool: Skipping optional fixer: set_literal
...
RefactoringTool: ./goose/utils/encoding.py
RefactoringTool: ./goose/videos/extractors.py
RefactoringTool: ./goose/videos/videos.py

alan@dg04:~/PyTeaserPython3$ git diff --stat
 demo.py | 6 +++---
 goose/__init__.py | 2 +-
 goose/article.py | 18 +++++++++---------
 goose/extractors.py | 8 ++++----
 goose/images/extractors.py | 6 +++---
 goose/images/image.py | 4 ++--
 goose/images/utils.py | 6 +++---
 goose/network.py | 16 ++++++++--------
 goose/outputformatters.py | 2 +-
 goose/parsers.py | 6 +++---
 goose/text.py | 4 ++--
 goose/utils/__init__.py | 10 +++++-----
 goose/utils/encoding.py | 28 ++++++++++++++--------------
 pyteaser.py | 16 ++++++++--------
 tests.py | 12 ++++++------
 15 files changed, 72 insertions(+), 72 deletions(-)

It’s obviously done some work – how does this affect the tests?

alan@dg04:~/PyTeaserPython3$ python -m tests
.E
======================================================================
ERROR: testURLs (__main__.TestSummarize)
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs
 summaries = SummarizeUrl(url)
 File "/home/alan/PyTeaserPython3/pyteaser.py", line 70, in SummarizeUrl
 article = grab_link(url)
...
 File "/home/alan/PyTeaserPython3/goose/text.py", line 88, in 
 class StopWords(object):
 File "/home/alan/PyTeaserPython3/goose/text.py", line 90, in StopWords
 PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]")
 File "/home/alan/anaconda3/lib/python3.6/re.py", line 233, in compile
 return _compile(pattern, flags)
 File "/home/alan/anaconda3/lib/python3.6/re.py", line 301, in _compile
 p = sre_compile.compile(pattern, flags)
 ...
 File "/home/alan/anaconda3/lib/python3.6/sre_parse.py", line 526, in _parse
 code1 = _class_escape(source, this)
 File "/home/alan/anaconda3/lib/python3.6/sre_parse.py", line 336, in _class_escape
 raise source.error('bad escape %s' % escape, len(escape))
sre_constants.error: bad escape \p at position 2

----------------------------------------------------------------------
Ran 2 tests in 0.054s

FAILED (errors=1)

Some progress. One of the two tests passed.

Root of the error in the failing test is this line: PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]"). Looks like it isn’t used anywhere:

alan@dg04:~/PyTeaserPython3$ grep -nrI PUNCTUATION
goose/text.py:90: PUNCTUATION = re.compile("[^\\p{Ll}\\p{Lu}\\p{Lt}\\p{Lo}\\p{Nd}\\p{Pc}\\s]")
alan@dg04:~/PyTeaserPython3$

Comment out that line and try again

alan@dg04:~/PyTeaserPython3$ python -m tests
.E
======================================================================
ERROR: testURLs (__main__.TestSummarize)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs
summaries = SummarizeUrl(url)
...
File "/home/alan/PyTeaserPython3/goose/text.py", line 91, in StopWords
TRANS_TABLE = string.maketrans('', '')
AttributeError: module 'string' has no attribute 'maketrans'

----------------------------------------------------------------------
Ran 2 tests in 0.061s

FAILED (errors=1)
alan@dg04:~/PyTeaserPython3$

I admit it was a bit optimistic to think that commenting out one line would do the trick. Now the problem arises when TRANS_TABLE is defined, and this is used elsewhere in the code.

alan@dg04:~/PyTeaserPython3$ grep -nrI TRANS_TABLE
goose/text.py:91: TRANS_TABLE = string.maketrans('', '')
goose/text.py:107: return content.translate(self.TRANS_TABLE, string.punctuation).decode('utf-8')
alan@dg04:~/PyTeaserPython3$

Fortunately someone put a useful comment into this method so I can google StackOverflow and find out how to do the same thing in Python3.

def remove_punctuation(self, content):
    # code taken form
    # http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python
    if isinstance(content, str):
        content = content.encode('utf-8')
    return content.translate(self.TRANS_TABLE, string.punctuation).decode('utf-8')

Sure enough there is an equivalent question and answer on StackOverflow so I can edit the method accordingly:

def remove_punctuation(self, content):
    # code taken form
    # https://stackoverflow.com/questions/34293875/how-to-remove-punctuation-marks-from-a-string-in-python-3-x-using-translate
    translator = str.maketrans('','',string.punctuation)
    return content.translate(translator)

And now I can remove the reference to TRANS_TABLE from line 91 and run the tests again.

alan@dg04:~/PyTeaserPython3$ python -m tests
.E
======================================================================
ERROR: testURLs (__main__.TestSummarize)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/alan/PyTeaserPython3/tests.py", line 20, in testURLs
summaries = SummarizeUrl(url)
...
return URLHelper.get_parsing_candidate(crawl_candidate.url)
File "/home/alan/PyTeaserPython3/goose/utils/__init__.py", line 104, in get_parsing_candidate
link_hash = '%s.%s' % (hashlib.md5(final_url).hexdigest(), time.time())
TypeError: Unicode-objects must be encoded before hashing

----------------------------------------------------------------------
Ran 2 tests in 0.273s

FAILED (errors=1)
alan@dg04:~/PyTeaserPython3$

A bit of digging to fix this and the tests now pass.

alan@dg04:~/PyTeaserPython3$ python -m tests
..
----------------------------------------------------------------------
Ran 2 tests in 0.273s

OK
alan@dg04:~/PyTeaserPython3$

Let’s see how the demo works.

alan@dg04:~/PyTeaserPython3$ python demo.py
None
None
None
alan@dg04:~

Hmm, that doesn’t seem right. Ah, I’m not connected to the internet, duh.

Connect and try again.

alan@dg04:~/PyTeaserPython3$ python demo.py
None
None
None
alan@dg04:~

Compare with the results from the python2 version.

[u"Twitter's move is the latest response from U.S. Internet firms following disclosures by former spy agency contractor Edward Snowden about widespread, classified U.S. government surveillance programs.",
u'"Since then, it has become clearer and clearer how important that step was to protecting our users\' privacy."',
u'"A year and a half ago, Twitter was first served completely over HTTPS," the company said in a blog posting.',
...

Something isn’t working. I track it down to this code in the definition of get_html in goose/network.py.

    try:
        result = urllib.request.urlopen(request).read()
    except:
        return None

In Python3 the encoding of the URL causes this error: urllib.error.URLError: . The try fails and so None is returned. I fix the encoding but there is now a ValueError being thrown. From grab_link in pyteaser.py.

    try:
        article = Goose().extract(url=inurl)
        return article
    except ValueError:
        print('Goose failed to extract article from url')
        return None

A bit more digging – this is due to the fact that the string ‘10.0’ can’t be converted to an int. I edit the code to use a float instead of an int in this case.

Seems to be working now.

alan@dg04:~/PyTeaserPython3$ python demo.py
["Twitter's move is the latest response from U.S. Internet firms following "
'disclosures by former spy agency contractor Edward Snowden about widespread, '
'classified U.S. government surveillance programs.',
'"Since then, it has become clearer and clearer how important that step was '
'to protecting our users\' privacy."',

Let’s just double-check the tests.

alan@dg04:~/PyTeaserPython3$ python -m tests
./home/alan/PyTeaserPython3/goose/outputformatters.py:65: DeprecationWarning: The unescape method is deprecated and will be removed in 3.5, use html.unescape() instead.
txt = HTMLParser().unescape(txt)
.
----------------------------------------------------------------------
Ran 2 tests in 3.282s

OK
alan@dg04:~/PyTeaserPython3$

The previous passing tests didn’t give me the full story. Let’s fix the deprecation warning in goose/outputformatters.py and re-run the tests.

alan@dg04:~/PyTeaserPython3$ python -m tests
..
----------------------------------------------------------------------
Ran 2 tests in 3.653s

OK

Better.

Finally, I want to double check the outputs of demo.py just to make sure I am getting the same output. It turns out that the summary for the second URL in the demo was producing a different result between Python 2 and Python 3. See the keywords function in pyteaser.py (beginning line 177). The culprit is line 184 where Counter has items in a different order between the different versions. Seems the ordering logic has changed subtly between the two versions.

In Python 3 version:

Counter({'montevrain': 6, 'animal': 5, 'tiger': 3, 'town': 3, 'officials': 2, 'dog': 2, 'big': 2, 'cat': 2, 'outside': 2, 'woman': 2, 'supermarket': 2, 'car': 2, 'park': 2, 'search': 2, 'called': 2, 'local': 2, 'schools': 2, 'kept': 2, 'parisien': 2, 'mayors': 2, 'office': 2

In Python2 version

Counter({'montevrain': 6, 'animal': 5, 'tiger': 3, 'town': 3, 'office': 2, 'local': 2, 'mayors': 2, 'woman': 2, 'big': 2, 'schools': 2, 'officials': 2, 'outside': 2, 'supermarket': 2, 'search': 2, 'parisien': 2, 'park': 2, 'car': 2, 'cat': 2, 'called': 2, 'dog': 2, 'kept': 2

So when line 187 picks the 10 most common keywords the Python 2 and Python 3 version end up with a different list. A subtle change in the logic of Counter made quite a difference to the end result of running PyTeaser.

A CTO perspective on developers

I wrote a series of posts called “Impress Your CTO”.

The TL;DR running through all of those is:

  1. Learn your tooling. We have all made the error of selecting all rows of a database table and then counting them when we should just done <code>select count(*) from table</code>. There are plenty of other instances where something can be very simply and robustly done in your datastore, or even at operating system level, rather than at the application level.
  2. Be aware that the requirements you get from users only scratch the surface of what they really want to be able to do. Great developers are those who have a sufficient understanding of the users and so intuitively understand what makes sense or not, and what has been left unsaid by the product owner.

The whole series, for your reading pleasure, is re-capped here:

 

First steps with Ethereum Private Networks and Smart Contracts on Ubuntu 16.04

Ethereum is still in that “move fast and break things” phase. The docs for contracts[1] are very out of date, even the docs for mining have some out of date content in[2].

I wanted a very simple guide to setting up a small private network for testing smart contracts. I couldn’t find a simple one that worked. After much trial and error and digging around on Stackexchange, see below the steps I eventually settled on to get things working with a minimum of copy/paste. Hope it will prove useful for other noobs out there and that more experienced people might help clear things up that I have misunderstood.

I’ll do 3 things:

  1. Set up my first node and do some mining on it
  2. Add a very simple contract
  3. Add a second node to the network

First make sure you have installed ethereum (geth) and solc, for example with:

sudo apt-get install software-properties-common
sudo add-apt-repository -y ppa:ethereum/ethereum
sudo apt-get update
sudo apt-get install ethereum solc

Set up first node and do some mining on it

Create a genesis file – the below is about as simple as I could find. Save it as genesis.json in your working directory.

{
  "config": {
    "chainId": 1907,
    "homesteadBlock": 0,
    "eip155Block": 0,
    "eip158Block": 0
  },
  "difficulty": "10",
  "gasLimit": "2100000",
  "alloc": {}
}

Make a new data directory for the first node and set up two accounts to be used by that node. (Obviously your addresses will differ to the examples you see below).

ethuser@host01:~$ mkdir node1
ethuser@host01:~$ geth --datadir node1 account new
WARN [07-19|14:16:22] No etherbase set and no accounts found as default 
Your new account is locked with a password. Please give a password. Do not forget this password.
Passphrase: 
Repeat passphrase: 
Address: {f74afb1facd5eb2dd69feb589213c12be9b38177}
ethuser@host01:~$ geth --datadir node1 account new
Your new account is locked with a password. Please give a password. Do not forget this password.
Passphrase: 
Repeat passphrase: 
Address: {f0a3cf66cc2806a1e9626e11e5324360ee97f968}

Choose a networkid for your private network and initiate the first node:

ethuser@host01:~$ geth --datadir node1 init genesis.json
INFO [07-19|14:21:44] Allocated cache and file handles         database=/home/ethuser/node1/geth/chaindata cache=16 handles=16
....
INFO [07-19|14:21:44] Successfully wrote genesis state         database=lightchaindata                          hash=dd3f8d…707d0d

Now launch a geth console

ethuser@host01:~$ geth --datadir node1 --networkid 98765 console
INFO [07-19|14:22:42] Starting peer-to-peer node               instance=Geth/v1.6.7-stable-ab5646c5/linux-amd64/go1.8.1
...
 datadir: /home/ethuser/node1
 modules: admin:1.0 debug:1.0 eth:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0

> 

The first account you created is set as eth.coinbase. This will earn ether through mining. It does not have any ether yet[3], so we need to mine some blocks:

> eth.coinbase
"0xf74afb1facd5eb2dd69feb589213c12be9b38177"
> eth.getBalance(eth.coinbase)
0
> miner.start(1)

First time you run this it will create the DAG. This will take some time. Once the DAG is completed, leave the miner running for a while until it mines a few blocks. When you are ready to stop it, stop it with miner.stop().

.....
INFO [07-19|14:40:03] 🔨 mined potential block                  number=13 hash=188f37…47ef07
INFO [07-19|14:40:03] Commit new mining work                   number=14 txs=0 uncles=0 elapsed=196.079µs
> miner.stop()
> eth.getBalance(eth.coinbase)
65000000000000000000
> eth.getBalance(eth.accounts[0])
65000000000000000000

The first account in the account list is the one that has been earning the ether for its mining, so all we prove above is that eth.coinbase == eth.accounts[0]. Now we’ve got some ether in the first account, let’s send it to the 2nd account we created. The source account has to be unlocked before it can send a transaction.

> eth.getBalance(eth.accounts[1])
0
> personal.unlockAccount(eth.accounts[0])
Unlock account 0xf74afb1facd5eb2dd69feb589213c12be9b38177
Passphrase: 
true
> eth.sendTransaction({from: eth.accounts[0], to: eth.accounts[1], value: web3.toWei(3,"ether")})
INFO [07-19|14:49:12] Submitted transaction                    fullhash=0xa69d3fdf5672d2a33b18af0a16e0b56da3cbff5197898ad8c37ced9d5506d8a8 recipient=0xf0a3cf66cc2806a1e9626e11e5324360ee97f968
"0xa69d3fdf5672d2a33b18af0a16e0b56da3cbff5197898ad8c37ced9d5506d8a8"

For this transaction to register it has to be mined into a block, so let’s mine one more block:

> miner.start(1)
INFO [07-19|14:50:14] Updated mining threads                   threads=1
INFO [07-19|14:50:14] Transaction pool price threshold updated price=18000000000
null
> INFO [07-19|14:50:14] Starting mining operation 
INFO [07-19|14:50:14] Commit new mining work                   number=14 txs=1 uncles=0 elapsed=507.975µs
INFO [07-19|14:51:39] Successfully sealed new block            number=14 hash=f77345…f484c9
INFO [07-19|14:51:39] 🔗 block reached canonical chain          number=9  hash=2e7186…5fbd96
INFO [07-19|14:51:39] 🔨 mined potential block                  number=14 hash=f77345…f484c9

> miner.stop()
true
> eth.getBalance(eth.accounts[1])
3000000000000000000

One small point: the docs talk about miner.hashrate. This no longer exists, you have to use eth.hashrate if you want to see mining speed.

Add a very simple contract

The example contract is based on an example in the Solidity docs. There is no straightforward way to compile a contract into geth. Browser-solidity is a good online resource but I want to stick to the local server as much as possible for this posting. Save the following contract into a text file called simple.sol

pragma solidity ^0.4.13;

contract Simple {
  function arithmetics(uint _a, uint _b) returns (uint o_sum, uint o_product) {
    o_sum = _a + _b;
    o_product = _a * _b;
  }

  function multiply(uint _a, uint _b) returns (uint) {
    return _a * _b;
  }
}

And compile it as below:

ethuser@host01:~/contract$ solc -o . --bin --abi simple.sol
ethuser@host01:~/contract$ ls
Simple.abi  Simple.bin  simple.sol

The .abi file holds the contract interface and the .bin file holds the compiled code. There is apparently no neat way to load these files into geth so we will need to edit those files into scripts that can be loaded. Edit the files so they look like the below:

ethuser@host01:~/contract$ cat Simple.abi
var simpleContract = eth.contract([{"constant":false,"inputs":[{"name":"_a","type":"uint256"},{"name":"_b","type":"uint256"}],"name":"multiply","outputs":[{"name":"","type":"uint256"}],"payable":false,"type":"function"},{"constant":false,"inputs":[{"name":"_a","type":"uint256"},{"name":"_b","type":"uint256"}],"name":"arithmetics","outputs":[{"name":"o_sum","type":"uint256"},{"name":"o_product","type":"uint256"}],"payable":false,"type":"function"}])

and

ethuser@host01:~/contract$ cat Simple.bin
personal.unlockAccount(eth.accounts[0])

var simple = simpleContract.new(
{ from: eth.accounts[0],
data: "0x6060604052341561000f57600080fd5b5b6101178061001f6000396000f30060606040526000357c0100000000000000000000000000000000000000000000000000000000900463ffffffff168063165c4a161460475780638c12d8f0146084575b600080fd5b3415605157600080fd5b606e600480803590602001909190803590602001909190505060c8565b6040518082815260200191505060405180910390f35b3415608e57600080fd5b60ab600480803590602001909190803590602001909190505060d6565b604051808381526020018281526020019250505060405180910390f35b600081830290505b92915050565b600080828401915082840290505b92509290505600a165627a7a72305820389009d0e8aec0e9007e8551ca12061194d624aaaf623e9e7e981da7e69b2e090029",
gas: 500000
}
)

Two things in particular to notice:

  1. In the .bin file you need to ensure that the from account is unlocked
  2. The code needs to be enclosed in quotes and begin with 0x

Launch geth as before, load the contract scripts, mine them into a block and then interact with the contract. We won’t be able to do anything useful with the contract until it’s mined, as you can see below.

ethuser@host01:~$ geth --datadir node1 --networkid 98765 console
INFO [07-19|16:33:02] Starting peer-to-peer node instance=Geth/v1.6.7-stable-ab5646c5/linux-amd64/go1.8.1
....
> loadScript("contract/Simple.abi")
true
> loadScript("contract/Simple.bin")
Unlock account 0xf74afb1facd5eb2dd69feb589213c12be9b38177
Passphrase: 
INFO [07-19|16:34:16] Submitted contract creation              fullhash=0x318caec477b1b5af4e36b277fe9a9b054d86744f2ee12e22c12a7d5e16f9a022 contract=0x2994da3a52a6744aafb5be2adb4ab3246a0517b2
true
> simple
{
....
  }],
  address: undefined,
  transactionHash: "0x318caec477b1b5af4e36b277fe9a9b054d86744f2ee12e22c12a7d5e16f9a022"
}
> simple.multiply
undefined
> miner.start(1)
INFO [07-19|16:36:07] Updated mining threads                   threads=1
...
INFO [07-19|16:36:21] 🔨 mined potential block                  number=15 hash=ac3991…83b9ac
...
> miner.stop()
true
> simple.multiply
function()
> simple.multiply.call(5,6)
30
> simple.arithmetics.call(8,9)
[17, 72]

Set up a second node in the network

I’ll run the second node on the same virtual machine to keep things simple. The steps to take are:

  1. Make sure the existing geth node is running
  2. Create an empty data directory for the second node
  3. Add accounts for the second geth node as before
  4. Initialise the second geth node using the same genesis block as before
  5. Launch the second geth node setting bootnodes to point to the existing node

The second geth node will need to run on a non-default port.

Find the enode details from the existing geth node:

> admin.nodeInfo
{  enode: "enode://08993401988acce4cd85ef46a8af10d1cacad39652c98a9df4d5785248d1910e51d7f3d330f0a96053001264700c7e94c4ac39d30ed5a5f79758774208adaa1f@[::]:30303", 
...

We will need to substitute [::] with the IP address of the host, in this case 127.0.0.1

To set up the second node:

ethuser@host01:~$ mkdir node2
ethuser@host01:~$ geth --datadir node2 account new
WARN [07-19|16:55:52] No etherbase set and no accounts found as default 
Your new account is locked with a password. Please give a password. Do not forget this password.
Passphrase: 
Repeat passphrase: 
Address: {00163ea9bd7c371f92ecc3020cfdc69a32f70250}
ethuser@host01:~$ geth --datadir node2 init genesis.json
INFO [07-19|16:56:14] Allocated cache and file handles         database=/home/ethuser/node2/geth/chaindata cache=16 handles=16
...
INFO [07-19|16:56:14] Writing custom genesis block 
INFO [07-19|16:56:14] Successfully wrote genesis state         database=lightchaindata                          hash=dd3f8d…707d0d
ethuser@host01:~$ geth --datadir node2 --networkid 98765 --port 30304 --bootnodes "enode://08993401988acce4cd85ef46a8af10d1cacad39652c98a9df4d5785248d1910e51d7f3d330f0a96053001264700c7e94c4ac39d30ed5a5f79758774208adaa1f@127.0.0.1:30303" console

Wait a little and you will see block synchronisation taking place

> INFO [07-19|16:59:36] Block synchronisation started 
INFO [07-19|16:59:36] Imported new state entries               count=1 flushed=0 elapsed=118.503µs processed=1 pending=4 retry=0 duplicate=0 unexpected=0
INFO [07-19|16:59:36] Imported new state entries               count=3 flushed=2 elapsed=339.353µs processed=4 pending=3 retry=0 duplicate=0 unexpected=0

To check that things have fully synced, run eth.getBlock('latest') on each of the nodes. If things aren’t looking right then use admin.peers on each node to make sure that each node has peered with the other node.

Now you can run the miner on one node and run transactions on the other node.

Notes

[1]  https://github.com/ethereum/go-ethereum/issues/3793:

Compiling via RPC has been removed in #3740 (see ethereum/EIPs#209 for why). We will bring it back under a different method name if there is sufficient user demand. You’re the second person to complain about it within 2 days, so it looks like there is demand.

[2] The official guide to Contracts is out of date. I spotted some out of date material on Mining and submitted an issue but updating the official docs doesn’t seem much of a priority so I figured I would collect my learnings here for now.
[3] You can pre-allocate ether in the genesis.json if you prefer, but that would mean a little more cut and paste which I am doing my best to minimise here.

 

Edited 26 Aug 2017

No need to specify networkid when initialising the node.

Edited 25 Jan 2018

Decreased difficulty to 10 per this comment from bglmmz, thanks for the suggestion.