python ‘in’ set vs list

Came across a comment on this Stackoverflow question which concerned converting a list to a set before looking for items in that set.

The point of set is to convert in from an O(n) operator to an O(1) operator. In this case it doesn’t really matter, but for larger data sets it would be inefficient to have multiple linear scans over a list.

Which led me to look a bit more at time complexity in python functions in general.

In the past I would only convert a list to a set if I wanted to remove duplicates. But evidently it’s a good habit to get into if speed of looking in that set is important. Though, obviously, converting list to set is O(n) in the first place.

Python: bind method outside loop to reduce overhead

Came across this interesting pattern in some of the sklearn codebase:

        # bind method outside of loop to reduce overhead
        ngrams_append = ngrams.append

        for n in range(min_n, min(max_n + 1, text_len + 1)):
            for i in range(text_len - n + 1):
                ngrams_append(text_document[i: i + n])
        return ngrams

 

I wonder, at what scale does this really start to make a meaningful performance difference?

Even if it doesn’t make a huge difference, I do find it pretty elegant and easy to follow.

Python script to handle company abbreviations

A while back I was doing some tasks to clean up company names. Wikipedia has a useful page but I couldn’t find a simple way to use this information in a python script. So after some downloading and wrangling data from this wikipedia page, here is a link to the code I ended up with. Posting it here in case it is of use to someone else out there.

Some useful factoids I picked up along the way:

  1. Take a lot of care with Unicode. Characters that look the same (depending on font) might be very different. For example: 'KT vs КТ'.lower() == 'kt vs кт'
  2. Some abbreviations might appear at the beginning of a name, not just at the end. For example ENEL RUSSIA PJSC vs PJSC “AEROFLOT”.

 

Comparing Python vs Ruby’s handling of empty Lists/Arrays

TIL Ruby is more forgiving when handling empty arrays that Python is with empty lists. If you try to access a non-existent index in Ruby you get a nil back. In Python you get an IndexError. I prefer the Python approach. It’s more straightforward.

Though Python also has a curious way that allows you to work around IndexError by trying to access a range of items in the List. And Ruby does also have an approach that will give you an appropriate error if the index is out of bounds. So who knows what the logic behind all of this is supposed to be ….

Python Version

Compare the following Python commands

empty = []
arr = ["here","are","some","words"]
empty[0]
arr[0]
empty[0:1]
arr[0:1]
" ".join(empty[0:1])
" ".join(arr[0:1])

Results of the Python commands:

>>> empty = []
>>> arr = ["here","are","some","words"]
>>> empty[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> arr[0]
'here'
>>> empty[0:0]
[]
>>> arr[0:0]
['here']
>>> " ".join(empty[0:0])
''
>>> " ".join(arr[0:0])
'here'

Ruby version

See the following Ruby commands

empty = []
arr = ["here","are","some","words"]
empty[0]
arr[0]
empty[0..0]
arr[0..0]
empty[0..0].join(" ")
arr[0..0].join(" ")

Results of the Ruby commands:

irb(main):007:0> empty = []
=> []
irb(main):008:0> arr = ["here","are","some","words"]
=> ["here", "are", "some", "words"]
irb(main):009:0> empty[0]
=> nil
irb(main):010:0> arr[0]
=> "here"
irb(main):011:0> empty[0..0]
=> []
irb(main):012:0> arr[0..0]
=> ["here"]
irb(main):013:0> empty[0..0].join(" ")
=> ""
irb(main):014:0> arr[0..0].join(" ")
=> "here"

Comparison

Python is pretty unforgiving if you try to access a non-existent element. Ruby is forgiving enough to give you a nil if the element is non-existent. Downside of the Ruby approach is that you don’t know if the nil means “there was a nil there” or “there was nothing there”.

If you want the less forgiving approach in Ruby you can use fetch like so:

irb(main):031:0> empty
=> []
irb(main):032:0> empty.fetch(0)
Traceback (most recent call last):
5: from /snap/ruby/132/bin/irb:23:in `<main>'
4: from /snap/ruby/132/bin/irb:23:in `load'
3: from /snap/ruby/132/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
2: from (irb):32
1: from (irb):32:in `fetch'
IndexError (index 0 outside of array bounds: 0...0)

 

Learning how Cypress interacts with LocalStorage

I asked some developer friends a while back about whether people still like Watir for testing or if people are going more wholesale down the Selenium route. Their answer: “None of the above, try Cypress instead.”

I love it when I ask a question and get an answer completely out of left field. Turns out they were absolutely right. Cypress has proved to be excellent. It makes writing browser automation tests fun. There was a bit of a learning curve, and so I want to share two things that it took a while for me to wrap my head around:

  1. Understanding Cypress’s async nature, and the need to rely on then for certain use cases.
  2. Understanding that if you follow along the tests in the UI then what you see isn’t always what Cypress sees.

To illustrate this, see this gist that accesess localStorage under different circumstances:

The first thing to note here is that Cypress’s async nature means that, for example, accessing localStorage makes sense from inside a then or inside an afterEach, but not from inside the main body of the test. See below log results from the first two tests in the gists which [a] open a page and clicks a link in it and [b] directly opens the same page.

You can see that the logging from the main body of the test doesn’t produce any output (row 7 from the first test and 4 from the second test).

The second thing to note is that what is seen is not always what is real.

In the third test, when following along with dev tools, I could see that localStorage was being set, but even the afterEach claimed there was no localStorage set (see row 1 of the afterEach).

Cypress, localStorage and redirection

 

I suspect this is because when an interactive user goes to https://docs.cypress.io, they end up redirected to a different page, and localStorage gets set during this redirection. What you are seeing in Dev tools is what is set as part of the redirection process (see how the URL in the image above is different to the one the test lands on), but what Cypress reports is what it found prior to the redirection.

 

 

Fun with True and False in Python

TIL: In Python, True = 1 and False = 0

>>> var0 = 0
>>> var1 = 1
>>> var0 == True
False
>>> var0 == False
True
>>> var1 == True
True
>>> var1 == False
False

The number 2, however, is neither True nor False. But it is truthy.

>>> var2 = 2
>>> var2 == True
False
>>> var2 == False
False
>>> bool(var2)
True

Which has the following interesting side effects when testing conditions

>>> if (var1 == True): print("***  var1 is true ***")
... else: print("*** var1 is not true ***")
... 
***  var1 is true ***

>>> if (var2 == True): print("*** var2 is true ***")
... else: print("*** var2 is not true ***")
... 
*** var2 is not true ***

>>> if (bool(var2)): print("*** var2 is truthy ***")
... else: print("*** var2 is not truthy ***")
... 
*** var2 is truthy ***

Admittedly, this is a bit artificial because in reality if you wanted to test for the existence of var2 you’d just do:

>>> if var2: print("*** var2 is truthy ***")
... else: print("*** var2 is not truthy ***")
...
*** var2 is truthy ***

But I did enjoy discovering this little nugget as it means you can easily count up the number of Trues in a list like so:

>>> sum([True,False,True,True,False,False,True])
4

Using local settings in a Scrapy project

TL;DR Seeing a pattern used in one framework can help you address similar problems in a different framework.

I was working on a scrapy project that would save pages into a local database. We wanted to be able to test it using a test database, just as you would with Rails.

Wec could see how to configure a database connection string in the settings.py config file, but this didn’t help switch between development and test database

Stackoverflow wasn’t much help, so we ended up rolling our own:

  1. Edit the settings.py file so it would read from additional settings files depending on a SCRAPY_ENV environment variable
  2. Move all the settings files to a separate config directory (and change scrapy.cfg so it knew where to look
  3. Edit .gitignore so that local files wouldn’t get committed to the repo (and then added some .sample files)

Git repo is here.

The magic happens at the end of settings.py

from importlib import import_module
from scrapy.utils.log import configure_logging
import logging
import os

SCRAPY_ENV=os.environ.get('SCRAPY_ENV',None)
if SCRAPY_ENV == None:
    raise ValueError("Must set SCRAPY_ENV environment var")
logger = logging.getLogger(__name__)
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

# Load if file exists; incorporate any names started with an
# uppercase letter into globals()
def load_extra_settings(fname):
    if not os.path.isfile("config/%s.py" % fname):
        logger.warning("Couldn't find %s, skipping" % fname)
        return
    mdl=import_module("config.%s" % fname)
    names = [x for x in mdl.__dict__ if x[0].isupper()]
    globals().update({k: getattr(mdl,k) for k in names})

load_extra_settings("secrets")
load_extra_settings("secrets_%s" % SCRAPY_ENV)
load_extra_settings("settings_%s" % SCRAPY_ENV)

It feels a bit hacky, but it does the job, so I would love to learn a more pythonic way to address this issue.

Re-learning regexes to help with NLP

TL;DR Don’t forget about boring old-school solutions when you’re working with shiny new tech approaches. Dusty old tech might still have a part to play.

I’ve been experimenting with entity recognition. The use case is to identify company names in text. I’ve been using the fab service from dandelion.eu as my gold standard of what should be achievable.

Overall it is pretty impressive. Consider the following phrases:

Takeda is a gerbil

Dandelion recognises that this phrase is about a gerbil.

Takeda eats barley from Shire

Dandelion recognises that this is about barley

Takeda buys goods from Shire

A subtle change of words means this sentence is probably about a company called Takeda and a company called Shire.

Very cool.

But what about this sentence:

Polish media and telecoms group Cyfrowy Polsat (CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million).

Still pretty impressive but it has made one big fumble. It thinks that CPS stands for Canon, whereas in reality CPS.WA is the stock ticker for Cyfrowy Polsat.

Is there an alternative approach?

In this sort of document, company names are often abbreviated, or referenced by their stock ticker. When they are abbreviated they use the same kind of convention. Sounds like a job for a regex.

In case you think that something as old-school as regexes can’t possibly have a role to play with cutting edge code, bear in mind they are heavily used in NLP, so Dandelion must be using them somewhere under the covers. Or have a look at Scikit-learn’s CountVectorizer which uses a very simply regex for the token-pattern that it uses for splitting up text into different terms.

I can’t remember the last time I used a regex in anger. (See also this great stackoverflow question and answer on the topic). I don’t like just copy pasting from stack overflow so I broke the task down into a few steps to make sure I fully understood what was going on. The iterations I went through are below (paste this into a python console to see the results).

sent = "Polish media and telecoms group Cyfrowy Polsat (CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million)."
import regex as re
re.findall(r"\(.+\)",sent)
re.findall(r"\(.+?\)",sent)
re.findall(r"\([A-Z.]+?\)",sent)
re.findall(r"[A-Z]\w+\s\([A-Z.]+?\)",sent)
re.findall(r"(?:[A-Z]\w+\s)*\([A-Z.]+?\)",sent)
re.findall(r"((?:[A-Z]\w+\s*)*)\s\(([A-Z.]+?)\)",sent)
re.findall(r"((?:[A-Z]\w+\s(?:[A-Z][A-Za-z.\s]+\s?)*))\s\(([A-Z.]+)\)",sent)

To go through the iterations one by one with what I re-learnt at each stage.

>>> re.findall(r"\(.+\)",sent)
['(CPS.WA) said late Thursday it agreed to buy a controlling stake in sports content producer and distributor Eleven Sports Network Sp.z o.o. (ESN) for 38 million euros ($44.48 million)']

Find one or more characters inside a bracket. Brackets have special significance so need to be escaped. This simple regex finds the longest match (it’s “greedy”) so we need to change it to a “lazy” search.

>>> re.findall(r"\(.+?\)",sent)
['(CPS.WA)', '(ESN)', '($44.48 million)']

Better, but the only real initials are combinations of a capital and a dot, so specify that:

>>> re.findall(r"\([A-Z.]+?\)",sent)
['(CPS.WA)', '(ESN)']

Next I need to find the text before the initials. This would be a Capital letter followed by one or more other characters followed by a space:

>>> re.findall(r"[A-Z]\w+\s\([A-Z.]+?\)",sent)
['Polsat (CPS.WA)']

Getting somewhere, but really we need multiple words before the initials. So put brackets around the part of the regex that includes a capitalised word and match this one or more times

>>> re.findall(r"([A-Z]\w+\s)+\([A-Z.]+?\)",sent)
['Polsat ']

WTF? Makes no sense. Except it does: The brackets create a capture group. This changes findall’s behaviour. findall now returns the results of the capture group rather than the overall match. See below how using the captures() method returns the whole match.

>>> re.search(r"([A-Z]\w+\s)+\([A-Z.]+?\)",sent).captures()
['Cyfrowy Polsat (CPS.WA)']

Solution is to turn this new group that we created into a non-capturing group using ?:

>>> re.findall(r"(?:[A-Z]\w+\s)+\([A-Z.]+?\)",sent)
['Cyfrowy Polsat (CPS.WA)', '(ESN)']

A bit better, but it would be nice now if we can separate out the name from the abbreviation, so we return two capture groups in the regex:

>>> re.findall(r"([A-Z]\w+\s[A-Z](?:[\w.\s]+))\s\(([A-Z.]+?)\)",sent)
[('Cyfrowy Polsat', 'CPS.WA'), ('', 'ESN')]

At this point the regex is only looking for capitalised words before the abbreviation. Eleven Sports Network has some lower case terms in the name, so the regex needs a bit more tuning. The following example does the job in this particular case. It looks for a capitalised word then a space, a capital letter and then some other text until it gets to what looks like an abbreviation in brackets:

>>> re.findall(r"([A-Z]\w+\s[A-Z](?:[\w.\s]+))\s\(([A-Z.]+)\)",sent)
[('Cyfrowy Polsat', 'CPS.WA'), ('Eleven Sports Network Sp.z o.o.', 'ESN')]

You can see this regex in action on the fab regex101.com site. Let’s break this down:

(                           )\s\((       )\)
 [A-Z]\w+\s[A-Z]                  [A-Z.]+
                (?:[\w.\s]+)

  • Capturing Group 1
    1. [A-Z] a capital letter
    2. \w+ one or more letters (to complete a word)
    3. \s a space (either to start another word, or just before the abbreviation)
    4. [A-Z] then a capital letter, to start a 2nd capitalised word
      1. (?:[\w.\s]+) a non-capturing group of one or more letters or full stops (periods) or spaces.
  • \s\( a space and then an open bracket
  • Capturing Group 2:
    1. [A-Z.]+? one or more capital letters or full stops.
  • \) Close bracket

This regex does the job but it didn’t take long for it to become incomprehensible. If ever there were a use case for copious unit tests it’s using regexes.

Also, it doesn’t generalise well. It won’t identify “Acme & Bob Corp (ABC)” or “ABC and Partners Ltd (ABC)” or “X.Y.Z. Property Co (XYZ)” or “ABC Enterprises, Inc. (‘ABC’)” properly. Writing one regex to handle all of these types of string would quickly become very brittle and hard to understand. In reality I would end up using a serious of regexes rather than trying to code one regex to rule the all.

Nevertheless I hope it’s clear how a boring old piece of tech can help plug gaps in the cool new stuff. It’s never too late to (re-)learn the old ways. And it’s ok to put your pride aside and go back to basics.

How long will this feature take?

Be clear on someone’s definition of done when trying to communicate what it will take to build a feature.

Planning software development timescales is hard. As an industry we have moved away from the detailed Gantt charts and their illusion of total clarity and control. Basecamp have recently been talking about the joys of using Hill Charts to better communicate project statuses. The folk at ProdPad have been championing, for a long time, the idea of a flexible, transparent roadmap instead of committing to timelines.

That’s all well and good if you are preaching to the choir. But if you are working in a startup and the CEO needs to make a build or buy decision for a piece of software, you need to make sure that you have some way of weighing up the likely costs and efforts of any new feature you commit to build. It’s not good enough to just prioritise requests and drop them in the backlog.

The excellent Programmer Time Translation Table is a surprisingly accurate way of interpreting developer time estimates. My own rule of thumb is similar to Anders’ project manager. I usually triple anything a developer tells me because you build everything at least 3 times: once based on the developer’s interpretation of the requirements; once to convert that into what the product owner wanted; and once into what the end users will use. But even these approaches only look at things from the developer’s point of view, based on a developer’s “definition of done”. The overall puzzle can be much bigger than that.

For example, the startup CEO who is trying to figure out if we should invest in Feature X probably has a much longer range “definition of done” than “when can we release a beta version”. For example: “When will this make an impact on my revenues” or “when will this improve my user churn rates”. Part of the CTO job is to help make that decision from the business point of view in addition to what seems to be interesting from a tech angle.

For example, consider these two answers to the same question “When will the feature be done?”.

  1. The dev team is working on it in the current iteration, assuming testing goes well it will be released in 2 weeks.
  2. The current set of requirements is currently on target to release in 2 weeks. We will then need to do some monitoring over the next month or two so that we can iron out any issues that we spot in production and build any high priority enhancements that the users need. After that we will need to keep working on enhancements, customer feedback and scalability/security improvements so probably should expect to dedicate X effort on an ongoing basis over the next year.

Two examples from my experience:

A B2B system used primarily by internal staff. It took us about 6 weeks to release the first version from initial brainstorming on it. Then it took about another two months to get the first person to use it live. Within 2 years it was contributing 20% of our revenues, and people couldn’t live without it.

An end user feature that we felt would differentiate us from the competition. This was pretty technically involved so the backend work kept someone busy for a couple months. After some user testing we realised that the UI was going to need some imaginative work to get right. Eventually it got released. Two months after release the take-up was pretty unimpressive. But 5 years later that feature was fully embedded and it seems that everyone is using it.

What is the right “definition of done” for both of these projects? Depends on who is asking. It’s as well to be clear on what definition they are using before you answer. The right answer might be in the range of months or years, not hours or weeks.

Memory and Mining in Ethereum – a look in Geth internals

There are a few comments on my guide to getting started with Ethereum private networks about mining taking ages to run, or apparently not running at all. The consensus seems to be that you need to be on a 64 bit OS and using plenty of memory. In this post I’ll compare mining in a private network with 1Gb, 2Gb and 4Gb virtual machine configurations. This was an academic exercise I went through for my own interest, but hopefully will be of some interest to people who enjoy understanding a bit more about how the code works.

Here are the steps, collected from various docs, that I went through on Ubuntu. The developer’s guide was my starting point.

Install Go:

sudo add-apt-repository ppa:gophers/archive
sudo apt-get update
sudo apt-get install golang-1.9

Set GOPATH:

mkdir -p ~/go; echo "export GOPATH=$HOME/go" >> ~/.bashrc
echo "export PATH=$PATH:$HOME/go/bin:/usr/local/go/bin" >> ~/.bashrc
source ~/.bashrc

A few more commands so Go can be found in the GOPATH

mkdir $GOPATH/bin
sudo ln -s /usr/lib/go-1.9/bin/go $GOPATH/bin/go

Clone geth repo and build geth:

git clone git@github.com:ethereum/go-ethereum.git $GOPATH/src/github.com/ethereum/go-ethereum
cd $GOPATH/src/github.com/ethereum/go-ethereum
go install -v ./cmd/geth

You’ve probably got two geths installed now, a system-wide one and the one that you just installed, as below.

ethuser@eth-host:~$ geth version | grep '^Version:'
Version: 1.7.3-stable
ethuser@eth-host:~$ $GOPATH/bin/geth version | grep '^Version:'
Version: 1.8.0-unstable

Before we do anything else, let’s just double check that you can run the newly installed geth. Assuming you used the guide in the previous post:

cd ~
$GOPATH/bin/geth --datadir node1 --networkid 98765 console

Add some logging to geth. See this commit which provides additional logging during the mining process:

https://github.com/alanbuxton/go-ethereum/commit/3990f15fab13d32f9b9f0224367295f285f7f9db#diff-d48da17e6237684b832b97db3107179f

Apply these changes to your local version, or if you prefer you can clone the branch in my fork that has these changes already in.

git clone -b extra_logging --depth 1 https://github.com/alanbuxton/go-ethereum.git $GOPATH/src/github.com/ethereum/go-ethereum

Rebuild your local version using go install -v ./cmd/geth and now you’ll get some more logging when you mine in your private network. See below for some examples with extra logging with 4Gb, 2Gb and 1Gb virtual machine configurations:

4Gb

> miner.start(1);admin.sleepBlocks(1);miner.stop()
INFO [01-26|09:37:47] Updated mining threads threads=1
INFO [01-26|09:37:47] Transaction pool price threshold updated price=18000000000
INFO [01-26|09:37:47] Starting mining operation 
INFO [01-26|09:37:47] Commit new mining work number=34 txs=0 uncles=0 elapsed=125.978µs
INFO [01-26|09:37:47] Started ethash search for new nonces miner=0 seed=7377726478179259701
INFO [01-26|09:38:05] Still mining miner=0 attempts=1000 duration=18s
INFO [01-26|09:38:06] Still mining miner=0 attempts=2000 duration=19s
INFO [01-26|09:38:06] Still mining miner=0 attempts=4000 duration=19s
INFO [01-26|09:38:06] Still mining miner=0 attempts=8000 duration=19s
INFO [01-26|09:38:07] Still mining miner=0 attempts=16000 duration=20s
INFO [01-26|09:38:08] Still mining miner=0 attempts=32000 duration=21s
INFO [01-26|09:38:09] Finished mining miner=0 attempts=42332 duration=22s
INFO [01-26|09:38:09] Successfully sealed new block number=34 hash=c53c2f…45286c
INFO [01-26|09:38:09] 🔨 mined potential block number=34 hash=c53c2f…45286c

2 Gb

> miner.start(1);admin.sleepBlocks(1);miner.stop()
INFO [01-26|10:32:19] Updated mining threads threads=1
INFO [01-26|10:32:19] Transaction pool price threshold updated price=18000000000
INFO [01-26|10:32:19] Starting mining operation 
INFO [01-26|10:32:19] Commit new mining work number=37 txs=0 uncles=0 elapsed=129.734µs
INFO [01-26|10:32:19] Started ethash search for new nonces miner=0 seed=4187772785547710899
INFO [01-26|10:32:40] Still mining miner=0 attempts=1000 duration=21s
INFO [01-26|10:32:46] Still mining miner=0 attempts=2000 duration=27s
INFO [01-26|10:32:49] Still mining miner=0 attempts=4000 duration=30s
INFO [01-26|10:32:49] Still mining miner=0 attempts=8000 duration=30s
INFO [01-26|10:32:50] Still mining miner=0 attempts=16000 duration=31s
INFO [01-26|10:32:51] Still mining miner=0 attempts=32000 duration=32s
INFO [01-26|10:32:53] Finished mining miner=0 attempts=57306 duration=34s
INFO [01-26|10:32:53] Successfully sealed new block number=37 hash=068096…c31777
INFO [01-26|10:32:53] 🔨 mined potential block number=37 hash=068096…c31777

1Gb

I cancelled after over half an hour…

> miner.start(1);admin.sleepBlocks(1);miner.stop()
INFO [01-26|09:51:58] Updated mining threads threads=1
INFO [01-26|09:51:58] Transaction pool price threshold updated price=18000000000
INFO [01-26|09:51:58] Starting mining operation 
INFO [01-26|09:51:58] Commit new mining work number=37 txs=0 uncles=0 elapsed=243.956µs
INFO [01-26|09:51:58] Started ethash search for new nonces miner=0 seed=1420748052411692923
INFO [01-26|09:53:11] Still mining miner=0 attempts=1000 duration=1m13s
INFO [01-26|09:54:23] Still mining miner=0 attempts=2000 duration=2m25s
INFO [01-26|09:56:52] Still mining miner=0 attempts=4000 duration=4m54s
INFO [01-26|10:01:53] Still mining miner=0 attempts=8000 duration=9m55s
INFO [01-26|10:11:49] Still mining miner=0 attempts=16000 duration=19m51s
INFO [01-26|10:30:27] Still mining miner=0 attempts=32000 duration=38m29s