Rip it up and start again, without ripping it up and starting again

Posted on August 5, 2023August 5, 2023 by alanbuxton

Time for another update on my side project, Syracuse: http://syracuse-1145.herokuapp.com/

I got some feedback that the structure of the relationships was a bit unintuitive. Fair enough, let’s update it to make more sense.

Previously code was doing the following:

Use RoBERTa-based LLM for entity extraction (the code is pretty old but works well)
Use Benepar dependency parsing to link up relevant entities with each other
Use FlanT5 LLM to dig out some more useful content from the text
Build up an RDF representation of the data
Clean up the RDF

Step 5 had got quite complex.

Also, I had a look at import/export of RDF in graphs – specifically Neo4J, but couldn’t see much excitement about RDF. I even made a PR to update some of the Neo4J / RDF documentation. It’s been stuck for 2+ months.

I wondered if a better approach would be to start again using a different set of technologies. Specifically,

Falcon7B instead of FlanT5
Just building the representation in a graph rather than using RDF

Falcon7B was very exciting to get a chance to try out. But in my use case it wasn’t any more useful than FlanT5.

Going down the graph route was a bit of fun for a while. I’ve used networkx quite a bit in the past so thought I’d try with that first. But, guess what, it turned out more complicated than I needed. Also I do like the simplicity and elegance of RDF, even if it makes me seem a bit, old.

So the final choice was to rip up all my post-processing and turn it into pre-processing, and then generate the RDF. It was heart-breaking to throw away a lot of code, but, as programmers, I think we know when we’ve built something that is just too brittle and needs some heavy refactoring. It worked well in the end, see the git stats below:

code: 6 files change: 449 insertions, 729 deletions
tests: 81 files changed, 3618 insertions, 1734 deletions

Yes, a lot of tests. It’s a data-heavy application so there are a lot of tests to make sure that data is transformed as expected. Whenever it doesn’t work, I add that data (or enough of it) as a test case and then fix it. Most of this test data was just changed with global find/replace so it’s not a big overhead to maintain. But having all those tests was crucial for doing any meaningful refactoring.

On the code side, it was very satisfying to remove more code than I was adding. It just showed how brittle and convoluted the codebase had become. As I discovered more edge cases I added more logic to deal with them. Eventually this ended up as lots of complexity. The new code is “cleaner”. I put clean in quotes because there is still a lot of copy/paste in there and similar functions doing similar things. This is because I like to follow “Make it work, make it right, make it fast“. Code that works but isn’t super elegant is going to be easier to maintain/fix/re-factor later than code that is super-abstracted.

Some observations on the above:

Tests are your friend (obviously)
Expect to need major refactoring in the future. However well you capture all the requirements now, there will be plenty that have not yet been captured, and plenty of need for change
Shiny new toys aren’t always going to help – approach with caution
Sometimes the simple old-fashioned technologies are just fine
However bad you think an app is, there is probably still 80% in there that is good, so beware completely starting from scratch.

See below for the RDF as it stands now compared to before:

Current version

@prefix ns1: <http://example.org/test/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/test/abc/Core_Scientific> a org:Organization ;
    ns1:basedInRawLow "USA" ;
    ns1:buyer <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> ;
    ns1:description "Artificial Intelligence and Blockchain technologies" ;
    ns1:foundName "Core Scientific" ;
    ns1:industry "Artificial Intelligence and Blockchain technologies" ;
    ns1:name "Core Scientific" .

<http://example.org/test/abc/Stax_Digital> a org:Organization ;
    ns1:basedInRawLow "LLC" ;
    ns1:description "blockchain mining" ;
    ns1:foundName "Stax Digital, LLC",
        "Stax Digital." ;
    ns1:industry "blockchain mining" ;
    ns1:name "Stax Digital" .

<http://example.org/test/abc/Stax_Digital_Assets_Acquisition> a ns1:CorporateFinanceActivity ;
    ns1:activityType "acquisition" ;
    ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
    ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
        "Core Scientific completes acquisition of Stax Digital." ;
    ns1:foundName "acquired",
        "acquisition" ;
    ns1:name "acquired",
        "acquisition" ;
    ns1:status "completed" ;
    ns1:targetDetails "assets" ;
    ns1:targetEntity <http://example.org/test/abc/Stax_Digital> ;
    ns1:targetName "Stax Digital" ;
    ns1:whereRaw "llc" .

Previous version

@prefix ns1: <http://example.org/test/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/test/abc/Core_Scientific> a <http://www.w3.org/ns/org#Organization> ;
    ns1:basedInLow "USA" ;
    ns1:description "Artificial Intelligence and Blockchain technologies" ;
    ns1:foundName "Core Scientific" ;
    ns1:industry "Artificial Intelligence and Blockchain technologies" ;
    ns1:name "Core Scientific" ;
    ns1:spender <http://example.org/test/abc/Purchase_Stax_Digital> .

<http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> a ns1:TargetDetails ;
    ns1:label "Assets" ;
    ns1:name "Acquired Assets Stax Digital, LLC" ;
    ns1:nextEntity "Stax Digital, LLC" ;
    ns1:previousEntity "acquired" ;
    ns1:targetEntity <http://example.org/test/abc/Stax_Digital> .

<http://example.org/test/abc/Purchase_Stax_Digital> a ns1:Activity ;
    ns1:activityType "Purchase" ;
    ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
    ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
        "Core Scientific completes acquisition of Stax Digital." ;
    ns1:label "Acquired",
        "Acquisition" ;
    ns1:name "Purchase Stax Digital" ;
    ns1:targetDetails <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> ;
    ns1:whenRaw "has happened, no date available" .

<http://example.org/test/abc/Stax_Digital> a <http://www.w3.org/ns/org#Organization> ;
    ns1:basedInLow "LLC" ;
    ns1:description "blockchain mining" ;
    ns1:foundName "Stax Digital, LLC",
        "Stax Digital." ;
    ns1:industry "blockchain mining" ;
    ns1:name "Stax Digital" .

DIFF
1a2
> @prefix org: <http://www.w3.org/ns/org#> .
4,5c5,7
< <http://example.org/test/abc/Core_Scientific> a <http://www.w3.org/ns/org#Organization> ;
<     ns1:basedInLow "USA" ;
---
> <http://example.org/test/abc/Core_Scientific> a org:Organization ;
>     ns1:basedInRawLow "USA" ;
>     ns1:buyer <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> ;
9,10c11
<     ns1:name "Core Scientific" ;
<     ns1:spender <http://example.org/test/abc/Purchase_Stax_Digital> .
---
>     ns1:name "Core Scientific" .
12,31c13,14
< <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> a ns1:TargetDetails ;
<     ns1:label "Assets" ;
<     ns1:name "Acquired Assets Stax Digital, LLC" ;
<     ns1:nextEntity "Stax Digital, LLC" ;
<     ns1:previousEntity "acquired" ;
<     ns1:targetEntity <http://example.org/test/abc/Stax_Digital> .
< 
< <http://example.org/test/abc/Purchase_Stax_Digital> a ns1:Activity ;
<     ns1:activityType "Purchase" ;
<     ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
<     ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
<         "Core Scientific completes acquisition of Stax Digital." ;
<     ns1:label "Acquired",
<         "Acquisition" ;
<     ns1:name "Purchase Stax Digital" ;
<     ns1:targetDetails <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> ;
<     ns1:whenRaw "has happened, no date available" .
< 
< <http://example.org/test/abc/Stax_Digital> a <http://www.w3.org/ns/org#Organization> ;
<     ns1:basedInLow "LLC" ;
---
> <http://example.org/test/abc/Stax_Digital> a org:Organization ;
>     ns1:basedInRawLow "LLC" ;
36a20,34
> 
> <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> a ns1:CorporateFinanceActivity ;
>     ns1:activityType "acquisition" ;
>     ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
>     ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
>         "Core Scientific completes acquisition of Stax Digital." ;
>     ns1:foundName "acquired",
>         "acquisition" ;
>     ns1:name "acquired",
>         "acquisition" ;
>     ns1:status "completed" ;
>     ns1:targetDetails "assets" ;
>     ns1:targetEntity <http://example.org/test/abc/Stax_Digital> ;
>     ns1:targetName "Stax Digital" ;
>     ns1:whereRaw "llc" .

Software Development

Sometimes I miss Haskell’s immutability

Posted on February 3, 2023February 5, 2023 by alanbuxton

I have ‘fond’ memories of tracking down a particularly PITA bug in some Python code. It was down to whether using = makes two references to the same underlying object, or whether it makes a copy of the object in question. And equally fond memories of debugging a ruby function that changed a hash unexpectedly.

This sort of thing:

>>> arr = [1,2,3]
>>> arr2 = arr
>>> arr
[1,2,3]
>>> arr2
[1,2,3]
>>> arr == arr2
True # Values are equivalent, objects are also the same
>>> id(arr)
4469618560 # the id for both arr and arr2 is the same: they are the same object
>>> id(arr2)
4469618560
>>> arr2.append(4)
>>> arr
[1,2,3,4]
>>> arr2
[1,2,3,4]

compared to

>>> arr = [1,2,3]
>>> arr2 = [1,2,3] + [] # arr2 is now a different object
>>> arr
[1,2,3]
>>> arr2
[1,2,3]
>>> arr == arr2
True # Values are the same even though the object is different
>>> id(arr)
4469618560
>>> id(arr2)
4469669184 # value is different
>>> arr2.append(4)
>>> arr
[1,2,3]
>>> arr2
[1,2,3,4]

Or, in a case that you’re more likely to see in the wild

>>> arr = [1,2,3]
>>> d1 = {"foo":arr, "bar": "baz"}
>>> d2 = d1
>>> d3 = dict(d1) # Creates new dict, with its own id
>>> id(d1)
140047012964608
>>> id(d2)
140047012964608
>>> id(d3)
140047012965120
>>> arr.append(4)
>>> d1["bar"] += "x"
>>> d1
{'foo': [1, 2, 3, 4], 'bar': 'bazx'}
>>> d2
{'foo': [1, 2, 3, 4], 'bar': 'bazx'}
>>> d3
{'foo': [1, 2, 3, 4], 'bar': 'baz'}

d1 and d2 are the same dict so when you change the value of bar, both have the same result, but d3 still has the old value. But, even though the dicts are different, the arr in them is the same one. So anything that mutates that list will change it in all the dicts

This sort of behaviour has its logic but it is a logic you have to learn, and a logic you often have to learn the painful way. I particularly ‘enjoy’ cases where you’ve gone to great lengths to make sure the object is a copy and then find that something inside that list gets mutated.

It doesn’t have to be like this.

In what is probably the most beautiful language I’ve ever used (though sadly only in personal projects, not for work), Haskell, everything is immutable.

If you’ve never tried Haskell this might sound incomprehensible, but it’s just a different approach to programming. Two good links about this:

Immutability is Awesome

What does immutable variable in Haskell mean

This language feature eliminates a whole class of bugs related to changing an object unexpectedly. So I’d encourage any software developers to try to get their heads around a language like Haskell. It opened my eyes to a whole different approach to writing code (e.g. focussing on individual functions rather than starting with a db schema and working out from there).

Software Development

Coding with ChatGPT

Posted on January 21, 2023January 21, 2023 by alanbuxton

I’ve been using ChatGPT to help with some coding problems. In all the cases I’ve tried it has been wrong but has given me useful ideas. I’ve seen some extremely enthusiastic people who are saying that ChatGPT writes all their code for them. I can only assume that they mean it is applying common patterns for them and saving boilerplate work. Here is a recent example of an interaction I had with ChatGPT as an illustration.

The initial prompt:

Hi, I want to write a python function that will find common subsets that can be extracted from a list of sets. A common subset is one where several elements always appear together.

For example with the following sets:
s1 = {“a”,”b”,”c”}
s2 = {“a”,”b”,”c”}
s3 = {“c”}
s4 = {“d”,”e”}
s5 = {“d”,”e”,”f”}
s6 = {“d”,”e”,”f”,”g”}

The function should return
[{“a”,”b”},{“d”,”e”}]

What I liked about using it:

It forced me to think about an individual function that can be tested in isolation
It forced me to think really explicitly in terms of the inputs and outputs of the function
The answers it provided, specifically using itertools.combinations, gave me a good direction to try out.

What I didn’t like about using it:

The code didn’t work
It gives the impression of knowing what it’s talking about. I have to keep reminding myself that it’s just producing a wall of text/code based on what it has seen in the past. Some of the stuff it generates is complete nonsense (see below when it tries to explain what output a particular function would create, or whether a set is a subset of another set).
It’s a walled garden. Whereas in another life I might have posted my question on StackOverflow and got an answer that others could also benefit from, this discussion is just between me and ChatGPT. Not that StackOverflow is a panacea – I’ve still got an unanswered question on there from 3 + years ago, but, nevertheless, there is an important principle at stake.

See below the screenshots of the whole interaction:

Needless to say, this final piece of code didn’t return the expected output either.

Full code of the above conversation is below for anyone who wants to do some cut and paste.

Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> def common_subsets(sets):
...     # Initialize an empty set to store the common subsets
...     common = set()
...     # Iterate through all possible subsets of the first set
...     for subset in itertools.chain.from_iterable(itertools.combinations(sets[0], r) for r in range(len(sets[0])+1)):
...         subset = set(subset)
...         # Check if the current subset is present in all other sets
...         if all(subset <= s for s in sets[1:]):
...             common.add(subset)
...     return list(common)
... 
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in common_subsets
NameError: name 'itertools' is not defined
>>> import itertools
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 9, in common_subsets
TypeError: unhashable type: 'set'
>>> def common_subsets(sets):
...     # Initialize an empty set to store the common subsets
...     common = set()
...     # Iterate through all possible subsets of the first set
...     for subset in itertools.chain.from_iterable(itertools.combinations(sets[0], r) for r in range(len(sets[0])+1)):
...         subset = frozenset(subset)
...         # Check if the current subset is present in all other sets
...         if all(subset <= s for s in sets[1:]):
...             common.add(subset)
...     return list(common)
... 
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[frozenset()]
>>> from typing import List
>>> 
>>> def common_subsets(sets:List[set]) -> List[set]:
...     common = set()
...     for subset in itertools.chain.from_iterable(itertools.combinations(sets[0], r) for r in range(1,len(sets[0])+1)):
...         subset = frozenset(subset)
...         if all(subset <= s for s in sets[1:]):
...             common.add(subset)
...     return list(common)
... 
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[]
>>> from typing import List
>>> 
>>> def common_subsets(sets:List[set]) -> List[set]:
...     subsets = [set(itertools.combinations(s, r)) for s in sets for r in range(1, len(s) + 1)]
...     common = set.intersection(*subsets)
...     return list(common)
... 
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[]
>>> from typing import List
>>> 
>>> def common_subsets(sets:List[set]) -> List[set]:
...     # find all subsets of all sets
...     subsets = [set(itertools.combinations(s, r)) for s in sets for r in range(1, len(s) + 1)]
...     # find subsets that are in all subsets list
...     common = set.intersection(*subsets)
...     # check if subsets are present in all sets
...     common = [subset for subset in common if all(subset <= s for s in sets)]
...     return list(common)
... 
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[]

And at the risk of embarrassing myself in front of the whole world, and for the benefit of any future ChatGPT iteration that is reading this post, below is the solution I did come up with. Improvements welcome 😀

import itertools
from typing import List

def get_all_set_combinations(sets: List):
    all_combinations = set()
    for s in sets:
        if len(s) == 1:
            continue
        for l in range(2,len(s)+1):
            combos = itertools.combinations(s,l)
            for x in combos:
                all_combinations.add(tuple(x))
    return all_combinations

def find_extractable_subsets(sets: List):
    combos = get_all_set_combinations(sets)
    matching = set()
    for combo in sorted(combos, key=len, reverse=True):
        combo_set = set(combo)
        if not is_candidate_set_extractable(combo_set, sets):
            continue
        addable = True
        for x in matching:
            if combo_set & set(x) == combo_set:
                addable = False
                break
        if addable:
            matching.add(combo)
    return matching

def is_candidate_set_extractable(candidate, sets):
    for s in sets:
        # if this candidate is fully included in a set then it's a candidate to be exractable
        if (candidate & s) == candidate or (candidate & s) == set():
            continue
        else:
            return False
    return True


### And can be tested with:
s1 = {"a","b","c"}
s2 = {"a","b","c"}
s3 = {"c"}
s4 = {"d","e"}
s5 = {"d","e","f"}
s6 = {"d","e","f","g"}
find_extractable_subsets([s1,s2,s3,s4,s5,s6])

# With the expected result:
# {('b', 'a'), ('e', 'd')}

# it only picks the longest matching subsets, e.g.
find_extractable_subsets([s1,s2,s4,s5,s6])

# produces expected result:
# {('e', 'd'), ('b', 'c', 'a')}

Machine Learning, Software Development, Supply Chain Management

Comparison of Transformers vs older ML architectures in Spend Classification

Posted on June 11, 2022June 11, 2022 by alanbuxton

I recently wrote a piece for my company blog about why Transformers are a better machine learning technology to use in your spend classification projects compared to older ML techniques.

Transformers: More Than Meets the Eye (in Spend Classification)

That was a theoretical post that discussed things like sub-word tokenization and self-attention and how these architectural features should be expected to deliver improvements over older ML approaches.

During the Jubilee Weekend, I thought I’d have a go at doing some real-world tests. I wanted to do a simple test to see how much of a difference this all really makes in the spend classification use case. The code is here: https://github.com/alanbuxton/tbfy-cpv-classifier-poc

TL;DR – Bidirectional LSTM is a world away from Support Vector Machines but Transformers have the edge over Bi-LSTM. In particular they are more tolerant of spelling inconsistencies.

This is an update of the code I did for this post: https://alanbuxton.wordpress.com/2021/10/25/transformers-vs-spend-classification/ in which I trained the Transformer for 20 epochs. In this case it was 15 epochs. FWIW the 20-epoch version was better at handling the ‘mobile office’ example. This does indicate that better results will be achieved with more training. But for the purposes of the current blog post there wasn’t any need to go further.

Machine Learning, Software Development

Analyzing a WhatsApp group chat with Seaborn, NetworkX and Transformers

Posted on May 30, 2022 by alanbuxton

We had a company shutdown recently. Simfoni operates ‘Anytime Anywhere’, which means that anyone can work whatever hours they feel are appropriate from wherever they want to. Every quarter we mandate a full company shutdown over a long weekend to make sure that we all take time away from work at the same time and come back to a clear inbox.

For me this meant a bunch of playing with my kids and hanging out in the garden.

But it also meant playing with some fun tech courtesy of a brief challenge I was set: what insights could I generate quickly from a WhatsApp chat list.

I had a go using some of my favourite tools: Seaborn for easy data visualization; Huggingface Transformers for ML insights and Networkx for graph analysis.

You can find the repo here: https://github.com/alanbuxton/whatsapp-analysis

Software Development

python ‘in’ set vs list

Posted on January 11, 2020 by alanbuxton

Came across a comment on this Stackoverflow question which concerned converting a list to a set before looking for items in that set.

The point of set is to convert in from an O(n) operator to an O(1) operator. In this case it doesn’t really matter, but for larger data sets it would be inefficient to have multiple linear scans over a list.

Which led me to look a bit more at time complexity in python functions in general.

In the past I would only convert a list to a set if I wanted to remove duplicates. But evidently it’s a good habit to get into if speed of looking in that set is important. Though, obviously, converting list to set is O(n) in the first place.

Software Development

Python: bind method outside loop to reduce overhead

Posted on December 13, 2019 by alanbuxton

Came across this interesting pattern in some of the sklearn codebase:

        # bind method outside of loop to reduce overhead
        ngrams_append = ngrams.append

        for n in range(min_n, min(max_n + 1, text_len + 1)):
            for i in range(text_len - n + 1):
                ngrams_append(text_document[i: i + n])
        return ngrams

I wonder, at what scale does this really start to make a meaningful performance difference?

Even if it doesn’t make a huge difference, I do find it pretty elegant and easy to follow.

Software Development

Comparing Python vs Ruby’s handling of empty Lists/Arrays

Posted on April 8, 2019April 9, 2019 by alanbuxton

TIL Ruby is more forgiving when handling empty arrays that Python is with empty lists. If you try to access a non-existent index in Ruby you get a nil back. In Python you get an IndexError. I prefer the Python approach. It’s more straightforward.

Though Python also has a curious way that allows you to work around IndexError by trying to access a range of items in the List. And Ruby does also have an approach that will give you an appropriate error if the index is out of bounds. So who knows what the logic behind all of this is supposed to be ….

Python Version

Compare the following Python commands

empty = []
arr = ["here","are","some","words"]
empty[0]
arr[0]
empty[0:1]
arr[0:1]
" ".join(empty[0:1])
" ".join(arr[0:1])

Results of the Python commands:

>>> empty = []
>>> arr = ["here","are","some","words"]
>>> empty[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>> arr[0]
'here'
>>> empty[0:0]
[]
>>> arr[0:0]
['here']
>>> " ".join(empty[0:0])
''
>>> " ".join(arr[0:0])
'here'

Ruby version

See the following Ruby commands

empty = []
arr = ["here","are","some","words"]
empty[0]
arr[0]
empty[0..0]
arr[0..0]
empty[0..0].join(" ")
arr[0..0].join(" ")

Results of the Ruby commands:

irb(main):007:0> empty = []
=> []
irb(main):008:0> arr = ["here","are","some","words"]
=> ["here", "are", "some", "words"]
irb(main):009:0> empty[0]
=> nil
irb(main):010:0> arr[0]
=> "here"
irb(main):011:0> empty[0..0]
=> []
irb(main):012:0> arr[0..0]
=> ["here"]
irb(main):013:0> empty[0..0].join(" ")
=> ""
irb(main):014:0> arr[0..0].join(" ")
=> "here"

Comparison

Python is pretty unforgiving if you try to access a non-existent element. Ruby is forgiving enough to give you a nil if the element is non-existent. Downside of the Ruby approach is that you don’t know if the nil means “there was a nil there” or “there was nothing there”.

If you want the less forgiving approach in Ruby you can use fetch like so:

irb(main):031:0> empty
=> []
irb(main):032:0> empty.fetch(0)
Traceback (most recent call last):
5: from /snap/ruby/132/bin/irb:23:in `<main>'
4: from /snap/ruby/132/bin/irb:23:in `load'
3: from /snap/ruby/132/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
2: from (irb):32
1: from (irb):32:in `fetch'
IndexError (index 0 outside of array bounds: 0...0)

Software Development

Fun with True and False in Python

Posted on January 30, 2019April 9, 2019 by alanbuxton

TIL: In Python, True = 1 and False = 0

>>> var0 = 0
>>> var1 = 1
>>> var0 == True
False
>>> var0 == False
True
>>> var1 == True
True
>>> var1 == False
False

The number 2, however, is neither True nor False. But it is truthy.

>>> var2 = 2
>>> var2 == True
False
>>> var2 == False
False
>>> bool(var2)
True

Which has the following interesting side effects when testing conditions

>>> if (var1 == True): print("***  var1 is true ***")
... else: print("*** var1 is not true ***")
... 
***  var1 is true ***

>>> if (var2 == True): print("*** var2 is true ***")
... else: print("*** var2 is not true ***")
... 
*** var2 is not true ***

>>> if (bool(var2)): print("*** var2 is truthy ***")
... else: print("*** var2 is not truthy ***")
... 
*** var2 is truthy ***

Admittedly, this is a bit artificial because in reality if you wanted to test for the existence of var2 you’d just do:

>>> if var2: print("*** var2 is truthy ***")
... else: print("*** var2 is not truthy ***")
...
*** var2 is truthy ***

But I did enjoy discovering this little nugget as it means you can easily count up the number of Trues in a list like so:

>>> sum([True,False,True,True,False,False,True])
4

Software Development

Using local settings in a Scrapy project

Posted on October 9, 2018December 9, 2018 by alanbuxton

TL;DR Seeing a pattern used in one framework can help you address similar problems in a different framework.

I was working on a scrapy project that would save pages into a local database. We wanted to be able to test it using a test database, just as you would with Rails.

Wec could see how to configure a database connection string in the settings.py config file, but this didn’t help switch between development and test database

Stackoverflow wasn’t much help, so we ended up rolling our own:

Edit the settings.py file so it would read from additional settings files depending on a SCRAPY_ENV environment variable
Move all the settings files to a separate config directory (and change scrapy.cfg so it knew where to look
Edit .gitignore so that local files wouldn’t get committed to the repo (and then added some .sample files)

Git repo is here.

The magic happens at the end of settings.py：

from importlib import import_module
from scrapy.utils.log import configure_logging
import logging
import os

SCRAPY_ENV=os.environ.get('SCRAPY_ENV',None)
if SCRAPY_ENV == None:
    raise ValueError("Must set SCRAPY_ENV environment var")
logger = logging.getLogger(__name__)
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

# Load if file exists; incorporate any names started with an
# uppercase letter into globals()
def load_extra_settings(fname):
    if not os.path.isfile("config/%s.py" % fname):
        logger.warning("Couldn't find %s, skipping" % fname)
        return
    mdl=import_module("config.%s" % fname)
    names = [x for x in mdl.__dict__ if x[0].isupper()]
    globals().update({k: getattr(mdl,k) for k in names})

load_extra_settings("secrets")
load_extra_settings("secrets_%s" % SCRAPY_ENV)
load_extra_settings("settings_%s" % SCRAPY_ENV)

It feels a bit hacky, but it does the job, so I would love to learn a more pythonic way to address this issue.

Be good, work hard, get lucky

Be good, work hard, get lucky

Tag: python

Rip it up and start again, without ripping it up and starting again

Sometimes I miss Haskell’s immutability

Coding with ChatGPT

Comparison of Transformers vs older ML architectures in Spend Classification

Analyzing a WhatsApp group chat with Seaborn, NetworkX and Transformers

python ‘in’ set vs list

Python: bind method outside loop to reduce overhead

Comparing Python vs Ruby’s handling of empty Lists/Arrays

Python Version

Ruby version

Comparison

Fun with True and False in Python

Using local settings in a Scrapy project