Software Development

HackerNews wisdom on dealing with a huge, crappy codebase

Hmmm, is it really that crappy? It’s supporting 20 million USD of revenue with just 3 developers. Yes, ok, I buy that the code is terrible and makes you feel ill whenever you have to look at it.

The comments are heartening to read. In general people recommend against a major re-write. They recommend writing tests to confirm how certain behaviour works and then refactoring piece by piece. It’s along the lines of https://martinfowler.com/bliki/StranglerFigApplication.html

If you read the HN thread and still think that the best approach is to start again from scratch, I refer you to https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/

And if you still think a clean slate is the way to go, I give you this story: In 1972 the UK changed from using pounds, shillings and pence to using decimal currency. In 1995 I was training as a programmer. The teacher told us that 2 of the main 4 banks in the UK still calculated interest payments in pounds, shillings and pence. They had just put a wrapper that converted to/from decimal currency around this legacy code.

If you are dealing with poorly-understood code that you can’t really touch without risking everything falling to pieces, you are definitely not alone. It is fixable, given enough time and money. But it’s very unlikely that it can be fixed in one fresh re-write. Most realistic is to address it piece by piece.

In this vein I did find this comment particularly interesting because it represented a concrete small step that could help start solving the problem:

You need to develop an small app that will handle authentication/authorization. Next time a feature comes you will implement that page in the new stack, the rest of production pages will still run in the old code base .

That’s it.

A concurrent small migration to the new system without changing all the system at once.

Why it works? New systems often fail to encapsulate all the complexity. also two systems, duplicate your workload until you decide to drop new system because of the first statement.

Finally, get stats from nginx and figure out which routes aren’t used in a month, try disabled some and see how much dead routes you can find and clean

https://news.ycombinator.com/item?id=32887439

Will it work? Maybe, maybe not, but this sort of mindset – of looking for specific problems that can be fixed and not getting overwhelmed by the extent of the all the problems in the codebase – is the best way to be addressing this kind of challenge.

Machine Learning, Software Development, Supply Chain Management

Comparison of Transformers vs older ML architectures in Spend Classification

I recently wrote a piece for my company blog about why Transformers are a better machine learning technology to use in your spend classification projects compared to older ML techniques.

That was a theoretical post that discussed things like sub-word tokenization and self-attention and how these architectural features should be expected to deliver improvements over older ML approaches.

During the Jubilee Weekend, I thought I’d have a go at doing some real-world tests. I wanted to do a simple test to see how much of a difference this all really makes in the spend classification use case. The code is here: https://github.com/alanbuxton/tbfy-cpv-classifier-poc

TL;DR – Bidirectional LSTM is a world away from Support Vector Machines but Transformers have the edge over Bi-LSTM. In particular they are more tolerant of spelling inconsistencies.

This is an update of the code I did for this post: https://alanbuxton.wordpress.com/2021/10/25/transformers-vs-spend-classification/ in which I trained the Transformer for 20 epochs. In this case it was 15 epochs. FWIW the 20-epoch version was better at handling the ‘mobile office’ example. This does indicate that better results will be achieved with more training. But for the purposes of the current blog post there wasn’t any need to go further.

Machine Learning, Software Development

Analyzing a WhatsApp group chat with Seaborn, NetworkX and Transformers

We had a company shutdown recently. Simfoni operates ‘Anytime Anywhere’, which means that anyone can work whatever hours they feel are appropriate from wherever they want to. Every quarter we mandate a full company shutdown over a long weekend to make sure that we all take time away from work at the same time and come back to a clear inbox.

For me this meant a bunch of playing with my kids and hanging out in the garden.

But it also meant playing with some fun tech courtesy of a brief challenge I was set: what insights could I generate quickly from a WhatsApp chat list.

I had a go using some of my favourite tools: Seaborn for easy data visualization; Huggingface Transformers for ML insights and Networkx for graph analysis.

You can find the repo here: https://github.com/alanbuxton/whatsapp-analysis

Product Management, Software Development

The Now/Next/Later Roadmap changed my life

I love ideas that seem so obvious in retrospect. The Now/Next/Later roadmap is one of those: https://www.prodpad.com/blog/how-to-build-a-product-roadmap-everyone-understands/

The traditional roadmap is a group of chevrons or bars set out in quarters one or two years into the future. It says that: In Q1 we will deliver features A and B, in Q2 features C and D etc. Something like the following:

If you can plan this far ahead then, fantastic. But many of us need to be more nimble and to adjust priorities in the face of changing needs and opportunities in the marketplace. For us, this kind of roadmap is too rigid of a way to communicate future product development plans.

Its problems stem from the way it combines prioritisation and timelines into one view. It allows different stakeholders to have very different interpretations of what done looks like for any of these future developments. And it obscures whether we need to do any work now for something that it planned for 3 quarters away.

Now/Next/Later fixes an important part of this by separating out relative priorities from timelines. At Simfoni, implementing Now/Next/Later has helped me have these sorts of conversations:

  1. Which of the items in now is genuinely something we should be working on now in support of company OKRs? And is everything in now really well-defined enough for us to be working on it right now?
  2. For the stuff that isn’t in now, which pieces should be in next and which pieces should be in later?
  3. What requirements/analysis/user testing do we need to do now on items that are in next and later so that we can be ready to build them when the time comes?

I even love the language of Now/Next/Later. Now clearly means something we should be working on right now. The less you have in now the more focussed you can be and the sooner you can get something properly done. Next means we need to get ready for something that we will be working on shortly, so we better firm up exactly what we need to do. And later is altogether more speculative and might need some early stage proof of concept work but it’s not something we should be spending much time discussing at the moment.

How to then apply timelines to the results of the now/next/later work is another topic. But so far I am finding that it helps to make the problem space smaller: we should be able to have pretty firm time-boxed estimates for work in now, and then things will get less precise as we move out into next and later.

All credit to Janna Bastow who invented it, see https://www.prodpad.com/blog/the-birth-of-the-modern-roadmap/ for the history.

Machine Learning, Software Development

Simpified history of NLP Transformers

(Some notes I made recently and posting here in case of interest to others – see the tables below)

The transformers story was kicked off by the “Attention is all you need” paper published in mid 2017. (See “Key Papers” section below). This eventually led to use cases like Google implementing transformers to improve its search in 2019/2020 and Microsoft implementing transformers to simplify writing code in 2021 (See “Real-world use of Transformers” section below).

For the rest of us, Huggingface has been producing some great code libraries for working with transformers. This was under heavy development in 2018-2019, including being renamed twice – an indicator of how in flux this area was at the time – but it’s fair to say that this has stabilised a lot over the past year. See “Major Huggingface releases” section below.

Another recent data point – Coursera’s Deep Learning Specialisation was based around using Google Brain’s Trax (https://github.com/google/trax). As of October 2021 Coursera has now announced that (in addition to doing some of the course with Trax) the transformers part now uses Huggingface.

Feels like transformers are at the level of maturity now that it makes sense to embed them into more real-world use cases. We will inevitably have to go through the Gartner Hype Cycle phases of inflated expectations leading to despair, so it’s important not to let expectations get too far ahead of reality. But even with that caveat in mind, now is a great time to be doing some experimentation with Huggingface’s transformers.

Key papers

Jun 2017“Attention is all you need” publishedhttps://arxiv.org/abs/1706.03762
Oct 2018 “BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding” publishedhttps://arxiv.org/abs/1810.04805
Jul 2019“RoBERTa: A Robustly Optimized BERT Pretraining Approach” published. https://arxiv.org/abs/1907.11692
May 2020“Language Models are Few-Shot Learners” published, describing use of GPT-3https://arxiv.org/abs/2005.14165

Real-world use of Transformers

Nov 2018Google open sources BERT codehttps://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
Oct 2019Google starts rolling out BERT implementation for searchhttps://searchengineland.com/faq-all-about-the-bert-algorithm-in-google-search-324193
May 2020OpenAI introduces GPT-3https://en.wikipedia.org/wiki/GPT-3
Oct 2020Google is using BERT used on “almost every English-language query”https://searchengineland.com/google-bert-used-on-almost-every-english-query-342193
May 2021Microsoft introduces GPT-3 into Power Appshttps://powerapps.microsoft.com/en-us/blog/introducing-power-apps-ideas-ai-powered-assistance-now-helps-anyone-create-apps-using-natural-language/

Major Huggingface Releases

Nov 2018Initial 0.1.2 release of pytorch-pretrained-bert https://github.com/huggingface/transformers/releases/tag/v0.1.2
Jul 2019v1.0 of their pytorch-transformers library (including change of name from pytorch-pretrained-bert to pytorch-transformers)https://github.com/huggingface/transformers/releases/tag/v1.0.0
Sep 2019v2.0, this time including name change from pytorch-transformers to, simply, transformershttps://github.com/huggingface/transformers/releases/tag/v2.0.0
June 2020v3.0 of transformershttps://github.com/huggingface/transformers/releases/tag/v3.0.0
Nov 2020v4.0 of transformershttps://github.com/huggingface/transformers/releases/tag/v4.0.0
Machine Learning, Software Development

Is it worth training an NLP Transformer from scratch?

In https://alanbuxton.wordpress.com/2021/09/21/transformers-for-use-oriented-entity-extraction/ I wrote about an experience training a custom transformer-based model to do a type of entity extraction. I tried training from scratch because the source text happened to have been preprocessed / lemmatized in such a way to include over 20 custom tokens that the RoBERTa model wouldn’t know about. My assumption was that this text would be so different to normal English that you may as well treat it as its own language.

Once I saw the results I decided to test this hypothesis somewhat by comparing the results of the preprocessed/lemmatized text with the custom model vs a raw version of the text on a fine-tuned out of the box roberta-base model.

Turns out that, for me, the fine-tuned RoBERTa model always outperformed the model trained from scratch, though the difference in performance becomes pretty minimal once you’re in the millions of sentences.

Conclusion – when working in this space, don’t make assumptions. Stand on the shoulders of as many giants as possible.

Approx number of
sentences for fine-tuning
F1 Score –
RoBERTa from scratch
F1 Score –
Fine-tuned roberta-base
1,5000.391820.52233
15,0000.752940.97764
40,0000.926390.99494
65,0000.972600.99627
125,0000.991050.99776
300,0000.996700.99797
600,0000.997710.99866
960,0000.997830.99865
1,400,0000.998100.99888

Machine Learning, Software Development, Technology Adoption

Transformers for Use-Oriented Entity Extraction

The internet is full of text information. We’re drowning in it. The only way to make sense of it is to use computers to interpret the text for us.

Consider this text:

Foo Inc announced it has acquired Bar Corp. The transaction closed yesterday, reported the Boston Globe.

This is a story about a company called ‘Foo’ buying a company called ‘Bar’. (I’m just using Foo and Bar as generic tech words, these aren’t real companies).

I was curious to see how the state of the art has evolved for pulling out these key bits of information from the text since I first looked at Dandelion in 2018.

TL;DR – existing Natural Language services vary from terrible to tolerable. But recent advances in language models, specifically transformers, point towards huge leaps in this kind of language processing.

Dandelion

Demo site: https://dandelion.eu/semantic-text/entity-extraction-demo/

While it was pretty impressive in 2018, the quality for this type of sentence is pretty poor. It only identified that the Boston Globe is an entity, but Dandelion tagged this entity as a “Work” (i.e. a work of art or literature). As I allowed more flexibility in finding entities, it also found that the term “Inc” and “Corp” usually relate to a Corporation, and it found a Toni Braxton song. Nul points.

Link to video

Explosion.ai

Demo site: https://explosion.ai/demos/displacy-ent

This organisation uses pretty standard named entity recognition. It successfully identified that there were three entities in this text. Pretty solid performance at extracting named entities, but not much help for my use case because the Boston Globe entity is not relevant to the key points of the story.

Link to video

Microsoft

Demo site: https://aidemos.microsoft.com/text-analytics

Thought I’d give Microsoft’s text analytics demo a whirl. Completely incomprehensible results. Worse than Dandelion.

Link to video

Completely WTF

Expert.ai

Demo site: https://try.expert.ai/analysis-and-classification

With Microsoft’s effort out of the way, time to look at a serious contender.

This one did a pretty good job. It identified Foo Inc and Bar Corp as businesses. It identified The Boston Globe as a different kind of entity. There was also some good inference that Foo had made an announcement and that something had acquired Bar Corp. But didn’t go so far as joining the dots that Foo was the buyer.

In this example, labelling The Boston Globe as Mass Media is helpful. It means I can ignore it unless I specifically want to know who is reporting which story. But this helpfulness can go too far. When I changed the name “Bar Corp” to “Reuters Corp” then the entity extraction only found one business entity: Foo Inc. The other two entities were now tagged as Mass Media.

Long story short – Expert.ai is the best so far, but a user would still need to implement a fair bit of post-processing to be able to extract they key elements from this text.

Link to video.

Expert.ai is identifying entities based on the nature of that entity, not based on the role that they are playing in the text. The relations are handled separately. I was looking for something that combined the relevant information from both the entities and their relations. I’ll call it ‘use-oriented entity extraction’ following Wittgenstein‘s quote that, if you want to understand language: “Don’t look for the meaning, look for the use”. In other words, the meaning of a word in some text can differ depending on how the word is used. In one sentence, Reuters might be the media company reporting a story. In another sentence, Reuters might be the business at the centre of the story.

Enter Transformers

I wondered how Transformers would do with the challenge of identifying the different entities depending on how the words are used in the text. So I trained a custom RoBERTa using a relatively small base set of text and some judicious pre-processing. I was blown away with the results. When I first saw all the 9’s appearing in the F1 score my initial reaction was “this has to be a bug, no way is this really this accurate”. Turns out it wasn’t a bug.

I’ve called the prototype “Napoli” because I like coastal locations and Napoli includes the consonants N, L and P. This is a super-simple proof of concept and would have a long way to go to become production-ready, but even these early results were pretty amazing:

  1. It could tell me that Foo Inc is the spending party that bought Bar Corp
  2. If I changed ‘Bar’ to ‘Reuters’ it could tell me that Foo Inc bought Reuters Corp
  3. If I changed the word “acquired” to “sold” it would tell me that Foo Inc is the receiving party that sold Reuters Corp (or Bar Corp etc).
  4. It didn’t get confused by the irrelevant fact that Boston Globe was doing the reporting.

Link to video

Software Development

python ‘in’ set vs list

Came across a comment on this Stackoverflow question which concerned converting a list to a set before looking for items in that set.

The point of set is to convert in from an O(n) operator to an O(1) operator. In this case it doesn’t really matter, but for larger data sets it would be inefficient to have multiple linear scans over a list.

Which led me to look a bit more at time complexity in python functions in general.

In the past I would only convert a list to a set if I wanted to remove duplicates. But evidently it’s a good habit to get into if speed of looking in that set is important. Though, obviously, converting list to set is O(n) in the first place.

Software Development

Python: bind method outside loop to reduce overhead

Came across this interesting pattern in some of the sklearn codebase:

        # bind method outside of loop to reduce overhead
        ngrams_append = ngrams.append

        for n in range(min_n, min(max_n + 1, text_len + 1)):
            for i in range(text_len - n + 1):
                ngrams_append(text_document[i: i + n])
        return ngrams

 

I wonder, at what scale does this really start to make a meaningful performance difference?

Even if it doesn’t make a huge difference, I do find it pretty elegant and easy to follow.