Product Management, Software Development

Bonfire of the Best Practices

As we get stuck into 2024 I am wondering how many of the product and engineering best practices that we’ve seen develop over the past 10-15 years will turn out to be ZIRPs (“zero interest rate phenomenon”).

Probably quite a few. Two reasons I can see:

First reason is pretty obvious. When money was cheap the objective was to raise as much investment as possible and use that money to build an aggressive roadmap fast. When time is of the essence and cost isn’t a big deal, the first reaction of most managers is going to be to hire people to do more stuff. Not surprising that many companies over-hired – e.g. see https://qz.com/from-overhiring-to-optionality-what-we-can-learn-from-1850267076 and https://stripe.com/gb/newsroom/news/ceo-patrick-collisons-email-to-stripe-employees

Burning loads of VC cash to hire loads of people to build more stuff brings its own set of challenges. The processes and best practices that evolve will work well with these sorts of challenges.  When there is less investment money floating around you need to be leaner and more intentional in where you spend your time. The processes that work best in a leaner environment won’t necessarily be the same as those that worked in the sorts of VC-backed outfits we’ve seen over the last decade.

It’s a common trope that you shouldn’t just pick a process that worked in a company whose brand you admire and expect it to work in your company. What I’m saying here is a bit stronger than that. I’m saying that any product or engineering advice coming out of the VC-backed cash-burning-machine era might need to be chucked in the bin in these leaner times.

It’s not just that a bigger company’s processes might not work for where your company is right now. The sorts of processes that worked in a ZIRP era business model might just not make sense in a leaner era.

Here comes the second reason that makes it even more important for you to question ZIRP-era process advice.

As a company gets bigger, the people who have the time and inclination to write about processes get further and further removed from the reality on the ground. Even in @simfoni it’s difficult for me to know what an individual product iteration really involves for the people doing the work. If you’re reading a blog post by a tech leader of a company with 200  or 1,000 engineers, how confident should you be that the blog reflects reality? Or does it reflect what the writer thinks is happening, or how they would like things to be happening?

A great example of this is the Spotify Model. The Spotify model is (was?) an organizational structure that inspired a lot of imitators until it turned out that even Spotify didn’t use the Spotify model: https://www.agility11.com/blog/2020/6/22/spotify-doesnt-use-the-spotify-model  It was as much a leadership aspiration as it was a statement of reality.

To sum up, I see 2 things combining here:

  1. Processes developed in ZIRP-era VC-backed company might fundamentally not work in leaner times
  2. The processes as described in blog posts and analyst reports maybe never even existed in real life

So a 2024 resolution for myself, and a call to arms for anyone reading is to “question all the blog posts”. Even more than you normally would.

Except for this one, of course. This one is bang on the money 😀

Software Development

Styling with django-allauth

I implemented django-allauth into Syracuse recently.

I wanted to implement some simple styling. It was surprisingly tricky to piece together the relevant information from different stackoverflow and medium posts. So here is what I ended up with in case it’s useful for others.

The app source code including allauth is in the allauth_v01 tag. The relevant commit that included allauth is here.

Briefly the necessary changes are:

settings.py

  1. Install allauth per the quickstart docs. I didn’t add the SOCIALACCOUNT_PROVIDERS piece as those can be set up via the admin interface as shown in this tutorial.
  2. The DIRS section tells Django where to look locally for templates. If it doesn’t find templates in here then it will look in other magical places which would include the templates that come in the allauth library

Then it’s just a case of find which template you need from the allauth library. In my case I just wanted to add some basic formatting from the other pages in the app to the allauth pages. So I just had to copy https://github.com/pennersr/django-allauth/blob/main/allauth/templates/allauth/layouts/base.html to an equivalent location within my local templates directory and add in one line at the top:

{% include 'layouts/main-styling.html' %}

Simple when you know how, right.

Up until this point the styling file was living in the relevant app directory in the project, so I moved it to a shared location in templates where it can be accessed by any part of the project. The rest of the updates in the commit are

  • related to moving this file and changing the rest of Syracuse to use the new file location.
  • implementing a little snippet to give an appropriate message/links if users are logged in or not
Software Development

More software development lessons from my side project

Last month I wrote about migrating the syracuse codebase to Neo4j and changing the hosting from Heroku to Digital Ocean.

Since then I’ve finished adding the remaining types of content that I have to the UI, so you can now see information on corporate finance activities, senior appointment activities and location-related activities (e.g. adding a new site, exiting a territory). This is all part of building up a picture of how an organization evolves over time using information extracted from unstructured data sources.

I want to write about two things that came up while I was doing this which reminded me of why some things are good do and some aren’t!

CSV, CSV, CSV

The bad news was that adding the location and appointment activities into the UI showed that there were some inconsistencies in how the different types were represented in the data. The good news was that the inconsistencies weren’t too hard to fix. All the data was stored as RDF triples in json-ld format. This made it pretty trivial to regenerate. It would have been a lot harder to do it if the data had been stored in a structured database. Once you start getting data into a database, then even the smallest schema change can get very complicated to handle. So I’m glad I follow the advice of one of the smartest developers I ever worked with: Don’t assume you need a database, if a CSV file can handle your requirements then go with that.

In fact one feature I implemented does use a CSV file as it’s storage. Easier than using a database for now. My preferred approach for data handling is:

  1. CSV
  2. JSON
  3. Database

Test, Test, Test

Adding these new data types into the graph made the already quite messy code even messier. It was ripe for a refactor. I then had to do a fair amount of work to get things to work right which involved a lot of refreshing a web page, tweaking some code, then repeating.

My initial instinct was to think that I didn’t have time to write any tests.

But guess what… it was only when I finally wrote some tests that I got to the bottom of the problems and fixed them all. All it took was 2 integration tests and I quickly fixed the issues. You don’t need 100% code coverage to make testing worthwhile. Even the process of putting together some test data for these integration tests helped to identify where some of the bugs were.

The app is currently live at https://syracuse.1145.am, I hope you enjoy it. The web app is running on a Digital Ocean droplet and the backend database is in Neo4j’s Auradb free tier. So there is a fair amount of traffic going backwards and forwards which means the app isn’t super fast. But hopefully it gives a flavor.

Software Development

Syracuse update: Django and Neo4j

Time for another update in my side project, following on from https://alanbuxton.wordpress.com/2023/08/05/rip-it-up-and-start-again-without-ripping-it-up-and-starting-again/

Since that post I’ve implemented a new backend with Neo4j and created an app for accessing the data1. It’s here: https://github.com/alanbuxton/syracuse-neo. The early commits have decent messages showing my baby steps in getting from one capability to the next2.

The previous app stored each topic collection separately in a Postgres database. A topic collection is the stories taken from one article, and there could be several stories within one article. This was ok as a starting point, but the point of this project is to connect the dots between different entities based on NLPing articles, so really I needed a Graph database to plug things together.

Heroku doesn’t support Neo4j so I’ve moved the app to a Digital Ocean VM that uses Neo4j’s free tier AuraDB. It’s hosted here http://syracuse.1145.am and has just enough data in it to fit into the AuraDB limits.

The graph visualization is done with vis.js. This turned out to be pretty straightforward to code: you just need some javascript for the nodes and some javascript for the edges. So long as your ids are all unique everything seems pretty straightforward.

Visualizing this sort of data in a graph makes it a lot more immediate than before. I just want to share a few entertaining images to show one feature I worked on.

The underlying data has a node for every time an entity (e.g. organization) is mentioned. This is intentional because when processing an article you can’t tell whether the name of a company in one article is the same company as a similarly-named company in a different article3. So each node in the graph is a mention of an organization and then there is some separate logic to figure out whether two nodes are the same organization or not. For example, if it’s a similar name and industry then it’s likely that the two are the same organziation.

This sometimes led to a ball of lines that looks like Pig-Pen’s hair.

On the plus side, this does make for a soothing visual as the graph library tries to move the nodes about into a sensible shape. With a bit of ambient sounds this could make a good relaxation video.

But, pretty though this may be, it’s hard to read. So I implemented an ‘uber node’ that is the result of clubbing together all the “same as” nodes. A lot more readable, see below:

Below is an example of the same graph after all the Accel Partners nodes had been combined together.

Next steps:

  1. Implement the other types of topic collections into this graph (e.g. people appointments, opening new locations)
  2. Implement a feature to easily flag any incorrect relationships or entities (which can then feed back into the ML training)

Thanks for reading!

Notes

  1. With thanks to the https://github.com/neo4j-examples/paradise-papers-django for their example app which gave a starting point for working with graph data in Neo4j. ↩︎
  2. For example, git checkout b2791fdb439c18026585bced51091f6c6dcd4f72 is a good one for complete newbies to see some basic interactions between Django and Neo4j. ↩︎
  3. Also this type of reconciliation is a difficult problem – I have some experience of it from this project: https://theybuyforyou.eu/business-cases/ – so it’s safer to process the articles and then have a separate process for combining the topics together. ↩︎
Software Development

Rip it up and start again, without ripping it up and starting again

Time for another update on my side project, Syracuse: http://syracuse-1145.herokuapp.com/

I got some feedback that the structure of the relationships was a bit unintuitive. Fair enough, let’s update it to make more sense.

Previously code was doing the following:

  1. Use RoBERTa-based LLM for entity extraction (the code is pretty old but works well)
  2. Use Benepar dependency parsing to link up relevant entities with each other
  3. Use FlanT5 LLM to dig out some more useful content from the text
  4. Build up an RDF representation of the data
  5. Clean up the RDF

Step 5 had got quite complex.

Also, I had a look at import/export of RDF in graphs – specifically Neo4J, but couldn’t see much excitement about RDF. I even made a PR to update some of the Neo4J / RDF documentation. It’s been stuck for 2+ months.

I wondered if a better approach would be to start again using a different set of technologies. Specifically,

  1. Falcon7B instead of FlanT5
  2. Just building the representation in a graph rather than using RDF

Falcon7B was very exciting to get a chance to try out. But in my use case it wasn’t any more useful than FlanT5.

Going down the graph route was a bit of fun for a while. I’ve used networkx quite a bit in the past so thought I’d try with that first. But, guess what, it turned out more complicated than I needed. Also I do like the simplicity and elegance of RDF, even if it makes me seem a bit, old.

So the final choice was to rip up all my post-processing and turn it into pre-processing, and then generate the RDF. It was heart-breaking to throw away a lot of code, but, as programmers, I think we know when we’ve built something that is just too brittle and needs some heavy refactoring. It worked well in the end, see the git stats below:

  • code: 6 files change: 449 insertions, 729 deletions
  • tests: 81 files changed, 3618 insertions, 1734 deletions

Yes, a lot of tests. It’s a data-heavy application so there are a lot of tests to make sure that data is transformed as expected. Whenever it doesn’t work, I add that data (or enough of it) as a test case and then fix it. Most of this test data was just changed with global find/replace so it’s not a big overhead to maintain. But having all those tests was crucial for doing any meaningful refactoring.

On the code side, it was very satisfying to remove more code than I was adding. It just showed how brittle and convoluted the codebase had become. As I discovered more edge cases I added more logic to deal with them. Eventually this ended up as lots of complexity. The new code is “cleaner”. I put clean in quotes because there is still a lot of copy/paste in there and similar functions doing similar things. This is because I like to follow “Make it work, make it right, make it fast“. Code that works but isn’t super elegant is going to be easier to maintain/fix/re-factor later than code that is super-abstracted.

Some observations on the above:

  1. Tests are your friend (obviously)
  2. Expect to need major refactoring in the future. However well you capture all the requirements now, there will be plenty that have not yet been captured, and plenty of need for change
  3. Shiny new toys aren’t always going to help – approach with caution
  4. Sometimes the simple old-fashioned technologies are just fine
  5. However bad you think an app is, there is probably still 80% in there that is good, so beware completely starting from scratch.

See below for the RDF as it stands now compared to before:

Current version

@prefix ns1: <http://example.org/test/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/test/abc/Core_Scientific> a org:Organization ;
    ns1:basedInRawLow "USA" ;
    ns1:buyer <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> ;
    ns1:description "Artificial Intelligence and Blockchain technologies" ;
    ns1:foundName "Core Scientific" ;
    ns1:industry "Artificial Intelligence and Blockchain technologies" ;
    ns1:name "Core Scientific" .

<http://example.org/test/abc/Stax_Digital> a org:Organization ;
    ns1:basedInRawLow "LLC" ;
    ns1:description "blockchain mining" ;
    ns1:foundName "Stax Digital, LLC",
        "Stax Digital." ;
    ns1:industry "blockchain mining" ;
    ns1:name "Stax Digital" .

<http://example.org/test/abc/Stax_Digital_Assets_Acquisition> a ns1:CorporateFinanceActivity ;
    ns1:activityType "acquisition" ;
    ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
    ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
        "Core Scientific completes acquisition of Stax Digital." ;
    ns1:foundName "acquired",
        "acquisition" ;
    ns1:name "acquired",
        "acquisition" ;
    ns1:status "completed" ;
    ns1:targetDetails "assets" ;
    ns1:targetEntity <http://example.org/test/abc/Stax_Digital> ;
    ns1:targetName "Stax Digital" ;
    ns1:whereRaw "llc" .

Previous version

@prefix ns1: <http://example.org/test/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/test/abc/Core_Scientific> a <http://www.w3.org/ns/org#Organization> ;
    ns1:basedInLow "USA" ;
    ns1:description "Artificial Intelligence and Blockchain technologies" ;
    ns1:foundName "Core Scientific" ;
    ns1:industry "Artificial Intelligence and Blockchain technologies" ;
    ns1:name "Core Scientific" ;
    ns1:spender <http://example.org/test/abc/Purchase_Stax_Digital> .

<http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> a ns1:TargetDetails ;
    ns1:label "Assets" ;
    ns1:name "Acquired Assets Stax Digital, LLC" ;
    ns1:nextEntity "Stax Digital, LLC" ;
    ns1:previousEntity "acquired" ;
    ns1:targetEntity <http://example.org/test/abc/Stax_Digital> .

<http://example.org/test/abc/Purchase_Stax_Digital> a ns1:Activity ;
    ns1:activityType "Purchase" ;
    ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
    ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
        "Core Scientific completes acquisition of Stax Digital." ;
    ns1:label "Acquired",
        "Acquisition" ;
    ns1:name "Purchase Stax Digital" ;
    ns1:targetDetails <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> ;
    ns1:whenRaw "has happened, no date available" .

<http://example.org/test/abc/Stax_Digital> a <http://www.w3.org/ns/org#Organization> ;
    ns1:basedInLow "LLC" ;
    ns1:description "blockchain mining" ;
    ns1:foundName "Stax Digital, LLC",
        "Stax Digital." ;
    ns1:industry "blockchain mining" ;
    ns1:name "Stax Digital" .

DIFF
1a2
> @prefix org: <http://www.w3.org/ns/org#> .
4,5c5,7
< <http://example.org/test/abc/Core_Scientific> a <http://www.w3.org/ns/org#Organization> ;
<     ns1:basedInLow "USA" ;
---
> <http://example.org/test/abc/Core_Scientific> a org:Organization ;
>     ns1:basedInRawLow "USA" ;
>     ns1:buyer <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> ;
9,10c11
<     ns1:name "Core Scientific" ;
<     ns1:spender <http://example.org/test/abc/Purchase_Stax_Digital> .
---
>     ns1:name "Core Scientific" .
12,31c13,14
< <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> a ns1:TargetDetails ;
<     ns1:label "Assets" ;
<     ns1:name "Acquired Assets Stax Digital, LLC" ;
<     ns1:nextEntity "Stax Digital, LLC" ;
<     ns1:previousEntity "acquired" ;
<     ns1:targetEntity <http://example.org/test/abc/Stax_Digital> .
< 
< <http://example.org/test/abc/Purchase_Stax_Digital> a ns1:Activity ;
<     ns1:activityType "Purchase" ;
<     ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
<     ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
<         "Core Scientific completes acquisition of Stax Digital." ;
<     ns1:label "Acquired",
<         "Acquisition" ;
<     ns1:name "Purchase Stax Digital" ;
<     ns1:targetDetails <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> ;
<     ns1:whenRaw "has happened, no date available" .
< 
< <http://example.org/test/abc/Stax_Digital> a <http://www.w3.org/ns/org#Organization> ;
<     ns1:basedInLow "LLC" ;
---
> <http://example.org/test/abc/Stax_Digital> a org:Organization ;
>     ns1:basedInRawLow "LLC" ;
36a20,34
> 
> <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> a ns1:CorporateFinanceActivity ;
>     ns1:activityType "acquisition" ;
>     ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
>     ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
>         "Core Scientific completes acquisition of Stax Digital." ;
>     ns1:foundName "acquired",
>         "acquisition" ;
>     ns1:name "acquired",
>         "acquisition" ;
>     ns1:status "completed" ;
>     ns1:targetDetails "assets" ;
>     ns1:targetEntity <http://example.org/test/abc/Stax_Digital> ;
>     ns1:targetName "Stax Digital" ;
>     ns1:whereRaw "llc" .
Software Development

Getting started with Neo4J and Neosemantics

This is based on Mark Needham’s excellent blog post on creating a Covid 19 graph with commands upated to work with Neo4J 5.5.0.

Disclaimer – I don’t know anything about infectious diseases, so apologies if I’ve misunderstood any terminology. Please refer back to Mark’s post for the in-depth analysis of what the various cypher queries mean and just use this post to get started quicker.

Pre-requisites: You are using Neo4J Desktop with APOC and Neosemantics installed, so your Neo4J Desktop looks something like this:

First set the config for the graph

CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE;
CALL n10s.graphconfig.init({handleVocabUris: "MAP"});

And then set up the mappings for the wikidata namespaces:

CALL n10s.nsprefixes.add("wdt","http://www.wikidata.org/prop/direct/");
CALL n10s.mapping.add("http://www.wikidata.org/prop/direct/P171","CHILD_OF");
CALL n10s.nsprefixes.add("rdfs","http://www.w3.org/2000/01/rdf-schema#");
CALL n10s.mapping.add("http://www.w3.org/2000/01/rdf-schema#label","name");

Now run the code to insert the virus data into your graph

WITH '
CONSTRUCT {
  ?cat rdfs:label ?catName .
  ?subCat rdfs:label ?subCatName ;
          wdt:P171 ?parentCat .
  }
WHERE {
  ?cat rdfs:label "Riboviria"@en .
  ?cat rdfs:label ?catName .
  filter(lang(?catName) = "en") .
  ?subCat wdt:P171+ ?cat ;
          wdt:P171 ?parentCat;
          rdfs:label ?subCatName
          filter(lang(?subCatName) = "en") .
}
' AS query
CALL n10s.rdf.import.fetch(
  "https://query.wikidata.org/sparql?query=" + apoc.text.urlencode(query),
  "JSON-LD",
  { headerParams: { Accept: "application/ld+json"}})
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo
RETURN terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo;

You should get a result that looks something like:

If there were any errors they would appear in the extraInfo field.

Let’s just check what nodes we have in the graph

MATCH (n)
RETURN n

The green node is the GraphConfig node. The virus definitions are just flagged with the generic ‘Resource’ label. To make things more useful later on when we have other types of data in the graph we can add a new label to the Viruses nodes

MATCH (n:Resource)
SET n:Virus
RETURN *

The 300 items are the ones Matched by this Cypher statement and they now all have the Virus label and a fetching yellow colour.

Mark’s blog post talks about cleaning up some cases where there is a shortcut between two CHILD_OF relationships. It’s a fun example of Variable-length pattern matching so worth a bit of a play with.

In my dataset the following Cypher query shows one such example:

MATCH (v:Virus {name: "Rotavirus"})-[co:CHILD_OF]->(p:Virus)
RETURN *

Rotavirus is a child of Sederovirinae (top left), Sedoreoviridae (top right) and Reoviridae (bottom left). Sederovirinae (top left) is a child of Reoviridae. So the CHILD_OF from Rotavirus to Reoviridae is a shortcut path that we want to remove.

To identify all the shortcuts we need to see all the cases where Virus v1 is a direct child of Virus v2 and at the same time the same Virus v1 has another path of at least 2 child_of steps to v2. That can be captured in the following Cypher query:

MATCH (v2:Virus)<-[shortcut:CHILD_OF]-(v1:Virus)-[co:CHILD_OF*2..]->(v2:Virus) 
RETURN *

To delete these shortcuts

MATCH (v2:Virus)<-[shortcut:CHILD_OF]-(v1:Virus)-[co:CHILD_OF*2..]->(v2:Virus) 
DELETE shortcut

This will return the number of rows deleted. To satisfy yourself that these are, indeed, deleted, you can re-run the Rotavirus query above which will now only show that Rotavirus is a child of Sederovirinae and Sedoreoviridae

To continue in Mark’s footsteps, let’s add some other entities to the graph. The syntax for adding mappings has changed, so we need to tweak the older command. In this case we added a prefix for “http://www.wikidata.org/prop/direct/&#8221; earlier on so we can just go ahead and add the mappings without adding a prefix first.

CALL n10s.mapping.add("http://www.wikidata.org/prop/direct/P2975","HOST");

Then load the data with the query below. Wwarning: this query will take a long time to run because it will make one request per virus to get the host information. In my case: Started streaming 7139 records after 22 ms and completed after 19590 ms, displaying first 1000 rows.

MATCH (r:Virus)
WITH n10s.rdf.getIRILocalName(r.uri) AS virus, r
WITH 'prefix schema: <http://schema.org/>

CONSTRUCT {
  wd:' + virus + ' wdt:P2975 ?host.
  ?host rdfs:label ?hostName ;
        rdf:type schema:Host

}
WHERE {
  OPTIONAL {
    wd:' + virus + ' wdt:P2975 ?host.
    ?host rdfs:label ?hostName.
    filter(lang(?hostName) = "en")
  }
}' AS query, r
CALL n10s.rdf.import.fetch("https://query.wikidata.org/sparql?query=" + apoc.text.urlencode(query),
        "JSON-LD",
        { headerParams: { Accept: "application/ld+json"}})

YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo
RETURN r.name, terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo;

In this case we’re using rdf:type schema:Host to specify that this is a resource of type Host. This graph is using default graphConfig handleRDFTypes which means that the rdf type is assigned as a label to the node for you.

Let’s see how many hosts we have

MATCH (n:Host)
RETURN  *

And a quick query to see the relationships between Virus and Host

MATCH (v:Virus)-[HOST]->(h:Host) 
RETURN *

From here you should be able to continue working on the more involved queries in Mark’s blog post.

Software Development

Syracuse update

I’ve now converted all the topics to RDF format so they can be represented natively in graph format. I essentially made up the RDF (with some inspiration from ChatGPT, naturally) so I’m sure it can be hugely improved.

Next steps I want to do are [a] to make the RDF more in line with appropriate standards and then [b] stitch together topics that relate to the same entity/location to create a timeline.

Non-cherry-picked results of the first example of each type (M&A, Appointment and Location) below. The demo is still running at http://syracuse-1145.herokuapp.com/ and I’d love to hear from anyone who finds the tool interesting.

M&A – Canadian buyer aims to improve Pornhub owner’s reputation

see http://syracuse-1145.herokuapp.com/topics/493555.

RDF for this:

@prefix ns1: <http://1145.am/db/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://1145.am/db/1941166/Ethical_Capital_Partners> a <http://www.w3.org/ns/org#Organization> ;
    ns1:basedInHigh "Canada" ;
    ns1:description "Canadian private equity firm" ;
    ns1:foundName "Ethical Capital Partners" ;
    ns1:industry "private equity" ;
    ns1:name "Ethical Capital Partners" ;
    ns1:spender <http://1145.am/db/1941166/Act_Mindgeek> .

<http://1145.am/db/1941166/Act_Mindgeek> a ns1:Activity ;
    ns1:activityType "Act" ;
    ns1:documentDate "2023-03-17T22:01:07.679000+00:00"^^xsd:dateTime ;
    ns1:documentExtract "Ethical Capital Partners (ECP) this week bought Pornhub owner MindGeek, its first ever acquisition, with the aim of improving the internet platform's reputation, the young Canadian private equity firm's partners said on Friday." ;
    ns1:label "activity" ;
    ns1:name "Act MindGeek" ;
    ns1:targetEntity <http://1145.am/db/1941166/Mindgeek> ;
    ns1:whenRaw "has happened, no date available" .

<http://1145.am/db/1941166/Mindgeek> a <http://www.w3.org/ns/org#Organization> ;
    ns1:basedInLow "Canadian" ;
    ns1:description "Pornhub owner" ;
    ns1:foundName "MindGeek" ;
    ns1:name "MindGeek" .

Appointment – Gantt named chancellor at Southern University-Shreveport

http://syracuse-1145.herokuapp.com/topics/493562

RDF for this:

@prefix ns1: <http://1145.am/db/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://1145.am/db/1940774/Aubra_Gantt> a <http://xmlns.com/foaf/0.1/Person> ;
    ns1:foundName "Aubra Gantt" ;
    ns1:name "Aubra Gantt" ;
    ns1:roleActivity <http://1145.am/db/1940774/Aubra_Gantt-Starting-Chancellor> .

<http://1145.am/db/1940774/Southern_University_Shreveport> a <http://www.w3.org/ns/org#Organization> ;
    ns1:basedInHigh "Shreveport" ;
    ns1:description "Southern University System" ;
    ns1:foundName "Southern University Shreveport" ;
    ns1:hasRole <http://1145.am/db/1940774/Southern_University_Shreveport-Chancellor> ;
    ns1:name "Southern University Shreveport" .

<http://1145.am/db/1940774/Aubra_Gantt-Starting-Chancellor> a ns1:RoleActivity ;
    ns1:activityType "starting" ;
    ns1:documentDate "2023-03-17T23:18:37+00:00"^^xsd:dateTime ;
    ns1:documentExtract "Shreveport native Aubra Gantt was chosen Friday as the next chancellor at the Southern University Shreveport campus." ;
    ns1:foundName "chosen" ;
    ns1:name "Chosen" ;
    ns1:orgFoundName "The Southern University System" ;
    ns1:role <http://1145.am/db/1940774/Southern_University_Shreveport-Chancellor> ;
    ns1:roleFoundName "chancellor" ;
    ns1:roleHolderFoundName "Aubra Gantt" .

<http://1145.am/db/1940774/Southern_University_Shreveport-Chancellor> a <http://www.w3.org/ns/org#Role> ;
    ns1:foundName "chancellor" ;
    ns1:name "Chancellor" ;
    ns1:orgFoundName "Southern University Shreveport" .

Location – UMC Circular Economy & Recycling Innovation Center Breaks Ground for a Zero Waste Future

http://syracuse-1145.herokuapp.com/topics/493512

RDF for this:

@prefix ns1: <http://1145.am/db/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://1145.am/db/1940440/United_Microelectronics_Corporation> a <http://www.w3.org/ns/org#Organization> ;
    ns1:basedInHigh "Taiwan" ;
    ns1:description "NYSE: UMC; TWSE: 2303" ;
    ns1:foundName "United Microelectronics Corporation" ;
    ns1:industry "semiconductor manufacturing" ;
    ns1:locationAdded <http://1145.am/db/1940440/United_Microelectronics_Corporation-Tainan_Taiwan> ;
    ns1:name "United Microelectronics Corporation" .

<http://1145.am/db/1940440/Tainan_Taiwan> a <http://www.w3.org/ns/org#Site> ;
    ns1:foundName "Tainan, Taiwan" ;
    ns1:name "Tainan, Taiwan" .

<http://1145.am/db/1940440/United_Microelectronics_Corporation-Tainan_Taiwan> a ns1:SiteAddedActivity ;
    ns1:actionFoundName "groundbreaking" ;
    ns1:documentDate "2023-03-17T09:07:00+00:00"^^xsd:dateTime ;
    ns1:documentExtract "United Microelectronics Corporation (NYSE: UMC; TWSE: 2303) (\"UMC\") today held a groundbreaking ceremony for its Circular Economy & Recycling Innovation Center, which will be established at its Fab 12A in Tainan, Taiwan." ;
    ns1:foundName "Tainan, Taiwan" ;
    ns1:location <http://1145.am/db/1940440/Tainan_Taiwan> ;
    ns1:locationFoundName "Tainan, Taiwan" ;
    ns1:locationPurpose "Circular Economy & Recycling Innovation Center" ;
    ns1:name "Tainan, Taiwan" ;
    ns1:orgFoundName "United Microelectronics Corporation" .

Product Management, Software Development

Agile Software Development 22 years on

The Agile Manifesto came about in Feb 2001. That makes it 22 years old now. There are people commercially writing code today who weren’t even born when it came out.

Goodness, I feel old.

The values that underpin the Agile Manifesto remain sound and the best way to approach your software development. But I have seen many times that in an effort to ‘do agile’, teams lose the ability to ‘be agile‘.

If you want to avoid this trap, bear in mind the context that the Agile Manifesto came out of.

Conventional wisdom at the time was was, for example, spend 1 month scoping the project, 3 months documenting requirements, 12 months building, 3 months testing and 6 months rolling out. This sort of order of things. With checkpoints at each step along the way.

There was a whole generation of project managers who prided themselves on delivering projects on time, on budget, on scope. I was one of them. Yet at the same time, the companies investing in tech would complain of over-runs in all of these factors.

Why the disconnect? Because of the focus on scope (building features) rather than impact (solving problems). As a project manager working in one of these projects you could build a successful career by agreeing a scope, writing it down and delivering it. Then, when the inevitable changes arose you’d handle those as individual ‘change orders’, each adding more scope which you would then deliver on time, on budget and so forth (and get paid more for doing). But the users would end up paying many times as much as initially expected over a much longer timescale before they saw the benefits that they had been looking for.

Agile changed all that by, basically, saying “let’s talk more between the programmers and the users and let’s deliver in smaller increments so we can keep focus on the most important things”.

That was the kernel and the genius of Agile.

At some point things started to go wrong. In my view it’s when Agile became equated with Scrum; anything non-scrum became called Waterfall and therefore not to ever be mentioned again.

Agile is not a methodology. Agile isn’t something you do. Agile is something that following those 4 simple values can help you be.

So as we look back on the arrival of Agile on the scene all those years ago, I urge you to forget about your Kanban and Scrum and Roadmaps and Mob Programming for a moment and just remind yourself of the 4 key values that can sit underneath all these attempts to become more agile in how we build software:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

https://agilemanifesto.org/

Software Development

Sometimes I miss Haskell’s immutability

I have ‘fond’ memories of tracking down a particularly PITA bug in some Python code.  It was down to whether using = makes two references to the same underlying object, or whether it makes a copy of the object in question. And equally fond memories of debugging a ruby function that changed a hash unexpectedly.

This sort of thing:

>>> arr = [1,2,3]
>>> arr2 = arr
>>> arr
[1,2,3]
>>> arr2
[1,2,3]
>>> arr == arr2
True # Values are equivalent, objects are also the same
>>> id(arr)
4469618560 # the id for both arr and arr2 is the same: they are the same object
>>> id(arr2)
4469618560
>>> arr2.append(4)
>>> arr
[1,2,3,4]
>>> arr2
[1,2,3,4]

compared to

>>> arr = [1,2,3]
>>> arr2 = [1,2,3] + [] # arr2 is now a different object
>>> arr
[1,2,3]
>>> arr2
[1,2,3]
>>> arr == arr2
True # Values are the same even though the object is different
>>> id(arr)
4469618560
>>> id(arr2)
4469669184 # value is different
>>> arr2.append(4)
>>> arr
[1,2,3]
>>> arr2
[1,2,3,4]

Or, in a case that you’re more likely to see in the wild

>>> arr = [1,2,3]
>>> d1 = {"foo":arr, "bar": "baz"}
>>> d2 = d1
>>> d3 = dict(d1) # Creates new dict, with its own id
>>> id(d1)
140047012964608
>>> id(d2)
140047012964608
>>> id(d3)
140047012965120
>>> arr.append(4)
>>> d1["bar"] += "x"
>>> d1
{'foo': [1, 2, 3, 4], 'bar': 'bazx'}
>>> d2
{'foo': [1, 2, 3, 4], 'bar': 'bazx'}
>>> d3
{'foo': [1, 2, 3, 4], 'bar': 'baz'}

d1 and d2 are the same dict so when you change the value of bar, both have the same result, but d3 still has the old value. But, even though the dicts are different, the arr in them is the same one. So anything that mutates that list will change it in all the dicts

This sort of behaviour has its logic but it is a logic you have to learn, and a logic you often have to learn the painful way. I particularly ‘enjoy’ cases where you’ve gone to great lengths to make sure the object is a copy and then find that something inside that list gets mutated.

It doesn’t have to be like this.

In what is probably the most beautiful language I’ve ever used (though sadly only in personal projects, not for work), Haskell, everything is immutable.

If you’ve never tried Haskell this might sound incomprehensible, but it’s just a different approach to programming. Two good links about this:

Immutability is Awesome

What does immutable variable in Haskell mean

This language feature eliminates a whole class of bugs related to changing an object unexpectedly. So I’d encourage any software developers to try to get their heads around a language like Haskell. It opened my eyes to a whole different approach to writing code (e.g. focussing on individual functions rather than starting with a db schema and working out from there).

Software Development

Coding with ChatGPT

I’ve been using ChatGPT to help with some coding problems. In all the cases I’ve tried it has been wrong but has given me useful ideas. I’ve seen some extremely enthusiastic people who are saying that ChatGPT writes all their code for them. I can only assume that they mean it is applying common patterns for them and saving boilerplate work. Here is a recent example of an interaction I had with ChatGPT as an illustration.

The initial prompt:

Hi, I want to write a python function that will find common subsets that can be extracted from a list of sets. A common subset is one where several elements always appear together.

For example with the following sets:
s1 = {“a”,”b”,”c”}
s2 = {“a”,”b”,”c”}
s3 = {“c”}
s4 = {“d”,”e”}
s5 = {“d”,”e”,”f”}
s6 = {“d”,”e”,”f”,”g”}

The function should return
[{“a”,”b”},{“d”,”e”}]

What I liked about using it:

  1. It forced me to think about an individual function that can be tested in isolation
  2. It forced me to think really explicitly in terms of the inputs and outputs of the function
  3. The answers it provided, specifically using itertools.combinations, gave me a good direction to try out.

What I didn’t like about using it:

  1. The code didn’t work
  2. It gives the impression of knowing what it’s talking about. I have to keep reminding myself that it’s just producing a wall of text/code based on what it has seen in the past. Some of the stuff it generates is complete nonsense (see below when it tries to explain what output a particular function would create, or whether a set is a subset of another set).
  3. It’s a walled garden. Whereas in another life I might have posted my question on StackOverflow and got an answer that others could also benefit from, this discussion is just between me and ChatGPT. Not that StackOverflow is a panacea – I’ve still got an unanswered question on there from 3 + years ago, but, nevertheless, there is an important principle at stake.

See below the screenshots of the whole interaction:

Needless to say, this final piece of code didn’t return the expected output either.

Full code of the above conversation is below for anyone who wants to do some cut and paste.

Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> def common_subsets(sets):
...     # Initialize an empty set to store the common subsets
...     common = set()
...     # Iterate through all possible subsets of the first set
...     for subset in itertools.chain.from_iterable(itertools.combinations(sets[0], r) for r in range(len(sets[0])+1)):
...         subset = set(subset)
...         # Check if the current subset is present in all other sets
...         if all(subset <= s for s in sets[1:]):
...             common.add(subset)
...     return list(common)
... 
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in common_subsets
NameError: name 'itertools' is not defined
>>> import itertools
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 9, in common_subsets
TypeError: unhashable type: 'set'
>>> def common_subsets(sets):
...     # Initialize an empty set to store the common subsets
...     common = set()
...     # Iterate through all possible subsets of the first set
...     for subset in itertools.chain.from_iterable(itertools.combinations(sets[0], r) for r in range(len(sets[0])+1)):
...         subset = frozenset(subset)
...         # Check if the current subset is present in all other sets
...         if all(subset <= s for s in sets[1:]):
...             common.add(subset)
...     return list(common)
... 
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[frozenset()]
>>> from typing import List
>>> 
>>> def common_subsets(sets:List[set]) -> List[set]:
...     common = set()
...     for subset in itertools.chain.from_iterable(itertools.combinations(sets[0], r) for r in range(1,len(sets[0])+1)):
...         subset = frozenset(subset)
...         if all(subset <= s for s in sets[1:]):
...             common.add(subset)
...     return list(common)
... 
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[]
>>> from typing import List
>>> 
>>> def common_subsets(sets:List[set]) -> List[set]:
...     subsets = [set(itertools.combinations(s, r)) for s in sets for r in range(1, len(s) + 1)]
...     common = set.intersection(*subsets)
...     return list(common)
... 
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[]
>>> from typing import List
>>> 
>>> def common_subsets(sets:List[set]) -> List[set]:
...     # find all subsets of all sets
...     subsets = [set(itertools.combinations(s, r)) for s in sets for r in range(1, len(s) + 1)]
...     # find subsets that are in all subsets list
...     common = set.intersection(*subsets)
...     # check if subsets are present in all sets
...     common = [subset for subset in common if all(subset <= s for s in sets)]
...     return list(common)
... 
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>> 
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[]

And at the risk of embarrassing myself in front of the whole world, and for the benefit of any future ChatGPT iteration that is reading this post, below is the solution I did come up with. Improvements welcome 😀

import itertools
from typing import List

def get_all_set_combinations(sets: List):
    all_combinations = set()
    for s in sets:
        if len(s) == 1:
            continue
        for l in range(2,len(s)+1):
            combos = itertools.combinations(s,l)
            for x in combos:
                all_combinations.add(tuple(x))
    return all_combinations

def find_extractable_subsets(sets: List):
    combos = get_all_set_combinations(sets)
    matching = set()
    for combo in sorted(combos, key=len, reverse=True):
        combo_set = set(combo)
        if not is_candidate_set_extractable(combo_set, sets):
            continue
        addable = True
        for x in matching:
            if combo_set & set(x) == combo_set:
                addable = False
                break
        if addable:
            matching.add(combo)
    return matching

def is_candidate_set_extractable(candidate, sets):
    for s in sets:
        # if this candidate is fully included in a set then it's a candidate to be exractable
        if (candidate & s) == candidate or (candidate & s) == set():
            continue
        else:
            return False
    return True


### And can be tested with:
s1 = {"a","b","c"}
s2 = {"a","b","c"}
s3 = {"c"}
s4 = {"d","e"}
s5 = {"d","e","f"}
s6 = {"d","e","f","g"}
find_extractable_subsets([s1,s2,s3,s4,s5,s6])

# With the expected result:
# {('b', 'a'), ('e', 'd')}

# it only picks the longest matching subsets, e.g.
find_extractable_subsets([s1,s2,s4,s5,s6])

# produces expected result:
# {('e', 'd'), ('b', 'c', 'a')}