As we get stuck into 2024 I am wondering how many of the product and engineering best practices that we’ve seen develop over the past 10-15 years will turn out to be ZIRPs (“zero interest rate phenomenon”).
Burning loads of VC cash to hire loads of people to build more stuff brings its own set of challenges. The processes and best practices that evolve will work well with these sorts of challenges. When there is less investment money floating around you need to be leaner and more intentional in where you spend your time. The processes that work best in a leaner environment won’t necessarily be the same as those that worked in the sorts of VC-backed outfits we’ve seen over the last decade.
It’s a common trope that you shouldn’t just pick a process that worked in a company whose brand you admire and expect it to work in your company. What I’m saying here is a bit stronger than that. I’m saying that any product or engineering advice coming out of the VC-backed cash-burning-machine era might need to be chucked in the bin in these leaner times.
It’s not just that a bigger company’s processes might not work for where your company is right now. The sorts of processes that worked in a ZIRP era business model might just not make sense in a leaner era.
Here comes the second reason that makes it even more important for you to question ZIRP-era process advice.
As a company gets bigger, the people who have the time and inclination to write about processes get further and further removed from the reality on the ground. Even in @simfoni it’s difficult for me to know what an individual product iteration really involves for the people doing the work. If you’re reading a blog post by a tech leader of a company with 200 or 1,000 engineers, how confident should you be that the blog reflects reality? Or does it reflect what the writer thinks is happening, or how they would like things to be happening?
A great example of this is the Spotify Model. The Spotify model is (was?) an organizational structure that inspired a lot of imitators until it turned out that even Spotify didn’t use the Spotify model: https://www.agility11.com/blog/2020/6/22/spotify-doesnt-use-the-spotify-model It was as much a leadership aspiration as it was a statement of reality.
To sum up, I see 2 things combining here:
Processes developed in ZIRP-era VC-backed company might fundamentally not work in leaner times
The processes as described in blog posts and analyst reports maybe never even existed in real life
So a 2024 resolution for myself, and a call to arms for anyone reading is to “question all the blog posts”. Even more than you normally would.
Except for this one, of course. This one is bang on the money 😀
I wanted to implement some simple styling. It was surprisingly tricky to piece together the relevant information from different stackoverflow and medium posts. So here is what I ended up with in case it’s useful for others.
The DIRS section tells Django where to look locally for templates. If it doesn’t find templates in here then it will look in other magical places which would include the templates that come in the allauth library
Then it’s just a case of find which template you need from the allauth library. In my case I just wanted to add some basic formatting from the other pages in the app to the allauth pages. So I just had to copy https://github.com/pennersr/django-allauth/blob/main/allauth/templates/allauth/layouts/base.html to an equivalent location within my local templates directory and add in one line at the top:
{% include 'layouts/main-styling.html' %}
Simple when you know how, right.
Up until this point the styling file was living in the relevant app directory in the project, so I moved it to a shared location in templates where it can be accessed by any part of the project. The rest of the updates in the commit are
related to moving this file and changing the rest of Syracuse to use the new file location.
implementing a little snippet to give an appropriate message/links if users are logged in or not
Since then I’ve finished adding the remaining types of content that I have to the UI, so you can now see information on corporate finance activities, senior appointment activities and location-related activities (e.g. adding a new site, exiting a territory). This is all part of building up a picture of how an organization evolves over time using information extracted from unstructured data sources.
I want to write about two things that came up while I was doing this which reminded me of why some things are good do and some aren’t!
CSV, CSV, CSV
The bad news was that adding the location and appointment activities into the UI showed that there were some inconsistencies in how the different types were represented in the data. The good news was that the inconsistencies weren’t too hard to fix. All the data was stored as RDF triples in json-ld format. This made it pretty trivial to regenerate. It would have been a lot harder to do it if the data had been stored in a structured database. Once you start getting data into a database, then even the smallest schema change can get very complicated to handle. So I’m glad I follow the advice of one of the smartest developers I ever worked with: Don’t assume you need a database, if a CSV file can handle your requirements then go with that.
In fact one feature I implemented does use a CSV file as it’s storage. Easier than using a database for now. My preferred approach for data handling is:
CSV
JSON
Database
Test, Test, Test
Adding these new data types into the graph made the already quite messy code even messier. It was ripe for a refactor. I then had to do a fair amount of work to get things to work right which involved a lot of refreshing a web page, tweaking some code, then repeating.
My initial instinct was to think that I didn’t have time to write any tests.
But guess what… it was only when I finally wrote some tests that I got to the bottom of the problems and fixed them all. All it took was 2 integration tests and I quickly fixed the issues. You don’t need 100% code coverage to make testing worthwhile. Even the process of putting together some test data for these integration tests helped to identify where some of the bugs were.
The app is currently live at https://syracuse.1145.am, I hope you enjoy it. The web app is running on a Digital Ocean droplet and the backend database is in Neo4j’s Auradb free tier. So there is a fair amount of traffic going backwards and forwards which means the app isn’t super fast. But hopefully it gives a flavor.
Since that post I’ve implemented a new backend with Neo4j and created an app for accessing the data1. It’s here: https://github.com/alanbuxton/syracuse-neo. The early commits have decent messages showing my baby steps in getting from one capability to the next2.
The previous app stored each topic collection separately in a Postgres database. A topic collection is the stories taken from one article, and there could be several stories within one article. This was ok as a starting point, but the point of this project is to connect the dots between different entities based on NLPing articles, so really I needed a Graph database to plug things together.
Heroku doesn’t support Neo4j so I’ve moved the app to a Digital Ocean VM that uses Neo4j’s free tier AuraDB. It’s hosted here http://syracuse.1145.am and has just enough data in it to fit into the AuraDB limits.
The graph visualization is done with vis.js. This turned out to be pretty straightforward to code: you just need some javascript for the nodes and some javascript for the edges. So long as your ids are all unique everything seems pretty straightforward.
Visualizing this sort of data in a graph makes it a lot more immediate than before. I just want to share a few entertaining images to show one feature I worked on.
The underlying data has a node for every time an entity (e.g. organization) is mentioned. This is intentional because when processing an article you can’t tell whether the name of a company in one article is the same company as a similarly-named company in a different article3. So each node in the graph is a mention of an organization and then there is some separate logic to figure out whether two nodes are the same organization or not. For example, if it’s a similar name and industry then it’s likely that the two are the same organziation.
This sometimes led to a ball of lines that looks like Pig-Pen’s hair.
On the plus side, this does make for a soothing visual as the graph library tries to move the nodes about into a sensible shape. With a bit of ambient sounds this could make a good relaxation video.
But, pretty though this may be, it’s hard to read. So I implemented an ‘uber node’ that is the result of clubbing together all the “same as” nodes. A lot more readable, see below:
Below is an example of the same graph after all the Accel Partners nodes had been combined together.
Next steps:
Implement the other types of topic collections into this graph (e.g. people appointments, opening new locations)
Implement a feature to easily flag any incorrect relationships or entities (which can then feed back into the ML training)
For example, git checkout b2791fdb439c18026585bced51091f6c6dcd4f72 is a good one for complete newbies to see some basic interactions between Django and Neo4j. ↩︎
Also this type of reconciliation is a difficult problem – I have some experience of it from this project: https://theybuyforyou.eu/business-cases/ – so it’s safer to process the articles and then have a separate process for combining the topics together. ↩︎
I got some feedback that the structure of the relationships was a bit unintuitive. Fair enough, let’s update it to make more sense.
Previously code was doing the following:
Use RoBERTa-based LLM for entity extraction (the code is pretty old but works well)
Use Benepar dependency parsing to link up relevant entities with each other
Use FlanT5 LLM to dig out some more useful content from the text
Build up an RDF representation of the data
Clean up the RDF
Step 5 had got quite complex.
Also, I had a look at import/export of RDF in graphs – specifically Neo4J, but couldn’t see much excitement about RDF. I even made a PR to update some of the Neo4J / RDF documentation. It’s been stuck for 2+ months.
I wondered if a better approach would be to start again using a different set of technologies. Specifically,
Just building the representation in a graph rather than using RDF
Falcon7B was very exciting to get a chance to try out. But in my use case it wasn’t any more useful than FlanT5.
Going down the graph route was a bit of fun for a while. I’ve used networkx quite a bit in the past so thought I’d try with that first. But, guess what, it turned out more complicated than I needed. Also I do like the simplicity and elegance of RDF, even if it makes me seem a bit, old.
So the final choice was to rip up all my post-processing and turn it into pre-processing, and then generate the RDF. It was heart-breaking to throw away a lot of code, but, as programmers, I think we know when we’ve built something that is just too brittle and needs some heavy refactoring. It worked well in the end, see the git stats below:
Yes, a lot of tests. It’s a data-heavy application so there are a lot of tests to make sure that data is transformed as expected. Whenever it doesn’t work, I add that data (or enough of it) as a test case and then fix it. Most of this test data was just changed with global find/replace so it’s not a big overhead to maintain. But having all those tests was crucial for doing any meaningful refactoring.
On the code side, it was very satisfying to remove more code than I was adding. It just showed how brittle and convoluted the codebase had become. As I discovered more edge cases I added more logic to deal with them. Eventually this ended up as lots of complexity. The new code is “cleaner”. I put clean in quotes because there is still a lot of copy/paste in there and similar functions doing similar things. This is because I like to follow “Make it work, make it right, make it fast“. Code that works but isn’t super elegant is going to be easier to maintain/fix/re-factor later than code that is super-abstracted.
Some observations on the above:
Tests are your friend (obviously)
Expect to need major refactoring in the future. However well you capture all the requirements now, there will be plenty that have not yet been captured, and plenty of need for change
Shiny new toys aren’t always going to help – approach with caution
Sometimes the simple old-fashioned technologies are just fine
However bad you think an app is, there is probably still 80% in there that is good, so beware completely starting from scratch.
See below for the RDF as it stands now compared to before:
Current version
@prefix ns1: <http://example.org/test/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/test/abc/Core_Scientific> a org:Organization ;
ns1:basedInRawLow "USA" ;
ns1:buyer <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> ;
ns1:description "Artificial Intelligence and Blockchain technologies" ;
ns1:foundName "Core Scientific" ;
ns1:industry "Artificial Intelligence and Blockchain technologies" ;
ns1:name "Core Scientific" .
<http://example.org/test/abc/Stax_Digital> a org:Organization ;
ns1:basedInRawLow "LLC" ;
ns1:description "blockchain mining" ;
ns1:foundName "Stax Digital, LLC",
"Stax Digital." ;
ns1:industry "blockchain mining" ;
ns1:name "Stax Digital" .
<http://example.org/test/abc/Stax_Digital_Assets_Acquisition> a ns1:CorporateFinanceActivity ;
ns1:activityType "acquisition" ;
ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
"Core Scientific completes acquisition of Stax Digital." ;
ns1:foundName "acquired",
"acquisition" ;
ns1:name "acquired",
"acquisition" ;
ns1:status "completed" ;
ns1:targetDetails "assets" ;
ns1:targetEntity <http://example.org/test/abc/Stax_Digital> ;
ns1:targetName "Stax Digital" ;
ns1:whereRaw "llc" .
Previous version
@prefix ns1: <http://example.org/test/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/test/abc/Core_Scientific> a <http://www.w3.org/ns/org#Organization> ;
ns1:basedInLow "USA" ;
ns1:description "Artificial Intelligence and Blockchain technologies" ;
ns1:foundName "Core Scientific" ;
ns1:industry "Artificial Intelligence and Blockchain technologies" ;
ns1:name "Core Scientific" ;
ns1:spender <http://example.org/test/abc/Purchase_Stax_Digital> .
<http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> a ns1:TargetDetails ;
ns1:label "Assets" ;
ns1:name "Acquired Assets Stax Digital, LLC" ;
ns1:nextEntity "Stax Digital, LLC" ;
ns1:previousEntity "acquired" ;
ns1:targetEntity <http://example.org/test/abc/Stax_Digital> .
<http://example.org/test/abc/Purchase_Stax_Digital> a ns1:Activity ;
ns1:activityType "Purchase" ;
ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
"Core Scientific completes acquisition of Stax Digital." ;
ns1:label "Acquired",
"Acquisition" ;
ns1:name "Purchase Stax Digital" ;
ns1:targetDetails <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> ;
ns1:whenRaw "has happened, no date available" .
<http://example.org/test/abc/Stax_Digital> a <http://www.w3.org/ns/org#Organization> ;
ns1:basedInLow "LLC" ;
ns1:description "blockchain mining" ;
ns1:foundName "Stax Digital, LLC",
"Stax Digital." ;
ns1:industry "blockchain mining" ;
ns1:name "Stax Digital" .
DIFF
1a2
> @prefix org: <http://www.w3.org/ns/org#> .
4,5c5,7
< <http://example.org/test/abc/Core_Scientific> a <http://www.w3.org/ns/org#Organization> ;
< ns1:basedInLow "USA" ;
---
> <http://example.org/test/abc/Core_Scientific> a org:Organization ;
> ns1:basedInRawLow "USA" ;
> ns1:buyer <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> ;
9,10c11
< ns1:name "Core Scientific" ;
< ns1:spender <http://example.org/test/abc/Purchase_Stax_Digital> .
---
> ns1:name "Core Scientific" .
12,31c13,14
< <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> a ns1:TargetDetails ;
< ns1:label "Assets" ;
< ns1:name "Acquired Assets Stax Digital, LLC" ;
< ns1:nextEntity "Stax Digital, LLC" ;
< ns1:previousEntity "acquired" ;
< ns1:targetEntity <http://example.org/test/abc/Stax_Digital> .
<
< <http://example.org/test/abc/Purchase_Stax_Digital> a ns1:Activity ;
< ns1:activityType "Purchase" ;
< ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
< ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
< "Core Scientific completes acquisition of Stax Digital." ;
< ns1:label "Acquired",
< "Acquisition" ;
< ns1:name "Purchase Stax Digital" ;
< ns1:targetDetails <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> ;
< ns1:whenRaw "has happened, no date available" .
<
< <http://example.org/test/abc/Stax_Digital> a <http://www.w3.org/ns/org#Organization> ;
< ns1:basedInLow "LLC" ;
---
> <http://example.org/test/abc/Stax_Digital> a org:Organization ;
> ns1:basedInRawLow "LLC" ;
36a20,34
>
> <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> a ns1:CorporateFinanceActivity ;
> ns1:activityType "acquisition" ;
> ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
> ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
> "Core Scientific completes acquisition of Stax Digital." ;
> ns1:foundName "acquired",
> "acquisition" ;
> ns1:name "acquired",
> "acquisition" ;
> ns1:status "completed" ;
> ns1:targetDetails "assets" ;
> ns1:targetEntity <http://example.org/test/abc/Stax_Digital> ;
> ns1:targetName "Stax Digital" ;
> ns1:whereRaw "llc" .
Disclaimer – I don’t know anything about infectious diseases, so apologies if I’ve misunderstood any terminology. Please refer back to Mark’s post for the in-depth analysis of what the various cypher queries mean and just use this post to get started quicker.
Pre-requisites: You are using Neo4J Desktop with APOC and Neosemantics installed, so your Neo4J Desktop looks something like this:
First set the config for the graph
CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE;
CALL n10s.graphconfig.init({handleVocabUris: "MAP"});
And then set up the mappings for the wikidata namespaces:
You should get a result that looks something like:
If there were any errors they would appear in the extraInfo field.
Let’s just check what nodes we have in the graph
MATCH (n)
RETURN n
The green node is the GraphConfig node. The virus definitions are just flagged with the generic ‘Resource’ label. To make things more useful later on when we have other types of data in the graph we can add a new label to the Viruses nodes
MATCH (n:Resource)
SET n:Virus
RETURN *
The 300 items are the ones Matched by this Cypher statement and they now all have the Virus label and a fetching yellow colour.
Mark’s blog post talks about cleaning up some cases where there is a shortcut between two CHILD_OF relationships. It’s a fun example of Variable-length pattern matching so worth a bit of a play with.
In my dataset the following Cypher query shows one such example:
MATCH (v:Virus {name: "Rotavirus"})-[co:CHILD_OF]->(p:Virus)
RETURN *
Rotavirus is a child of Sederovirinae (top left), Sedoreoviridae (top right) and Reoviridae (bottom left). Sederovirinae (top left) is a child of Reoviridae. So the CHILD_OF from Rotavirus to Reoviridae is a shortcut path that we want to remove.
To identify all the shortcuts we need to see all the cases where Virus v1 is a direct child of Virus v2 and at the same time the same Virus v1 has another path of at least 2 child_of steps to v2. That can be captured in the following Cypher query:
MATCH (v2:Virus)<-[shortcut:CHILD_OF]-(v1:Virus)-[co:CHILD_OF*2..]->(v2:Virus)
RETURN *
To delete these shortcuts
MATCH (v2:Virus)<-[shortcut:CHILD_OF]-(v1:Virus)-[co:CHILD_OF*2..]->(v2:Virus)
DELETE shortcut
This will return the number of rows deleted. To satisfy yourself that these are, indeed, deleted, you can re-run the Rotavirus query above which will now only show that Rotavirus is a child of Sederovirinae and Sedoreoviridae
To continue in Mark’s footsteps, let’s add some other entities to the graph. The syntax for adding mappings has changed, so we need to tweak the older command. In this case we added a prefix for “http://www.wikidata.org/prop/direct/” earlier on so we can just go ahead and add the mappings without adding a prefix first.
Then load the data with the query below. Wwarning: this query will take a long time to run because it will make one request per virus to get the host information. In my case: Started streaming 7139 records after 22 ms and completed after 19590 ms, displaying first 1000 rows.
MATCH (r:Virus)
WITH n10s.rdf.getIRILocalName(r.uri) AS virus, r
WITH 'prefix schema: <http://schema.org/>
CONSTRUCT {
wd:' + virus + ' wdt:P2975 ?host.
?host rdfs:label ?hostName ;
rdf:type schema:Host
}
WHERE {
OPTIONAL {
wd:' + virus + ' wdt:P2975 ?host.
?host rdfs:label ?hostName.
filter(lang(?hostName) = "en")
}
}' AS query, r
CALL n10s.rdf.import.fetch("https://query.wikidata.org/sparql?query=" + apoc.text.urlencode(query),
"JSON-LD",
{ headerParams: { Accept: "application/ld+json"}})
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo
RETURN r.name, terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo;
In this case we’re using rdf:type schema:Host to specify that this is a resource of type Host. This graph is using default graphConfig handleRDFTypes which means that the rdf type is assigned as a label to the node for you.
Let’s see how many hosts we have
MATCH (n:Host)
RETURN *
And a quick query to see the relationships between Virus and Host
MATCH (v:Virus)-[HOST]->(h:Host)
RETURN *
From here you should be able to continue working on the more involved queries in Mark’s blog post.
I’ve now converted all the topics to RDF format so they can be represented natively in graph format. I essentially made up the RDF (with some inspiration from ChatGPT, naturally) so I’m sure it can be hugely improved.
Next steps I want to do are [a] to make the RDF more in line with appropriate standards and then [b] stitch together topics that relate to the same entity/location to create a timeline.
Non-cherry-picked results of the first example of each type (M&A, Appointment and Location) below. The demo is still running at http://syracuse-1145.herokuapp.com/ and I’d love to hear from anyone who finds the tool interesting.
The Agile Manifesto came about in Feb 2001. That makes it 22 years old now. There are people commercially writing code today who weren’t even born when it came out.
Goodness, I feel old.
The values that underpin the Agile Manifesto remain sound and the best way to approach your software development. But I have seen many times that in an effort to ‘do agile’, teams lose the ability to ‘be agile‘.
If you want to avoid this trap, bear in mind the context that the Agile Manifesto came out of.
Conventional wisdom at the time was was, for example, spend 1 month scoping the project, 3 months documenting requirements, 12 months building, 3 months testing and 6 months rolling out. This sort of order of things. With checkpoints at each step along the way.
There was a whole generation of project managers who prided themselves on delivering projects on time, on budget, on scope. I was one of them. Yet at the same time, the companies investing in tech would complain of over-runs in all of these factors.
Why the disconnect? Because of the focus on scope (building features) rather than impact (solving problems). As a project manager working in one of these projects you could build a successful career by agreeing a scope, writing it down and delivering it. Then, when the inevitable changes arose you’d handle those as individual ‘change orders’, each adding more scope which you would then deliver on time, on budget and so forth (and get paid more for doing). But the users would end up paying many times as much as initially expected over a much longer timescale before they saw the benefits that they had been looking for.
Agile changed all that by, basically, saying “let’s talk more between the programmers and the users and let’s deliver in smaller increments so we can keep focus on the most important things”.
That was the kernel and the genius of Agile.
At some point things started to go wrong. In my view it’s when Agile became equated with Scrum; anything non-scrum became called Waterfall and therefore not to ever be mentioned again.
Agile is not a methodology. Agile isn’t something you do. Agile is something that following those 4 simple values can help you be.
So as we look back on the arrival of Agile on the scene all those years ago, I urge you to forget about your Kanban and Scrum and Roadmaps and Mob Programming for a moment and just remind yourself of the 4 key values that can sit underneath all these attempts to become more agile in how we build software:
Individuals and interactions over processes and tools Working software over comprehensive documentation Customer collaboration over contract negotiation Responding to change over following a plan
>>> arr = [1,2,3]
>>> arr2 = arr
>>> arr
[1,2,3]
>>> arr2
[1,2,3]
>>> arr == arr2
True # Values are equivalent, objects are also the same
>>> id(arr)
4469618560 # the id for both arr and arr2 is the same: they are the same object
>>> id(arr2)
4469618560
>>> arr2.append(4)
>>> arr
[1,2,3,4]
>>> arr2
[1,2,3,4]
compared to
>>> arr = [1,2,3]
>>> arr2 = [1,2,3] + [] # arr2 is now a different object
>>> arr
[1,2,3]
>>> arr2
[1,2,3]
>>> arr == arr2
True # Values are the same even though the object is different
>>> id(arr)
4469618560
>>> id(arr2)
4469669184 # value is different
>>> arr2.append(4)
>>> arr
[1,2,3]
>>> arr2
[1,2,3,4]
Or, in a case that you’re more likely to see in the wild
d1 and d2 are the same dict so when you change the value of bar, both have the same result, but d3 still has the old value. But, even though the dicts are different, the arr in them is the same one. So anything that mutates that list will change it in all the dicts
This sort of behaviour has its logic but it is a logic you have to learn, and a logic you often have to learn the painful way. I particularly ‘enjoy’ cases where you’ve gone to great lengths to make sure the object is a copy and then find that something inside that list gets mutated.
It doesn’t have to be like this.
In what is probably the most beautiful language I’ve ever used (though sadly only in personal projects, not for work), Haskell, everything is immutable.
If you’ve never tried Haskell this might sound incomprehensible, but it’s just a different approach to programming. Two good links about this:
This language feature eliminates a whole class of bugs related to changing an object unexpectedly. So I’d encourage any software developers to try to get their heads around a language like Haskell. It opened my eyes to a whole different approach to writing code (e.g. focussing on individual functions rather than starting with a db schema and working out from there).
I’ve been using ChatGPT to help with some coding problems. In all the cases I’ve tried it has been wrong but has given me useful ideas. I’ve seen some extremely enthusiastic people who are saying that ChatGPT writes all their code for them. I can only assume that they mean it is applying common patterns for them and saving boilerplate work. Here is a recent example of an interaction I had with ChatGPT as an illustration.
The initial prompt:
Hi, I want to write a python function that will find common subsets that can be extracted from a list of sets. A common subset is one where several elements always appear together.
For example with the following sets: s1 = {“a”,”b”,”c”} s2 = {“a”,”b”,”c”} s3 = {“c”} s4 = {“d”,”e”} s5 = {“d”,”e”,”f”} s6 = {“d”,”e”,”f”,”g”}
The function should return [{“a”,”b”},{“d”,”e”}]
What I liked about using it:
It forced me to think about an individual function that can be tested in isolation
It forced me to think really explicitly in terms of the inputs and outputs of the function
The answers it provided, specifically using itertools.combinations, gave me a good direction to try out.
What I didn’t like about using it:
The code didn’t work
It gives the impression of knowing what it’s talking about. I have to keep reminding myself that it’s just producing a wall of text/code based on what it has seen in the past. Some of the stuff it generates is complete nonsense (see below when it tries to explain what output a particular function would create, or whether a set is a subset of another set).
It’s a walled garden. Whereas in another life I might have posted my question on StackOverflow and got an answer that others could also benefit from, this discussion is just between me and ChatGPT. Not that StackOverflow is a panacea – I’ve still got an unanswered question on there from 3 + years ago, but, nevertheless, there is an important principle at stake.
See below the screenshots of the whole interaction:
Needless to say, this final piece of code didn’t return the expected output either.
Full code of the above conversation is below for anyone who wants to do some cut and paste.
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> def common_subsets(sets):
... # Initialize an empty set to store the common subsets
... common = set()
... # Iterate through all possible subsets of the first set
... for subset in itertools.chain.from_iterable(itertools.combinations(sets[0], r) for r in range(len(sets[0])+1)):
... subset = set(subset)
... # Check if the current subset is present in all other sets
... if all(subset <= s for s in sets[1:]):
... common.add(subset)
... return list(common)
...
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>>
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in common_subsets
NameError: name 'itertools' is not defined
>>> import itertools
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>>
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 9, in common_subsets
TypeError: unhashable type: 'set'
>>> def common_subsets(sets):
... # Initialize an empty set to store the common subsets
... common = set()
... # Iterate through all possible subsets of the first set
... for subset in itertools.chain.from_iterable(itertools.combinations(sets[0], r) for r in range(len(sets[0])+1)):
... subset = frozenset(subset)
... # Check if the current subset is present in all other sets
... if all(subset <= s for s in sets[1:]):
... common.add(subset)
... return list(common)
...
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>>
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[frozenset()]
>>> from typing import List
>>>
>>> def common_subsets(sets:List[set]) -> List[set]:
... common = set()
... for subset in itertools.chain.from_iterable(itertools.combinations(sets[0], r) for r in range(1,len(sets[0])+1)):
... subset = frozenset(subset)
... if all(subset <= s for s in sets[1:]):
... common.add(subset)
... return list(common)
...
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>>
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[]
>>> from typing import List
>>>
>>> def common_subsets(sets:List[set]) -> List[set]:
... subsets = [set(itertools.combinations(s, r)) for s in sets for r in range(1, len(s) + 1)]
... common = set.intersection(*subsets)
... return list(common)
...
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>>
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[]
>>> from typing import List
>>>
>>> def common_subsets(sets:List[set]) -> List[set]:
... # find all subsets of all sets
... subsets = [set(itertools.combinations(s, r)) for s in sets for r in range(1, len(s) + 1)]
... # find subsets that are in all subsets list
... common = set.intersection(*subsets)
... # check if subsets are present in all sets
... common = [subset for subset in common if all(subset <= s for s in sets)]
... return list(common)
...
>>> s1 = {"a","b","c"}
>>> s2 = {"a","b","c"}
>>> s3 = {"c"}
>>> s4 = {"d","e"}
>>> s5 = {"d","e","f"}
>>> s6 = {"d","e","f","g"}
>>>
>>> print(common_subsets([s1,s2,s3,s4,s5,s6]))
[]
And at the risk of embarrassing myself in front of the whole world, and for the benefit of any future ChatGPT iteration that is reading this post, below is the solution I did come up with. Improvements welcome 😀
import itertools
from typing import List
def get_all_set_combinations(sets: List):
all_combinations = set()
for s in sets:
if len(s) == 1:
continue
for l in range(2,len(s)+1):
combos = itertools.combinations(s,l)
for x in combos:
all_combinations.add(tuple(x))
return all_combinations
def find_extractable_subsets(sets: List):
combos = get_all_set_combinations(sets)
matching = set()
for combo in sorted(combos, key=len, reverse=True):
combo_set = set(combo)
if not is_candidate_set_extractable(combo_set, sets):
continue
addable = True
for x in matching:
if combo_set & set(x) == combo_set:
addable = False
break
if addable:
matching.add(combo)
return matching
def is_candidate_set_extractable(candidate, sets):
for s in sets:
# if this candidate is fully included in a set then it's a candidate to be exractable
if (candidate & s) == candidate or (candidate & s) == set():
continue
else:
return False
return True
### And can be tested with:
s1 = {"a","b","c"}
s2 = {"a","b","c"}
s3 = {"c"}
s4 = {"d","e"}
s5 = {"d","e","f"}
s6 = {"d","e","f","g"}
find_extractable_subsets([s1,s2,s3,s4,s5,s6])
# With the expected result:
# {('b', 'a'), ('e', 'd')}
# it only picks the longest matching subsets, e.g.
find_extractable_subsets([s1,s2,s4,s5,s6])
# produces expected result:
# {('e', 'd'), ('b', 'c', 'a')}