Software Development

More software development lessons from my side project

Last month I wrote about migrating the syracuse codebase to Neo4j and changing the hosting from Heroku to Digital Ocean.

Since then I’ve finished adding the remaining types of content that I have to the UI, so you can now see information on corporate finance activities, senior appointment activities and location-related activities (e.g. adding a new site, exiting a territory). This is all part of building up a picture of how an organization evolves over time using information extracted from unstructured data sources.

I want to write about two things that came up while I was doing this which reminded me of why some things are good do and some aren’t!

CSV, CSV, CSV

The bad news was that adding the location and appointment activities into the UI showed that there were some inconsistencies in how the different types were represented in the data. The good news was that the inconsistencies weren’t too hard to fix. All the data was stored as RDF triples in json-ld format. This made it pretty trivial to regenerate. It would have been a lot harder to do it if the data had been stored in a structured database. Once you start getting data into a database, then even the smallest schema change can get very complicated to handle. So I’m glad I follow the advice of one of the smartest developers I ever worked with: Don’t assume you need a database, if a CSV file can handle your requirements then go with that.

In fact one feature I implemented does use a CSV file as it’s storage. Easier than using a database for now. My preferred approach for data handling is:

  1. CSV
  2. JSON
  3. Database

Test, Test, Test

Adding these new data types into the graph made the already quite messy code even messier. It was ripe for a refactor. I then had to do a fair amount of work to get things to work right which involved a lot of refreshing a web page, tweaking some code, then repeating.

My initial instinct was to think that I didn’t have time to write any tests.

But guess what… it was only when I finally wrote some tests that I got to the bottom of the problems and fixed them all. All it took was 2 integration tests and I quickly fixed the issues. You don’t need 100% code coverage to make testing worthwhile. Even the process of putting together some test data for these integration tests helped to identify where some of the bugs were.

The app is currently live at https://syracuse.1145.am, I hope you enjoy it. The web app is running on a Digital Ocean droplet and the backend database is in Neo4j’s Auradb free tier. So there is a fair amount of traffic going backwards and forwards which means the app isn’t super fast. But hopefully it gives a flavor.

Software Development

Syracuse update: Django and Neo4j

Time for another update in my side project, following on from https://alanbuxton.wordpress.com/2023/08/05/rip-it-up-and-start-again-without-ripping-it-up-and-starting-again/

Since that post I’ve implemented a new backend with Neo4j and created an app for accessing the data1. It’s here: https://github.com/alanbuxton/syracuse-neo. The early commits have decent messages showing my baby steps in getting from one capability to the next2.

The previous app stored each topic collection separately in a Postgres database. A topic collection is the stories taken from one article, and there could be several stories within one article. This was ok as a starting point, but the point of this project is to connect the dots between different entities based on NLPing articles, so really I needed a Graph database to plug things together.

Heroku doesn’t support Neo4j so I’ve moved the app to a Digital Ocean VM that uses Neo4j’s free tier AuraDB. It’s hosted here http://syracuse.1145.am and has just enough data in it to fit into the AuraDB limits.

The graph visualization is done with vis.js. This turned out to be pretty straightforward to code: you just need some javascript for the nodes and some javascript for the edges. So long as your ids are all unique everything seems pretty straightforward.

Visualizing this sort of data in a graph makes it a lot more immediate than before. I just want to share a few entertaining images to show one feature I worked on.

The underlying data has a node for every time an entity (e.g. organization) is mentioned. This is intentional because when processing an article you can’t tell whether the name of a company in one article is the same company as a similarly-named company in a different article3. So each node in the graph is a mention of an organization and then there is some separate logic to figure out whether two nodes are the same organization or not. For example, if it’s a similar name and industry then it’s likely that the two are the same organziation.

This sometimes led to a ball of lines that looks like Pig-Pen’s hair.

On the plus side, this does make for a soothing visual as the graph library tries to move the nodes about into a sensible shape. With a bit of ambient sounds this could make a good relaxation video.

But, pretty though this may be, it’s hard to read. So I implemented an ‘uber node’ that is the result of clubbing together all the “same as” nodes. A lot more readable, see below:

Below is an example of the same graph after all the Accel Partners nodes had been combined together.

Next steps:

  1. Implement the other types of topic collections into this graph (e.g. people appointments, opening new locations)
  2. Implement a feature to easily flag any incorrect relationships or entities (which can then feed back into the ML training)

Thanks for reading!

Notes

  1. With thanks to the https://github.com/neo4j-examples/paradise-papers-django for their example app which gave a starting point for working with graph data in Neo4j. ↩︎
  2. For example, git checkout b2791fdb439c18026585bced51091f6c6dcd4f72 is a good one for complete newbies to see some basic interactions between Django and Neo4j. ↩︎
  3. Also this type of reconciliation is a difficult problem – I have some experience of it from this project: https://theybuyforyou.eu/business-cases/ – so it’s safer to process the articles and then have a separate process for combining the topics together. ↩︎
Software Development

Rip it up and start again, without ripping it up and starting again

Time for another update on my side project, Syracuse: http://syracuse-1145.herokuapp.com/

I got some feedback that the structure of the relationships was a bit unintuitive. Fair enough, let’s update it to make more sense.

Previously code was doing the following:

  1. Use RoBERTa-based LLM for entity extraction (the code is pretty old but works well)
  2. Use Benepar dependency parsing to link up relevant entities with each other
  3. Use FlanT5 LLM to dig out some more useful content from the text
  4. Build up an RDF representation of the data
  5. Clean up the RDF

Step 5 had got quite complex.

Also, I had a look at import/export of RDF in graphs – specifically Neo4J, but couldn’t see much excitement about RDF. I even made a PR to update some of the Neo4J / RDF documentation. It’s been stuck for 2+ months.

I wondered if a better approach would be to start again using a different set of technologies. Specifically,

  1. Falcon7B instead of FlanT5
  2. Just building the representation in a graph rather than using RDF

Falcon7B was very exciting to get a chance to try out. But in my use case it wasn’t any more useful than FlanT5.

Going down the graph route was a bit of fun for a while. I’ve used networkx quite a bit in the past so thought I’d try with that first. But, guess what, it turned out more complicated than I needed. Also I do like the simplicity and elegance of RDF, even if it makes me seem a bit, old.

So the final choice was to rip up all my post-processing and turn it into pre-processing, and then generate the RDF. It was heart-breaking to throw away a lot of code, but, as programmers, I think we know when we’ve built something that is just too brittle and needs some heavy refactoring. It worked well in the end, see the git stats below:

  • code: 6 files change: 449 insertions, 729 deletions
  • tests: 81 files changed, 3618 insertions, 1734 deletions

Yes, a lot of tests. It’s a data-heavy application so there are a lot of tests to make sure that data is transformed as expected. Whenever it doesn’t work, I add that data (or enough of it) as a test case and then fix it. Most of this test data was just changed with global find/replace so it’s not a big overhead to maintain. But having all those tests was crucial for doing any meaningful refactoring.

On the code side, it was very satisfying to remove more code than I was adding. It just showed how brittle and convoluted the codebase had become. As I discovered more edge cases I added more logic to deal with them. Eventually this ended up as lots of complexity. The new code is “cleaner”. I put clean in quotes because there is still a lot of copy/paste in there and similar functions doing similar things. This is because I like to follow “Make it work, make it right, make it fast“. Code that works but isn’t super elegant is going to be easier to maintain/fix/re-factor later than code that is super-abstracted.

Some observations on the above:

  1. Tests are your friend (obviously)
  2. Expect to need major refactoring in the future. However well you capture all the requirements now, there will be plenty that have not yet been captured, and plenty of need for change
  3. Shiny new toys aren’t always going to help – approach with caution
  4. Sometimes the simple old-fashioned technologies are just fine
  5. However bad you think an app is, there is probably still 80% in there that is good, so beware completely starting from scratch.

See below for the RDF as it stands now compared to before:

Current version

@prefix ns1: <http://example.org/test/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/test/abc/Core_Scientific> a org:Organization ;
    ns1:basedInRawLow "USA" ;
    ns1:buyer <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> ;
    ns1:description "Artificial Intelligence and Blockchain technologies" ;
    ns1:foundName "Core Scientific" ;
    ns1:industry "Artificial Intelligence and Blockchain technologies" ;
    ns1:name "Core Scientific" .

<http://example.org/test/abc/Stax_Digital> a org:Organization ;
    ns1:basedInRawLow "LLC" ;
    ns1:description "blockchain mining" ;
    ns1:foundName "Stax Digital, LLC",
        "Stax Digital." ;
    ns1:industry "blockchain mining" ;
    ns1:name "Stax Digital" .

<http://example.org/test/abc/Stax_Digital_Assets_Acquisition> a ns1:CorporateFinanceActivity ;
    ns1:activityType "acquisition" ;
    ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
    ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
        "Core Scientific completes acquisition of Stax Digital." ;
    ns1:foundName "acquired",
        "acquisition" ;
    ns1:name "acquired",
        "acquisition" ;
    ns1:status "completed" ;
    ns1:targetDetails "assets" ;
    ns1:targetEntity <http://example.org/test/abc/Stax_Digital> ;
    ns1:targetName "Stax Digital" ;
    ns1:whereRaw "llc" .

Previous version

@prefix ns1: <http://example.org/test/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/test/abc/Core_Scientific> a <http://www.w3.org/ns/org#Organization> ;
    ns1:basedInLow "USA" ;
    ns1:description "Artificial Intelligence and Blockchain technologies" ;
    ns1:foundName "Core Scientific" ;
    ns1:industry "Artificial Intelligence and Blockchain technologies" ;
    ns1:name "Core Scientific" ;
    ns1:spender <http://example.org/test/abc/Purchase_Stax_Digital> .

<http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> a ns1:TargetDetails ;
    ns1:label "Assets" ;
    ns1:name "Acquired Assets Stax Digital, LLC" ;
    ns1:nextEntity "Stax Digital, LLC" ;
    ns1:previousEntity "acquired" ;
    ns1:targetEntity <http://example.org/test/abc/Stax_Digital> .

<http://example.org/test/abc/Purchase_Stax_Digital> a ns1:Activity ;
    ns1:activityType "Purchase" ;
    ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
    ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
        "Core Scientific completes acquisition of Stax Digital." ;
    ns1:label "Acquired",
        "Acquisition" ;
    ns1:name "Purchase Stax Digital" ;
    ns1:targetDetails <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> ;
    ns1:whenRaw "has happened, no date available" .

<http://example.org/test/abc/Stax_Digital> a <http://www.w3.org/ns/org#Organization> ;
    ns1:basedInLow "LLC" ;
    ns1:description "blockchain mining" ;
    ns1:foundName "Stax Digital, LLC",
        "Stax Digital." ;
    ns1:industry "blockchain mining" ;
    ns1:name "Stax Digital" .

DIFF
1a2
> @prefix org: <http://www.w3.org/ns/org#> .
4,5c5,7
< <http://example.org/test/abc/Core_Scientific> a <http://www.w3.org/ns/org#Organization> ;
<     ns1:basedInLow "USA" ;
---
> <http://example.org/test/abc/Core_Scientific> a org:Organization ;
>     ns1:basedInRawLow "USA" ;
>     ns1:buyer <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> ;
9,10c11
<     ns1:name "Core Scientific" ;
<     ns1:spender <http://example.org/test/abc/Purchase_Stax_Digital> .
---
>     ns1:name "Core Scientific" .
12,31c13,14
< <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> a ns1:TargetDetails ;
<     ns1:label "Assets" ;
<     ns1:name "Acquired Assets Stax Digital, LLC" ;
<     ns1:nextEntity "Stax Digital, LLC" ;
<     ns1:previousEntity "acquired" ;
<     ns1:targetEntity <http://example.org/test/abc/Stax_Digital> .
< 
< <http://example.org/test/abc/Purchase_Stax_Digital> a ns1:Activity ;
<     ns1:activityType "Purchase" ;
<     ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
<     ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
<         "Core Scientific completes acquisition of Stax Digital." ;
<     ns1:label "Acquired",
<         "Acquisition" ;
<     ns1:name "Purchase Stax Digital" ;
<     ns1:targetDetails <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> ;
<     ns1:whenRaw "has happened, no date available" .
< 
< <http://example.org/test/abc/Stax_Digital> a <http://www.w3.org/ns/org#Organization> ;
<     ns1:basedInLow "LLC" ;
---
> <http://example.org/test/abc/Stax_Digital> a org:Organization ;
>     ns1:basedInRawLow "LLC" ;
36a20,34
> 
> <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> a ns1:CorporateFinanceActivity ;
>     ns1:activityType "acquisition" ;
>     ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
>     ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
>         "Core Scientific completes acquisition of Stax Digital." ;
>     ns1:foundName "acquired",
>         "acquisition" ;
>     ns1:name "acquired",
>         "acquisition" ;
>     ns1:status "completed" ;
>     ns1:targetDetails "assets" ;
>     ns1:targetEntity <http://example.org/test/abc/Stax_Digital> ;
>     ns1:targetName "Stax Digital" ;
>     ns1:whereRaw "llc" .
Software Development

Getting started with Neo4J and Neosemantics

This is based on Mark Needham’s excellent blog post on creating a Covid 19 graph with commands upated to work with Neo4J 5.5.0.

Disclaimer – I don’t know anything about infectious diseases, so apologies if I’ve misunderstood any terminology. Please refer back to Mark’s post for the in-depth analysis of what the various cypher queries mean and just use this post to get started quicker.

Pre-requisites: You are using Neo4J Desktop with APOC and Neosemantics installed, so your Neo4J Desktop looks something like this:

First set the config for the graph

CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE;
CALL n10s.graphconfig.init({handleVocabUris: "MAP"});

And then set up the mappings for the wikidata namespaces:

CALL n10s.nsprefixes.add("wdt","http://www.wikidata.org/prop/direct/");
CALL n10s.mapping.add("http://www.wikidata.org/prop/direct/P171","CHILD_OF");
CALL n10s.nsprefixes.add("rdfs","http://www.w3.org/2000/01/rdf-schema#");
CALL n10s.mapping.add("http://www.w3.org/2000/01/rdf-schema#label","name");

Now run the code to insert the virus data into your graph

WITH '
CONSTRUCT {
  ?cat rdfs:label ?catName .
  ?subCat rdfs:label ?subCatName ;
          wdt:P171 ?parentCat .
  }
WHERE {
  ?cat rdfs:label "Riboviria"@en .
  ?cat rdfs:label ?catName .
  filter(lang(?catName) = "en") .
  ?subCat wdt:P171+ ?cat ;
          wdt:P171 ?parentCat;
          rdfs:label ?subCatName
          filter(lang(?subCatName) = "en") .
}
' AS query
CALL n10s.rdf.import.fetch(
  "https://query.wikidata.org/sparql?query=" + apoc.text.urlencode(query),
  "JSON-LD",
  { headerParams: { Accept: "application/ld+json"}})
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo
RETURN terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo;

You should get a result that looks something like:

If there were any errors they would appear in the extraInfo field.

Let’s just check what nodes we have in the graph

MATCH (n)
RETURN n

The green node is the GraphConfig node. The virus definitions are just flagged with the generic ‘Resource’ label. To make things more useful later on when we have other types of data in the graph we can add a new label to the Viruses nodes

MATCH (n:Resource)
SET n:Virus
RETURN *

The 300 items are the ones Matched by this Cypher statement and they now all have the Virus label and a fetching yellow colour.

Mark’s blog post talks about cleaning up some cases where there is a shortcut between two CHILD_OF relationships. It’s a fun example of Variable-length pattern matching so worth a bit of a play with.

In my dataset the following Cypher query shows one such example:

MATCH (v:Virus {name: "Rotavirus"})-[co:CHILD_OF]->(p:Virus)
RETURN *

Rotavirus is a child of Sederovirinae (top left), Sedoreoviridae (top right) and Reoviridae (bottom left). Sederovirinae (top left) is a child of Reoviridae. So the CHILD_OF from Rotavirus to Reoviridae is a shortcut path that we want to remove.

To identify all the shortcuts we need to see all the cases where Virus v1 is a direct child of Virus v2 and at the same time the same Virus v1 has another path of at least 2 child_of steps to v2. That can be captured in the following Cypher query:

MATCH (v2:Virus)<-[shortcut:CHILD_OF]-(v1:Virus)-[co:CHILD_OF*2..]->(v2:Virus) 
RETURN *

To delete these shortcuts

MATCH (v2:Virus)<-[shortcut:CHILD_OF]-(v1:Virus)-[co:CHILD_OF*2..]->(v2:Virus) 
DELETE shortcut

This will return the number of rows deleted. To satisfy yourself that these are, indeed, deleted, you can re-run the Rotavirus query above which will now only show that Rotavirus is a child of Sederovirinae and Sedoreoviridae

To continue in Mark’s footsteps, let’s add some other entities to the graph. The syntax for adding mappings has changed, so we need to tweak the older command. In this case we added a prefix for “http://www.wikidata.org/prop/direct/&#8221; earlier on so we can just go ahead and add the mappings without adding a prefix first.

CALL n10s.mapping.add("http://www.wikidata.org/prop/direct/P2975","HOST");

Then load the data with the query below. Wwarning: this query will take a long time to run because it will make one request per virus to get the host information. In my case: Started streaming 7139 records after 22 ms and completed after 19590 ms, displaying first 1000 rows.

MATCH (r:Virus)
WITH n10s.rdf.getIRILocalName(r.uri) AS virus, r
WITH 'prefix schema: <http://schema.org/>

CONSTRUCT {
  wd:' + virus + ' wdt:P2975 ?host.
  ?host rdfs:label ?hostName ;
        rdf:type schema:Host

}
WHERE {
  OPTIONAL {
    wd:' + virus + ' wdt:P2975 ?host.
    ?host rdfs:label ?hostName.
    filter(lang(?hostName) = "en")
  }
}' AS query, r
CALL n10s.rdf.import.fetch("https://query.wikidata.org/sparql?query=" + apoc.text.urlencode(query),
        "JSON-LD",
        { headerParams: { Accept: "application/ld+json"}})

YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo
RETURN r.name, terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo;

In this case we’re using rdf:type schema:Host to specify that this is a resource of type Host. This graph is using default graphConfig handleRDFTypes which means that the rdf type is assigned as a label to the node for you.

Let’s see how many hosts we have

MATCH (n:Host)
RETURN  *

And a quick query to see the relationships between Virus and Host

MATCH (v:Virus)-[HOST]->(h:Host) 
RETURN *

From here you should be able to continue working on the more involved queries in Mark’s blog post.