Time for another update on my side project, Syracuse: http://syracuse-1145.herokuapp.com/
I got some feedback that the structure of the relationships was a bit unintuitive. Fair enough, let’s update it to make more sense.
Previously code was doing the following:
- Use RoBERTa-based LLM for entity extraction (the code is pretty old but works well)
- Use Benepar dependency parsing to link up relevant entities with each other
- Use FlanT5 LLM to dig out some more useful content from the text
- Build up an RDF representation of the data
- Clean up the RDF
Step 5 had got quite complex.
Also, I had a look at import/export of RDF in graphs – specifically Neo4J, but couldn’t see much excitement about RDF. I even made a PR to update some of the Neo4J / RDF documentation. It’s been stuck for 2+ months.
I wondered if a better approach would be to start again using a different set of technologies. Specifically,
- Falcon7B instead of FlanT5
- Just building the representation in a graph rather than using RDF
Falcon7B was very exciting to get a chance to try out. But in my use case it wasn’t any more useful than FlanT5.
Going down the graph route was a bit of fun for a while. I’ve used networkx
quite a bit in the past so thought I’d try with that first. But, guess what, it turned out more complicated than I needed. Also I do like the simplicity and elegance of RDF, even if it makes me seem a bit, old.
So the final choice was to rip up all my post-processing and turn it into pre-processing, and then generate the RDF. It was heart-breaking to throw away a lot of code, but, as programmers, I think we know when we’ve built something that is just too brittle and needs some heavy refactoring. It worked well in the end, see the git stats below:
- code: 6 files change: 449 insertions, 729 deletions
- tests: 81 files changed, 3618 insertions, 1734 deletions
Yes, a lot of tests. It’s a data-heavy application so there are a lot of tests to make sure that data is transformed as expected. Whenever it doesn’t work, I add that data (or enough of it) as a test case and then fix it. Most of this test data was just changed with global find/replace so it’s not a big overhead to maintain. But having all those tests was crucial for doing any meaningful refactoring.
On the code side, it was very satisfying to remove more code than I was adding. It just showed how brittle and convoluted the codebase had become. As I discovered more edge cases I added more logic to deal with them. Eventually this ended up as lots of complexity. The new code is “cleaner”. I put clean in quotes because there is still a lot of copy/paste in there and similar functions doing similar things. This is because I like to follow “Make it work, make it right, make it fast“. Code that works but isn’t super elegant is going to be easier to maintain/fix/re-factor later than code that is super-abstracted.
Some observations on the above:
- Tests are your friend (obviously)
- Expect to need major refactoring in the future. However well you capture all the requirements now, there will be plenty that have not yet been captured, and plenty of need for change
- Shiny new toys aren’t always going to help – approach with caution
- Sometimes the simple old-fashioned technologies are just fine
- However bad you think an app is, there is probably still 80% in there that is good, so beware completely starting from scratch.
See below for the RDF as it stands now compared to before:
Current version
@prefix ns1: <http://example.org/test/> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/test/abc/Core_Scientific> a org:Organization ;
ns1:basedInRawLow "USA" ;
ns1:buyer <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> ;
ns1:description "Artificial Intelligence and Blockchain technologies" ;
ns1:foundName "Core Scientific" ;
ns1:industry "Artificial Intelligence and Blockchain technologies" ;
ns1:name "Core Scientific" .
<http://example.org/test/abc/Stax_Digital> a org:Organization ;
ns1:basedInRawLow "LLC" ;
ns1:description "blockchain mining" ;
ns1:foundName "Stax Digital, LLC",
"Stax Digital." ;
ns1:industry "blockchain mining" ;
ns1:name "Stax Digital" .
<http://example.org/test/abc/Stax_Digital_Assets_Acquisition> a ns1:CorporateFinanceActivity ;
ns1:activityType "acquisition" ;
ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
"Core Scientific completes acquisition of Stax Digital." ;
ns1:foundName "acquired",
"acquisition" ;
ns1:name "acquired",
"acquisition" ;
ns1:status "completed" ;
ns1:targetDetails "assets" ;
ns1:targetEntity <http://example.org/test/abc/Stax_Digital> ;
ns1:targetName "Stax Digital" ;
ns1:whereRaw "llc" .
Previous version
@prefix ns1: <http://example.org/test/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<http://example.org/test/abc/Core_Scientific> a <http://www.w3.org/ns/org#Organization> ;
ns1:basedInLow "USA" ;
ns1:description "Artificial Intelligence and Blockchain technologies" ;
ns1:foundName "Core Scientific" ;
ns1:industry "Artificial Intelligence and Blockchain technologies" ;
ns1:name "Core Scientific" ;
ns1:spender <http://example.org/test/abc/Purchase_Stax_Digital> .
<http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> a ns1:TargetDetails ;
ns1:label "Assets" ;
ns1:name "Acquired Assets Stax Digital, LLC" ;
ns1:nextEntity "Stax Digital, LLC" ;
ns1:previousEntity "acquired" ;
ns1:targetEntity <http://example.org/test/abc/Stax_Digital> .
<http://example.org/test/abc/Purchase_Stax_Digital> a ns1:Activity ;
ns1:activityType "Purchase" ;
ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
"Core Scientific completes acquisition of Stax Digital." ;
ns1:label "Acquired",
"Acquisition" ;
ns1:name "Purchase Stax Digital" ;
ns1:targetDetails <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> ;
ns1:whenRaw "has happened, no date available" .
<http://example.org/test/abc/Stax_Digital> a <http://www.w3.org/ns/org#Organization> ;
ns1:basedInLow "LLC" ;
ns1:description "blockchain mining" ;
ns1:foundName "Stax Digital, LLC",
"Stax Digital." ;
ns1:industry "blockchain mining" ;
ns1:name "Stax Digital" .
DIFF
1a2
> @prefix org: <http://www.w3.org/ns/org#> .
4,5c5,7
< <http://example.org/test/abc/Core_Scientific> a <http://www.w3.org/ns/org#Organization> ;
< ns1:basedInLow "USA" ;
---
> <http://example.org/test/abc/Core_Scientific> a org:Organization ;
> ns1:basedInRawLow "USA" ;
> ns1:buyer <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> ;
9,10c11
< ns1:name "Core Scientific" ;
< ns1:spender <http://example.org/test/abc/Purchase_Stax_Digital> .
---
> ns1:name "Core Scientific" .
12,31c13,14
< <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> a ns1:TargetDetails ;
< ns1:label "Assets" ;
< ns1:name "Acquired Assets Stax Digital, LLC" ;
< ns1:nextEntity "Stax Digital, LLC" ;
< ns1:previousEntity "acquired" ;
< ns1:targetEntity <http://example.org/test/abc/Stax_Digital> .
<
< <http://example.org/test/abc/Purchase_Stax_Digital> a ns1:Activity ;
< ns1:activityType "Purchase" ;
< ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
< ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
< "Core Scientific completes acquisition of Stax Digital." ;
< ns1:label "Acquired",
< "Acquisition" ;
< ns1:name "Purchase Stax Digital" ;
< ns1:targetDetails <http://example.org/test/abc/Acquired_Assets_Stax_Digital_Llc> ;
< ns1:whenRaw "has happened, no date available" .
<
< <http://example.org/test/abc/Stax_Digital> a <http://www.w3.org/ns/org#Organization> ;
< ns1:basedInLow "LLC" ;
---
> <http://example.org/test/abc/Stax_Digital> a org:Organization ;
> ns1:basedInRawLow "LLC" ;
36a20,34
>
> <http://example.org/test/abc/Stax_Digital_Assets_Acquisition> a ns1:CorporateFinanceActivity ;
> ns1:activityType "acquisition" ;
> ns1:documentDate "2022-11-28T05:06:07.000008"^^xsd:dateTime ;
> ns1:documentExtract "Core Scientific (www.corescientific.com) has acquired assets of Stax Digital, LLC, a specialist blockchain mining company with extensive product development experience and a strong track record of developing enterprise mining solutions for GPUs.",
> "Core Scientific completes acquisition of Stax Digital." ;
> ns1:foundName "acquired",
> "acquisition" ;
> ns1:name "acquired",
> "acquisition" ;
> ns1:status "completed" ;
> ns1:targetDetails "assets" ;
> ns1:targetEntity <http://example.org/test/abc/Stax_Digital> ;
> ns1:targetName "Stax Digital" ;
> ns1:whereRaw "llc" .