Time for another update in my side project, following on from https://alanbuxton.wordpress.com/2023/08/05/rip-it-up-and-start-again-without-ripping-it-up-and-starting-again/
Since that post I’ve implemented a new backend with Neo4j and created an app for accessing the data1. It’s here: https://github.com/alanbuxton/syracuse-neo. The early commits have decent messages showing my baby steps in getting from one capability to the next2.
The previous app stored each topic collection separately in a Postgres database. A topic collection is the stories taken from one article, and there could be several stories within one article. This was ok as a starting point, but the point of this project is to connect the dots between different entities based on NLPing articles, so really I needed a Graph database to plug things together.
Heroku doesn’t support Neo4j so I’ve moved the app to a Digital Ocean VM that uses Neo4j’s free tier AuraDB. It’s hosted here http://syracuse.1145.am and has just enough data in it to fit into the AuraDB limits.
The graph visualization is done with vis.js
. This turned out to be pretty straightforward to code: you just need some javascript for the nodes and some javascript for the edges. So long as your ids are all unique everything seems pretty straightforward.
Visualizing this sort of data in a graph makes it a lot more immediate than before. I just want to share a few entertaining images to show one feature I worked on.
The underlying data has a node for every time an entity (e.g. organization) is mentioned. This is intentional because when processing an article you can’t tell whether the name of a company in one article is the same company as a similarly-named company in a different article3. So each node in the graph is a mention of an organization and then there is some separate logic to figure out whether two nodes are the same organization or not. For example, if it’s a similar name and industry then it’s likely that the two are the same organziation.
This sometimes led to a ball of lines that looks like Pig-Pen’s hair.
On the plus side, this does make for a soothing visual as the graph library tries to move the nodes about into a sensible shape. With a bit of ambient sounds this could make a good relaxation video.
But, pretty though this may be, it’s hard to read. So I implemented an ‘uber node’ that is the result of clubbing together all the “same as” nodes. A lot more readable, see below:
Below is an example of the same graph after all the Accel Partners nodes had been combined together.
Next steps:
- Implement the other types of topic collections into this graph (e.g. people appointments, opening new locations)
- Implement a feature to easily flag any incorrect relationships or entities (which can then feed back into the ML training)
Thanks for reading!
Notes
- With thanks to the https://github.com/neo4j-examples/paradise-papers-django for their example app which gave a starting point for working with graph data in Neo4j. ↩︎
- For example,
git checkout b2791fdb439c18026585bced51091f6c6dcd4f72
is a good one for complete newbies to see some basic interactions between Django and Neo4j. ↩︎ - Also this type of reconciliation is a difficult problem – I have some experience of it from this project: https://theybuyforyou.eu/business-cases/ – so it’s safer to process the articles and then have a separate process for combining the topics together. ↩︎