Software Development

Getting started with Neo4J and Neosemantics

This is based on Mark Needham’s excellent blog post on creating a Covid 19 graph with commands upated to work with Neo4J 5.5.0.

Disclaimer – I don’t know anything about infectious diseases, so apologies if I’ve misunderstood any terminology. Please refer back to Mark’s post for the in-depth analysis of what the various cypher queries mean and just use this post to get started quicker.

Pre-requisites: You are using Neo4J Desktop with APOC and Neosemantics installed, so your Neo4J Desktop looks something like this:

First set the config for the graph

CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE;
CALL n10s.graphconfig.init({handleVocabUris: "MAP"});

And then set up the mappings for the wikidata namespaces:

CALL n10s.nsprefixes.add("wdt","http://www.wikidata.org/prop/direct/");
CALL n10s.mapping.add("http://www.wikidata.org/prop/direct/P171","CHILD_OF");
CALL n10s.nsprefixes.add("rdfs","http://www.w3.org/2000/01/rdf-schema#");
CALL n10s.mapping.add("http://www.w3.org/2000/01/rdf-schema#label","name");

Now run the code to insert the virus data into your graph

WITH '
CONSTRUCT {
  ?cat rdfs:label ?catName .
  ?subCat rdfs:label ?subCatName ;
          wdt:P171 ?parentCat .
  }
WHERE {
  ?cat rdfs:label "Riboviria"@en .
  ?cat rdfs:label ?catName .
  filter(lang(?catName) = "en") .
  ?subCat wdt:P171+ ?cat ;
          wdt:P171 ?parentCat;
          rdfs:label ?subCatName
          filter(lang(?subCatName) = "en") .
}
' AS query
CALL n10s.rdf.import.fetch(
  "https://query.wikidata.org/sparql?query=" + apoc.text.urlencode(query),
  "JSON-LD",
  { headerParams: { Accept: "application/ld+json"}})
YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo
RETURN terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo;

You should get a result that looks something like:

If there were any errors they would appear in the extraInfo field.

Let’s just check what nodes we have in the graph

MATCH (n)
RETURN n

The green node is the GraphConfig node. The virus definitions are just flagged with the generic ‘Resource’ label. To make things more useful later on when we have other types of data in the graph we can add a new label to the Viruses nodes

MATCH (n:Resource)
SET n:Virus
RETURN *

The 300 items are the ones Matched by this Cypher statement and they now all have the Virus label and a fetching yellow colour.

Mark’s blog post talks about cleaning up some cases where there is a shortcut between two CHILD_OF relationships. It’s a fun example of Variable-length pattern matching so worth a bit of a play with.

In my dataset the following Cypher query shows one such example:

MATCH (v:Virus {name: "Rotavirus"})-[co:CHILD_OF]->(p:Virus)
RETURN *

Rotavirus is a child of Sederovirinae (top left), Sedoreoviridae (top right) and Reoviridae (bottom left). Sederovirinae (top left) is a child of Reoviridae. So the CHILD_OF from Rotavirus to Reoviridae is a shortcut path that we want to remove.

To identify all the shortcuts we need to see all the cases where Virus v1 is a direct child of Virus v2 and at the same time the same Virus v1 has another path of at least 2 child_of steps to v2. That can be captured in the following Cypher query:

MATCH (v2:Virus)<-[shortcut:CHILD_OF]-(v1:Virus)-[co:CHILD_OF*2..]->(v2:Virus) 
RETURN *

To delete these shortcuts

MATCH (v2:Virus)<-[shortcut:CHILD_OF]-(v1:Virus)-[co:CHILD_OF*2..]->(v2:Virus) 
DELETE shortcut

This will return the number of rows deleted. To satisfy yourself that these are, indeed, deleted, you can re-run the Rotavirus query above which will now only show that Rotavirus is a child of Sederovirinae and Sedoreoviridae

To continue in Mark’s footsteps, let’s add some other entities to the graph. The syntax for adding mappings has changed, so we need to tweak the older command. In this case we added a prefix for “http://www.wikidata.org/prop/direct/&#8221; earlier on so we can just go ahead and add the mappings without adding a prefix first.

CALL n10s.mapping.add("http://www.wikidata.org/prop/direct/P2975","HOST");

Then load the data with the query below. Wwarning: this query will take a long time to run because it will make one request per virus to get the host information. In my case: Started streaming 7139 records after 22 ms and completed after 19590 ms, displaying first 1000 rows.

MATCH (r:Virus)
WITH n10s.rdf.getIRILocalName(r.uri) AS virus, r
WITH 'prefix schema: <http://schema.org/>

CONSTRUCT {
  wd:' + virus + ' wdt:P2975 ?host.
  ?host rdfs:label ?hostName ;
        rdf:type schema:Host

}
WHERE {
  OPTIONAL {
    wd:' + virus + ' wdt:P2975 ?host.
    ?host rdfs:label ?hostName.
    filter(lang(?hostName) = "en")
  }
}' AS query, r
CALL n10s.rdf.import.fetch("https://query.wikidata.org/sparql?query=" + apoc.text.urlencode(query),
        "JSON-LD",
        { headerParams: { Accept: "application/ld+json"}})

YIELD terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo
RETURN r.name, terminationStatus, triplesLoaded, triplesParsed, namespaces, extraInfo;

In this case we’re using rdf:type schema:Host to specify that this is a resource of type Host. This graph is using default graphConfig handleRDFTypes which means that the rdf type is assigned as a label to the node for you.

Let’s see how many hosts we have

MATCH (n:Host)
RETURN  *

And a quick query to see the relationships between Virus and Host

MATCH (v:Virus)-[HOST]->(h:Host) 
RETURN *

From here you should be able to continue working on the more involved queries in Mark’s blog post.