Getting Started with Neo4J Cypher Query Language

Getting Started with Neo4J Cypher Query Language

Neo4J’s Cypher is a powerful query language for graph databases. The language shares a heritage with traditional database tools like Structured Query Language (SQL), but it does occupy a distinct phase within that heritage.

To differentiate the two languages, one might liken them to two related yet mostly no longer mutually intelligible natural languages, such as Dutch and English.

Whereas SQL is the classic, table-oriented data storage, retrieval, and manipulation tool, Cypher is not quite based on that same notion of a table.

Cypher, and the databases it allows you to query like AuraDB, are instead concerned with the components of graphs as we know them in the traditional computer science sense:

This crash course will attempt to cover a similar curriculum for Cypher Query Language (CQL?) as I covered in the past for Kusto Query Language (KQL). However, this one will diverge somewhat from the KQL course since the nature of this language is a bit different.

Also, the KQL course contains a mostly flavor-agnostic and in-depth introduction to regex; we will discuss regex in this course, but I’m not reproducing that section specifically because it would be almost entirely duplicate content. Check out the KQL course for more on regex, and you can even follow along with the KQL course’s regex introduction in Cypher using a setup for regex testing we develop later on.

In this course, I’m using a free AuraDB instance with the Northwind dataset — in case you want to follow along.

Setups for Running Cypher Queries

There are a few different ways of running Neo4J Cypher queries.

The most normal among you will probably just use the web interface the Neo4J folks give you for running queries against your databases. This approach among all of the options seems perhaps most well-adjusted.

Another option is to use the Cypher Shell tool, which allows you to interact with your Neo4J databases from text-based interfaces like your terminal. If you want to do this, I’m not going to stop you.

But these options, in my view, are not accessible to everyone. Not everyone has regular access to computing devices which are amenable to running those tools.

In that vein, I’m providing here a small Node.js/JavaScript framework for running Cypher queries. I happen to run this on Replit since they have an easy-to-use, mobile IDE available for free, and it works great with Node.js.

Another benefit is this allows you to dip your toe into the world of writing a server which can communicate with a Neo4J database 🤓.

The first thing your setup needs is a few environment variables containing the connection details for your Neo4J DataBase Management System (DBMS). These secrets should be provided to you in a simple text file when you spin up your instance. On Replit, you add these in the Secrets tab. These are the names of the secrets you need to add:

Next from NPM you need to install the official neo4j-driver package. On Replit you do this in the Packages tab of your Node.js REPL.

We’re going to make for ourselves a simple config file called query.config.json in the root directory:

{
  "outputMode": "csv",
  "permissibleOutputModeValues": [
    "console",
    "json",
    "csv"
  ],
  "driverOptions": {
    "maxConnectionLifetime": 3000000,
    "disableLosslessIntegers": true
  }
}

The outputMode value, when read by the script, will determine the method used to output the query’s result set.

permissibleOutputModeValues is just there to remind you what options there are.

And finally, the driverOptions are optional but just a quality-of-life preference for me; if you need high precision integer support then leave disableLosslessIntegers as false.

Now we’re ready to write our script. In index.js:

const neo4j = require('neo4j-driver')
const fs = require('fs/promises')

const neo4jURI = process.env['NEO4J_URI']
const neo4jUsername = process.env['NEO4J_USERNAME']
const neo4jPassword = process.env['NEO4J_PASSWORD']

;(async function(){
  let driver, session, before
  try {
    const query = await fs.readFile(
      'query.cypher', 
      'utf8'
    )
    const {
      outputMode,
      driverOptions
    } = JSON.parse(
      await fs.readFile(
        "query.config.json",
        "utf8"
      )
    )
    driver = neo4j.driver(
      neo4jURI,
      neo4j.auth.basic(
        neo4jUsername, 
        neo4jPassword
      ),
      driverOptions
    );
    session = driver.session()
    console.log(
`
n----------------n
| Session Opened |
u----------------u
`
    )
    before = performance.now()
    const result = await session.run(
      query
    )
    const after = performance.now()
    const durationMS = after - before
    console.log(
      `Query Duration (milliseconds): ${
        durationMS
      }`
    )
    if(
      typeof outputMode != "undefined" && 
      outputMode === "json"
    ){
      await fs.writeFile(
        "results.json",
        JSON.stringify(
          result.records.map(
            record => record.toObject()
          )
        ),
        {
          encoding: "utf8"
        }
      )
    } else if(
      typeof outputMode != "undefined" && 
      outputMode === "csv"
    ){
      await fs.writeFile(
        "results.csv",
        [
          !result.records.length ? "" : result.records[0].keys.sort().map(
            k => `"${
              k.replace(
                "\"", 
                "\"\""
              )
            }"`
          ).join(","),
          "\n",
          result.records.map(
            record => record.keys.sort().map(
              key => `"${
                String(
                  record.get(key)
                ).replace(
                  "\"", 
                  "\"\""
                )
              }"`
            ).join(",")
          ).join("\n") 
        ].join(""),
        {
          encoding: "utf8"
        }
      )
    } else {
      result.records.forEach(
        record => {
          console.log(
    `
    n---------n
    | Record: |
    u---------u`
          )
          record.keys.forEach(
            key => {
              console.log(
                `\nKey: ${
                  key
                } | Value: ${
                  JSON.stringify(record.get(
                    key
                  ))
                }`
              )
            }
          )
        }
      )
    }
  } catch (e) {
    const after = performance.now()
    const durationMS = after - before
    console.log(
      `Duration (milliseconds): ${
        durationMS
      }`
    )
    console.log(
`

n---------------n
| Caught Error: |
u---------------u
`
    )
    console.log(e)
  } finally {
    session.close()
    console.log(
`

n----------------n
| Session Closed |
u----------------u
`
    )
    driver.close()
  }
})()

Again, if you don’t care, move along to the next section and query the DB however you want — no judgement. But, if you’re curious about the mobile-compatible and JavaScript-infused detour we’ve taken, keep reading.

First we require the neo4j-driver and fs/promises, the latter for file reading and writing.

We import our environment variables we set earlier for connecting to our Neo4J DB.

And then we are off to the races. Our business of querying the database and handling the output is taken within the async IIFE. For convenience, I’d just like to use async/await syntax for JavaScript promises. So we define an asynchronous function and immediately invoke that function definition. Inside that function’s scope we can await promises to our heart’s content.

Naturally, the behaviors we implement inside of the async function are encapsulated within a try/catch/finally setup. try is for the business logic of querying the database and handling its responses. When those behaviors fail, we catch those errors and log them clearly. And finally, we always make a good-faith effort to clean-up the resources we create for querying.

We get started by reading the query file, where our cypher queries will be stored, as well as the query config json file. Using those contents, we instantiate the driver, make a session with it, and fire off a query.

With the query results, we take one of three avenues (which is determined by our config) for rendering the output.

My personal favorite is the console output, and you can see I enjoyed some simple ASCII rendering.

JSON is easily supported, since JavaScript has in-built support for serializing objects to JSON.

The CSV-rendering is slightly trickier since JavaScript doesn’t have native support for CSV. We tackle this by first CSV-escaping each field, where each doublequote " (if any) is replaced by two doublequotes ""; then, the resulting field value is surrounded in double quotes. The double-quoted fields are joined by commas to form records. And finally we join that array of records with newlines.

Now you’re ready to pop a query in the query file and tap the run button!

Enumerating Databases and Capabilities

The first thing you may be curious about is which databases are available to you with the DBMS connection info you provided.

To see those databases, run this cypher command:

SHOW DATABASES;

You’ll notice one or more databases you’re interested in, plus a so-called system database which controls all the other databases under its care.

Great. How do you know what your actual capabilities are with the connection info you’ve used?

To understand this properly, you’ll need a way to enumerate these details. If this sounds boring to you, I admonish you to not skip this section. Many of the secrets of how Neo4J and Cypher work will be unlocked when you solve this problem.

Cypher is willing to deliver to you lists of functions and procedures you can utilize in your queries. It’s a kind of dictionary pertaining to the Cypher language, and you can use Cypher itself to query this dictionary!

First, what is the difference between procedures and functions in Cypher? The general theme is that functions are more broadly usable, as you can use them within any expression. Procedures, on the other hand, are more limited in terms of the types of Cypher clauses within which they can be invoked, mainly CALL clauses.

Let’s see an example of a query you can use to list invokable procedures:

SHOW PROCEDURES EXECUTABLE BY CURRENT USER
YIELD * 
SKIP 0 LIMIT 10;

The first line is self-explanatory. The second line means, for each record representing a procedure, YIELD all fields in that record. You could provide an allowlist of fields in place of the asterisk. Finally, the third line is what allows you to page through the results. Since you probably don’t know how many procedures are available or want to read them all, it might be better to grab, say, ten at a time to read about. Set the LIMIT to whatever you want, and when you’re ready to read the next “page” of results, increment the SKIP value by the quantity you chose for your LIMIT.

The above approach is great for acquainting yourself with the language. But let’s say you’re looking for a particular kind of procedure. Now the paging strategy alone is not likely to satisfy you, as it will be a linear search through a potentially large quantity of result pages.

Actually, you can utilize a WHERE clause like in many query languages to narrow down the returned results. It only requires a few modifications to the query:

SHOW PROCEDURES EXECUTABLE BY CURRENT USER
YIELD * 
WHERE name CONTAINS "graph" OR description CONTAINS "graph"
RETURN *
SKIP 0 LIMIT 10;

We are searching for procedures where the name or description fields contain the text graph. After filtering and before the paging at the end of the query, we have to RETURN * for the query to work. This is similar to YIELD, but we will learn more about the details of this syntax later.

Let’s modify the above to search for functions instead. And this time, instead of a simple CONTAINS, we’ll try a regex filter to search for anything that fits one of a number of functionalities we might be interested in:

SHOW FUNCTIONS EXECUTABLE BY CURRENT USER
YIELD * 
WHERE name =~ "(?i).*(?:regex|rex|ip|coalesce|json|xml|jpath|xpath|parse|bin|float|int|decimal|double|rand|extract).*"
RETURN COUNT(*);

Here we’re just fishing through the function dictionary for any hits based on popular terms from other languages we might already know. And we don’t come up dry!

Instead of returning the fields directly, we returned the result of our first aggregation function, count. When I ran this, it told me there are 52 functions that matched! So let’s page through these 52 records and see all the functions that matched:

SHOW FUNCTIONS EXECUTABLE BY CURRENT USER
YIELD * 
WHERE name =~ "(?i).*(?:regex|rex|ip|coalesce|json|xml|jpath|xpath|parse|bin|float|int|decimal|double|rand|extract).*"
RETURN name, description
SKIP 0 LIMIT 10;

What a cornucopia of tools we unearthed with a bit of regex. If you don’t understand the regex pattern — don’t worry. What we did with regex can be done via the CONTAINS syntax shown earlier by chaining multiple CONTAINS predicates together with OR as shown with procedures.

We will cover regex in more detail later. However, for those of you who are curious about the regex above (Cypher uses Java flavor of regex, by the way), here’s what we did in that pattern:

Overall, we want the pattern to match one member from a set of search terms, where the matched search term can have an arbitrary, optional prefix and an arbitrary, optional suffix. You can reuse this pattern to search with your own search terms, as long as you’re mindful of regex syntax constraints.

Exploring a Database

At last, you’d like to learn how to start exploring the contents of your graph database.

In a traditional, table-oriented database, this would probably start with pulling a list of tables from which you’re able to SELECT.

As mentioned earlier, Neo4J is a graph-based database and so is not based on the same notion of tables. Instead of tables of records, we store nodes and the relationships between them.

So how do we categorize these entities if not with tables? We do so in this way:

In fact, there are procedures you can CALL to enumerate the labels and types accessible to your queries within the database:

CALL db.labels()
YIELD label
RETURN *
ORDER BY label
SKIP 0 LIMIT 100;

This enumerates labels you can use to filter nodes in queries of your database graph.

The same kind of enumeration, but of types you can use to filter relationships in queries of your database graph:

CALL db.relationshipTypes()
YIELD relationshipType
RETURN relationshipType
ORDER BY relationshipType
SKIP 0 LIMIT 100;

Databases always use some form of data indexing in order to support fast retrieval as well as various types of constraints like uniqueness of values stored. Even though you don’t explicitly chose which indexes your query uses in Cypher, it can be very important to understand what indexes are in place before you start querying certain types of databases.

To page through the different indexes your Neo4J database has configured:

SHOW INDEXES
YIELD *
RETURN *
SKIP 0 LIMIT 20

I’ll draw attention to a few things:

  1. Each index lists the query which was used to create it. This bolsters your ability to understand the functionality the index has — and also to understand the kind of language used to create indexes in Cypher.
  2. Indexes each have a state, where you want this value to be ONLINE. If an index is not in that state, the query planner can’t use it to help your query be fast.
  3. Indexes each have a listed provider and type. This refers to the type of the index.

In my Northwind database, I only have a series of Range indexes (the most general purpose type), and two Token Lookup indexes. The latter two are critical indexes that come with every database to enable you to quickly lookup nodes and relationships by their labels and types; hence, under ordinary circumstances these two indexes should never be dropped.

Finally, we are ready for match game.

These are like SQL SELECT statements, but more graph-oriented. Cypher MATCH statements offer a delightfully nerdy ASCII syntax for declaratively matching sub-graphs within your database.

Here’s a very simple MATCH query:

MATCH (n)
RETURN n
LIMIT 2

We request to match at most two nodes. We know we are matching nodes because of the pair of parentheses. Within the pair of parentheses, we specify n as the variable name or key to which each node should be assigned, which allows us to reference the matched nodes within the RETURN clause by way of that key. In this case, we just return n instead of transforming it (more on that later); owing to that, you’ll see each record’s key, in the result set returned to you, is also n.

You can rename that key in your RETURN clause:

MATCH (n)
RETURN n AS MyNode
LIMIT 2

A slightly more verbose but similar syntax allows you to pull two relationships:

MATCH ()-[r]-()
RETURN r
LIMIT 2

The empty parentheses indicate to match nodes, but don’t bother to remember them for later in the query.

Under ordinary circumstances, relationships always have two nodes associated with them, a start and an end node. This does mean that all relationships in Neo4J are “directed”. A directed graph is one where the edges or relationships are like one-way streets or lanes.

You can try modifying the above query to also return to you the nodes alongside the connecting relationship so you can figure out which is the start and which is the end:

MATCH (one)-[rel]-(two)
RETURN r
LIMIT 2

This a more “investigative” approach to determining which node is the start and which is the end. The match syntax may be modified to add a left or right angle bracket character to indicate which name to assign the start and end nodes:

MATCH (start)-[rel]->(end)
RETURN start, rel, end
LIMIT 2

You can flip the syntax the other direction just as well:

MATCH (end)<-[rel]-(start)
RETURN start, rel, end
LIMIT 2

If you wish, you can assign a name to the entire ASCII path you’ve defined and return that:

MATCH MyPath=(end)<-[rel]-(start)
RETURN MyPath
LIMIT 2

Your paths can easily span more than 2 nodes:

MATCH (start)-[rel1]->(mid)-[rel2]->(end)
RETURN start, rel1, mid, rel2, end
LIMIT 2

So far we are just returning the names we give to the matched nodes and relationships. You see that, for each name you return, you get in the response a key with that name set to an object with multiple properties.

However, the RETURN clause allows you to use all kinds of expressions (and functions) to derive the ultimate form of the response:

MATCH (start)-[rel1]->(mid)-[rel2]->(end)
RETURN id(start) AS StartId, 
  labels(start) AS StartLabels, 
  type(rel1) AS Relation1Type, 
  id(mid) AS MidId, 
  labels(mid) AS MidLabels, 
  type(rel2) AS Relation2Type, 
  id(end) AS EndId, 
  labels(end) as EndLabels
LIMIT 2

Here we have enumerated some labels and types that are associated with parts of the path matched by the query.

With that knowledge in hand, we can refine the query further so the MATCH statement ensures matched elements have particular labels or types. When that happens, we can more reliably dig into the properties associated with a particular element and selectively return them:

MATCH (start:Employee)-[rel1:IN_TERRITORY]->(mid:Territory)-[rel2:IN_REGION]->(end:Region)
RETURN apoc.text.join(
  [
    start.titleOfCourtesy,
    start.firstName,
    start.lastName
  ], " "
) AS FullName, 
  mid.territoryDescription AS Territory, 
  end.regionDescription AS Region
LIMIT 10

We added label filters to the nodes by appending, within the parentheses, a colon after our chosen variable name and then the label by which we are filtering. For instance, for the node which we are calling start, only allow for that node matches which have the label Employee.

The relationship type filters are a similar syntax, but of course within the square brackets.

We return the result of a more complex expression than we have seen before, being a list with three entries.

The first, FullName is constructed by utilizing the apoc.text.join function, which takes two arguments:

  1. a list of strings which we wish to concatenate, end-to-end; the entries in this list are derived by using the dot . accessor syntax to access name properties of the node with name start.
  2. a delimeter string to stick between the strings we are concatenating; in this case, we use a space character so there is proper spacing in the produced name.

The remaining two fields are more simply defined by using the dot accessor syntax and a renaming operation.

It can seem redundant to assign names to the matched relationships if those names are not used later in the query. Feel free to drop those names if they are not used:

MATCH (start:Employee)-[:IN_TERRITORY]->(mid:Territory)-[:IN_REGION]->(end:Region)
RETURN apoc.text.join(
  [
    start.titleOfCourtesy,
    start.firstName,
    start.lastName
  ], " "
) AS FullName, 
  mid.territoryDescription AS Territory, 
  end.regionDescription AS Region
LIMIT 10

The MATCH clause also allows you to declaratively match nodes and relationships based on the values of properties they may have.

Let’s say you want a match mostly like what we have been matching in the last few queries, but you also want all of the matched Region nodes to have a regionDescription property equal to Eastern:

MATCH (start:Employee)-[:IN_TERRITORY]->(mid:Territory)-[:IN_REGION]->(end:Region { regionDescription: "Eastern" })
RETURN apoc.text.join(
  [
    start.titleOfCourtesy,
    start.firstName,
    start.lastName
  ], " "
) AS FullName, 
  mid.territoryDescription AS Territory, 
  end.regionDescription AS Region
LIMIT 10

Now we only have Employees from that Region.

This syntax is convenient, but limited to exact-matches. Remember WHERE statements from before? Those allow you to perform various kinds of partial matching, like prefix, suffix, substring, and regex. Here’s the above example modified to use an additional prefix filter:

MATCH (start:Employee)-[:IN_TERRITORY]->(mid:Territory)-[:IN_REGION]->(end:Region { regionDescription: "Eastern" })
WHERE mid.territoryDescription STARTS WITH "B"
RETURN apoc.text.join(
  [
    start.titleOfCourtesy,
    start.firstName,
    start.lastName
  ], " "
) AS FullName, 
  mid.territoryDescription AS Territory, 
  end.regionDescription AS Region
LIMIT 10

Now we have only Employee nodes which are in a Territory that starts with the letter “B” in the “Eastern” Region.

Transforming Results and Calculating Statistics

After your MATCH statement, you’ll commonly want to transform your results. So far we have done this to a limited extent within the confines of the RETURN statement. Yet, you might miss the fluent pipelines of transformations you composed with languages like Splunk Processing Language (SPL) and Kusto Query Language (KQL).

Fret not! Cypher also possesses this pipeline-esque capability. The key to accomplishing this is the WITH statement.

Basically, WITH statements are much like RETURN statements in functionality, with the difference being that RETURN must be the final stop. You can chain as many or few WITH statements (even with other types of statements in between), as long as at the end there is a RETURN statement. In case you are familiar with KQL, these WITH and RETURN statements function a bit like the KQL project command.

Here’s an example which chains several of these statements together, producing some statistical summarizations along the way:

MATCH (o:Order) 
WITH toFloat(o.freight) AS Freight, 
  o.shipCountry AS ShipCountry
WITH avg(Freight) AS AvgFreight, 
  stdev(Freight) AS StdevFreight, 
  ShipCountry
RETURN StdevFreight/AvgFreight AS CVFreight,
  ShipCountry
ORDER BY CVFreight

The final ordering statement impacts the order of the returned results, so they are sorted in the manner you specify.

Here’s an expanded version of that query which additionally includes calculation of percentiles, followed by taking ratios of them to the calculated averages:

MATCH (o:Order) 
WITH toFloat(o.freight) as Freight, 
  o.shipCountry as ShipCountry 
WITH avg(Freight) AS AvgFreight, 
  stdev(Freight) AS StdevFreight, 
  percentileCont(Freight,0.05) AS P05Freight, 
  percentileCont(Freight,0.95) AS P95Freight, 
  ShipCountry
RETURN StdevFreight/AvgFreight AS CVFreight, 
  P05Freight/AvgFreight AS P5ToAvgFreight, 
  P95Freight/AvgFreight AS P95ToAvgFreight, 
  ShipCountry
ORDER BY CVFreight

Now we are at a level of calculating summary statistics and performing simple table projections.

From Summaries To Time Series

What if we want to produce a time series? This is a table where we calculate statistical aggregation functions once for each span or bin of time (say, 5 minutes) across a larger range of time (say, 24 hours).

This requires a type of transformation where we truncate the timestamps associated with particular nodes in order to remove a specified degree of precision from them.

Here’s a query that demonstrates that:

MATCH (o:Order) 
WITH apoc.text.regexGroups(
  o.orderDate, 
  "^([0-9]+)[-]([0-9]+)[-]([0-9]+).*$"
) AS DateParts, o.orderDate AS OgOrderDate
WITH DateParts[0][1] AS Year, DateParts[0][2] AS Month, DateParts[0][3] AS Day, OgOrderDate
WITH datetime({ 
  year: toInteger(Year), 
  month: toInteger(Month), 
  day: toInteger(Day)
}) AS DateTime, OgOrderDate
WITH toString(
  datetime.truncate(
    "month", 
    DateTime
  )
) AS Truncated, OgOrderDate
RETURN COUNT(*), Truncated
ORDER BY Truncated

The purpose of this query is to return a separate count of orders for each month where there are orders in the graph.

It follows this logic:

Binning Timestamps Based on Recurring Windows of Time

I’ve written about this before in my Epic Kusto Query Language (KQL) Crash Course. People often schedule events on a recurring basis. For example, a paycheck may commonly be received from an employer every other week on a Thursday or Friday.

Building upon the paycheck example, it may be more useful to you to know which days of the week tend to have the most paychecks issued, rather than a timechart of daily paycheck volumes going back several months or years. This is an example of trying to comprehend the seasonality of a phenomenon.

To do this, you only need to manipulate your time range bin labels a bit more:

MATCH (o:Order) 
WITH apoc.text.regexGroups(
  o.orderDate, 
  "^([0-9]+)[-]([0-9]+)[-]([0-9]+).*$"
) AS DateParts, o.orderDate AS OgOrderDate
WITH DateParts[0][1] AS Year, 
  DateParts[0][2] AS Month, 
  DateParts[0][3] AS Day, 
  OgOrderDate
WITH datetime({ 
  year: toInteger(Year), 
  month: toInteger(Month), 
  day: toInteger(Day)
}) AS DateTime, OgOrderDate
WITH DateTime.weekDay AS WeekDay, 
  DateTime.week AS Week, OgOrderDate
WITH apoc.text.lpad(
  toString(Week), 
  2, 
  "0"
) + " " + toString(WeekDay) AS Week_WeekDay,  
  OgOrderDate
RETURN COUNT(*) AS Count, Week_WeekDay
ORDER BY Count

Here we take our earlier produced DateTime field and separate out two of its properties: weekDay and week. The first is an integer referring to which of the 7 days of the week the DateTime lands on. The second is a number referring to which of the 52 weeks in a year the DateTime lands on. We convert these integers to strings, padding the week value with a 0 on the left in cases where the week number is less than 10. Finally, we produce a count of orders binned by our week and weekday label field.

This gives us an idea of the highest volume days for this business, regardless of how many years of data exist in the dataset. It also allows you to start noticing trends, like high volume days being clustered in particular weeks or if a particular day of the week tends to be high volume regardless of which week of the year contains it.

Winding and Unwinding Multi-Value Fields

Properties are not limited to holding scalar values, like integers. They can also hold multi-valued data types, like lists. As in other languages, Cypher provides facilities for “expanding” these multi-valued datatypes so that each item within the multi-valued field becomes its own record.

First, run this query:

WITH [
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyy"
] AS X
RETURN X

Notice how it returns a single record with a single property set to an array of strings.

Now run this modified version:

WITH [
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyy"
] AS X
UNWIND X AS UX
WITH apoc.text.regexGroups(
  UX, 
  "^([xXoO]{75,200}?)"
)[0][1] AS EX
RETURN SIZE(EX) AS Length

Notice how this returns 3 separate records. This is owing to the UNWIND command. The regex pattern demonstrates a lazy match inside a capturing group. Feel free to re-use this setup to test out various regex patterns you come up with.

Let’s further modify the last query to “wind” the result back up into a single array by using the collect function:

WITH [
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyy"
] AS X
UNWIND X AS UX
WITH apoc.text.regexGroups(
  UX, 
  "^([xXoO]{75,200}?)"
)[0][1] AS EX
WITH SIZE(EX) AS Length
RETURN collect(Length) AS Lengths

Parsing JSON

Taking an example from my Kusto Query Language course:

WITH [
  '{"id": 1, "brand": "TechMaster", "model": "X9000", "cpu": "QuadCore 3.0GHz", "ram": "16GB", "storage": "512GB SSD", "graphics": "NVIDIA GTX 1080", "price": "$1299.99"}',
  '{"id": 2, "brand": "QuantumTech", "model": "UltraCore", "cpu": "OctaCore 2.5GHz", "ram": "32GB", "storage": "1TB HDD", "graphics": "AMD RX 5700", "price": "$1499.99"}',
  '{"id": 3, "brand": "CyberFusion", "model": "StealthX", "cpu": "HexaCore 2.8GHz", "ram": "64GB", "storage": "256GB NVMe", "graphics": "GeForce RTX 3060", "price": "$1699.99"}',
  '{"id": 4, "brand": "SysGenius", "model": "ElitePro", "cpu": "QuadCore 3.2GHz", "ram": "8GB", "storage": "1TB SSD", "graphics": "Intel UHD Graphics", "price": "$1099.99"}',
  '{"id": 5, "brand": "MegaByte", "model": "PowerSpeed", "cpu": "OctaCore 3.5GHz", "ram": "128GB", "storage": "2TB HDD", "graphics": "NVIDIA RTX 3080", "price": "$1999.99"}',
  '{"id": 6, "brand": "NeoGaming", "model": "TitanX", "cpu": "HexaCore 3.0GHz", "ram": "16GB", "storage": "512GB NVMe", "graphics": "AMD RX 5600 XT", "price": "$1299.99"}',
  '{"id": 7, "brand": "Futurix", "model": "FusionWave", "cpu": "QuadCore 2.7GHz", "ram": "32GB", "storage": "1TB SSD", "graphics": "GeForce GTX 1660", "price": "$1399.99"}',
  '{"id": 8, "brand": "TechWave", "model": "SwiftForce", "cpu": "OctaCore 2.2GHz", "ram": "64GB", "storage": "256GB SSD", "graphics": "NVIDIA GTX 1070", "price": "$1599.99"}',
  '{"id": 9, "brand": "QuantumPulse", "model": "Xplorer", "cpu": "HexaCore 2.5GHz", "ram": "8GB", "storage": "512GB HDD", "graphics": "AMD RX 550", "price": "$1099.99"}',
  '{"id": 10, "brand": "EpicGen", "model": "ThunderBlaze", "cpu": "QuadCore 3.5GHz", "ram": "128GB", "storage": "2TB SSD", "graphics": "Intel Iris Xe", "price": "$1899.99"}',
  '{"id": 11, "brand": "CyberPulse", "model": "NovaSpeed", "cpu": "OctaCore 3.2GHz", "ram": "16GB", "storage": "1TB NVMe", "graphics": "NVIDIA RTX 3070", "price": "$1799.99"}',
  '{"id": 12, "brand": "TechFusion", "model": "VelocityX", "cpu": "HexaCore 2.8GHz", "ram": "32GB", "storage": "512GB SSD", "graphics": "AMD RX 570", "price": "$1299.99"}',
  '{"id": 13, "brand": "InnoTech", "model": "BlazePro", "cpu": "QuadCore 2.5GHz", "ram": "64GB", "storage": "1TB HDD", "graphics": "GeForce GTX 1650", "price": "$1499.99"}',
  '{"id": 14, "brand": "FusionTech", "model": "SwiftNova", "cpu": "OctaCore 3.0GHz", "ram": "8GB", "storage": "256GB NVMe", "graphics": "NVIDIA GTX 1060", "price": "$1199.99"}',
  '{"id": 15, "brand": "MegaTech", "model": "TurboWave", "cpu": "HexaCore 2.2GHz", "ram": "128GB", "storage": "2TB SSD", "graphics": "AMD RX 580", "price": "$1699.99"}',
  '{"id": 16, "brand": "QuantumFusion", "model": "RapidX", "cpu": "QuadCore 3.2GHz", "ram": "16GB", "storage": "512GB HDD", "graphics": "Intel UHD Graphics", "price": "$1299.99"}',
  '{"id": 17, "brand": "EpicPulse", "model": "BlitzTech", "cpu": "OctaCore 2.8GHz", "ram": "32GB", "storage": "1TB SSD", "graphics": "NVIDIA RTX 3060", "price": "$1499.99"}',
  '{"id": 18, "brand": "TechNova", "model": "XtremeSpeed", "cpu": "HexaCore 3.5GHz", "ram": "64GB", "storage": "256GB SSD", "graphics": "GeForce GTX 1660", "price": "$1599.99"}',
  '{"id": 19, "brand": "CyberBlaze", "model": "ThunderX", "cpu": "QuadCore 2.2GHz", "ram": "8GB", "storage": "512GB SSD", "graphics": "AMD RX 5600 XT", "price": "$1199.99"}',
  '{"id": 20, "brand": "FutureTech", "model": "SwiftFusion", "cpu": "HexaCore 3.0GHz", "ram": "128GB", "storage": "1TB NVMe", "graphics": "NVIDIA GTX 1070", "price": "$1799.99"}'
] AS JSONs
UNWIND JSONs AS JSON
RETURN JSON

This is a set of JSON-encoded data pertaining to an inventory of computers. It’s not part of our actual Neo4J database, but a kind of in-memory table of strings we conjured up to practice Cypher manipulations. Think of it as a small sample of the kind of language you could find in much larger quantities in a real database.

I’ll say at the outset, ordinarily you would want to keep parsing logic, as much as possible, outside of the database tier. Some say it’s a better task for application servers. Nonetheless, there are plenty of situations where doing this within the database may be your best option.

We want to separate out this data into proper fields, the kinds which we could use to perform valid statistical calculations. Of course you can pull these values out with some regexes, but this is ill-advised. In Cypher, we have ready-made JSON parsing functionality, and we should use it.

Here’s the entirety of the Cypher script we are looking for, with explanations following:

WITH [
  '{"id": 1, "brand": "TechMaster", "model": "X9000", "cpu": "QuadCore 3.0GHz", "ram": "16GB", "storage": "512GB SSD", "graphics": "NVIDIA GTX 1080", "price": "$1299.99"}',
  '{"id": 2, "brand": "QuantumTech", "model": "UltraCore", "cpu": "OctaCore 2.5GHz", "ram": "32GB", "storage": "1TB HDD", "graphics": "AMD RX 5700", "price": "$1499.99"}',
  '{"id": 3, "brand": "CyberFusion", "model": "StealthX", "cpu": "HexaCore 2.8GHz", "ram": "64GB", "storage": "256GB NVMe", "graphics": "GeForce RTX 3060", "price": "$1699.99"}',
  '{"id": 4, "brand": "SysGenius", "model": "ElitePro", "cpu": "QuadCore 3.2GHz", "ram": "8GB", "storage": "1TB SSD", "graphics": "Intel UHD Graphics", "price": "$1099.99"}',
  '{"id": 5, "brand": "MegaByte", "model": "PowerSpeed", "cpu": "OctaCore 3.5GHz", "ram": "128GB", "storage": "2TB HDD", "graphics": "NVIDIA RTX 3080", "price": "$1999.99"}',
  '{"id": 6, "brand": "NeoGaming", "model": "TitanX", "cpu": "HexaCore 3.0GHz", "ram": "16GB", "storage": "512GB NVMe", "graphics": "AMD RX 5600 XT", "price": "$1299.99"}',
  '{"id": 7, "brand": "Futurix", "model": "FusionWave", "cpu": "QuadCore 2.7GHz", "ram": "32GB", "storage": "1TB SSD", "graphics": "GeForce GTX 1660", "price": "$1399.99"}',
  '{"id": 8, "brand": "TechWave", "model": "SwiftForce", "cpu": "OctaCore 2.2GHz", "ram": "64GB", "storage": "256GB SSD", "graphics": "NVIDIA GTX 1070", "price": "$1599.99"}',
  '{"id": 9, "brand": "QuantumPulse", "model": "Xplorer", "cpu": "HexaCore 2.5GHz", "ram": "8GB", "storage": "512GB HDD", "graphics": "AMD RX 550", "price": "$1099.99"}',
  '{"id": 10, "brand": "EpicGen", "model": "ThunderBlaze", "cpu": "QuadCore 3.5GHz", "ram": "128GB", "storage": "2TB SSD", "graphics": "Intel Iris Xe", "price": "$1899.99"}',
  '{"id": 11, "brand": "CyberPulse", "model": "NovaSpeed", "cpu": "OctaCore 3.2GHz", "ram": "16GB", "storage": "1TB NVMe", "graphics": "NVIDIA RTX 3070", "price": "$1799.99"}',
  '{"id": 12, "brand": "TechFusion", "model": "VelocityX", "cpu": "HexaCore 2.8GHz", "ram": "32GB", "storage": "512GB SSD", "graphics": "AMD RX 570", "price": "$1299.99"}',
  '{"id": 13, "brand": "InnoTech", "model": "BlazePro", "cpu": "QuadCore 2.5GHz", "ram": "64GB", "storage": "1TB HDD", "graphics": "GeForce GTX 1650", "price": "$1499.99"}',
  '{"id": 14, "brand": "FusionTech", "model": "SwiftNova", "cpu": "OctaCore 3.0GHz", "ram": "8GB", "storage": "256GB NVMe", "graphics": "NVIDIA GTX 1060", "price": "$1199.99"}',
  '{"id": 15, "brand": "MegaTech", "model": "TurboWave", "cpu": "HexaCore 2.2GHz", "ram": "128GB", "storage": "2TB SSD", "graphics": "AMD RX 580", "price": "$1699.99"}',
  '{"id": 16, "brand": "QuantumFusion", "model": "RapidX", "cpu": "QuadCore 3.2GHz", "ram": "16GB", "storage": "512GB HDD", "graphics": "Intel UHD Graphics", "price": "$1299.99"}',
  '{"id": 17, "brand": "EpicPulse", "model": "BlitzTech", "cpu": "OctaCore 2.8GHz", "ram": "32GB", "storage": "1TB SSD", "graphics": "NVIDIA RTX 3060", "price": "$1499.99"}',
  '{"id": 18, "brand": "TechNova", "model": "XtremeSpeed", "cpu": "HexaCore 3.5GHz", "ram": "64GB", "storage": "256GB SSD", "graphics": "GeForce GTX 1660", "price": "$1599.99"}',
  '{"id": 19, "brand": "CyberBlaze", "model": "ThunderX", "cpu": "QuadCore 2.2GHz", "ram": "8GB", "storage": "512GB SSD", "graphics": "AMD RX 5600 XT", "price": "$1199.99"}',
  '{"id": 20, "brand": "FutureTech", "model": "SwiftFusion", "cpu": "HexaCore 3.0GHz", "ram": "128GB", "storage": "1TB NVMe", "graphics": "NVIDIA GTX 1070", "price": "$1799.99"}'
] AS JSONs
UNWIND JSONs AS JSON
WITH apoc.convert.fromJsonMap(JSON) as parsed
WITH parsed.id AS id,
  parsed.brand AS brand,
  parsed.model AS model,
  tail(
    apoc.text.regexGroups(
      parsed.cpu,
      "^([Qq][Uu][Aa][Dd]|[Hh][Ee][Xx][Aa]|[Oo][Cc][Tt][Aa])[Cc][Oo][Rr][Ee][ ](\\d+[.]?\\d*)GHz$"
    )[0]
  ) AS cpuSpecs,
  toInteger(
    apoc.text.regexGroups(
      parsed.ram,
      "^([0-9]+)[Gg][Bb]$"
    )[0][1]
  ) AS ramGB,
  tail(
    apoc.text.regexGroups(
      parsed.storage,
      "^([0-9]+)([GgTt][Bb])[ ]([A-Za-z0-9_.-]+)$"
    )[0]
  ) AS storage,
  parsed.graphics AS graphics,
  toFloat(
    apoc.text.regexGroups(
      parsed.price,
      "^[$](.*)$"
    )[0][1]
  ) AS priceDollars
WITH id, brand, model, graphics, priceDollars, 
  CASE cpuSpecs[0]
    WHEN "Quad" THEN 4
    WHEN "Hexa" THEN 6
    WHEN "Octa" THEN 8
    ELSE -1
  END AS cpuCores, toFloat(
    cpuSpecs[1]
  ) AS cpuGHz, ramGB,
  toInteger(
    storage[0]
  ) AS storageSize, storage[1] AS storageUnit, storage[2] AS storageType
RETURN id, brand, model, graphics, 
  priceDollars, cpuCores, cpuGHz, ramGB,
  CASE storageUnit
    WHEN "GB" THEN storageSize
    ELSE storageSize * 1000
  END AS storageGB

What a drastic transformation! It shows what kind of transformations you can accomplish with the assistance of JSON parsing utilities.

Custom File Formats

This is yet another example derived from my KQL course. It’s true these fictional documents are quite contrived, but this is a good practice problem for learning how to construct your own rudimentary parsing routines in a language like Cypher.

Following is the initial table of strings. I’m not going to copy and paste the table again as we proceed, so just remember that the subsequent commands in this section are operating on the result of this initial set of strings:

WITH [
  "computer = server016; <batches><batch>{\"DurationMS\": 90, \"Payload\": \"- Name: John Doe\\n  DOB: 1990-05-15\\n  Email: john.doe@email.com\\n- Name: Jane Smith\\n  DOB: 1985-08-22\\n  Email: jane.smith@email.com\\n- Name: Alex Brown\\n  DOB: 1995-11-10\\n  Email: alex.brown@email.com\"}</batch></batches>",
    "computer = server081; <batches><batch>{\"DurationMS\": 81, \"Payload\": \"- name: Emily White\\n  dob: 1992-09-18\\n  email: emily.white@email.com\\n- name: Mark Johnson\\n  dob: 1980-03-05\\n  email: mark.johnson@email.com\\n- name: Sarah Miller\\n  dob: 1995-11-30\\n  email: sarah.miller@email.com\"}</batch><batch>{\"DurationMS\": 65, \"Payload\": \"- name: Alex Turner\\n  dob: 1987-06-24\\n  email: alex.turner@email.com\\n- name: Jessica Adams\\n  dob: 1994-12-15\\n  email: jessica.adams@email.com\\n- name: Brian Parker\\n  dob: 1982-08-03\\n  email: brian.parker@email.com\\n- name: Rachel Lee\\n  dob: 1998-02-09\\n  email: rachel.lee@email.com\\n- name: Kevin Scott\\n  dob: 1989-05-17\\n  email: kevin.scott@email.com\"}</batch></batches>",
    "computer = server056; <batches><batch>{\"DurationMS\": 72, \"Payload\": \"- name: Olivia Jenkins\\n  dob: 1991-10-07\\n  email: olivia.jenkins@email.com\\n- name: Marcus Turner\\n  dob: 1985-04-18\\n  email: marcus.turner@email.com\"}</batch></batches>",
    "computer = server081; <batches><batch>{\"DurationMS\": 15, \"Payload\": \"- name: Laura Davis\\n  dob: 1993-08-26\\n  email: laura.davis@email.com\\n- name: Ethan Wilson\\n  dob: 1984-12-03\\n  email: ethan.wilson@email.com\\n- name: Amanda Parker\\n  dob: 1990-02-14\\n  email: amanda.parker@email.com\\n- name: Tyler Mitchell\\n  dob: 1988-06-11\\n  email: tyler.mitchell@email.com\"}</batch><batch>{\"DurationMS\": 48, \"Payload\": \"- name: Grace Roberts\\n  dob: 1997-03-29\\n  email: grace.roberts@email.com\"}</batch><batch>{\"DurationMS\": 7, \"Payload\": \"- name: Benjamin Hall\\n  dob: 1981-09-22\\n  email: benjamin.hall@email.com\\n- name: Natalie Adams\\n  dob: 1996-11-15\\n  email: natalie.adams@email.com\\n- name: Christopher Hill\\n  dob: 1986-07-01\\n  email: christopher.hill@email.com\"}</batch></batches>",
    "computer = server056; <batches><batch>{\"DurationMS\": 33, \"Payload\": \"- name: Lily Moore\\n  dob: 1999-01-12\\n  email: lily.moore@email.com\\n- name: Jackson Brown\\n  dob: 1983-05-04\\n  email: jackson.brown@email.com\\n- name: Chloe Taylor\\n  dob: 1995-09-17\\n  email: chloe.taylor@email.com\"}</batch><batch>{\"DurationMS\": 13, \"Payload\": \"- name: Daniel Wright\\n  dob: 1989-12-30\\n  email: daniel.wright@email.com\\n- name: Sophia Smith\\n  dob: 1992-04-23\\n  email: sophia.smith@email.com\"}</batch></batches>",
    "computer = server016; <batches><batch>{\"DurationMS\": 9, \"Payload\": \"- name: Ryan Miller\\n  dob: 1987-11-08\\n  email: ryan.miller@email.com\"}</batch></batches>",
    "computer = server091; <batches><batch>{\"DurationMS\": 8, \"Payload\": \"- name: Emma Turner\\n  dob: 1994-06-20\\n  email: emma.turner@email.com\\n- name: Noah Johnson\\n  dob: 1980-02-01\\n  email: noah.johnson@email.com\\n- name: Madison Scott\\n  dob: 1998-07-14\\n  email: madison.scott@email.com\\n- name: Owen Harris\\n  dob: 1982-10-27\\n  email: owen.harris@email.com\"}</batch></batches>"
] AS Documents

Next, we unwind the document strings so each record has one Document string:

UNWIND Documents as Document

We use regex extract the computer name listed at the start as well as the XML document that takes up most of the rest of the string. We use tail as we have before to skip the intial element in the regex matches array, which is the entire matched string itself. This is all assigned to the DocParts field:

WITH tail(
    apoc.text.regexGroups(
      Document,
      "^[cC]omputer[ ]*[=][ ]*([^;]+)[;][ ]*([\\s\\S]+)$"
    )[0]
  ) AS DocParts

A random UUID is produced and assigned to DocumentID. The ComputerName and XML are separated out from the DocParts field into their own fields:

WITH randomUUID() AS DocumentID, DocParts[0] AS ComputerName, DocParts[1] AS XML

We parse the XML document field into a Cypher map:

WITH DocumentID, ComputerName, apoc.xml.parse(
  XML
) as parsed

We UNWIND the _children array within the parsed field which was produced from the XML parsing. These represent the batch elements from the XML documents. We assign the result to the BatchJSON field:

UNWIND parsed._children AS BatchJSON

We reach into the BatchJSON field and pull out the _text field, parsing the JSON it contains into a Cypher map which we call Batch:

WITH DocumentID, 
  ComputerName, 
  apoc.convert.fromJsonMap(
    BatchJSON._text
  ) AS Batch

Due to the earlier unwinding, some of the DocumentID values are now repeated across multiple records. We create another UUID for each record and call it BatchID. We also split out the DurationMS integer value and the Payload (YAML-formatted) string value from the Batch field:

WITH DocumentID, 
  ComputerName, 
  randomUUID() AS BatchID,
  Batch.DurationMS AS DurationMS,
  Batch.Payload AS Payload

We split the Payload string on newline characters to separate the YAML records. This produces a list of strings we call YAMLs:

WITH DocumentID, 
  BatchID,
  ComputerName, 
  DurationMS,
  split(
    Payload,
    "\n"
  ) as YAMLs

In addition to the list of YAML strings, we require a companion array which contains the indices of each element in the YAML list. We accomplish this with the range function, which creates a list of integers ranging from 0 to one less than the length of the list of YAML strings:

WITH DocumentID, 
  BatchID,
  ComputerName, 
  DurationMS,
  YAMLs,
  range(
    0, 
    size(YAMLs) - 1
  ) AS YAMLsIndices

Now we need to unwind the list of YAML indices:

UNWIND YAMLsIndices As YAMLIndex

We filter all records so that only the ones with a YAML index divisible by 3 remain, as these are the indices of the start of each YAML record (each one spans 3 lines):

WITH DocumentID, 
  BatchID,
  ComputerName, 
  DurationMS,
  YAMLs,
  YAMLIndex
WHERE YAMLIndex % 3 = 0

Wrapping up, we create a unique UUID to serve as record ID for final records. Additionally, we split out the YAML strings into dedicated fields by way of our index value and some arithmetic, subjecting the values to some final post-processing with regex (optional).

RETURN DocumentID, 
  BatchID,
  randomUUID() AS RecordID,
  ComputerName, 
  DurationMS,
  apoc.text.regexGroups(
    YAMLs[YAMLIndex],
    "^[-][ ]name[:][ ](.*)$"
  )[0][1] AS Name,
  apoc.text.regexGroups(
    YAMLs[YAMLIndex + 1],
    "^[ ]{2}dob[:][ ](.*)$"
  )[0][1] AS DOB,
  apoc.text.regexGroups(
    YAMLs[YAMLIndex + 2],
    "^[ ]{2}email[:][ ](.*)$"
  )[0][1] AS Email

Inserting

The primary way to create new nodes and relationships in your database is with the CREATE command. The syntax is similar to the syntax we’ve used with the MATCH command.

Let’s add a new type of node to our Northwind database:

CREATE (:Reviewer { name: "Mr. Smithers" }), (:Reviewer:Approver { name: "Mr. Mann" })

We added two nodes to our database graph, each with a label of Reviewer and each having its own name property. We also gave Mr. Mann an additional label of Approver.

We can create entire paths with this syntax which includes relationships. Additionally, we can match existing nodes first, assign them to variables, and reference those matched nodes in a path we are creating.

For example:

MATCH (MrSmithers:Reviewer { name: "Mr. Smithers" }), (MrMann:Approver { name: "Mr. Mann" })
CREATE (MrMann)-[:SUPERVISES]->(MrSmithers)

This last command first retrieves nodes we created earlier and assigns them to variables. Then, we refer to those variables to specify the nodes that are being connected by the SUPERVISES relationship we are creating. Of course, we could also have created the relationship in the first command that we ran.

A different method of creating nodes is available. This method, called MERGE, will do the same thing as CREATE if the node described doesn’t exist, and if it does exist, the node will simply be MATCHed and bound to the specified variable.

MERGE MyPath=(MrSmithers:Reviewer { name: "Mr. Smithers" })<-[:EMPLOYS]-(MrMann:Approver { name: "Mr. Mann" })
RETURN MyPath

Updating

A simple way of updating details of your graph elements is with the SET command. It can update node labels and properties, and for relationships it can update properties.

MERGE MyPath=(MrSmithers:Reviewer { name: "Mr. Smithers" })<-[e:EMPLOYS]-(MrMann:Approver { name: "Mr. Mann" })
SET MrSmithers.Rating = 85, MrMann.Rating = 75, e.since = "1995-03-15"
RETURN MyPath

Conclusion

That should be enough to get you started exploring the Matrix with Neo4J Cypher. Thanks for reading, hope this was useful.