June 28, 2017

PostgreSQL To Neo4j

Recently, I was asked to make a bit of PostgreSQL data available to a Neo4j graph database and I thought it worth sharing the experience extracting the data from PostgreSQL and loading the data to Neo4j.

First, a bit of background regarding the application…

It’s used to track the many systems (and components within them) deployed across the many environments within our local data center and at AWS. It gives us visibility into our system topology and helps us answer questions like…

  • What version of system X is deployed to QA? Pre-Preduction? Production?
  • What systems and/or libraries does system X depend upon at build time?
  • What systems does system X depend upon at runtime?
  • Much more!

And it is a big part of our orchestration and provisioning efforts here as we move more and more systems to AWS.

Here’s a quick mention of what data (nodes) and relationships I was asked to make available in Neo4j…

  • Environments
  • Hosts
  • Systems
  • Components
  • Tags
  • Deployments

  • Host-Environments
  • System-Components
  • Component-Tags
  • Component-Deployments
  • Deployment-Hosts
  • Runtime-Dependencies

And now my experience…

I wrote the queries and executed them in pgAdmin. Here’s one of the simpler queries…

SELECT e.name 
FROM apps.st_environment e 
INNER JOIN apps.st_environment_family ef ON e.st_environment_family_uid=ef.st_environment_family_uid 
WHERE ef.name='PRD' 
ORDER BY e.precedence 

I extracted the results of each query using pgAdmin’s File->Export feature. Many clicks later I had the 12 .csv files that I needed but I certainly did NOT have a process that was repeatable… I needed a script!

Building the script, however, wasn’t without trial and error. You see, I first attempted to use pgAdmin and, in particular, the PostgreSQL COPY command to copy the query results to .csv files…

COPY (
SELECT e.name
FROM apps.st_environment e
INNER JOIN apps.st_environment_family ef ON e.st_environment_family_uid=ef.st_environment_family_uid
WHERE ef.name='PRD'
ORDER BY e.precedence
) TO 'environments.csv' WITH csv header;

I quickly discovered that I was not a superuser…

ERROR:  must be superuser to COPY to or from a file
HINT:  Anyone can COPY to stdout or from stdin. psql's \copy command also works for anyone.

Thanks for the hint to the PostgreSQL team… it was very helpful!

Given that granting superuser access to my user was not a viable option, I then turned to using psql, the terminal-based front-end to PostgreSQL, and, in particular, the \copy meta-command to copy query results to .csv files…

START TRANSACTION;

\copy (
SELECT e.name 
FROM apps.st_environment e 
INNER JOIN apps.st_environment_family ef ON e.st_environment_family_uid=ef.st_environment_family_uid 
WHERE ef.name='PRD' 
ORDER BY e.precedence
) TO 'environments.csv' WITH csv header;
.
.
.
ROLLBACK;

But I quickly discovered that psql does not support newline characters as mentioned here and I changed my script (removing the newline characters)…

START TRANSACTION;

\copy (select e.name from apps.st_environment e inner join apps.st_environment_family ef on e.st_environment_family_uid=ef.st_environment_family_uid where ef.name='PRD' order by e.precedence) TO 'environments.csv' WITH csv header;
.
.
.
ROLLBACK;

This was a little less than ideal given that some of my queries are rather complex but it was a small price to pay… I had a script and a repeatable means to extract the data from my PostgreSQL database.

Here’s the psql command I used to invoke my script (extracting data from the topology database)…

$ psql --host=<hostname> --port=<port> --username=<username> topology
topology=> \i export.sql

Next, I launched an EC2 (centos) instance and installed Java 8 and the latest Neo4j community download, 3.2.1… the details of which are beyond the scope of this post. I then copied my .csv files (using scp) from my local development machine to the EC2 instance and $NEO4J_HOME/import, in particular. Please note, file URLs referenced within cyber-shell script are relative to the $NEO4J_HOME/import directory.

And then I wrote a cypher script, import.cypher, to create my nodes, constraints, and relationships (in Neo4j)…

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///environments.csv" AS row
CREATE (:Environment {environmentName: row.name, location: row.location});
.
.
.
CREATE CONSTRAINT ON (e:Environment) ASSERT e.environmentName IS UNIQUE;
.
.
.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///host-environments.csv" AS row
MATCH (h:Host {hostName: row.host_name})
MATCH (e:Environment {environmentName: row.environment_name})
MERGE (h)-[:BELONGS_TO]->(e);
.
.
.

And I piped the script to the cypher-shell command…

cat import.cypher | $NEO4J_HOME/bin/cypher-shell -u <username> -p <password> -a bolt://0.0.0.0:7687 --fail-fast

But it took me a little while to get the script just right. I got good at re-creating my database…

$ $NEO4J_HOME/bin/neo4j stop
$ rm -rf $NEO4J_HOME/data/databases/graph.db
$ $NEO4J_HOME/bin/neo4j start

Of course, it goes without saying but I’ll say it anyway… don’t delete a database that’s actually in use.

Finally, my data was moved! But given that Neo4j was running on an EC2 instance, I had to tweak its configuration, $NEO4J_HOME/conf/neo4j.conf, so that I could access Neo4j from a browser on my development machine…

# With default configuration Neo4j only accepts local connections.
# To accept non-local connections, uncomment this line:
dbms.connectors.default_listen_address=<ip-address>

And when I restarted Neo4j once more, I was finally able to query Neo4j…

MATCH (s:Deploy {systemName:"Adjustments"}) RETURN s;
MATCH (s:System)-[:HAS]-(c:Component) return s,c;  

References

Tom Muldoon

Software Architect