What is Neo4j

Neo4j is an open-source Graph Database implemented in Java by Neo4j, Inc. Its accessible from software written in other languages using the Cypher Query Language (CQL) through a transactional HTTP endpoint or through the binary ‘bolt‘ protocol. The developers describe Neo4j as an ACID-compliant (Atomicity, Consistency, Isolation, Durability) transactional database with native graph storage and processing. Neo4j is the most popular graph database at this point.

It is dual-licensed:GPLv3 and AGPLv3 / commercial. Neo4j comes in 3 editions:

  • Community:free but is limited to running on 1 node only due to the lack of clustering and is without hot backups
  • Enterprise:requires buying a license unless the application built on top of it is open-sourced; does not have the limitations of community edition; allows clustering, hot backups, and monitoring
  • Government:extends the Enterprise Edition adding additional government specific services including FISMA-related certification and accreditation support

 

Components of Neo4j

Neo4j stores the data in the form of either an edge, a node, or an attribute. Each node and edge can have any number of attributes. Both the nodes and edges can be labelled. Labels can be used to narrow searches.

  • Nodes:It is like table in RDBMS where data is stored e.g. Asset, Customer.
  • Relationships:Connection between data which mapped between two nodes. E.g. ‘Jim’ KNOWS ‘Sherry’; ‘Tom Hanks’ ACTED-IN ‘Sully’.
  • Properties:Tags which can be attached to both Nodes and Relationships. It is having the data. E.g. Node ‘Asset’ can have properties like ‘Height’, ‘Manufacturing Date’ etc.

 

Use Case:Employee Skills DB

We wanted to build a database in Neo4j to understand the internal relationships of the people through their personal skillsets, certifications, domain expertise, clients and projects. We wanted to build a simple interface that would help the HR, Leadership team and the project manager to allocate the right resources based on the technical requirements of the project. It would have functionalities to know the skills for every person in the organization; search for the right employee based on a list of skillsets, find connections (shortest path) between two persons etc.

For hosting Neo4j, we launched a new VPC EC2 instance on AWS cloud and installed Neo4j Community Edition v3.2.6 in that server. Once the database is installed, we created a new database instance and build Neo4j nodes and relationships. Neo4j interface is accessed through the URL http://localhost:7474 or http://127.0.0.1:7474 with default or configured login credentials in Neo4j.

Picture 1 – the AWS EC2 instance we launched for hosting Neo4j

Picture 2 – Launching the Neo4j custom database

 

Creating the Data Model

For our use case, we needed to create nodes for managing the master data and relationship data. Below table explains the data model.

SL #Node NameTypeDescriptionField Names
1PersonNodeMaster list of all peopleAlias, First_Name, Last_Name, Email, Designation, Joining_Date
2SkillNodeMaster list of all skills and skillsetsSkill_Alias, SKill_Detail, Parent_Skill_Alias
3CertificationNodeMaster list of all industry certificationsCertification_Name, Certifying_Company
4IndustryNodeMaster list of all industriesIndustry_Name, Industry_Description
5DomainNodeMaster list of all business functionsDomain_Name, Domain_Description
6Reports_ToRelationshipWhat is reporting hierarchy of the organization, team or projectSource_Person_Alias, Target_Person_Alias, Role
7KnowsRelationshipSkillset expertise list for peopleAlias, Skill_Alias, Knowledge_Level
8Ceritified_AsRelationshipCertification list for peopleAlias, Certification_Name
9Worked_ForRelationshipPeople and Industry relationshipAlias, Industry_Name

Table 1 – Data Details

 

We used below commands, to create the data model. First, we cleared all existing (if any) nodes and relationships.

MATCH (n)
OPTIONAL MATCH (n)-[r]-()
DELETE n,r

Next, the Nodes are created. Data can be loaded through individual CREATE commands as well, but we chose to load data from CSV file since its would be faster and easy to manage in future. For importing the CSV files that are located in local computer (EC2 machine in our case), we need to put the CSV files inside the ‘import’ folder at the location of database (mentioned in Picture 2 above). If ‘import’ is not present, we need to create it.
We created an index on First_Name for fast searching on People node.

CREATE CONSTRAINT ON (p:Person) ASSERT p.Alias IS UNIQUE;

LOAD CSV FROM “file:///people.csv” AS row
CREATE (:Person {Alias:row[0], First_Name:row[1], Last_Name:row[2], Email:row[3], Designation:row[4], DOJ:row[5]});

CREATE INDEX ON :Person(First_Name);

We created an index on Skill_Alias for fast searching on Skill node.

CREATE CONSTRAINT ON (sk:Skill) ASSERT sk.Skill_Alias IS UNIQUE;

LOAD CSV FROM “file:///skill.csv” AS row
CREATE (:Skill {Skill_Alias:row[0], SKill_Detail:row[1]});

CREATE INDEX ON :Skill(Skill_Alias);

Certification node was also created from source CSV file.

LOAD CSV FROM “file:///certif.csv” AS row
CREATE (:Certification {Certificate_Alias:row[0], Certification_Name:row[1]});

Since, there would be many records in the relationships, we thought of loading the same data from CSV as well for easier management.
‘org_structure.csv’ contains the relationship between persons through ‘Alias’ field from Person node.

USING PERIODIC COMMIT
LOAD CSV FROM “file:///org_structure.csv” AS row
MATCH (p1:Person {Alias:row[0]}), (p2:Person {Alias:row[1]})
CREATE (p1)-[:REPORTS_TO]->(p2);

‘person_skill.csv’ contains the relationship between persons and their skills through ‘Alias’ field from Person node and ‘Skill_Alias’ field from Skill node.

USING PERIODIC COMMIT
LOAD CSV FROM “file:///person_skill.csv” AS row
MATCH (p1:Person {Alias:row[0]}), (p2:Skill {Skill_Alias:row[1]})
CREATE (p1)-[:KNOWS]->(p2);

‘person_certif.csv’ contains the relationship between persons and their professional certifications through ‘Alias’ field from Person node and ‘Certification_Name’ field from Certification node.

USING PERIODIC COMMIT
LOAD CSV FROM “file:///person_certif.csv” AS row
MATCH (p1:Person {Alias:row[0]}), (p2:Certification {Certification_Name:row[1]})
CREATE (p1)-[:CERTIFIED_AS]->(p2);

Once all these commands are successfully run, we can validate the data through sample commands like below. The below commands show graph output with only top 25 records from the node / relationship.

MATCH (n:<NODE NAME>) RETURN n LIMIT 25
MATCH p=()-[r:<RELATIONSHIP NAME>]->() RETURN p LIMIT 25

 

Query to Know Employee / Team Hierarchy

To see the graph for team hierarchy, we can either click on ‘REPORTS_TO’ relationship or run the below command.

MATCH p=()-[r:REPORTS_TO]->() RETURN p

 

Query to Know All Employees’ Skillset

We wanted to know all the skills  for employees. Below command shows the graph output with all the skills for all people from ‘KNOWS’ relationship. It shows both the person alias and their associated skills.

MATCH p=()-[r:KNOWS]->() RETURN p

 

Query to Know All Employees’ Certifications

We wanted to know all the certifications for employees. Below command shows the graph output with all the certifications for all people from ‘CERTIFIED_AS’ relationship. It shows both the person alias and their associated skills.

MATCH p=()-[r:CERTIFIED_AS]->() RETURN p

 

Query to Find Who Has a Particular Skillset

We wanted to know the list of people who has ‘AWS’ as skillset for a particular project requirement. Below command shows the graph output with all the people names that have ‘AWS’ from ‘KNOWS’ relationship. It shows the person names and all their skills. If we want to show only the person names, we need to use RETURN a.

MATCH (a:Person)-[:KNOWS]->(b:Skill)
WHERE b.Skill_Alias contains “AWS”
RETURN a,b

 

Query to Find What Skillset a Person Has

We wanted to know what skills ‘John-Doe’ has (Alias field’s value). Below command shows the graph output with all the skills that are associated with ‘John-Doe’ from ‘KNOWS’ relationship.

MATCH (a:Person)-[:KNOWS]->(b:Skill)
WHERE a.Alias = “John-Doe”
RETURN a,b

 

Conclusion

Neo4j is highly efficient in managing data with many interconnecting relationships. It’s data model doesn’t usually require a predefined schema. We don’t need to create the database structure before loading the data, unlike traditional DBMS. Neo4j is a “schema-optional” DBMS, where data is the structure.
We wanted to create a basic employee database and quickly query the same through Neo4j to understand the functionalities of this popular graph database. The whole exercise took less than 2 hours. But if you want to build a complete end-to-end application, Neo4j provides lot of good functionality to support your advance requirements.
Neo4j is extremely well suited for social networking applications like Facebook, Twitter, etc. But there are many other areas where Neo4j excels. Here are some of the areas that Neo4j can be used for:

  • Social Networks
  • Real-time Product Recommendations
  • Network diagrams
  • Fraud Detection
  • Access Management
  • Graph Based Search of Digital Assets
  • Master Data Management

 

<