What is Neo4j
Neo4j is an open-source Graph Database implemented in Java by Neo4j, Inc. Its accessible from software written in other languages using the Cypher Query Language (CQL) through a transactional HTTP endpoint or through the binary ‘bolt‘ protocol. The developers describe Neo4j as an ACID-compliant (Atomicity, Consistency, Isolation, Durability) transactional database with native graph storage and processing. Neo4j is the most popular graph database at this point.
It is dual-licensed:GPLv3 and AGPLv3 / commercial. Neo4j comes in 3 editions:
- Community:free but is limited to running on 1 node only due to the lack of clustering and is without hot backups
- Enterprise:requires buying a license unless the application built on top of it is open-sourced; does not have the limitations of community edition; allows clustering, hot backups, and monitoring
- Government:extends the Enterprise Edition adding additional government specific services including FISMA-related certification and accreditation support
Components of Neo4j
Neo4j stores the data in the form of either an edge, a node, or an attribute. Each node and edge can have any number of attributes. Both the nodes and edges can be labelled. Labels can be used to narrow searches.
- Nodes:It is like table in RDBMS where data is stored e.g. Asset, Customer.
- Relationships:Connection between data which mapped between two nodes. E.g. ‘Jim’ KNOWS ‘Sherry’; ‘Tom Hanks’ ACTED-IN ‘Sully’.
- Properties:Tags which can be attached to both Nodes and Relationships. It is having the data. E.g. Node ‘Asset’ can have properties like ‘Height’, ‘Manufacturing Date’ etc.
Use Case:Employee Skills DB
We wanted to build a database in Neo4j to understand the internal relationships of the people through their personal skillsets, certifications, domain expertise, clients and projects. We wanted to build a simple interface that would help the HR, Leadership team and the project manager to allocate the right resources based on the technical requirements of the project. It would have functionalities to know the skills for every person in the organization; search for the right employee based on a list of skillsets, find connections (shortest path) between two persons etc.
For hosting Neo4j, we launched a new VPC EC2 instance on AWS cloud and installed Neo4j Community Edition v3.2.6 in that server. Once the database is installed, we created a new database instance and build Neo4j nodes and relationships. Neo4j interface is accessed through the URL http://localhost:7474 or http://127.0.0.1:7474 with default or configured login credentials in Neo4j.
Picture 1 – the AWS EC2 instance we launched for hosting Neo4j
Picture 2 – Launching the Neo4j custom database
Creating the Data Model
For our use case, we needed to create nodes for managing the master data and relationship data. Below table explains the data model.
SL # | Node Name | Type | Description | Field Names |
1 | Person | Node | Master list of all people | Alias, First_Name, Last_Name, Email, Designation, Joining_Date |
2 | Skill | Node | Master list of all skills and skillsets | Skill_Alias, SKill_Detail, Parent_Skill_Alias |
3 | Certification | Node | Master list of all industry certifications | Certification_Name, Certifying_Company |
4 | Industry | Node | Master list of all industries | Industry_Name, Industry_Description |
5 | Domain | Node | Master list of all business functions | Domain_Name, Domain_Description |
6 | Reports_To | Relationship | What is reporting hierarchy of the organization, team or project | Source_Person_Alias, Target_Person_Alias, Role |
7 | Knows | Relationship | Skillset expertise list for people | Alias, Skill_Alias, Knowledge_Level |
8 | Ceritified_As | Relationship | Certification list for people | Alias, Certification_Name |
9 | Worked_For | Relationship | People and Industry relationship | Alias, Industry_Name |
Table 1 – Data Details
We used below commands, to create the data model. First, we cleared all existing (if any) nodes and relationships.
MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r |
Next, the Nodes are created. Data can be loaded through individual CREATE commands as well, but we chose to load data from CSV file since its would be faster and easy to manage in future. For importing the CSV files that are located in local computer (EC2 machine in our case), we need to put the CSV files inside the ‘import’ folder at the location of database (mentioned in Picture 2 above). If ‘import’ is not present, we need to create it.
We created an index on First_Name for fast searching on People node.
CREATE CONSTRAINT ON (p:Person) ASSERT p.Alias IS UNIQUE; LOAD CSV FROM “file:///people.csv” AS row CREATE INDEX ON :Person(First_Name); |
We created an index on Skill_Alias for fast searching on Skill node.
CREATE CONSTRAINT ON (sk:Skill) ASSERT sk.Skill_Alias IS UNIQUE; LOAD CSV FROM “file:///skill.csv” AS row CREATE INDEX ON :Skill(Skill_Alias); |
Certification node was also created from source CSV file.
LOAD CSV FROM “file:///certif.csv” AS row CREATE (:Certification {Certificate_Alias:row[0], Certification_Name:row[1]}); |
Since, there would be many records in the relationships, we thought of loading the same data from CSV as well for easier management.
‘org_structure.csv’ contains the relationship between persons through ‘Alias’ field from Person node.
USING PERIODIC COMMIT LOAD CSV FROM “file:///org_structure.csv” AS row MATCH (p1:Person {Alias:row[0]}), (p2:Person {Alias:row[1]}) CREATE (p1)-[:REPORTS_TO]->(p2); |
‘person_skill.csv’ contains the relationship between persons and their skills through ‘Alias’ field from Person node and ‘Skill_Alias’ field from Skill node.
USING PERIODIC COMMIT LOAD CSV FROM “file:///person_skill.csv” AS row MATCH (p1:Person {Alias:row[0]}), (p2:Skill {Skill_Alias:row[1]}) CREATE (p1)-[:KNOWS]->(p2); |
‘person_certif.csv’ contains the relationship between persons and their professional certifications through ‘Alias’ field from Person node and ‘Certification_Name’ field from Certification node.
USING PERIODIC COMMIT LOAD CSV FROM “file:///person_certif.csv” AS row MATCH (p1:Person {Alias:row[0]}), (p2:Certification {Certification_Name:row[1]}) CREATE (p1)-[:CERTIFIED_AS]->(p2); |
Once all these commands are successfully run, we can validate the data through sample commands like below. The below commands show graph output with only top 25 records from the node / relationship.
MATCH (n:<NODE NAME>) RETURN n LIMIT 25 |
MATCH p=()-[r:<RELATIONSHIP NAME>]->() RETURN p LIMIT 25 |
Query to Know Employee / Team Hierarchy
To see the graph for team hierarchy, we can either click on ‘REPORTS_TO’ relationship or run the below command.
MATCH p=()-[r:REPORTS_TO]->() RETURN p |
Query to Know All Employees’ Skillset
We wanted to know all the skills for employees. Below command shows the graph output with all the skills for all people from ‘KNOWS’ relationship. It shows both the person alias and their associated skills.
MATCH p=()-[r:KNOWS]->() RETURN p |
Query to Know All Employees’ Certifications
We wanted to know all the certifications for employees. Below command shows the graph output with all the certifications for all people from ‘CERTIFIED_AS’ relationship. It shows both the person alias and their associated skills.
MATCH p=()-[r:CERTIFIED_AS]->() RETURN p |
Query to Find Who Has a Particular Skillset
We wanted to know the list of people who has ‘AWS’ as skillset for a particular project requirement. Below command shows the graph output with all the people names that have ‘AWS’ from ‘KNOWS’ relationship. It shows the person names and all their skills. If we want to show only the person names, we need to use RETURN a.
MATCH (a:Person)-[:KNOWS]->(b:Skill) WHERE b.Skill_Alias contains “AWS” RETURN a,b |
Query to Find What Skillset a Person Has
We wanted to know what skills ‘John-Doe’ has (Alias field’s value). Below command shows the graph output with all the skills that are associated with ‘John-Doe’ from ‘KNOWS’ relationship.
MATCH (a:Person)-[:KNOWS]->(b:Skill) WHERE a.Alias = “John-Doe” RETURN a,b |
Conclusion
Neo4j is highly efficient in managing data with many interconnecting relationships. It’s data model doesn’t usually require a predefined schema. We don’t need to create the database structure before loading the data, unlike traditional DBMS. Neo4j is a “schema-optional” DBMS, where data is the structure.
We wanted to create a basic employee database and quickly query the same through Neo4j to understand the functionalities of this popular graph database. The whole exercise took less than 2 hours. But if you want to build a complete end-to-end application, Neo4j provides lot of good functionality to support your advance requirements.
Neo4j is extremely well suited for social networking applications like Facebook, Twitter, etc. But there are many other areas where Neo4j excels. Here are some of the areas that Neo4j can be used for:
- Social Networks
- Real-time Product Recommendations
- Network diagrams
- Fraud Detection
- Access Management
- Graph Based Search of Digital Assets
- Master Data Management