Using PostgreSQL as a Graph Database: A Simple Approach for Beginners
Graph databases are great for representing complex relationships between entities, but they can be expensive. As an alternative, we can use PostgreSQL to achieve similar functionality at a lower cost. This guide explains how to do it, even if you’re not very technical.
Concept Overview
In a graph database, data is stored as nodes and edges. Nodes represent entities (like people, accounts, or products), and edges represent relationships between these entities (like friendships, ownerships, or transactions).
We’ll use two PostgreSQL tables to store this data:
- Entity Table: Holds the nodes.
- Relationship Table: Holds the edges.
We’ll refer to entities as “nodes” and relationships as “edges.”
Why Use PostgreSQL?
Graph databases like Neo4j can be expensive. PostgreSQL is a powerful and cost-effective alternative that can be used to create a similar structure by leveraging its JSONB support and indexing capabilities.
Entity Table
The entity
table in PostgreSQL is designed to store entities with a flexible schema, accommodating various properties without the need to predefine a fixed set of columns for all possible attributes. This approach uses JSONB to store dynamic properties, while frequently queried attributes are stored in dedicated columns for better performance.
CREATE TABLE public.entity (
id text PRIMARY KEY,
_type text NOT NULL,
_class text NOT NULL,
metadata jsonb DEFAULT '{}'::jsonb
);
CREATE INDEX entity_class ON entity (_class);
CREATE INDEX entity_type ON entity (_type);
- id: Unique identifier (e.g., account_id, commit_sha).
- _type: Type of the entity (e.g., github_account, snyk_account).
- _class: Label for the entity (e.g., Account, User).
- metadata: Other data in JSON format.
Frequently queried attributes like _type
and _class
are stored in dedicated columns and indexed, improving query performance.
Relationship Table
The relationship table is a key component for modeling directed relationships between entities in a graph-like structure within PostgreSQL. It captures how entities (nodes) are connected through specific relationships (edges).
CREATE TABLE public.relationship (
source_entity_id text NOT NULL,
target_entity_id text NOT NULL,
_class text,
PRIMARY KEY (source_entity_id, target_entity_id),
CHECK (source_entity_id != target_entity_id),
CONSTRAINT relationship_source_entity_id_fkey FOREIGN KEY (source_entity_id) REFERENCES public.entity (id),
CONSTRAINT relationship_target_entity_id_fkey FOREIGN KEY (target_entity_id) REFERENCES public.entity (id)
);
CREATE INDEX relationship_class ON relationship (_class);
- source_entity_id: ID of the source entity.
- target_entity_id: ID of the target entity.
- _class: Label for the relationship (e.g., HAS, REVIEWED).
Constraints and Indexes
PRIMARY KEY (source_entity_id, target_entity_id):
This composite primary key ensures the uniqueness of each relationship, preventing duplicate entries between the same pair of entities.CHECK (source_entity_id != target_entity_id):
This constraint prevents self-referencing relationships, ensuring an entity cannot have a relationship with itself. This avoids potential infinite loops and maintains the integrity of the graph structure.
Preventing Cycles
Cycles in a graph can lead to infinite loops and inconsistent data, making it crucial to detect and prevent them when managing relationships between entities. Here’s a detailed explanation on how to create and use a function in PostgreSQL to detect cycles.
Why Prevent Cycles?
In a graph database, cycles occur when a path forms a loop, meaning you can start at a node and return to it by following the edges. This can cause issues in data processing and querying, especially in recursive operations.
Creating a Function to Detect Cycles
To prevent cycles, we can create a recursive function in PostgreSQL that traverses the graph and checks for the presence of cycles before adding new relationships. Here’s how you can implement it:
- Create the Function
This function,has_cycle
, will take two parameters:input_source_node_id
andinput_target_node_id
. It will returnTRUE
if a cycle is detected andFALSE
otherwise.
CREATE OR REPLACE FUNCTION has_cycle(input_source_node_id text, input_target_node_id text)
RETURNS BOOLEAN
LANGUAGE plpgsql AS $$
DECLARE
rec RECORD;
BEGIN
FOR rec IN
WITH RECURSIVE traversed AS (
SELECT
ARRAY[input_source_node_id] AS path,
input_target_node_id AS target_node_id
UNION ALL
SELECT
traversed.path || relationship.source_entity_id,
relationship.target_entity_id
FROM traversed
JOIN relationship ON relationship.source_entity_id = traversed.target_node_id
)
SELECT * FROM traversed
LOOP
IF rec.target_node_id = ANY(rec.path) THEN
RETURN TRUE; -- Early return, stop looking when first cycle is detected
END IF;
END LOOP;
RETURN FALSE;
END;
$$;
- Using the Function
Integrate thehas_cycle
function into your relationship management process to ensure no cycles are created. You can add a check constraint to therelationship
table to prevent inserting a relationship that would cause a cycle.
ALTER TABLE relationship
ADD CONSTRAINT check_no_cycles CHECK (NOT has_cycle(source_entity_id, target_entity_id));
This constraint ensures that before any new relationship is inserted into the relationship
table, the has_cycle
function is called to verify that the new relationship won't create a cycle. If a cycle is detected, the insertion is aborted, maintaining the integrity of your graph.
Adding Data to the Entity and Relationship Tables
To populate the entity
and relationship
tables with data, we'll follow these steps:
- Add Entities: Insert data representing various entities such as accounts, users, repositories, and pull requests.
- Add Relationships: Define the directed relationships between these entities.
Adding Entities
Here we insert multiple entities into the entity
table. Each entity has an ID, type, class, and metadata.
INSERT INTO public.entity (id, _type, _class, metadata)
VALUES
('account_1', 'github_account', 'Account', '{"name": "Account 1"}'),
('account_2', 'github_account', 'Account', '{"name": "Account 2"}'),
('user_1', 'github_user', 'User', '{"name": "User 1"}'),
('user_2', 'github_user', 'User', '{"name": "User 2"}'),
('user_3', 'github_user', 'User', '{"name": "User 3"}'),
('repository_1', 'github_repository', 'CodeRepo', '{"name": "Repository 1"}'),
('repository_2', 'github_repository', 'CodeRepo', '{"name": "Repository 2"}'),
('repository_3', 'github_repository', 'CodeRepo', '{"name": "Repository 3"}'),
('pr_1', 'github_pullRequest', 'PullRequest', '{"name": "Pull Request 1"}'),
('pr_2', 'github_pullRequest', 'PullRequest', '{"name": "Pull Request 2"}'),
('pr_3', 'github_pullRequest', 'PullRequest', '{"name": "Pull Request 3"}')
ON CONFLICT (id)
DO NOTHING;
- Purpose: This command inserts records for GitHub accounts, users, repositories, and pull requests into the
entity
table. - ON CONFLICT (id) DO NOTHING: Ensures that if an entity with the same ID already exists, it will not be inserted again, preventing duplicate entries.
Adding Relationships
Next, we define the relationships between these entities and insert them into the relationship
table.
INSERT INTO public.relationship (source_entity_id, target_entity_id, _class)
VALUES
('account_1', 'user_1', 'HAS'),
('account_1', 'user_2', 'HAS'),
('account_2', 'user_3', 'HAS'),
('account_1', 'repository_1', 'HAS'),
('account_1', 'repository_2', 'HAS'),
('account_2', 'repository_3', 'HAS'),
('repository_1', 'pr_2', 'HAS'),
('repository_2', 'pr_1', 'HAS'),
('repository_3', 'pr_3', 'HAS'),
('user_2', 'pr_1', 'APPROVED'),
('user_2', 'pr_2', 'OPENED')
ON CONFLICT (source_entity_id, target_entity_id)
DO NOTHING;
- Purpose: This command establishes directed relationships between entities. For example, an account “HAS” users or repositories, and users “APPROVED” or “OPENED” pull requests.
- ON CONFLICT (source_entity_id, target_entity_id) DO NOTHING: Ensures that if a relationship with the same source and target entity IDs already exists, it will not be inserted again, preventing duplicate entries.
Querying the Data
To retrieve data from the entity
and relationship
tables, you can use various SQL queries to fetch specific entities or relationships, and even visualize hierarchical relationships using recursive queries.
SELECT * FROM entity;
SELECT * FROM entity WHERE _class = 'User';
To find entities directly related to others:
SELECT * FROM relationship WHERE _class = 'HAS';
Recursive Queries
Recursive queries are useful for visualizing and navigating hierarchical relationships between entities.
Visualizing Relationships Recursively:
The following query recursively finds all relationships starting from any entity and joins the related entities:
WITH RECURSIVE AccountRelationships AS (
SELECT
*,
r._class AS relationship_class
FROM
public.entity e
JOIN
public.relationship r ON e.id = r.source_entity_id
)
SELECT
ar.id,
ar._type,
ar.metadata,
ar.relationship_class,
e.*
FROM
AccountRelationships ar
JOIN
public.entity e ON ar.target_entity_id = e.id;
- WITH RECURSIVE AccountRelationships AS (…): This CTE (Common Table Expression) creates a temporary result set called
AccountRelationships
which recursively finds relationships. - SELECT … JOIN public.entity e ON ar.target_entity_id = e.id: This part joins the
AccountRelationships
result set with theentity
table to retrieve detailed information about the target entities.
Use Case: Find Pull Requests for an Account
To find all pull requests related to a specific account using a recursive query, you can use the following query. This will navigate through the relationships starting from the account and find related pull requests.
WITH RECURSIVE AccountRelationships AS (
SELECT
*,sq
r._class AS relationship_class
FROM
public.entity e
JOIN
public.relationship r ON e.id = r.source_entity_id
WHERE
e._class = 'Account' AND e.id = 'account_1'
)
SELECT
ar.id,
ar._type,
ar.metadata,
ar.relationship_class,
e.*
FROM
AccountRelationships ar
JOIN
public.entity e ON ar.target_entity_id = e.id
WHERE
e._class = 'CodeRepo'
- WITH RECURSIVE AccountRelationships AS (…): The CTE starts with the specified account (
account_1
). - UNION ALL: Recursively joins the
entity
andrelationship
tables to find all connected entities. - WHERE e._class = ‘PullRequest’;: Filters the results to only include pull requests.
Conclusion
Using PostgreSQL as a graph database enables efficient storage and querying of complex relationships. By leveraging PostgreSQL’s capabilities in JSONB storage and indexing, you can create a flexible and powerful data model without incurring the costs of specialized graph databases. This approach allows you to:
- Store Entities and Relationships: Capture various types of entities and their relationships.
- Query Data Efficiently: Use SQL queries to fetch and analyze data.
- Visualize Relationships: Utilize recursive queries to navigate and visualize hierarchical data.
By structuring your data in this manner, you can achieve optimal performance and flexibility for a wide range of applications.