A research knowledge graph tracks the relationships between papers, authors, institutions, topics, and funding. These relationships are what make academic data valuable — citation networks reveal influence, co-authorship reveals collaboration patterns, and funding links reveal the flow of resources through the research ecosystem. This tutorial builds a complete research knowledge graph in Ferrosa, querying it with both CQL and Cypher.

Note
Complete the 3-Node Cluster Setup tutorial first. Your cluster must have FERROSA_GRAPH_ENABLED=true set (which our cluster-setup tutorial includes).

The Use Case

University research offices, funding agencies, and academic publishers all need to answer relationship questions about scholarly output:

  • Which researchers collaborate across institutions?

  • What is the citation chain between two papers?

  • Which topics are covered by papers funded by a specific grant?

  • Who are the most influential researchers in a subfield, measured by transitive citation impact?

  • Which institutions share the most topic overlap?

In a traditional Cassandra deployment, each of these questions requires multiple round trips and application-side join logic. A separate graph database (Neo4j, Neptune) adds ETL complexity and consistency headaches. Ferrosa eliminates both problems: the same CQL tables serve key-value lookups and graph traversals, with no data duplication.

The data model

Our knowledge graph has five entity types (vertices) and six relationship types (edges):

Vertex Description Key properties

Researcher

An academic with publications and metrics

name, h_index, specializations (set), social_links (map)

Institution

A university or research lab

name, country, ranking, departments (set)

Paper

A published research paper

title, venue, year, citation_count, keywords (set), urls (map)

Topic

A research area or subfield

name, parent_topic, paper_count

Grant

A funding award

title, funder, amount (decimal), start_date, end_date

Edge Connects Properties

AUTHORED

Researcher → Paper

author_position, corresponding (boolean)

CITES

Paper → Paper

context, section

AFFILIATED_WITH

Researcher → Institution

department, role, start_year, end_year

COVERS

Paper → Topic

relevance (float)

FUNDED_BY

Paper → Grant

role

COLLABORATES

Researcher → Researcher

paper_count, first_collab_year, last_collab_year

Schema Design

Ferrosa’s graph engine works on top of ordinary CQL tables. You annotate vertex tables with graph.type: vertex and edge tables with graph.type: edge, plus metadata that tells the engine how to resolve cross-table joins.

Vertex tables

Each vertex table has a single UUID primary key. The graph.label extension maps the CQL table to a Cypher label — MATCH (r:Researcher) references the table with graph.label = 'Researcher'.

This schema exercises several CQL data types: uuid, text, int, float, decimal, date, timestamp, boolean, set<text>, and map<text, text>.

Edge tables

Edge tables use a composite primary key of (source_id, target_id). The graph.source and graph.target extensions tell the engine which columns hold the vertex UUIDs, and graph.source_label / graph.target_label resolve which vertex table each UUID belongs to.

This design means CQL can efficiently answer "which papers did Alice author?" (partition key lookup on source_id), while the graph engine’s reverse adjacency index also supports "who authored this paper?" without ALLOW FILTERING.

Secondary indexes

Secondary indexes on frequently queried columns (name, venue, year, country) enable CQL to filter without full table scans. Ferrosa’s secondary indexes are backed by local token-range scans and the graph adjacency index, making them far more efficient than Cassandra’s legacy secondary indexes.

The full schema

CREATE KEYSPACE IF NOT EXISTS knowledge WITH replication = {
    'class': 'SimpleStrategy', 'replication_factor': 3
};

USE knowledge;

-- ============================================================
-- Vertex tables (entities)
-- ============================================================

-- Researcher: an academic with publications, metrics, and social links
CREATE TABLE IF NOT EXISTS researcher (
    researcher_id uuid PRIMARY KEY,
    name text,
    email text,
    title text,
    h_index int,
    citations_total int,
    orcid text,
    joined_at timestamp,
    active boolean,
    specializations set<text>,
    social_links map<text, text>
) WITH extensions = {
    'graph.type': 'vertex',
    'graph.label': 'Researcher'
};

-- Institution: a university or research lab
CREATE TABLE IF NOT EXISTS institution (
    institution_id uuid PRIMARY KEY,
    name text,
    country text,
    city text,
    institution_type text,
    ranking int,
    founded_year int,
    departments set<text>,
    metadata map<text, text>
) WITH extensions = {
    'graph.type': 'vertex',
    'graph.label': 'Institution'
};

-- Paper: a published research paper
CREATE TABLE IF NOT EXISTS paper (
    paper_id uuid PRIMARY KEY,
    title text,
    abstract_text text,
    doi text,
    venue text,
    year int,
    month int,
    citation_count int,
    paper_type text,
    keywords set<text>,
    urls map<text, text>
) WITH extensions = {
    'graph.type': 'vertex',
    'graph.label': 'Paper'
};

-- Topic: a research area or subfield
CREATE TABLE IF NOT EXISTS topic (
    topic_id uuid PRIMARY KEY,
    name text,
    parent_topic text,
    description text,
    paper_count int
) WITH extensions = {
    'graph.type': 'vertex',
    'graph.label': 'Topic'
};

-- Grant: a funding award from a government or private funder
CREATE TABLE IF NOT EXISTS grant_award (
    grant_id uuid PRIMARY KEY,
    title text,
    funder text,
    amount decimal,
    currency text,
    start_date date,
    end_date date,
    status text
) WITH extensions = {
    'graph.type': 'vertex',
    'graph.label': 'Grant'
};

-- ============================================================
-- Edge tables (relationships)
-- ============================================================

-- Researcher AUTHORED Paper
CREATE TABLE IF NOT EXISTS authored (
    source_id uuid,
    target_id uuid,
    author_position int,
    corresponding boolean,
    PRIMARY KEY (source_id, target_id)
) WITH extensions = {
    'graph.type': 'edge',
    'graph.label': 'AUTHORED',
    'graph.source': 'source_id',
    'graph.target': 'target_id',
    'graph.source_label': 'Researcher',
    'graph.target_label': 'Paper'
};

-- Paper CITES Paper
CREATE TABLE IF NOT EXISTS cites (
    source_id uuid,
    target_id uuid,
    context text,
    section text,
    PRIMARY KEY (source_id, target_id)
) WITH extensions = {
    'graph.type': 'edge',
    'graph.label': 'CITES',
    'graph.source': 'source_id',
    'graph.target': 'target_id',
    'graph.source_label': 'Paper',
    'graph.target_label': 'Paper'
};

-- Researcher AFFILIATED_WITH Institution
CREATE TABLE IF NOT EXISTS affiliated_with (
    source_id uuid,
    target_id uuid,
    department text,
    role text,
    start_year int,
    end_year int,
    PRIMARY KEY (source_id, target_id)
) WITH extensions = {
    'graph.type': 'edge',
    'graph.label': 'AFFILIATED_WITH',
    'graph.source': 'source_id',
    'graph.target': 'target_id',
    'graph.source_label': 'Researcher',
    'graph.target_label': 'Institution'
};

-- Paper COVERS Topic
CREATE TABLE IF NOT EXISTS covers_topic (
    source_id uuid,
    target_id uuid,
    relevance float,
    PRIMARY KEY (source_id, target_id)
) WITH extensions = {
    'graph.type': 'edge',
    'graph.label': 'COVERS',
    'graph.source': 'source_id',
    'graph.target': 'target_id',
    'graph.source_label': 'Paper',
    'graph.target_label': 'Topic'
};

-- Paper FUNDED_BY Grant
CREATE TABLE IF NOT EXISTS funded_by (
    source_id uuid,
    target_id uuid,
    role text,
    PRIMARY KEY (source_id, target_id)
) WITH extensions = {
    'graph.type': 'edge',
    'graph.label': 'FUNDED_BY',
    'graph.source': 'source_id',
    'graph.target': 'target_id',
    'graph.source_label': 'Paper',
    'graph.target_label': 'Grant'
};

-- Researcher COLLABORATES with Researcher
CREATE TABLE IF NOT EXISTS collaborates_with (
    source_id uuid,
    target_id uuid,
    paper_count int,
    first_collab_year int,
    last_collab_year int,
    PRIMARY KEY (source_id, target_id)
) WITH extensions = {
    'graph.type': 'edge',
    'graph.label': 'COLLABORATES',
    'graph.source': 'source_id',
    'graph.target': 'target_id',
    'graph.source_label': 'Researcher',
    'graph.target_label': 'Researcher'
};

-- ============================================================
-- Secondary indexes for CQL queries
-- ============================================================

CREATE INDEX IF NOT EXISTS ON researcher (name);
CREATE INDEX IF NOT EXISTS ON researcher (active);
CREATE INDEX IF NOT EXISTS ON paper (venue);
CREATE INDEX IF NOT EXISTS ON paper (year);
CREATE INDEX IF NOT EXISTS ON paper (doi);
CREATE INDEX IF NOT EXISTS ON institution (country);
CREATE INDEX IF NOT EXISTS ON institution (name);
CREATE INDEX IF NOT EXISTS ON topic (name);
CREATE INDEX IF NOT EXISTS ON grant_award (funder);
CREATE INDEX IF NOT EXISTS ON grant_award (status);

Loading Data

Our sample dataset includes:

  • 10 researchers across 4 institutions (MIT, Stanford, Oxford, ETH Zurich)

  • 4 institutions with departments and metadata

  • 12 papers with realistic titles in ML, distributed systems, NLP, computer vision, databases, and security

  • 6 topics forming a hierarchy (NLP and Computer Vision are subtopics of Machine Learning)

  • 3 grants from NSF, DARPA, and ERC

  • 30 authorship edges (multiple authors per paper, with position and corresponding author flags)

  • 12 citation edges (creating cycles and chains in the citation graph)

  • 11 affiliation edges (including one cross-institutional: Alice was at Stanford before MIT)

  • 24 topic coverage edges (with relevance scores)

  • 6 funding edges

  • 11 collaboration edges

We use readable UUIDs with distinct prefixes so you can follow the data through queries:

  • Researchers: 11111111-0000-0000-0000-00000000000N

  • Institutions: 22222222-0000-0000-0000-00000000000N

  • Papers: 33333333-0000-0000-0000-00000000000N

  • Topics: 44444444-0000-0000-0000-00000000000N

  • Grants: 55555555-0000-0000-0000-00000000000N

In production you would use uuid() to generate them.

USE knowledge;

-- ============================================================
-- Institutions (4)
-- ============================================================

INSERT INTO institution (institution_id, name, country, city, institution_type, ranking, founded_year, departments, metadata) VALUES
  (22222222-0000-0000-0000-000000000001, 'MIT', 'United States', 'Cambridge',
   'University', 1, 1861,
   {'CSAIL', 'EECS', 'Mathematics', 'Brain and Cognitive Sciences'},
   {'website': 'https://www.mit.edu', 'endowment_usd': '27.4B'});

INSERT INTO institution (institution_id, name, country, city, institution_type, ranking, founded_year, departments, metadata) VALUES
  (22222222-0000-0000-0000-000000000002, 'Stanford', 'United States', 'Stanford',
   'University', 2, 1885,
   {'Computer Science', 'Statistics', 'AI Lab', 'HAI'},
   {'website': 'https://www.stanford.edu', 'endowment_usd': '36.3B'});

INSERT INTO institution (institution_id, name, country, city, institution_type, ranking, founded_year, departments, metadata) VALUES
  (22222222-0000-0000-0000-000000000003, 'Oxford', 'United Kingdom', 'Oxford',
   'University', 3, 1096,
   {'Computer Science', 'Mathematics', 'Engineering Science'},
   {'website': 'https://www.ox.ac.uk', 'endowment_gbp': '7.1B'});

INSERT INTO institution (institution_id, name, country, city, institution_type, ranking, founded_year, departments, metadata) VALUES
  (22222222-0000-0000-0000-000000000004, 'ETH Zurich', 'Switzerland', 'Zurich',
   'University', 7, 1855,
   {'Computer Science', 'Information Technology and Electrical Engineering', 'Mathematics'},
   {'website': 'https://ethz.ch', 'type': 'public'});

-- ============================================================
-- Researchers (10)
-- ============================================================

-- MIT researchers
INSERT INTO researcher (researcher_id, name, email, title, h_index, citations_total, orcid, joined_at, active, specializations, social_links) VALUES
  (11111111-0000-0000-0000-000000000001, 'Alice Chen', 'achen@mit.edu',
   'Associate Professor', 45, 12800, '0000-0001-2345-6789',
   '2015-09-01T00:00:00Z', true,
   {'Deep Learning', 'Natural Language Processing', 'Transformers'},
   {'scholar': 'https://scholar.google.com/alice', 'twitter': '@alicechen_ml'});

INSERT INTO researcher (researcher_id, name, email, title, h_index, citations_total, orcid, joined_at, active, specializations, social_links) VALUES
  (11111111-0000-0000-0000-000000000002, 'Bob Martinez', 'bmartinez@mit.edu',
   'Assistant Professor', 28, 4200, '0000-0002-3456-7890',
   '2019-01-15T00:00:00Z', true,
   {'Distributed Systems', 'Consensus Protocols', 'Cloud Computing'},
   {'scholar': 'https://scholar.google.com/bob', 'github': 'bobmartinez'});

INSERT INTO researcher (researcher_id, name, email, title, h_index, citations_total, orcid, joined_at, active, specializations, social_links) VALUES
  (11111111-0000-0000-0000-000000000003, 'Carol Washington', 'cwash@mit.edu',
   'PhD Student', 8, 620, '0000-0003-4567-8901',
   '2021-09-01T00:00:00Z', true,
   {'Computer Vision', 'Medical Imaging'},
   {'scholar': 'https://scholar.google.com/carol'});

-- Stanford researchers
INSERT INTO researcher (researcher_id, name, email, title, h_index, citations_total, orcid, joined_at, active, specializations, social_links) VALUES
  (11111111-0000-0000-0000-000000000004, 'David Kim', 'dkim@stanford.edu',
   'Professor', 62, 31500, '0000-0004-5678-9012',
   '2008-07-01T00:00:00Z', true,
   {'Machine Learning', 'Statistical Learning Theory', 'Optimization'},
   {'scholar': 'https://scholar.google.com/dkim', 'twitter': '@davidkim_ai'});

INSERT INTO researcher (researcher_id, name, email, title, h_index, citations_total, orcid, joined_at, active, specializations, social_links) VALUES
  (11111111-0000-0000-0000-000000000005, 'Eva Schmidt', 'eschmidt@stanford.edu',
   'Associate Professor', 38, 9100, '0000-0005-6789-0123',
   '2014-01-10T00:00:00Z', true,
   {'Database Systems', 'Query Optimization', 'Distributed Databases'},
   {'scholar': 'https://scholar.google.com/eva', 'website': 'https://eva-schmidt.dev'});

INSERT INTO researcher (researcher_id, name, email, title, h_index, citations_total, orcid, joined_at, active, specializations, social_links) VALUES
  (11111111-0000-0000-0000-000000000006, 'Frank Okafor', 'fokafor@stanford.edu',
   'Postdoctoral Fellow', 15, 1800, '0000-0006-7890-1234',
   '2022-06-01T00:00:00Z', true,
   {'Federated Learning', 'Privacy-Preserving ML', 'Security'},
   {'github': 'frankokafor'});

-- Oxford researchers
INSERT INTO researcher (researcher_id, name, email, title, h_index, citations_total, orcid, joined_at, active, specializations, social_links) VALUES
  (11111111-0000-0000-0000-000000000007, 'Grace Okonkwo', 'gokonkwo@cs.ox.ac.uk',
   'Professor', 55, 22000, '0000-0007-8901-2345',
   '2005-10-01T00:00:00Z', true,
   {'Formal Verification', 'Distributed Consensus', 'Security'},
   {'scholar': 'https://scholar.google.com/grace', 'twitter': '@graceokonkwo'});

INSERT INTO researcher (researcher_id, name, email, title, h_index, citations_total, orcid, joined_at, active, specializations, social_links) VALUES
  (11111111-0000-0000-0000-000000000008, 'Henry Patel', 'hpatel@cs.ox.ac.uk',
   'Lecturer', 22, 3100, '0000-0008-9012-3456',
   '2018-03-15T00:00:00Z', true,
   {'Natural Language Processing', 'Knowledge Graphs', 'Information Retrieval'},
   {'scholar': 'https://scholar.google.com/henry'});

-- ETH Zurich researchers
INSERT INTO researcher (researcher_id, name, email, title, h_index, citations_total, orcid, joined_at, active, specializations, social_links) VALUES
  (11111111-0000-0000-0000-000000000009, 'Ingrid Johansson', 'ijohansson@ethz.ch',
   'Assistant Professor', 19, 2400, '0000-0009-0123-4567',
   '2020-04-01T00:00:00Z', true,
   {'Distributed Systems', 'Storage Engines', 'Cloud-Native Databases'},
   {'website': 'https://ingridj.ethz.ch', 'github': 'ingridj'});

INSERT INTO researcher (researcher_id, name, email, title, h_index, citations_total, orcid, joined_at, active, specializations, social_links) VALUES
  (11111111-0000-0000-0000-000000000010, 'Jun Tanaka', 'jtanaka@ethz.ch',
   'PhD Student', 6, 380, '0000-0010-1234-5678',
   '2023-02-01T00:00:00Z', true,
   {'Computer Vision', 'Self-Supervised Learning'},
   {'github': 'juntanaka'});

-- ============================================================
-- Topics (6)
-- ============================================================

INSERT INTO topic (topic_id, name, parent_topic, description, paper_count) VALUES
  (44444444-0000-0000-0000-000000000001, 'Machine Learning', null,
   'Algorithms that improve automatically through experience and data', 45000);

INSERT INTO topic (topic_id, name, parent_topic, description, paper_count) VALUES
  (44444444-0000-0000-0000-000000000002, 'Distributed Systems', null,
   'Computing systems with components on networked computers that coordinate actions', 18000);

INSERT INTO topic (topic_id, name, parent_topic, description, paper_count) VALUES
  (44444444-0000-0000-0000-000000000003, 'Natural Language Processing', 'Machine Learning',
   'Computational techniques for analyzing and generating human language', 22000);

INSERT INTO topic (topic_id, name, parent_topic, description, paper_count) VALUES
  (44444444-0000-0000-0000-000000000004, 'Computer Vision', 'Machine Learning',
   'Extraction of high-dimensional information from images and video', 28000);

INSERT INTO topic (topic_id, name, parent_topic, description, paper_count) VALUES
  (44444444-0000-0000-0000-000000000005, 'Databases', null,
   'Systems for storing, querying, and managing structured data', 15000);

INSERT INTO topic (topic_id, name, parent_topic, description, paper_count) VALUES
  (44444444-0000-0000-0000-000000000006, 'Security', null,
   'Protection of computer systems and networks from attack and unauthorized access', 12000);

-- ============================================================
-- Papers (12)
-- ============================================================

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000001,
   'Attention Is All You Need 2.0: Sparse Transformers at Scale',
   'We present an evolution of the transformer architecture that uses sparse attention patterns to achieve linear-time complexity while maintaining quality on long-context tasks.',
   '10.1234/neurips.2024.001', 'NeurIPS', 2024, 12, 142,
   'Conference',
   {'transformers', 'sparse attention', 'scaling laws', 'deep learning'},
   {'pdf': 'https://arxiv.org/pdf/2024.00001', 'code': 'https://github.com/sparse-tf'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000002,
   'Raft Revisited: Consensus for Disaggregated Storage',
   'We extend the Raft consensus protocol to support disaggregated storage architectures where compute and storage are separated, achieving 3x throughput improvement.',
   '10.1234/osdi.2024.002', 'OSDI', 2024, 7, 87,
   'Conference',
   {'raft', 'consensus', 'disaggregated storage', 'distributed systems'},
   {'pdf': 'https://arxiv.org/pdf/2024.00002'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000003,
   'Vision Transformers for Medical Image Segmentation',
   'We adapt vision transformers for volumetric medical image segmentation, outperforming CNN-based methods on three benchmark datasets.',
   '10.1234/miccai.2024.003', 'MICCAI', 2024, 10, 63,
   'Conference',
   {'vision transformers', 'medical imaging', 'segmentation'},
   {'pdf': 'https://arxiv.org/pdf/2024.00003', 'dataset': 'https://data.medseg.org'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000004,
   'Federated Learning with Differential Privacy Guarantees',
   'A framework for training ML models across institutions without sharing raw data, providing formal epsilon-delta differential privacy bounds.',
   '10.1234/icml.2024.004', 'ICML', 2024, 7, 95,
   'Conference',
   {'federated learning', 'differential privacy', 'distributed ML'},
   {'pdf': 'https://arxiv.org/pdf/2024.00004', 'code': 'https://github.com/fedpriv'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000005,
   'LSM-Tree Compaction Strategies for Cloud-Native Databases',
   'We survey and benchmark compaction strategies for LSM-tree storage engines running on cloud object stores, proposing a new tiered-leveled hybrid.',
   '10.1234/vldb.2024.005', 'VLDB', 2024, 8, 41,
   'Conference',
   {'LSM-tree', 'compaction', 'cloud storage', 'databases'},
   {'pdf': 'https://arxiv.org/pdf/2024.00005'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000006,
   'Formal Verification of Distributed Consensus Protocols',
   'We apply TLA+ and Isabelle/HOL to mechanically verify safety and liveness properties of three consensus protocols including Raft and Paxos.',
   '10.1234/popl.2023.006', 'POPL', 2023, 1, 112,
   'Conference',
   {'formal verification', 'TLA+', 'consensus', 'Raft', 'Paxos'},
   {'pdf': 'https://arxiv.org/pdf/2023.00006', 'proofs': 'https://github.com/consensus-proofs'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000007,
   'Knowledge Graph Embeddings for Scientific Literature',
   'We learn dense vector representations of entities in a scientific knowledge graph, enabling link prediction for citation recommendation.',
   '10.1234/emnlp.2023.007', 'EMNLP', 2023, 11, 78,
   'Conference',
   {'knowledge graphs', 'embeddings', 'citation prediction', 'NLP'},
   {'pdf': 'https://arxiv.org/pdf/2023.00007'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000008,
   'Self-Supervised Contrastive Learning for Object Detection',
   'We propose a contrastive pre-training objective that improves downstream object detection accuracy by 4.2% on COCO without additional labeled data.',
   '10.1234/cvpr.2023.008', 'CVPR', 2023, 6, 156,
   'Conference',
   {'self-supervised learning', 'contrastive learning', 'object detection'},
   {'pdf': 'https://arxiv.org/pdf/2023.00008', 'weights': 'https://huggingface.co/ssl-det'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000009,
   'Query Optimization in Heterogeneous Distributed Databases',
   'A cost-based optimizer for queries spanning multiple storage backends including row stores, column stores, and object stores.',
   '10.1234/sigmod.2023.009', 'SIGMOD', 2023, 6, 54,
   'Conference',
   {'query optimization', 'distributed databases', 'cost model'},
   {'pdf': 'https://arxiv.org/pdf/2023.00009'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000010,
   'Adversarial Robustness in Federated Learning Systems',
   'We analyze attack surfaces in federated learning and propose Byzantine-resilient aggregation methods with provable convergence guarantees.',
   '10.1234/sp.2024.010', 'IEEE S&P', 2024, 5, 38,
   'Conference',
   {'adversarial ML', 'federated learning', 'Byzantine fault tolerance', 'security'},
   {'pdf': 'https://arxiv.org/pdf/2024.00010'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000011,
   'A Survey of Large Language Models for Code Generation',
   'We survey 47 large language models evaluated on code generation benchmarks, analyzing scaling behavior, training data, and safety implications.',
   '10.1234/acmcs.2024.011', 'ACM Computing Surveys', 2024, 3, 210,
   'Journal',
   {'large language models', 'code generation', 'survey', 'NLP'},
   {'pdf': 'https://arxiv.org/pdf/2024.00011'});

INSERT INTO paper (paper_id, title, abstract_text, doi, venue, year, month, citation_count, paper_type, keywords, urls) VALUES
  (33333333-0000-0000-0000-000000000012,
   'Consistent Hashing with Bounded Loads for Elastic Storage',
   'We extend consistent hashing to bound per-node load during elastic scaling, proving O(1/n) maximum load deviation with high probability.',
   '10.1234/disc.2024.012', 'DISC', 2024, 10, 29,
   'Conference',
   {'consistent hashing', 'load balancing', 'elastic storage', 'distributed systems'},
   {'pdf': 'https://arxiv.org/pdf/2024.00012'});

-- ============================================================
-- Grants (3)
-- ============================================================

INSERT INTO grant_award (grant_id, title, funder, amount, currency, start_date, end_date, status) VALUES
  (55555555-0000-0000-0000-000000000001,
   'Scalable Privacy-Preserving Machine Learning',
   'NSF', 1200000.00, 'USD', '2023-01-01', '2026-12-31', 'active');

INSERT INTO grant_award (grant_id, title, funder, amount, currency, start_date, end_date, status) VALUES
  (55555555-0000-0000-0000-000000000002,
   'Next-Generation Cloud-Native Database Architectures',
   'DARPA', 3500000.00, 'USD', '2022-06-01', '2025-05-31', 'active');

INSERT INTO grant_award (grant_id, title, funder, amount, currency, start_date, end_date, status) VALUES
  (55555555-0000-0000-0000-000000000003,
   'Trustworthy AI for Healthcare',
   'ERC', 2000000.00, 'EUR', '2024-01-01', '2028-12-31', 'active');

-- ============================================================
-- Edges: AUTHORED (researchers -> papers)
-- ============================================================

-- Paper 1: Sparse Transformers (Alice + David)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000001, 33333333-0000-0000-0000-000000000001, 1, true);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000004, 33333333-0000-0000-0000-000000000001, 2, false);

-- Paper 2: Raft Revisited (Bob + Ingrid)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000002, 33333333-0000-0000-0000-000000000002, 1, true);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000009, 33333333-0000-0000-0000-000000000002, 2, false);

-- Paper 3: Medical ViTs (Carol + Alice)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000003, 33333333-0000-0000-0000-000000000003, 1, true);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000001, 33333333-0000-0000-0000-000000000003, 2, false);

-- Paper 4: Federated Privacy (Frank + David + Grace)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000006, 33333333-0000-0000-0000-000000000004, 1, true);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000004, 33333333-0000-0000-0000-000000000004, 2, false);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000007, 33333333-0000-0000-0000-000000000004, 3, false);

-- Paper 5: LSM Compaction (Eva + Ingrid)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000005, 33333333-0000-0000-0000-000000000005, 1, true);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000009, 33333333-0000-0000-0000-000000000005, 2, false);

-- Paper 6: Consensus Verification (Grace + Bob)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000007, 33333333-0000-0000-0000-000000000006, 1, true);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000002, 33333333-0000-0000-0000-000000000006, 2, false);

-- Paper 7: KG Embeddings (Henry + Alice)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000008, 33333333-0000-0000-0000-000000000007, 1, true);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000001, 33333333-0000-0000-0000-000000000007, 2, false);

-- Paper 8: Contrastive Learning (Jun + Carol + David)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000010, 33333333-0000-0000-0000-000000000008, 1, false);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000003, 33333333-0000-0000-0000-000000000008, 2, false);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000004, 33333333-0000-0000-0000-000000000008, 3, true);

-- Paper 9: Query Optimization (Eva)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000005, 33333333-0000-0000-0000-000000000009, 1, true);

-- Paper 10: Adversarial FL (Frank + Grace)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000006, 33333333-0000-0000-0000-000000000010, 1, true);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000007, 33333333-0000-0000-0000-000000000010, 2, false);

-- Paper 11: LLM Survey (Alice + Henry + David)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000001, 33333333-0000-0000-0000-000000000011, 1, true);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000008, 33333333-0000-0000-0000-000000000011, 2, false);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000004, 33333333-0000-0000-0000-000000000011, 3, false);

-- Paper 12: Consistent Hashing (Bob + Ingrid + Eva)
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000002, 33333333-0000-0000-0000-000000000012, 1, true);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000009, 33333333-0000-0000-0000-000000000012, 2, false);
INSERT INTO authored (source_id, target_id, author_position, corresponding) VALUES
  (11111111-0000-0000-0000-000000000005, 33333333-0000-0000-0000-000000000012, 3, false);

-- ============================================================
-- Edges: CITES (paper -> paper)
-- ============================================================

-- Paper 1 (Sparse Transformers) cites Paper 8 (Contrastive Learning)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000001, 33333333-0000-0000-0000-000000000008,
   'Building on self-supervised representations from contrastive pre-training', 'Related Work');

-- Paper 2 (Raft Revisited) cites Paper 6 (Consensus Verification)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000002, 33333333-0000-0000-0000-000000000006,
   'Our protocol satisfies the safety properties verified in prior work', 'Correctness');

-- Paper 3 (Medical ViTs) cites Paper 1 (Sparse Transformers)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000003, 33333333-0000-0000-0000-000000000001,
   'We adopt the sparse attention mechanism for volumetric data', 'Method');

-- Paper 3 (Medical ViTs) cites Paper 8 (Contrastive Learning)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000003, 33333333-0000-0000-0000-000000000008,
   'Pre-training with contrastive objectives improves segmentation', 'Experiments');

-- Paper 4 (Federated Privacy) cites Paper 10 (Adversarial FL)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000004, 33333333-0000-0000-0000-000000000010,
   'Prior work identified Byzantine attack vectors we address', 'Threat Model');

-- Paper 5 (LSM Compaction) cites Paper 9 (Query Optimization)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000005, 33333333-0000-0000-0000-000000000009,
   'Query-aware compaction complements optimizer cost models', 'Discussion');

-- Paper 5 (LSM Compaction) cites Paper 12 (Consistent Hashing)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000005, 33333333-0000-0000-0000-000000000012,
   'Data placement during elastic scaling affects compaction', 'System Design');

-- Paper 7 (KG Embeddings) cites Paper 11 (LLM Survey)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000007, 33333333-0000-0000-0000-000000000011,
   'LLMs can generate candidate entity descriptions for embedding', 'Approach');

-- Paper 10 (Adversarial FL) cites Paper 4 (Federated Privacy)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000010, 33333333-0000-0000-0000-000000000004,
   'Differential privacy mechanisms interact with robustness guarantees', 'Analysis');

-- Paper 11 (LLM Survey) cites Paper 1 (Sparse Transformers)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000011, 33333333-0000-0000-0000-000000000001,
   'Sparse attention is key to scaling context length in LLMs', 'Architecture');

-- Paper 12 (Consistent Hashing) cites Paper 2 (Raft Revisited)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000012, 33333333-0000-0000-0000-000000000002,
   'Consensus is needed to agree on the hash ring configuration', 'Protocol');

-- Paper 12 (Consistent Hashing) cites Paper 5 (LSM Compaction)
INSERT INTO cites (source_id, target_id, context, section) VALUES
  (33333333-0000-0000-0000-000000000012, 33333333-0000-0000-0000-000000000005,
   'Load-balanced placement reduces compaction amplification', 'Evaluation');

-- ============================================================
-- Edges: AFFILIATED_WITH (researcher -> institution)
-- ============================================================

-- MIT affiliations
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000001, 22222222-0000-0000-0000-000000000001,
   'CSAIL', 'Associate Professor', 2015, null);
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000002, 22222222-0000-0000-0000-000000000001,
   'CSAIL', 'Assistant Professor', 2019, null);
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000003, 22222222-0000-0000-0000-000000000001,
   'EECS', 'PhD Student', 2021, null);

-- Stanford affiliations
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000004, 22222222-0000-0000-0000-000000000002,
   'Computer Science', 'Professor', 2008, null);
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000005, 22222222-0000-0000-0000-000000000002,
   'Computer Science', 'Associate Professor', 2014, null);
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000006, 22222222-0000-0000-0000-000000000002,
   'Computer Science', 'Postdoctoral Fellow', 2022, null);

-- Oxford affiliations
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000007, 22222222-0000-0000-0000-000000000003,
   'Computer Science', 'Professor', 2005, null);
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000008, 22222222-0000-0000-0000-000000000003,
   'Computer Science', 'Lecturer', 2018, null);

-- ETH Zurich affiliations
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000009, 22222222-0000-0000-0000-000000000004,
   'Computer Science', 'Assistant Professor', 2020, null);
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000010, 22222222-0000-0000-0000-000000000004,
   'Computer Science', 'PhD Student', 2023, null);

-- Cross-institutional affiliation: Alice was a PhD student at Stanford before MIT
INSERT INTO affiliated_with (source_id, target_id, department, role, start_year, end_year) VALUES
  (11111111-0000-0000-0000-000000000001, 22222222-0000-0000-0000-000000000002,
   'AI Lab', 'PhD Student', 2010, 2015);

-- ============================================================
-- Edges: COVERS (paper -> topic)
-- ============================================================

-- Paper 1 covers ML and NLP
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000001, 44444444-0000-0000-0000-000000000001, 0.95);
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000001, 44444444-0000-0000-0000-000000000003, 0.70);

-- Paper 2 covers Distributed Systems
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000002, 44444444-0000-0000-0000-000000000002, 0.95);

-- Paper 3 covers ML and Computer Vision
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000003, 44444444-0000-0000-0000-000000000001, 0.80);
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000003, 44444444-0000-0000-0000-000000000004, 0.90);

-- Paper 4 covers ML and Security
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000004, 44444444-0000-0000-0000-000000000001, 0.85);
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000004, 44444444-0000-0000-0000-000000000006, 0.75);

-- Paper 5 covers Databases and Distributed Systems
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000005, 44444444-0000-0000-0000-000000000005, 0.95);
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000005, 44444444-0000-0000-0000-000000000002, 0.60);

-- Paper 6 covers Distributed Systems and Security
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000006, 44444444-0000-0000-0000-000000000002, 0.90);
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000006, 44444444-0000-0000-0000-000000000006, 0.65);

-- Paper 7 covers NLP and ML
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000007, 44444444-0000-0000-0000-000000000003, 0.90);
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000007, 44444444-0000-0000-0000-000000000001, 0.70);

-- Paper 8 covers Computer Vision and ML
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000008, 44444444-0000-0000-0000-000000000004, 0.95);
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000008, 44444444-0000-0000-0000-000000000001, 0.80);

-- Paper 9 covers Databases
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000009, 44444444-0000-0000-0000-000000000005, 0.95);

-- Paper 10 covers Security and ML
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000010, 44444444-0000-0000-0000-000000000006, 0.90);
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000010, 44444444-0000-0000-0000-000000000001, 0.65);

-- Paper 11 covers NLP and ML
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000011, 44444444-0000-0000-0000-000000000003, 0.95);
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000011, 44444444-0000-0000-0000-000000000001, 0.90);

-- Paper 12 covers Distributed Systems and Databases
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000012, 44444444-0000-0000-0000-000000000002, 0.90);
INSERT INTO covers_topic (source_id, target_id, relevance) VALUES
  (33333333-0000-0000-0000-000000000012, 44444444-0000-0000-0000-000000000005, 0.70);

-- ============================================================
-- Edges: FUNDED_BY (paper -> grant)
-- ============================================================

-- Papers funded by NSF privacy grant
INSERT INTO funded_by (source_id, target_id, role) VALUES
  (33333333-0000-0000-0000-000000000004, 55555555-0000-0000-0000-000000000001, 'primary');
INSERT INTO funded_by (source_id, target_id, role) VALUES
  (33333333-0000-0000-0000-000000000010, 55555555-0000-0000-0000-000000000001, 'secondary');

-- Papers funded by DARPA database grant
INSERT INTO funded_by (source_id, target_id, role) VALUES
  (33333333-0000-0000-0000-000000000005, 55555555-0000-0000-0000-000000000002, 'primary');
INSERT INTO funded_by (source_id, target_id, role) VALUES
  (33333333-0000-0000-0000-000000000009, 55555555-0000-0000-0000-000000000002, 'primary');
INSERT INTO funded_by (source_id, target_id, role) VALUES
  (33333333-0000-0000-0000-000000000012, 55555555-0000-0000-0000-000000000002, 'secondary');

-- Papers funded by ERC healthcare grant
INSERT INTO funded_by (source_id, target_id, role) VALUES
  (33333333-0000-0000-0000-000000000003, 55555555-0000-0000-0000-000000000003, 'primary');

-- ============================================================
-- Edges: COLLABORATES (researcher <-> researcher)
-- ============================================================

-- Alice collaborates with David (3 papers together)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000001, 11111111-0000-0000-0000-000000000004, 3, 2023, 2024);

-- Alice collaborates with Carol (1 paper)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000001, 11111111-0000-0000-0000-000000000003, 1, 2024, 2024);

-- Alice collaborates with Henry (2 papers)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000001, 11111111-0000-0000-0000-000000000008, 2, 2023, 2024);

-- Bob collaborates with Ingrid (2 papers)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000002, 11111111-0000-0000-0000-000000000009, 2, 2024, 2024);

-- Bob collaborates with Grace (1 paper)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000002, 11111111-0000-0000-0000-000000000007, 1, 2023, 2023);

-- Eva collaborates with Ingrid (1 paper)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000005, 11111111-0000-0000-0000-000000000009, 1, 2024, 2024);

-- Eva collaborates with Bob (1 paper)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000005, 11111111-0000-0000-0000-000000000002, 1, 2024, 2024);

-- Frank collaborates with Grace (2 papers)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000006, 11111111-0000-0000-0000-000000000007, 2, 2024, 2024);

-- Frank collaborates with David (1 paper)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000006, 11111111-0000-0000-0000-000000000004, 1, 2024, 2024);

-- Jun collaborates with Carol (1 paper)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000010, 11111111-0000-0000-0000-000000000003, 1, 2023, 2023);

-- Jun collaborates with David (1 paper)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year) VALUES
  (11111111-0000-0000-0000-000000000010, 11111111-0000-0000-0000-000000000004, 1, 2023, 2023);

CQL Queries

With the data loaded, CQL handles direct lookups and secondary index queries efficiently. This section exercises a broad range of CQL syntax: SELECT, INSERT, UPDATE, DELETE, COUNT, LIMIT, ALLOW FILTERING, IF NOT EXISTS, USING TTL, USING TIMESTAMP, collection operations on sets and maps, and multi-column WHERE clauses.

What CQL handles well

Primary key lookups are the fastest possible query — single-partition, single-row:

SELECT name, h_index, specializations FROM researcher
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

Secondary index queries scan the local token range efficiently:

SELECT title, year, citation_count FROM paper WHERE venue = 'NeurIPS';
SELECT name, city, ranking FROM institution WHERE country = 'United States';

Edge table lookups use the partition key to find all relationships from a source:

SELECT target_id, author_position, corresponding FROM authored
WHERE source_id = 11111111-0000-0000-0000-000000000001;

Collection operations let you add and remove elements from sets and maps in place:

UPDATE researcher SET specializations = specializations + {'Reinforcement Learning'}
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

UPDATE researcher SET social_links['linkedin'] = 'https://linkedin.com/in/alicechen'
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

Lightweight transactions (LWT) provide conditional inserts:

INSERT INTO researcher (researcher_id, name, email, active)
VALUES (11111111-0000-0000-0000-000000000099, 'New Researcher', 'new@test.com', true)
IF NOT EXISTS;

Where CQL hits a wall

These queries are impossible to express in a single CQL statement:

  • Citation chains: "What papers cite papers that cite this paper?" requires fetching each hop separately

  • Cross-institution collaboration: "Which researchers at MIT collaborate with researchers at Stanford?" requires joining three tables

  • Topic overlap: "Which topics are shared between two institutions?" requires a 4-hop traversal

  • Path finding: "What is the shortest collaboration chain from Alice to Jun?" is not expressible at all

  • Mutual citations: "Find papers that cite each other" requires a self-join on the cites table

Full CQL query reference

USE knowledge;

-- ============================================================
-- Basic lookups by primary key
-- ============================================================

-- Full row for a single researcher
SELECT * FROM researcher WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

-- Project specific columns
SELECT name, h_index, specializations FROM researcher
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

-- Look up a paper by ID
SELECT title, venue, year, citation_count FROM paper
WHERE paper_id = 33333333-0000-0000-0000-000000000001;

-- Look up a grant
SELECT title, funder, amount, currency FROM grant_award
WHERE grant_id = 55555555-0000-0000-0000-000000000001;

-- ============================================================
-- Secondary index queries
-- ============================================================

-- Papers at a specific venue
SELECT title, year, citation_count FROM paper WHERE venue = 'NeurIPS';

-- Papers from a specific year
SELECT title, venue, citation_count FROM paper WHERE year = 2024;

-- Institutions in a country
SELECT name, city, ranking FROM institution WHERE country = 'United States';

-- Find a topic by name
SELECT * FROM topic WHERE name = 'Machine Learning';

-- Active grants from a specific funder
SELECT title, amount, currency, start_date, end_date FROM grant_award
WHERE funder = 'NSF';

-- Active grants by status
SELECT title, funder, amount FROM grant_award WHERE status = 'active';

-- Papers by DOI
SELECT title, venue, year FROM paper WHERE doi = '10.1234/neurips.2024.001';

-- ============================================================
-- Collection operations
-- ============================================================

-- Read a set column
SELECT specializations FROM researcher
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

-- Read a map column
SELECT social_links FROM researcher
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

-- Read keywords set from a paper
SELECT keywords FROM paper
WHERE paper_id = 33333333-0000-0000-0000-000000000001;

-- Read URLs map from a paper
SELECT urls FROM paper
WHERE paper_id = 33333333-0000-0000-0000-000000000001;

-- Read departments from institution
SELECT departments FROM institution
WHERE institution_id = 22222222-0000-0000-0000-000000000001;

-- ============================================================
-- Count and aggregation
-- ============================================================

SELECT COUNT(*) FROM paper;
SELECT COUNT(*) FROM researcher;
SELECT COUNT(*) FROM institution;
SELECT COUNT(*) FROM authored;
SELECT COUNT(*) FROM cites;

-- ============================================================
-- LIMIT
-- ============================================================

SELECT title, venue, year FROM paper LIMIT 5;
SELECT name, h_index FROM researcher LIMIT 3;

-- ============================================================
-- Edge table queries (clustering key traversals)
-- ============================================================

-- All papers authored by Alice
SELECT target_id, author_position, corresponding FROM authored
WHERE source_id = 11111111-0000-0000-0000-000000000001;

-- All citations from Paper 1 (Sparse Transformers)
SELECT target_id, context, section FROM cites
WHERE source_id = 33333333-0000-0000-0000-000000000001;

-- All citations from Paper 3 (Medical ViTs) -- multiple citations
SELECT target_id, context, section FROM cites
WHERE source_id = 33333333-0000-0000-0000-000000000003;

-- All topics covered by Paper 4 (Federated Privacy)
SELECT target_id, relevance FROM covers_topic
WHERE source_id = 33333333-0000-0000-0000-000000000004;

-- All papers funded by the DARPA grant
SELECT source_id, role FROM funded_by
WHERE target_id = 55555555-0000-0000-0000-000000000002 ALLOW FILTERING;

-- Alice's collaborators
SELECT target_id, paper_count, first_collab_year, last_collab_year FROM collaborates_with
WHERE source_id = 11111111-0000-0000-0000-000000000001;

-- ============================================================
-- Multiple WHERE conditions (partition + clustering)
-- ============================================================

-- Specific affiliation: Alice at MIT
SELECT department, role, start_year, end_year FROM affiliated_with
WHERE source_id = 11111111-0000-0000-0000-000000000001
  AND target_id = 22222222-0000-0000-0000-000000000001;

-- Specific authorship: Alice authored Paper 1
SELECT author_position, corresponding FROM authored
WHERE source_id = 11111111-0000-0000-0000-000000000001
  AND target_id = 33333333-0000-0000-0000-000000000001;

-- Specific citation: Paper 12 cites Paper 2
SELECT context, section FROM cites
WHERE source_id = 33333333-0000-0000-0000-000000000012
  AND target_id = 33333333-0000-0000-0000-000000000002;

-- ============================================================
-- UPDATE: collection manipulation
-- ============================================================

-- Add a specialization to a researcher's set
UPDATE researcher SET specializations = specializations + {'Reinforcement Learning'}
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

-- Add a social link to a researcher's map
UPDATE researcher SET social_links['linkedin'] = 'https://linkedin.com/in/alicechen'
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

-- Add a keyword to a paper
UPDATE paper SET keywords = keywords + {'efficiency'}
WHERE paper_id = 33333333-0000-0000-0000-000000000001;

-- Update a URL in a paper's map
UPDATE paper SET urls['slides'] = 'https://neurips.cc/slides/sparse-tf'
WHERE paper_id = 33333333-0000-0000-0000-000000000001;

-- Update a scalar column
UPDATE paper SET citation_count = 156
WHERE paper_id = 33333333-0000-0000-0000-000000000001;

-- Update h_index
UPDATE researcher SET h_index = 47
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

-- Add a department to an institution
UPDATE institution SET departments = departments + {'Linguistics'}
WHERE institution_id = 22222222-0000-0000-0000-000000000001;

-- Verify the updates
SELECT specializations, social_links FROM researcher
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;
SELECT citation_count, keywords, urls FROM paper
WHERE paper_id = 33333333-0000-0000-0000-000000000001;
SELECT departments FROM institution
WHERE institution_id = 22222222-0000-0000-0000-000000000001;

-- ============================================================
-- DELETE: collection element removal
-- ============================================================

-- Remove a specialization from a set
UPDATE researcher SET specializations = specializations - {'Reinforcement Learning'}
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

-- Remove a social link from a map
DELETE social_links['linkedin'] FROM researcher
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

-- Remove a keyword from a paper
UPDATE paper SET keywords = keywords - {'efficiency'}
WHERE paper_id = 33333333-0000-0000-0000-000000000001;

-- Remove a URL from a paper's map
DELETE urls['slides'] FROM paper
WHERE paper_id = 33333333-0000-0000-0000-000000000001;

-- Remove a department from an institution
UPDATE institution SET departments = departments - {'Linguistics'}
WHERE institution_id = 22222222-0000-0000-0000-000000000001;

-- ============================================================
-- INSERT IF NOT EXISTS (lightweight transactions)
-- ============================================================

-- Conditionally insert a new researcher (only if UUID doesn't exist)
INSERT INTO researcher (researcher_id, name, email, active)
VALUES (11111111-0000-0000-0000-000000000099, 'New Researcher', 'new@test.com', true)
IF NOT EXISTS;

-- Conditionally insert a duplicate (should return applied=false)
INSERT INTO researcher (researcher_id, name, email, active)
VALUES (11111111-0000-0000-0000-000000000001, 'Duplicate Alice', 'dup@test.com', true)
IF NOT EXISTS;

-- Conditionally insert a new topic
INSERT INTO topic (topic_id, name, description, paper_count)
VALUES (44444444-0000-0000-0000-000000000099, 'Quantum Computing',
        'Computation using quantum mechanical phenomena', 3200)
IF NOT EXISTS;

-- ============================================================
-- USING TTL (time-to-live for expiring data)
-- ============================================================

-- Insert a preprint that expires after 24 hours
INSERT INTO paper (paper_id, title, year, paper_type)
VALUES (33333333-0000-0000-0000-000000000099, 'Preprint Draft: Preliminary Results', 2024, 'Preprint')
USING TTL 86400;

-- Insert a temporary collaboration record (30-day TTL)
INSERT INTO collaborates_with (source_id, target_id, paper_count, first_collab_year, last_collab_year)
VALUES (11111111-0000-0000-0000-000000000099, 11111111-0000-0000-0000-000000000001, 0, 2024, 2024)
USING TTL 2592000;

-- ============================================================
-- USING TIMESTAMP (explicit write timestamps)
-- ============================================================

-- Insert a topic with an explicit microsecond timestamp
INSERT INTO topic (topic_id, name, description, paper_count)
VALUES (44444444-0000-0000-0000-000000000098, 'Quantum Computing',
        'Computation using quantum mechanical phenomena', 3200)
USING TIMESTAMP 1704067200000000;

-- Update with an explicit timestamp
UPDATE researcher USING TIMESTAMP 1704067200000000
SET citations_total = 13000
WHERE researcher_id = 11111111-0000-0000-0000-000000000001;

-- ============================================================
-- DELETE operations
-- ============================================================

-- Delete a specific row
DELETE FROM collaborates_with
WHERE source_id = 11111111-0000-0000-0000-000000000099
  AND target_id = 11111111-0000-0000-0000-000000000001;

-- Delete a specific column value
DELETE email FROM researcher
WHERE researcher_id = 11111111-0000-0000-0000-000000000099;

-- Clean up test data
DELETE FROM researcher WHERE researcher_id = 11111111-0000-0000-0000-000000000099;
DELETE FROM paper WHERE paper_id = 33333333-0000-0000-0000-000000000099;
DELETE FROM topic WHERE topic_id = 44444444-0000-0000-0000-000000000099;
DELETE FROM topic WHERE topic_id = 44444444-0000-0000-0000-000000000098;

-- ============================================================
-- ALLOW FILTERING (full-table scans with non-indexed columns)
-- ============================================================

-- Find researchers by title (no index on title)
SELECT name, title, h_index FROM researcher WHERE title = 'Professor' ALLOW FILTERING;

-- Find papers of a specific type
SELECT title, venue, year FROM paper WHERE paper_type = 'Journal' ALLOW FILTERING;

-- Find corresponding authors
SELECT source_id, target_id FROM authored WHERE corresponding = true ALLOW FILTERING;

-- Find high-relevance topic coverage
SELECT source_id, target_id, relevance FROM covers_topic WHERE relevance > 0.9 ALLOW FILTERING;

Cypher Graph Queries

Ferrosa exposes a graph HTTP endpoint on port 7474. The same data you inserted via CQL is immediately available for Cypher queries. No ETL. No sync delay. The graph index is updated asynchronously as CQL writes land.

Basic node queries

Find a researcher by name, returning scalar and collection properties:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (r:Researcher {name: \"Alice Chen\"}) RETURN r.name, r.h_index, r.specializations", "keyspace": "knowledge"}'

List all institutions ordered by ranking:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (i:Institution) RETURN i.name, i.country, i.ranking ORDER BY i.ranking", "keyspace": "knowledge"}'

Single-hop traversals

Find all papers authored by a researcher:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (r:Researcher {name: \"Alice Chen\"})-[:AUTHORED]->(p:Paper) RETURN p.title, p.year, p.venue ORDER BY p.year DESC", "keyspace": "knowledge"}'

Find all researchers at MIT:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (r:Researcher)-[:AFFILIATED_WITH]->(i:Institution {name: \"MIT\"}) RETURN r.name, r.h_index, r.title ORDER BY r.h_index DESC", "keyspace": "knowledge"}'

Multi-hop traversals

This is where the graph engine outperforms application-side joins.

Citation chains — trace influence through the literature, 2 hops deep:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (p1:Paper)-[:CITES]->(p2:Paper)-[:CITES]->(p3:Paper) RETURN p1.title AS citing, p2.title AS via, p3.title AS cited LIMIT 10", "keyspace": "knowledge"}'

Researcher’s topics via papers — traverse through authorship to topic coverage:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (r:Researcher {name: \"Alice Chen\"})-[:AUTHORED]->(p:Paper)-[:COVERS]->(t:Topic) RETURN DISTINCT t.name", "keyspace": "knowledge"}'

Cross-institution collaborations — find researchers who collaborate across institutional boundaries:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (r1:Researcher)-[:AFFILIATED_WITH]->(i1:Institution), (r2:Researcher)-[:AFFILIATED_WITH]->(i2:Institution), (r1)-[:COLLABORATES]->(r2) WHERE i1.name <> i2.name RETURN r1.name, i1.name AS inst1, r2.name, i2.name AS inst2", "keyspace": "knowledge"}'

Funding chain — trace the full path from funder through grant to paper to topic:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (p:Paper)-[:FUNDED_BY]->(g:Grant), (p)-[:COVERS]->(t:Topic) RETURN g.funder, g.title, p.title, t.name ORDER BY g.funder", "keyspace": "knowledge"}'

Path queries

Shortest path between researchers — find the shortest connection through any relationship type:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH path = shortestPath((a:Researcher {name: \"Alice Chen\"})-[*]-(b:Researcher {name: \"Eva Schmidt\"})) RETURN path", "keyspace": "knowledge"}'

Shortest collaboration chain — restricted to COLLABORATES edges:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH path = shortestPath((a:Researcher {name: \"Alice Chen\"})-[:COLLABORATES*]-(b:Researcher {name: \"Jun Tanaka\"})) RETURN [n IN nodes(path) | n.name] AS chain", "keyspace": "knowledge"}'

Citation depth traversal — follow citations up to 3 hops from a starting paper:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH path = (p:Paper)-[:CITES*1..3]->(end:Paper) WHERE p.title = \"Vision Transformers for Medical Image Segmentation\" RETURN [n IN nodes(path) | n.title] AS chain LIMIT 5", "keyspace": "knowledge"}'

Aggregation

Cypher supports COUNT, COLLECT, AVG, SUM, and SIZE for graph-aware aggregations.

Papers per topic:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (p:Paper)-[:COVERS]->(t:Topic) RETURN t.name, COUNT(p) AS paper_count ORDER BY paper_count DESC", "keyspace": "knowledge"}'

Institution publication count:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (r:Researcher)-[:AFFILIATED_WITH]->(i:Institution), (r)-[:AUTHORED]->(p:Paper) RETURN i.name, COUNT(DISTINCT p) AS papers ORDER BY papers DESC", "keyspace": "knowledge"}'

Average citations per venue:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (p:Paper) RETURN p.venue, AVG(p.citation_count) AS avg_citations, COUNT(p) AS papers ORDER BY avg_citations DESC", "keyspace": "knowledge"}'

Run all queries as a test suite

A bash script runs every Cypher query and verifies HTTP 200 responses:

#!/usr/bin/env bash
set -euo pipefail

# Cypher graph queries for the research knowledge graph tutorial.
# Requires a running Ferrosa cluster with FERROSA_GRAPH_ENABLED=true.

FERROSA_GRAPH_HOST="${FERROSA_GRAPH_HOST:-localhost}"
FERROSA_GRAPH_PORT="${FERROSA_GRAPH_PORT:-7474}"
BASE_URL="http://${FERROSA_GRAPH_HOST}:${FERROSA_GRAPH_PORT}"

PASS=0
FAIL=0

run_query() {
    local label="$1"
    local endpoint="$2"
    local method="${3:-GET}"
    local body="${4:-}"

    printf "%-65s " "${label}..."

    if [ "${method}" = "GET" ]; then
        HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "${BASE_URL}${endpoint}")
    else
        HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
            -X "${method}" "${BASE_URL}${endpoint}" \
            -H "Content-Type: application/json" \
            -d "${body}")
    fi

    if [ "${HTTP_CODE}" -eq 200 ]; then
        echo "PASS (${HTTP_CODE})"
        PASS=$((PASS + 1))
    else
        echo "FAIL (${HTTP_CODE})"
        FAIL=$((FAIL + 1))
    fi
}

echo "========================================"
echo "  Research Knowledge Graph Cypher Tests"
echo "  Host: ${FERROSA_GRAPH_HOST}:${FERROSA_GRAPH_PORT}"
echo "========================================"
echo ""

# ------------------------------------------------------------------
# Health and schema
# ------------------------------------------------------------------

run_query "Graph health check" "/graph/health"
run_query "Graph schema introspection" "/graph/schema"

echo ""
echo "--- Basic Node Queries ---"

# ------------------------------------------------------------------
# Basic node queries
# ------------------------------------------------------------------

run_query "Find researcher by name" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher {name: \"Alice Chen\"}) RETURN r.name, r.h_index, r.specializations", "keyspace": "knowledge"}'

run_query "All institutions ordered by ranking" \
    "/graph/query" "POST" \
    '{"query": "MATCH (i:Institution) RETURN i.name, i.country, i.ranking ORDER BY i.ranking", "keyspace": "knowledge"}'

run_query "Papers published in 2024" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p:Paper) WHERE p.year = 2024 RETURN p.title, p.venue ORDER BY p.citation_count DESC", "keyspace": "knowledge"}'

run_query "Topics ordered by paper count" \
    "/graph/query" "POST" \
    '{"query": "MATCH (t:Topic) RETURN t.name, t.paper_count ORDER BY t.paper_count DESC", "keyspace": "knowledge"}'

run_query "Active grants with amounts" \
    "/graph/query" "POST" \
    '{"query": "MATCH (g:Grant) WHERE g.status = \"active\" RETURN g.title, g.funder, g.amount ORDER BY g.amount DESC", "keyspace": "knowledge"}'

echo ""
echo "--- Single-Hop Relationship Traversals ---"

# ------------------------------------------------------------------
# Single-hop relationship traversals
# ------------------------------------------------------------------

run_query "Papers authored by Alice Chen" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher {name: \"Alice Chen\"})-[:AUTHORED]->(p:Paper) RETURN p.title, p.year, p.venue ORDER BY p.year DESC", "keyspace": "knowledge"}'

run_query "Coauthors of Sparse Transformers paper" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher)-[:AUTHORED]->(p:Paper {title: \"Attention Is All You Need 2.0: Sparse Transformers at Scale\"}) RETURN r.name, r.title ORDER BY r.name", "keyspace": "knowledge"}'

run_query "Researchers at MIT" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher)-[:AFFILIATED_WITH]->(i:Institution {name: \"MIT\"}) RETURN r.name, r.h_index, r.title ORDER BY r.h_index DESC", "keyspace": "knowledge"}'

run_query "Papers citing Consensus Verification" \
    "/graph/query" "POST" \
    '{"query": "MATCH (citing:Paper)-[:CITES]->(p:Paper {title: \"Formal Verification of Distributed Consensus Protocols\"}) RETURN citing.title, citing.year", "keyspace": "knowledge"}'

run_query "Papers funded by DARPA" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p:Paper)-[:FUNDED_BY]->(g:Grant {funder: \"DARPA\"}) RETURN p.title, g.title AS grant_title", "keyspace": "knowledge"}'

run_query "Alice'\''s direct collaborators" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher {name: \"Alice Chen\"})-[:COLLABORATES]->(other:Researcher) RETURN other.name, other.title", "keyspace": "knowledge"}'

echo ""
echo "--- Multi-Hop Traversals ---"

# ------------------------------------------------------------------
# Multi-hop traversals
# ------------------------------------------------------------------

run_query "Citation chain (2 hops)" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p1:Paper)-[:CITES]->(p2:Paper)-[:CITES]->(p3:Paper) RETURN p1.title AS citing, p2.title AS via, p3.title AS cited LIMIT 10", "keyspace": "knowledge"}'

run_query "Researcher topics via papers" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher {name: \"Alice Chen\"})-[:AUTHORED]->(p:Paper)-[:COVERS]->(t:Topic) RETURN DISTINCT t.name", "keyspace": "knowledge"}'

run_query "Cross-institution collaborations" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r1:Researcher)-[:AFFILIATED_WITH]->(i1:Institution), (r2:Researcher)-[:AFFILIATED_WITH]->(i2:Institution), (r1)-[:COLLABORATES]->(r2) WHERE i1.name <> i2.name RETURN r1.name, i1.name AS inst1, r2.name, i2.name AS inst2", "keyspace": "knowledge"}'

run_query "Funding chain: funder -> grant -> paper -> topic" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p:Paper)-[:FUNDED_BY]->(g:Grant), (p)-[:COVERS]->(t:Topic) RETURN g.funder, g.title, p.title, t.name ORDER BY g.funder", "keyspace": "knowledge"}'

run_query "Researcher -> paper -> citation -> coauthor" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher {name: \"Bob Martinez\"})-[:AUTHORED]->(p:Paper)-[:CITES]->(cited:Paper)<-[:AUTHORED]-(coauthor:Researcher) WHERE coauthor.name <> \"Bob Martinez\" RETURN DISTINCT coauthor.name, cited.title", "keyspace": "knowledge"}'

echo ""
echo "--- Aggregation Queries ---"

# ------------------------------------------------------------------
# Aggregation queries
# ------------------------------------------------------------------

run_query "Most cited papers (top 5)" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p:Paper) RETURN p.title, p.citation_count ORDER BY p.citation_count DESC LIMIT 5", "keyspace": "knowledge"}'

run_query "Papers per topic" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p:Paper)-[:COVERS]->(t:Topic) RETURN t.name, COUNT(p) AS paper_count ORDER BY paper_count DESC", "keyspace": "knowledge"}'

run_query "Researcher publication count" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher)-[:AUTHORED]->(p:Paper) RETURN r.name, COUNT(p) AS publications ORDER BY publications DESC", "keyspace": "knowledge"}'

run_query "Collaboration count per researcher" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher)-[:COLLABORATES]->(other:Researcher) RETURN r.name, COUNT(other) AS collabs ORDER BY collabs DESC", "keyspace": "knowledge"}'

run_query "Average citations per venue" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p:Paper) RETURN p.venue, AVG(p.citation_count) AS avg_citations, COUNT(p) AS papers ORDER BY avg_citations DESC", "keyspace": "knowledge"}'

run_query "Institution publication count" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher)-[:AFFILIATED_WITH]->(i:Institution), (r)-[:AUTHORED]->(p:Paper) RETURN i.name, COUNT(DISTINCT p) AS papers ORDER BY papers DESC", "keyspace": "knowledge"}'

run_query "Grant impact (papers per grant)" \
    "/graph/query" "POST" \
    '{"query": "MATCH (g:Grant)<-[:FUNDED_BY]-(p:Paper) RETURN g.title, g.funder, COUNT(p) AS papers ORDER BY papers DESC", "keyspace": "knowledge"}'

echo ""
echo "--- Path Queries ---"

# ------------------------------------------------------------------
# Path queries
# ------------------------------------------------------------------

run_query "Shortest path between Alice and Eva" \
    "/graph/query" "POST" \
    '{"query": "MATCH path = shortestPath((a:Researcher {name: \"Alice Chen\"})-[*]-(b:Researcher {name: \"Eva Schmidt\"})) RETURN path", "keyspace": "knowledge"}'

run_query "All paths to a paper (max 3 hops)" \
    "/graph/query" "POST" \
    '{"query": "MATCH path = (r:Researcher)-[*1..3]->(p:Paper {title: \"Attention Is All You Need 2.0: Sparse Transformers at Scale\"}) RETURN path LIMIT 10", "keyspace": "knowledge"}'

run_query "Shortest collaboration path (Alice to Jun)" \
    "/graph/query" "POST" \
    '{"query": "MATCH path = shortestPath((a:Researcher {name: \"Alice Chen\"})-[:COLLABORATES*]-(b:Researcher {name: \"Jun Tanaka\"})) RETURN [n IN nodes(path) | n.name] AS chain", "keyspace": "knowledge"}'

run_query "Citation path depth 3" \
    "/graph/query" "POST" \
    '{"query": "MATCH path = (p:Paper)-[:CITES*1..3]->(end:Paper) WHERE p.title = \"Vision Transformers for Medical Image Segmentation\" RETURN [n IN nodes(path) | n.title] AS chain LIMIT 5", "keyspace": "knowledge"}'

echo ""
echo "--- Filtering and Predicates ---"

# ------------------------------------------------------------------
# Filtering and predicates
# ------------------------------------------------------------------

run_query "High h-index researchers (>40)" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher) WHERE r.h_index > 40 RETURN r.name, r.h_index, r.title ORDER BY r.h_index DESC", "keyspace": "knowledge"}'

run_query "Highly cited papers (>100)" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p:Paper) WHERE p.citation_count > 100 RETURN p.title, p.citation_count, p.venue ORDER BY p.citation_count DESC", "keyspace": "knowledge"}'

run_query "Funded research with amounts" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p:Paper)-[:FUNDED_BY]->(g:Grant) RETURN p.title, g.funder, g.amount, g.currency ORDER BY g.amount DESC", "keyspace": "knowledge"}'

run_query "Papers with keyword filter" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p:Paper) WHERE \"consensus\" IN p.keywords RETURN p.title, p.venue", "keyspace": "knowledge"}'

run_query "Researchers with specific specialization" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher) WHERE \"Distributed Systems\" IN r.specializations RETURN r.name, r.title, r.h_index", "keyspace": "knowledge"}'

run_query "Corresponding authors only" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher)-[a:AUTHORED]->(p:Paper) WHERE a.corresponding = true RETURN r.name, p.title ORDER BY r.name", "keyspace": "knowledge"}'

echo ""
echo "--- Complex Graph Analytics ---"

# ------------------------------------------------------------------
# Complex graph analytics
# ------------------------------------------------------------------

run_query "Research influence (papers citing my papers)" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher {name: \"Alice Chen\"})-[:AUTHORED]->(p:Paper)<-[:CITES]-(citing:Paper) RETURN p.title AS original, citing.title AS cited_by", "keyspace": "knowledge"}'

run_query "Collaboration network (collect)" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher)-[:COLLABORATES]->(other:Researcher) RETURN r.name, COLLECT(other.name) AS collaborators ORDER BY SIZE(COLLECT(other.name)) DESC", "keyspace": "knowledge"}'

run_query "Topic overlap between MIT and Stanford" \
    "/graph/query" "POST" \
    '{"query": "MATCH (i1:Institution {name: \"MIT\"})<-[:AFFILIATED_WITH]-(r1:Researcher)-[:AUTHORED]->(p1:Paper)-[:COVERS]->(t:Topic)<-[:COVERS]-(p2:Paper)<-[:AUTHORED]-(r2:Researcher)-[:AFFILIATED_WITH]->(i2:Institution {name: \"Stanford\"}) RETURN DISTINCT t.name AS shared_topic", "keyspace": "knowledge"}'

run_query "Mutual citation detection" \
    "/graph/query" "POST" \
    '{"query": "MATCH (a:Paper)-[:CITES]->(b:Paper)-[:CITES]->(a) RETURN a.title, b.title", "keyspace": "knowledge"}'

run_query "Researcher reach (papers at distance 2)" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher {name: \"Grace Okonkwo\"})-[:AUTHORED]->(p1:Paper)-[:CITES]->(p2:Paper) RETURN r.name, p1.title AS authored, p2.title AS influences", "keyspace": "knowledge"}'

run_query "Co-citation analysis" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p1:Paper)-[:CITES]->(common:Paper)<-[:CITES]-(p2:Paper) WHERE id(p1) < id(p2) RETURN p1.title, p2.title, common.title AS co_cited LIMIT 10", "keyspace": "knowledge"}'

run_query "Topic-spanning researchers" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher)-[:AUTHORED]->(p:Paper)-[:COVERS]->(t:Topic) WITH r, COLLECT(DISTINCT t.name) AS topics WHERE SIZE(topics) >= 2 RETURN r.name, topics ORDER BY SIZE(topics) DESC", "keyspace": "knowledge"}'

run_query "Unfunded highly-cited papers" \
    "/graph/query" "POST" \
    '{"query": "MATCH (p:Paper) WHERE p.citation_count > 50 AND NOT (p)-[:FUNDED_BY]->(:Grant) RETURN p.title, p.citation_count ORDER BY p.citation_count DESC", "keyspace": "knowledge"}'

run_query "Institution collaboration graph" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r1:Researcher)-[:AFFILIATED_WITH]->(i1:Institution), (r2:Researcher)-[:AFFILIATED_WITH]->(i2:Institution), (r1)-[:COLLABORATES]->(r2) WHERE i1.name < i2.name RETURN i1.name, i2.name, COUNT(*) AS collaboration_count", "keyspace": "knowledge"}'

run_query "Topic expertise by institution" \
    "/graph/query" "POST" \
    '{"query": "MATCH (i:Institution)<-[:AFFILIATED_WITH]-(r:Researcher)-[:AUTHORED]->(p:Paper)-[:COVERS]->(t:Topic) RETURN i.name, t.name, COUNT(DISTINCT p) AS papers ORDER BY i.name, papers DESC", "keyspace": "knowledge"}'

echo ""
echo "--- WITH Chaining and Subqueries ---"

# ------------------------------------------------------------------
# WITH chaining and subqueries
# ------------------------------------------------------------------

run_query "Top researcher per institution" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher)-[:AFFILIATED_WITH]->(i:Institution) WITH i, r ORDER BY r.h_index DESC WITH i, COLLECT(r)[0] AS top RETURN i.name, top.name, top.h_index", "keyspace": "knowledge"}'

run_query "Research pipeline: grant -> paper -> topic" \
    "/graph/query" "POST" \
    '{"query": "MATCH (g:Grant)<-[:FUNDED_BY]-(p:Paper)-[:COVERS]->(t:Topic) WITH g, COLLECT(DISTINCT t.name) AS topics, COUNT(DISTINCT p) AS papers RETURN g.title, g.funder, papers, topics", "keyspace": "knowledge"}'

run_query "Prolific collaborators (>= 2 collabs)" \
    "/graph/query" "POST" \
    '{"query": "MATCH (r:Researcher)-[c:COLLABORATES]->(other:Researcher) WITH r, COUNT(c) AS total_collabs WHERE total_collabs >= 2 RETURN r.name, total_collabs ORDER BY total_collabs DESC", "keyspace": "knowledge"}'

echo ""
echo "========================================"
echo "Results: ${PASS} passed, ${FAIL} failed"
echo "========================================"

if [ "${FAIL}" -gt 0 ]; then
    exit 1
fi

Advanced Analytics

The real power of a knowledge graph emerges when you combine traversals with aggregation and filtering to answer complex analytical questions.

Citation network analysis

Research influence — find all papers that cite papers authored by a specific researcher:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (r:Researcher {name: \"Alice Chen\"})-[:AUTHORED]->(p:Paper)<-[:CITES]-(citing:Paper) RETURN p.title AS original, citing.title AS cited_by", "keyspace": "knowledge"}'

Mutual citations — detect papers that cite each other (potential self-citation rings):

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (a:Paper)-[:CITES]->(b:Paper)-[:CITES]->(a) RETURN a.title, b.title", "keyspace": "knowledge"}'

Co-citation analysis — find papers frequently cited together, revealing hidden thematic links:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (p1:Paper)-[:CITES]->(common:Paper)<-[:CITES]-(p2:Paper) WHERE id(p1) < id(p2) RETURN p1.title, p2.title, common.title AS co_cited LIMIT 10", "keyspace": "knowledge"}'

Collaboration patterns

Collaboration network — list each researcher’s collaborators:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (r:Researcher)-[:COLLABORATES]->(other:Researcher) RETURN r.name, COLLECT(other.name) AS collaborators ORDER BY SIZE(COLLECT(other.name)) DESC", "keyspace": "knowledge"}'

Institution collaboration graph — aggregate individual collaborations into inter-institutional links:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (r1:Researcher)-[:AFFILIATED_WITH]->(i1:Institution), (r2:Researcher)-[:AFFILIATED_WITH]->(i2:Institution), (r1)-[:COLLABORATES]->(r2) WHERE i1.name < i2.name RETURN i1.name, i2.name, COUNT(*) AS collaboration_count", "keyspace": "knowledge"}'

Topic-spanning researchers — find researchers who publish across multiple research areas:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (r:Researcher)-[:AUTHORED]->(p:Paper)-[:COVERS]->(t:Topic) WITH r, COLLECT(DISTINCT t.name) AS topics WHERE SIZE(topics) >= 2 RETURN r.name, topics ORDER BY SIZE(topics) DESC", "keyspace": "knowledge"}'

Topic clustering and overlap

Topic overlap between institutions — find shared research interests:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (i1:Institution {name: \"MIT\"})<-[:AFFILIATED_WITH]-(r1:Researcher)-[:AUTHORED]->(p1:Paper)-[:COVERS]->(t:Topic)<-[:COVERS]-(p2:Paper)<-[:AUTHORED]-(r2:Researcher)-[:AFFILIATED_WITH]->(i2:Institution {name: \"Stanford\"}) RETURN DISTINCT t.name AS shared_topic", "keyspace": "knowledge"}'

Topic expertise by institution — break down each institution’s publication focus:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (i:Institution)<-[:AFFILIATED_WITH]-(r:Researcher)-[:AUTHORED]->(p:Paper)-[:COVERS]->(t:Topic) RETURN i.name, t.name, COUNT(DISTINCT p) AS papers ORDER BY i.name, papers DESC", "keyspace": "knowledge"}'

Funding impact

Unfunded highly-cited papers — identify impactful research that was not grant-funded:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (p:Paper) WHERE p.citation_count > 50 AND NOT (p)-[:FUNDED_BY]->(:Grant) RETURN p.title, p.citation_count ORDER BY p.citation_count DESC", "keyspace": "knowledge"}'

Research pipeline — trace the full funding-to-outcome pipeline using WITH chaining:

curl -X POST http://localhost:7474/graph/query \
  -H "Content-Type: application/json" \
  -d '{"query": "MATCH (g:Grant)<-[:FUNDED_BY]-(p:Paper)-[:COVERS]->(t:Topic) WITH g, COLLECT(DISTINCT t.name) AS topics, COUNT(DISTINCT p) AS papers RETURN g.title, g.funder, papers, topics", "keyspace": "knowledge"}'

What You Learned

In this tutorial you:

  1. Designed a research knowledge graph schema with 5 vertex types and 6 edge types, using WITH extensions to annotate standard CQL tables for graph traversal.

  2. Loaded rich sample data covering 10 researchers, 4 institutions, 12 papers, 6 topics, and 3 grants — all with realistic relationships and properties spanning multiple CQL data types (uuid, text, int, float, decimal, date, timestamp, boolean, set, map).

  3. Queried with CQL using primary key lookups, secondary indexes, collection operations, lightweight transactions, TTL, explicit timestamps, and ALLOW FILTERING — and saw where CQL’s single-table model breaks down for relationship queries.

  4. Queried with Cypher using node matching, relationship traversals (1-hop, 2-hop, N-hop), shortestPath, variable-length patterns, WHERE predicates, ORDER BY, LIMIT, and aggregation functions (COUNT, COLLECT, AVG, SIZE).

  5. Analyzed citation networks to detect influence chains, mutual citations, and co-citation patterns.

  6. Mapped collaboration patterns across institutions, identifying topic-spanning researchers and inter-institutional research links.

  7. Traced funding impact from grant to paper to topic, finding both funded and unfunded high-impact research.

The core insight: same data, two query paradigms. CQL for speed and scale. Cypher for relationships and patterns. No data duplication. No separate database. No ETL.

Next steps