IMDB Movie Data

Full Recipe¶

Explore a standard graph data set using Quine to combine data from separate sources, then generates a new event stream from the combined data.

IMDB Movie Data Recipe

Download Recipe

Scenario¶

In this scenario, Quine combines data from multiple CSV files into one graph. As the graph is formed, a standing query matches and reports every instance of when a person is both the actor and director of a movie.

Sample Data¶

The sample data for this recipe is provided in two csv files that were exported from a relational database.

File 1: movieData.csv contains the Person, Movie, and Join rows.

Download movieData.csv

File 2: ratingData.csv contains rows of ratings.

Download ratingData.csv

Click on the buttons above to download the sample data into the same directory where Quine will be run.

How it Works¶

This recipe parses the CSV files using ingest streams. We used multiple ingests streams to parse the movieData.csv file to highlight how the Cypher acts on each data structure. Creating a single ingest stream for the movie data would be more efficient.

The recipe ingests the CSV files to create the following graph shape.

Movie and Genre Nodes¶

The first ingest stream matches the movieData.csv file's rows containing Movie entities. These rows are parsed, turned into Movie and Genre nodes, and filled with properties. Genre nodes are connected to Movie nodes in the graph.

YAMLJSON

- type: FileIngest
  path: $movie_file
  format:
    type: CypherCsv
    headers: true
    query: |-
      WITH $that AS row
      MATCH (m) 
      WHERE row.Entity = 'Movie' 
        AND id(m) = idFrom("Movie", row.movieId)
      SET
        m:Movie,
        m.tmdbId = row.tmdbId,
        m.imdbId = row.imdbId,
        m.imdbRating = toFloat(row.imdbRating),
        m.released = row.released,
        m.title = row.title,
        m.year = toInteger(row.year),
        m.poster = row.poster,
        m.runtime = toInteger(row.runtime),
        m.countries = split(coalesce(row.countries,""), "|"),
        m.imdbVotes = toInteger(row.imdbVotes),
        m.revenue = toInteger(row.revenue),
        m.plot = row.plot,
        m.url = row.url,
        m.budget = toInteger(row.budget),
        m.languages = split(coalesce(row.languages,""), "|"),
        m.movieId = row.movieId
      WITH m,split(coalesce(row.genres,""), "|") AS genres
      UNWIND genres AS genre
      WITH m, genre
      MATCH (g) 
      WHERE id(g) = idFrom("Genre", genre)
      SET g.genre = genre, g:Genre
      CREATE (m:Movie)-[:IN_GENRE]->(g:Genre)

POST /api/v1/ingest/INGEST-1

{
  "type": "FileIngest",
  "path": "$movie_file",
  "format": {
    "type": "CypherCsv",
    "headers": true,
    "query": "WITH $that AS row\nMATCH (m) \nWHERE row.Entity = 'Movie' \n  AND id(m) = idFrom(\"Movie\", row.movieId)\nSET\n  m:Movie,\n  m.tmdbId = row.tmdbId,\n  m.imdbId = row.imdbId,\n  m.imdbRating = toFloat(row.imdbRating),\n  m.released = row.released,\n  m.title = row.title,\n  m.year = toInteger(row.year),\n  m.poster = row.poster,\n  m.runtime = toInteger(row.runtime),\n  m.countries = split(coalesce(row.countries,\"\"), \"|\"),\n  m.imdbVotes = toInteger(row.imdbVotes),\n  m.revenue = toInteger(row.revenue),\n  m.plot = row.plot,\n  m.url = row.url,\n  m.budget = toInteger(row.budget),\n  m.languages = split(coalesce(row.languages,\"\"), \"|\"),\n  m.movieId = row.movieId\nWITH m,split(coalesce(row.genres,\"\"), \"|\") AS genres\nUNWIND genres AS genre\nWITH m, genre\nMATCH (g) \nWHERE id(g) = idFrom(\"Genre\", genre)\nSET g.genre = genre, g:Genre\nCREATE (m:Movie)-[:IN_GENRE]->(g:Genre)"
  }
}

Person Nodes¶

The second ingest stream matches the movieData.csv file's rows containing Person entities. These rows are parsed, turned into Person nodes, and filled with properties.

YAMLJSON

- type: FileIngest
  path: $movie_file
  format:
    type: CypherCsv
    headers: true
    query: |-
      WITH $that AS row
      MATCH (p) 
      WHERE row.Entity = "Person" 
        AND id(p) = idFrom("Person", row.tmdbId)
      SET
        p:Person,
        p.imdbId = row.imdbId,
        p.bornIn = row.bornIn,
        p.name = row.name,
        p.bio = row.bio,
        p.poster = row.poster,
        p.url = row.url,
        p.born = row.born,
        p.died = row.died,
        p.tmdbId = row.tmdbId,
        p.born = CASE row.born WHEN "" THEN null ELSE datetime(row.born + "T00:00:00Z") END,
        p.died = CASE row.died WHEN "" THEN null ELSE datetime(row.died + "T00:00:00Z") END

POST /api/v1/ingest/INGEST-2

{
  "type": "FileIngest",
  "path": "$movie_file",
  "format": {
    "type": "CypherCsv",
    "headers": true,
    "query": "WITH $that AS row\nMATCH (p) \nWHERE row.Entity = \"Person\" \n  AND id(p) = idFrom(\"Person\", row.tmdbId)\nSET\n  p:Person,\n  p.imdbId = row.imdbId,\n  p.bornIn = row.bornIn,\n  p.name = row.name,\n  p.bio = row.bio,\n  p.poster = row.poster,\n  p.url = row.url,\n  p.born = row.born,\n  p.died = row.died,\n  p.tmdbId = row.tmdbId,\n  p.born = CASE row.born WHEN \"\" THEN null ELSE datetime(row.born + \"T00:00:00Z\") END,\n  p.died = CASE row.died WHEN \"\" THEN null ELSE datetime(row.died + \"T00:00:00Z\") END"
  }
}

Role Nodes¶

The third ingest stream matches the movieData.csv file's rows containing Join entities that have Acting in the Work column. These rows are parsed, turned into Role nodes, filled with properties, and connected to the graph. Additionally, the ACTED_IN relationship is set between the Person and Movie nodes.

YAMLJSON

- type: FileIngest
  path: $movie_file
  format:
    type: CypherCsv
    headers: true
    query: |-
      WITH $that AS row
      WITH row 
      WHERE row.Entity = "Join" 
        AND row.Work = "Acting"
      MATCH (p), (m), (r) 
      WHERE id(p) = idFrom("Person", row.tmdbId)
        AND id(m) = idFrom("Movie", row.movieId)
        AND id(r) = idFrom("Role", row.tmdbId, row.movieId, row.role)
      SET 
        r.role = row.role, 
        r.movie = row.movieId, 
        r.tmdbId = row.tmdbId, 
        r:Role
      CREATE (p:Person)-[:PLAYED]->(r:Role)<-[:HAS_ROLE]-(m:Movie)
      CREATE (p:Person)-[:ACTED_IN]->(m:Movie)

POST /api/v1/ingest/INGEST-3

{
  "type": "FileIngest",
  "path": "$movie_file",
  "format": {
    "type": "CypherCsv",
    "headers": true,
    "query": "WITH $that AS row\nWITH row \nWHERE row.Entity = \"Join\" \n  AND row.Work = \"Acting\"\nMATCH (p), (m), (r) \nWHERE id(p) = idFrom(\"Person\", row.tmdbId)\n  AND id(m) = idFrom(\"Movie\", row.movieId)\n  AND id(r) = idFrom(\"Role\", row.tmdbId, row.movieId, row.role)\nSET \n  r.role = row.role, \n  r.movie = row.movieId, \n  r.tmdbId = row.tmdbId, \n  r:Role\nCREATE (p:Person)-[:PLAYED]->(r:Role)<-[:HAS_ROLE]-(m:Movie)\nCREATE (p:Person)-[:ACTED_IN]->(m:Movie)"
  }
}

Directed Nodes¶

The fourth ingest stream matches the movieData.csv file's rows containing Join entities that have Directing in the Work column. These rows are parsed and the DIRECTED relationship is created between the Person and Movie nodes.

YAMLJSON

- type: FileIngest
  path: $movie_file
  format:
    type: CypherCsv
    headers: true
    query: |-
      WITH $that AS row
      WITH row WHERE row.Entity = "Join" AND row.Work = "Directing"
      MATCH (p), (m) 
      WHERE id(p) = idFrom("Person", row.tmdbId)
        AND id(m) = idFrom("Movie", row.movieId)
      CREATE (p:Person)-[:DIRECTED]->(m:Movie)

POST /api/v1/ingest/INGEST-4

{
  "type": "FileIngest",
  "path": "$movie_file",
  "format": {
    "type": "CypherCsv",
    "headers": true,
    "query": "WITH $that AS row\nWITH row WHERE row.Entity = \"Join\" AND row.Work = \"Directing\"\nMATCH (p), (m) \nWHERE id(p) = idFrom(\"Person\", row.tmdbId)\n  AND id(m) = idFrom(\"Movie\", row.movieId)\nCREATE (p:Person)-[:DIRECTED]->(m:Movie)"
  }
}

Rating Nodes¶

The fifth ingest stream matches rows from the ratingsData.csv file to create User and Rating nodes, fill them with parameters, and connect them into the graph.

YAMLJSON

- type: FileIngest
  path: $rating_file
  format:
    type: CypherCsv
    headers: true
    query: |-
      WITH $that AS row
      MATCH (m), (u), (rtg) 
      WHERE id(m) = idFrom("Movie", row.movieId)
        AND id(u) = idFrom("User", row.userId)
        AND id(rtg) = idFrom("Rating", row.movieId, row.userId, row.rating)
      SET u.name = row.name, u:User
      SET rtg.rating = row.rating,
        rtg.timestamp = toInteger(row.timestamp),
        rtg:Rating
      CREATE (u:User)-[:SUBMITTED]->(rtg:Rating)<-[:HAS_RATING]-(m:Movie)
      CREATE (u:User)-[:RATED]->(m:Movie)

POST /api/v1/ingest/INGEST-5

{
  "type": "FileIngest",
  "path": "$rating_file",
  "format": {
    "type": "CypherCsv",
    "headers": true,
    "query": "WITH $that AS row\nMATCH (m), (u), (rtg) \nWHERE id(m) = idFrom(\"Movie\", row.movieId)\n  AND id(u) = idFrom(\"User\", row.userId)\n  AND id(rtg) = idFrom(\"Rating\", row.movieId, row.userId, row.rating)\nSET u.name = row.name, u:User\nSET rtg.rating = row.rating,\n  rtg.timestamp = toInteger(row.timestamp),\n  rtg:Rating\nCREATE (u:User)-[:SUBMITTED]->(rtg:Rating)<-[:HAS_RATING]-(m:Movie)\nCREATE (u:User)-[:RATED]->(m:Movie)"
  }
}

Acted and Directed¶

A standing query detects when an actor (Person) has both the ACTED_IN and DIRECTED relationship to the same Movie.

When a pattern match is found, the ActedDirected relationship is created between the Person and Movie nodes in the graph, and an alert is written into the ActorDirector.jsonl file.

YAMLJSON

- pattern:
    type: Cypher
    mode: MultipleValues
    query: |-
      MATCH (a:Movie)<-[:ACTED_IN]-(p:Person)-[:DIRECTED]->(m:Movie) 
      WHERE id(a) = id(m)
      RETURN id(m) as movieId, id(p) as personId
  outputs:
    set-ActedDirected:
      type: CypherQuery
      query: |-
        MATCH (m),(p)
        WHERE id(m) = $that.data.movieId 
          AND id(p) = $that.data.personId
        WITH *
        CREATE (p:Person)-[:ActedDirected]->(m:Movie)
        RETURN id(m) as movieId, m.title as Movie, id(p) as personId, p.name as Actor
      andThen:
        type: WriteToFile
        path: "ActorDirector.jsonl"

/api/v1/query/standing/STANDING-1

{
  "pattern": {
    "type": "Cypher",
    "mode": "MultipleValues",
    "query": "MATCH (a:Movie)<-[:ACTED_IN]-(p:Person)-[:DIRECTED]->(m:Movie) \nWHERE id(a) = id(m)\nRETURN id(m) as movieId, id(p) as personId"
  },
  "outputs": {
    "set-ActedDirected": {
      "type": "CypherQuery",
      "query": "MATCH (m),(p)\nWHERE id(m) = $that.data.movieId \n  AND id(p) = $that.data.personId\nWITH *\nCREATE (p:Person)-[:ActedDirected]->(m:Movie)\nRETURN id(m) as movieId, m.title as Movie, id(p) as personId, p.name as Actor",
      "andThen": {
        "type": "WriteToFile",
        "path": "ActorDirector.jsonl"
      }
    }
  }
}

The output object contains information about the match and the nodes matching the query.

{
       "meta": {
              "isPositiveMatch": true,
              "resultId": "d2008617-cc5c-4f81-8472-f3db277f8da2"
       },
       "data": {
              "Actor": "Clint Eastwood",
              "Movie": "Unforgiven",
              "movieId": "4a6d64c8-9c90-3362-b443-4d2e7b2fb9d1",
              "personId": "4638a820-3b68-3fc7-9fa7-341e876b701e"
       }
}

Running the Recipe¶

java \
-jar quine-1.9.3.jar -r movieData.yaml\
--recipe-value movie_file=movieData.csv \
--recipe-value rating_file=ratingData.csv

Tip

This recipe will create an ActorDirector.jsonl file in the local directory that you should remove before each run. We found it easier to launch Quine using the following shell script so that we didn't forget to clean up output from previous runs.

run-recipe.sh

#!/bin/bash

[ -f ActorDirector.jsonl ] && rm ActorDirector.jsonl

java \
-jar quine-1.9.3.jar -r movieData.yaml\
--recipe-value movie_file=movieData.csv \
--recipe-value rating_file=ratingData.csv

Launching Quine directly or using run-recipe.sh will produce output similar to this.

❯ ./run_recipe.sh
Graph is ready
Running Recipe: Ingesting CSV Files
Using 6 node appearances
Using 8 sample queries
Running Standing Query STANDING-1
Running Ingest Stream INGEST-1
Running Ingest Stream INGEST-2
Running Ingest Stream INGEST-3
Running Ingest Stream INGEST-4
Running Ingest Stream INGEST-5
Quine web server available at http://localhost:8080
INGEST-1 status is completed and ingested 74090
INGEST-2 status is completed and ingested 74090
INGEST-3 status is completed and ingested 74090
INGEST-4 status is completed and ingested 74090
INGEST-5 status is completed and ingested 100005

 | => STANDING-1 count 491

Summary¶

Be sure to open Quine in your browser using the URL provided in your terminal window. Several sample queries are ready for you to use in the Exploration UI query bar. Click on the query bar and launch the queries by pressing the Query button.

Tip

Submit text queries with Shift+Enter to avoid the Exploration UI sending back an error.

Sample Query	Type	Description
Sample of Nodes	Node	Return a sample of nodes from the graph
Count Nodes	Text	Count the types of nodes in the graph
Count Relationships	Text	Count the types of relationships in the graph
Movie Genres	Node	Return the movie genres nodes from the graph
Person Acted In a movie	Node	Return all of the actor nodes from the graph
Person Directed a movie	Node	Return all of the director nodes from the graph
Person Acted In and Directed a movie	Node	Return all of the actor and movie nodes where the actor also directed
User Rated a movie	Node	Return all of the rating nodes from the graph

The results from the Person Acted In and Directed a movie create an interesting shape you can explore further with your queries.