Skip to content

Disaster Recovery

Quine Enterprise persists all durable state (graph data, standing queries, and ingest definitions) in Cassandra. Quine Enterprise itself is stateless relative to the persistence layer. This means disaster recovery reduces to two steps: recover Cassandra, then redeploy Quine Enterprise on top of it.

Cassandra Is the Recovery Target

All backup, replication, and restore operations target Cassandra. Follow Apache Cassandra's recommended practices for your deployment model:

  • Snapshots: Use nodetool snapshot to create point-in-time backups. Copy snapshots to durable storage (S3, GCS, or equivalent) outside your primary failure domain. You do not need to stop Quine Enterprise before taking a snapshot. Cassandra snapshots are crash-consistent, and Quine Enterprise's event-sourced journals ensure incomplete writes are recovered on restart.
  • Cross-zone replication: Configure NetworkTopologyStrategy with replicas across availability zones or datacenters so Cassandra can survive the loss of an entire zone. Maintain replication factor 3 with LOCAL_QUORUM consistency for both reads and writes.

See the Apache Cassandra documentation on replication and backup and restore for detailed procedures.

What Needs Attention After Recovery

Once Cassandra is restored, deploy Quine Enterprise using your existing infrastructure-as-code (Helm, Terraform, or equivalent) pointed at the recovered Cassandra cluster. See Helm Quick Start or Terraform on AWS.

If the recovery environment is identical to the original (same source endpoints, same output destinations, same cluster size), Quine Enterprise will resume processing with no further action. Graph data, standing queries, and ingests all come back from Cassandra.

If the recovery environment differs, some configurations may need updating:

  • Ingest source addresses. If Kafka brokers, Kinesis streams, or SQS queues have different endpoints in the recovery region, update or recreate the affected ingest definitions via the REST API.
  • Standing query output destinations. If webhooks, Kafka topics, or other output targets have different addresses, update those output configurations.
  • Cluster size. If the recovery cluster has a different target-size, ingests must be deleted and recreated because they are defined per-member. See Cluster Scaling.

After recovery, ingest streams resume from their last committed offsets. Depending on how long the outage lasted and whether sources retained the data (Kafka retention, Kinesis shard retention), there may be a backlog. Monitor ingest.{name}.count and consumer group lag until the cluster catches up.

Delivery Guarantees During Recovery

Ingest provides at-least-once delivery. On recovery, some records may be reprocessed. This is safe as long as ingest queries are idempotent (using idFrom() for deterministic node IDs). Standing query outputs are best-effort. Any results that were queued but not yet delivered at the time of failure may be lost. See Delivery Guarantees for details on both.

When Disaster Recovery Is Not Needed

A single Quine Enterprise member failure is handled by the cluster's built-in resilience. If hot spares are provisioned, a spare assumes the failed member's position automatically. If all Quine Enterprise members fail but Cassandra is unaffected, simply redeploy Quine Enterprise; no Cassandra restore is needed. See Cluster Resilience for details. Full disaster recovery (restoring Cassandra and redeploying Quine Enterprise) is only necessary when the persistence layer itself is lost.

Planning Checklist

  • Cassandra replication spans failure domains (NetworkTopologyStrategy across availability zones or regions)
  • Cassandra snapshots are taken on a regular schedule and copied to durable storage outside the primary failure domain
  • Infrastructure-as-code for Quine Enterprise is tested and ready to deploy in the recovery environment
  • Ingest source endpoints and standing query output destinations are documented, with recovery-environment equivalents identified
  • Kafka/Kinesis/SQS retention periods are long enough to cover your target recovery time
  • Recovery procedure has been tested end-to-end in a non-production environment