askill
databricks-spark-structured-streaming

databricks-spark-structured-streamingSafety 95Repository

Comprehensive guide to Spark Structured Streaming for production workloads. Use when building streaming pipelines, implementing real-time data processing, handling stateful operations, or optimizing streaming performance.

1.3k stars
26.4k downloads
Updated 2 weeks ago

Package Files

Loading files...
SKILL.md

Spark Structured Streaming

Production-ready streaming pipelines with Spark Structured Streaming. This skill provides navigation to detailed patterns and best practices.

Quick Start

from pyspark.sql.functions import col, from_json

# Basic Kafka to Delta streaming
df = (spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "broker:9092")
    .option("subscribe", "topic")
    .load()
    .select(from_json(col("value").cast("string"), schema).alias("data"))
    .select("data.*")
)

df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/Volumes/catalog/checkpoints/stream") \
    .trigger(processingTime="30 seconds") \
    .start("/delta/target_table")

Core Patterns

PatternDescriptionReference
Kafka StreamingKafka to Delta, Kafka to Kafka, Real-Time ModeSee kafka-streaming.md
Stream JoinsStream-stream joins, stream-static joinsSee stream-stream-joins.md, stream-static-joins.md
Multi-Sink WritesWrite to multiple tables, parallel mergesSee multi-sink-writes.md
Merge OperationsMERGE performance, parallel merges, optimizationsSee merge-operations.md

Configuration

TopicDescriptionReference
CheckpointsCheckpoint management and best practicesSee checkpoint-best-practices.md
Stateful OperationsWatermarks, state stores, RocksDB configurationSee stateful-operations.md
Trigger & CostTrigger selection, cost optimization, RTMSee trigger-and-cost-optimization.md

Best Practices

TopicDescriptionReference
Production ChecklistComprehensive best practicesSee streaming-best-practices.md

Production Checklist

  • Checkpoint location is persistent (UC volumes, not DBFS)
  • Unique checkpoint per stream
  • Fixed-size cluster (no autoscaling for streaming)
  • Monitoring configured (input rate, lag, batch duration)
  • Exactly-once verified (txnVersion/txnAppId)
  • Watermark configured for stateful operations
  • Left joins for stream-static (not inner)

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

78/100Analyzed 2/24/2026

A well-structured technical reference skill for Spark Structured Streaming that serves as a navigation hub to detailed patterns. Includes clear when-to-use guidance, actionable code examples, and a production checklist. Located in a dedicated skills folder with tags for discoverability. Acts as a high-quality reference document that points to external detailed files, making it appropriately complete for a reference skill. Scores high on clarity, safety, and reusability within the Spark streaming domain.

95
85
80
60
72

Metadata

Licenseunknown
Version-
Updated2 weeks ago
Publisherdatabricks-solutions

Tags

databaseobservability