ML Streaming Inference Pattern
Classification
- Domain: Computer Science, AI/ML
- Category: ML System Design Patterns
- Novelty: 7/10 (modern pattern for real-time systems)
- Practitioner Evidence: 10/10 (Kafka, Flink, production-validated)
Mental Model
Streaming inference processes predictions on events as they arrive in real-time data streams, rather than batching or waiting for requests. Like a conveyor belt factory where each item gets inspected immediately as it passes, versus collecting items into boxes for later inspection. Data flows continuously through the model with sub-second latency.
When to Use
- Real-time event processing (fraud detection, anomaly detection, IoT sensor monitoring)
- Predictions must happen on data in motion before storage (filter/route/enrich streams)
- Low-latency requirements (milliseconds to seconds) with high throughput (thousands/second)
- Continuous data streams from Kafka, Kinesis, Pub/Sub, or IoT sources
- Predictions inform downstream stream processing (feature engineering, alerting, routing)
Core Framework
1. Streaming Architecture Selection
Choose deployment pattern for model in stream processing pipeline
Option A: Embedded Model Pattern
- Deploy model directly inside stream processor (Kafka Streams, Flink)
- Load model in-memory within application code (TensorFlow, PyTorch, ONNX)
- Process events using stream processing DSL with model calls
- Best for: Simple models, low latency requirements, tight coupling needs
Option B: Model Server Pattern
- Deploy dedicated model serving infrastructure (TensorFlow Serving, Seldon, KServe)
- Stream processor makes RPC calls to model server (HTTP/REST or gRPC)
- Model server handles versioning, A/B testing, scaling independently
- Best for: Complex models, shared across services, versioning needs
2. Stream Processor Setup
Configure stream processing engine for real-time inference
Kafka Streams Approach:
- Define topology: source topic → transform → model inference → sink topic
- Configure processing guarantees (at-least-once vs. exactly-once)
- Set parallelism (number of stream threads = topic partitions)
- Implement stateful processing if predictions need context (windowed aggregations)
Apache Flink Approach:
- Create DataStream from Kafka/Kinesis source
- Map/FlatMap functions call model for predictions
- Configure checkpointing for fault tolerance (every 60-300 seconds)
- Use AsyncIO for non-blocking model server calls (maintain throughput)
3. Model Loading & Initialization
Optimize model deployment for streaming performance
- Load model once during processor initialization (avoid per-event loading)
- Use model serialization formats optimized for inference (ONNX, TorchScript, SavedModel)
- Pre-warm model with dummy predictions (avoid cold-start latency on first event)
- Configure batch inference within streams (micro-batches of 10-100 events for throughput)
4. Feature Engineering in Streams
Extract features from streaming events for model input
- Parse event payload into feature vector (JSON → numerical/categorical features)
- Enrich events with lookup data (joins with reference tables, caches, feature stores)
- Apply stateful transformations (rolling windows, session aggregations, counters)
- Handle missing features with defaults/imputation matching training pipeline
5. Inference Execution
Perform prediction on streaming events with low latency
- For embedded models: Direct function call within stream processor
- For model servers: Async HTTP/gRPC request with timeout (100-500ms)
- Implement micro-batching: Accumulate 10-50 events, batch predict, distribute results
- Handle prediction failures with fallback logic (default scores, retry, dead letter queue)
6. Output & Downstream Integration
Route predictions to consumers and storage
- Publish predictions to output Kafka topic (prediction_id, features, score, timestamp)
- Trigger actions based on prediction thresholds (fraud alert if score > 0.9)
- Enrich original event with prediction (merge input + output streams)
- Sink to databases for serving (low-latency KV stores) or analytics (data warehouse)
7. Monitoring & Observability
Track streaming inference performance and model health
- Latency metrics: End-to-end latency (event arrival → prediction output), model inference time
- Throughput metrics: Events/second processed, predictions/second generated
- Model metrics: Prediction distribution, confidence scores, drift detection
- Error handling: Prediction failures, timeout rate, dead letter queue size
Practical Application
Real-Time Fraud Detection (Credit Card Transactions)
Problem: Detect fraudulent transactions within 100ms to block before authorization Streaming Solution:
- Transaction events stream into Kafka topic (card_id, amount, merchant, location, timestamp)
- Flink job enriches with stateful features (transaction velocity last 5 min, merchant history)
- XGBoost model embedded in Flink scores each transaction (fraud_score: 0-1)
- High-risk transactions (score > 0.85) published to fraud_alerts topic → blocks authorization
- All predictions logged to data warehouse for model monitoring and retraining Result: 50ms p99 latency, 100K transactions/second throughput
IoT Anomaly Detection (Manufacturing Sensors)
Problem: Detect machine failures from 10K sensor streams in real-time Streaming Solution:
- Sensor data streams from devices to AWS Kinesis (temperature, vibration, pressure every 1 second)
- Kafka Streams aggregates 10-second windows per machine (mean, std, max, min)
- Isolation Forest model (embedded ONNX) scores each window for anomaly (anomaly_score)
- Anomalies (score > threshold) trigger alerts to maintenance team via SNS
- Normal predictions stored in TimescaleDB for trend analysis Result: 2-second end-to-end latency, early detection 30 minutes before failure
Content Recommendation (Social Media Feed)
Problem: Score feed posts in real-time as users scroll Streaming Solution:
- User scroll events stream to Kafka (user_id, post_id, scroll_position, timestamp)
- Kafka Streams calls TensorFlow Serving via gRPC (async) with user/post embeddings
- Ranking model returns relevance scores for candidate posts
- Top-scored posts returned to client within 200ms
- User interactions (clicks, likes) feedback to training pipeline via Kafka
Edge Cases & Nuances
Backpressure & Rate Limiting: Inference slower than event arrival rate
- Use micro-batching to increase throughput (trade small latency for higher QPS)
- Scale horizontally: Add stream processor instances (Kafka partitions, Flink parallelism)
- Implement load shedding: Drop low-priority events during overload (sample 10% of events)
Model Update Without Downtime: Deploying new model version
- Blue-green deployment: Run old + new versions, gradually shift traffic (canary release)
- For embedded models: Rolling restart stream processors with new model binary
- For model servers: Update server behind load balancer, test before full rollout
Event Ordering & Exactly-Once Processing: Preventing duplicate predictions
- Use Kafka exactly-once semantics (EOS) with transactional producers/consumers
- Implement idempotent predictions with deduplication keys (event_id tracking)
- Handle late-arriving events with watermarks (Flink) or grace periods
Cold Start & Stateful Processing: New stream processor instance initialization
- Restore state from checkpoints (Flink savepoints, Kafka changelog topics)
- Pre-populate caches/lookup tables before processing events (initialization phase)
- Use state TTL to prevent unbounded state growth (expire old entries after N hours)
Anti-Patterns
Synchronous Blocking Calls: Calling slow external APIs synchronously in stream processing Stateless Processing of Temporal Patterns: Ignoring event history when model needs context Over-Sized Models: Running 10GB deep learning model with 500ms latency in millisecond-latency streams No Backpressure Handling: Letting event queue grow unbounded during processing slowdowns
Trade-offs
Embedded Model vs. Model Server:
- Embedded: Lower latency (no RPC), tighter coupling, harder to version/update, duplicated models
- Model Server: Higher latency (network call), loose coupling, easy versioning, centralized serving
Micro-Batching vs. Per-Event Inference:
- Micro-batching: Higher throughput (batch efficiency), slightly higher latency (accumulation delay)
- Per-event: Lower latency (immediate processing), lower throughput (overhead per event)
At-Least-Once vs. Exactly-Once:
- At-least-once: Simpler, higher throughput, possible duplicate predictions (idempotency needed)
- Exactly-once: Complex, lower throughput (coordination overhead), no duplicates
Related Frameworks
- Batch Processing Pattern: Pre-compute predictions offline (complements streaming for hybrid systems)
- Online Learning Pattern: Update model continuously from streaming data (streaming training)
- Lambda Architecture: Batch layer + speed layer combining batch and streaming predictions
- Kappa Architecture: Pure streaming architecture (stream processing for all data)
- Feature Store: Consistent feature engineering for batch and streaming (avoid training/serving skew)
Practitioner Sources
- Kafka ML Systems (Kai Waehner): Real-time inference with Kafka + Flink, architecture patterns
- Google ML Design Patterns: Streaming inference patterns, deployment strategies
- Confluent ML Blog: Machine learning in Kafka applications, best practices
- Apache Flink ML: Streaming ML pipelines, stateful inference, checkpointing strategies
- TensorFlow Serving: Model serving for production inference, gRPC APIs, versioning
