What is the best message queue for trading systems?

For most teams, Apache Kafka is the right default—it's well-understood, widely deployed, and fast enough for all but ultra-low-latency strategies (1-10ms typical latency). For sub-millisecond requirements, consider Aeron (single-digit microseconds) or ZeroMQ (microseconds). NATS offers lower latency than Kafka with simpler operations.

What latency should I target for trading data pipelines?

It depends on your strategy. High-frequency trading requires microseconds. Medium-frequency strategies need milliseconds. Lower-frequency strategies can tolerate seconds. Risk monitoring systems need sub-second updates. Even if your strategy doesn't require ultra-low latency, your risk systems do.

Should I build or buy trading data infrastructure?

Build custom when latency is your competitive edge, your requirements are unusual, or you have the specialized team. Buy when time-to-market matters more than optimization, your alpha comes from signals not execution, or operational burden exceeds value. Most mature firms use a hybrid: vendor solutions for commodity infrastructure, custom code for the hot path.

Real-Time Data Pipeline Architecture for Trading Systems

Q: How do you handle market data bursts at market open?

Market opens can produce 100x normal message rates; FOMC announcements can spike to 500x. Design for backpressure: either buffer (adding latency) or shed load intelligently (dropping less important data first). Your system must handle bursts without dropping critical data or blocking upstream.

Your trading strategy needs real-time market data. Not "real-time" as in "updated every minute"—real-time as in microseconds matter.

We've seen trades go wrong because of a 50ms delay that seemed insignificant. We've seen million-dollar strategies fail because a feed went stale and nobody noticed for 3 minutes. Building infrastructure that reliably delivers market data from exchange to strategy, with minimal latency and maximum reliability, is a foundational challenge in quantitative trading.

This post covers the architecture patterns, technology choices, and trade-offs involved in building real-time data pipelines for trading systems.

Why Real-Time Matters

Different use cases have different latency requirements:

High-frequency trading: Microseconds matter. Every nanosecond of latency is a competitive disadvantage.

Medium-frequency strategies: Milliseconds matter. Stale data means missed opportunities or adverse fills.

Lower-frequency strategies: Seconds to minutes acceptable. But even here, data freshness enables better execution and risk management.

Risk monitoring: Sub-second updates critical. You need to know your exposure in real-time, not from a batch job.

Even if your strategy doesn't require ultra-low latency, your risk systems do. And reliable data delivery matters at every timescale.

The Data Flow

A real-time data pipeline has five layers:

Sources→Ingestion→Processing→Storage→Consumption

Sources

Where the data comes from:

Exchanges: Direct feeds, co-located connections
Data vendors: Consolidated feeds, normalized data
Brokers: Proprietary data, execution feeds
Alternative data: News, sentiment, satellite imagery

Each source has different characteristics:

Latency (direct feeds faster than vendor feeds)
Coverage (vendors aggregate multiple exchanges)
Format (FIX protocol, JSON, binary protocols)
Reliability (redundancy, failover)

Ingestion

Getting data into your system:

Connection management: Maintain persistent connections, handle reconnects
Protocol handling: Parse exchange-specific formats
Timestamping: Record when data arrived (not just exchange timestamp)
Initial buffering: Handle bursts without dropping data

Processing

Transforming raw data into usable form:

Normalization: Convert to common format
Validation: Check data quality
Enrichment: Add derived fields
Aggregation: Build bars, compute features

Storage

Persisting data for different use cases:

Hot storage: In-memory for real-time access
Warm storage: Recent history for analysis
Cold storage: Full history for backtesting

Consumption

Delivering data to consumers:

Push: Stream to subscribers
Pull: Query on demand
Hybrid: Subscribe to updates, query for history

Architecture Patterns

Simple Pipeline

For straightforward requirements:

Exchange→Connector→Kafka→Consumer→Strategy

↓

TimescaleDB

Components:

Exchange connector handles protocol-specific parsing
Kafka provides buffering and durability
Consumers process and route to destinations
TimescaleDB stores time-series history

Pros: Simple, few moving parts Cons: Limited flexibility, all-or-nothing consumption

Fan-Out Pipeline

For multiple consumers with different needs:

Exchange→Connector→Kafka→Topic per asset

→Real-time Consumer→DragonflyDB(hot)

→Analytics Consumer→TimescaleDB(warm)

→Archive Consumer→S3/Parquet(cold)

Pros: Consumers can process independently Cons: More complexity, potential consistency issues

Event-Driven Architecture

For complex processing requirements:

Exchange→Connector→Event Bus

→Normalizer→Normalized Bus

→Feature Calc→Feature Store

→Risk Calc→Risk Engine

→Signal Gen→Order Manager

→Raw Archiver→Cold Storage

Pros: Highly decoupled, each component can scale independently Cons: Complex, harder to debug, potential latency accumulation

Technology Choices

Message Queues

Apache Kafka:

High throughput, durable
Good for most trading workloads
Typical latency: 1-10ms

NATS:

Lower latency than Kafka
Less durable (but NATS JetStream adds persistence)
Simpler operations

ZeroMQ:

Very low latency (microseconds)
No broker (peer-to-peer)
No durability by default

Aeron:

Ultra-low latency (single-digit microseconds)
Designed for trading systems
Requires more expertise to operate

For most teams, Kafka is the right default. It's well-understood, widely deployed, and fast enough for all but the most latency-sensitive strategies.

Stream Processing

Kafka Streams:

Tight Kafka integration
Good for simple transformations
Exactly-once semantics

Apache Flink:

Powerful windowing and stateful processing
Lower latency than batch alternatives
More operational complexity

Custom processing:

For maximum control and minimum latency
More development effort
No framework overhead

For trading systems, custom processing often wins. Framework overhead matters when microseconds count.

Storage

Hot tier (in-memory):

DragonflyDB: Redis-compatible, higher throughput
Aerospike: Higher throughput, persistence
Custom in-process: Lowest latency

Warm tier (recent history):

TimescaleDB: PostgreSQL-based, SQL interface
QuestDB: Time-series optimized, very fast ingestion
InfluxDB: Purpose-built for time-series

Cold tier (full history):

Parquet on S3: Columnar, cost-effective
Delta Lake: ACID transactions on object storage
Apache Iceberg: Modern table format

Design Considerations

Latency vs Throughput

You can optimize for one or the other, but not both simultaneously.

Low latency design:

Process immediately, don't batch
In-memory everything
Direct connections, no intermediaries
Single-threaded to avoid lock contention
Kernel bypass networking (DPDK, Solarflare OpenOnload)
Busy-polling instead of interrupt-driven I/O

High throughput design:

Batch for efficiency
Compress data
Parallelize processing
Trade latency for volume

The tension is real: batching improves throughput but adds latency. Compression saves bandwidth but costs CPU cycles. Parallelization increases throughput but introduces coordination overhead.

Most trading systems need both—low latency for the hot path (signal generation, order submission) and high throughput for the warm path (analytics, storage). Design separate paths rather than trying to optimize one path for both.

Reliability

Data loss is expensive. A missed tick during a flash crash could mean a missed trading opportunity—or worse, a risk system that doesn't know your true exposure.

Redundancy: Multiple connections to data sources. Not just primary/backup, but ideally from different network paths. If your primary and backup both go through the same switch, you haven't solved the problem.

Persistence: Don't rely on in-memory only. But understand the latency cost of persistence. Write-ahead logs add microseconds. Synchronous replication adds milliseconds. Know which data must be durable and which can be rebuilt.

Monitoring: Know immediately when something fails. "Immediately" means seconds, not minutes. If your alerting latency is longer than your data staleness tolerance, you'll learn about problems from traders, not dashboards.

Recovery: Ability to replay from checkpoint. This requires careful design—your consumers must be idempotent, your timestamps must be deterministic, and your replay must not affect live trading.

Failure modes to design for:

Network partition between data center and exchange
Vendor feed going stale (sending data, but old data)
Upstream system sending malformed messages
Clock drift causing timestamp inconsistencies
Memory pressure causing garbage collection pauses
Disk full preventing persistence

Scalability

As data volume grows, your pipeline needs to scale. But scaling a real-time system is different from scaling a batch system.

Horizontal scaling: Add more consumers. But this only works if your data is partitionable. Order book updates for a single symbol cannot be parallelized—they must be processed in order.

Partitioning strategies:

By symbol: Most common, works well for independent instruments
By exchange: Good for multi-venue strategies
By asset class: Useful when processing logic differs
By client: For multi-tenant systems

The partition key determines your parallelism ceiling. Choose carefully—repartitioning a live system is painful.

Backpressure: Handle bursts without dropping data. Market opens, economic announcements, and flash crashes all produce traffic spikes of 10-100x normal volume. Your system must either buffer (adding latency) or shed load intelligently (dropping less important data first).

Data Volume

Typical volumes for different data types:

Data Type	Volume per Day	Storage per Year
Daily bars	MB	GB
Minute bars	GB	TB
Tick data	Tens of GB	Hundreds of TB
Full order book	Hundreds of GB	PB

Plan your storage accordingly. Full order book data for US equities alone can exceed 5TB per day uncompressed. Most firms keep full depth for recent history (days to weeks) and progressively downsample older data.

The economics matter: storing a year of tick data costs roughly $10-50K in cloud storage. Storing a year of full order book data costs 10-100x more. Factor in egress costs if you're running backtests that scan historical data.

Operational Realities

What Actually Goes Wrong

In theory, data flows smoothly from exchange to strategy. In practice:

Vendor feeds go stale. The connection stays up, messages keep arriving, but the timestamps stop advancing. Your system thinks it's receiving live data when it's actually receiving delayed or replayed data. We've seen a feed replay 10-minute-old data during a flash crash—the system happily traded on prices that no longer existed. Detection requires comparing feed timestamps to wall clock time—and handling the legitimate case where markets are simply quiet.

Exchanges send bad data. Erroneous prints, crossed markets, trades at impossible prices. Your pipeline needs to either filter these (risking filtering legitimate data) or pass them through with quality flags (requiring downstream systems to handle bad data).

Sequence gaps appear. You receive message 1000, then message 1002. Is message 1001 lost forever, or just delayed? The answer determines whether you should wait, request retransmission, or proceed without it.

Timestamps lie. Exchange timestamps reflect when the event occurred at the exchange. Your receipt timestamps reflect when you received it. The difference varies by milliseconds to seconds depending on network conditions. Reconciling these for accurate latency measurement is surprisingly difficult.

Bursts overwhelm buffers. Market opens produce 100x normal message rates. FOMC announcements are worse—we've seen 500x normal traffic in the 100ms after a rate decision. If your buffer fills, you either drop messages (bad) or block upstream (also bad, and may cascade). Proper backpressure design is essential but rarely implemented correctly the first time.

Monitoring That Actually Helps

Generic infrastructure monitoring (CPU, memory, disk) isn't enough. You need domain-specific observability:

Message rates by symbol and exchange. A sudden drop might indicate a feed problem—or a trading halt. You need context to distinguish.

Latency percentiles, not averages. P99 latency matters more than mean latency. A system with 1ms mean latency but 100ms P99 will cause problems that don't show up in average-based dashboards.

Sequence gap tracking. How many gaps per hour? How long until gaps are filled? Trending upward is a warning sign.

Cross-feed divergence. If you have multiple feeds for the same instruments, they should agree. Divergence indicates a problem with at least one feed.

Consumer lag. How far behind real-time is each consumer? Lag that grows over time indicates a consumer that can't keep up with production rate.

This is a core area where our observability platform adds value: real-time visibility into pipeline health metrics that matter for trading. But "tuned to distinguish operational issues from normal market behavior" is the key phrase—and that tuning is ongoing work. What counts as abnormal latency changes as your infrastructure evolves. What counts as a suspicious throughput drop depends on time of day and market conditions. The platform provides the visibility; your team provides the judgment about what the numbers mean and when thresholds need adjustment.

Common Architectures

For Small Teams

Small Team Architecture

Vendor Feed→Python Connector→DragonflyDB→Strategy

↓PostgreSQL

Simple, maintainable, good enough for many use cases. Use a vendor feed to avoid exchange connectivity complexity. DragonflyDB for real-time, PostgreSQL for history.

For Medium Teams

Medium Team Architecture

Exchange Feeds→Go Connector→Kafka→Consumer Group

→DragonflyDB(latest quotes)

→TimescaleDB(bars, history)

→S3(tick archive)

Multiple exchanges, robust message bus, separate storage tiers. Go or Rust connectors for performance.

For Large Teams

Large Team Architecture

Multiple Exchanges→Custom FPGAs→Low-Latency Bus

→Strategy Engines(co-located)

→Risk Systems

→Archival Pipeline→Data Lake

Hardware-accelerated where latency matters, sophisticated infrastructure throughout. Significant engineering investment.

The Build vs Buy Decision

When Custom Makes Sense

Build custom infrastructure when:

Latency is your edge. If you're competing on speed, every component in your stack is a potential optimization target. Generic solutions optimize for the general case, not your specific case.

Your requirements are unusual. Multi-asset strategies, exotic instruments, or unique data sources may not fit vendor assumptions.

You have the team. Building and operating real-time infrastructure requires specific expertise. If you don't have it, buying buys you time to develop it.

When Vendor Solutions Win

Buy when:

Time-to-market matters more than optimization. A vendor solution that's 80% as good but available today often beats a custom solution that's perfect but takes 18 months.

Your edge is elsewhere. If your alpha comes from better signals, not faster execution, infrastructure is a cost center. Minimize it.

Operational burden exceeds value. Running Kafka, TimescaleDB, and DragonflyDB in production requires on-call rotations, upgrade planning, and incident response. Managed services transfer that burden.

The Hybrid Approach

Most mature firms end up with a hybrid: vendor solutions for commodity infrastructure (message queues, databases), custom code for the hot path (connectors, signal generation, order routing).

The key is knowing which is which. Don't build a message queue. Don't buy a trading strategy.

Conclusion

Real-time data pipeline architecture is foundational to trading systems. The choices you make here constrain everything downstream—latency, reliability, scalability.

Start simple. Measure everything. Optimize where it matters.

Most teams over-engineer initially and under-monitor. Build something that works, instrument it thoroughly, and iterate based on data. The teams that succeed are not the ones with the most sophisticated initial architecture—they're the ones who can see what's happening in their pipeline and respond quickly when things go wrong.

That's the value of systematic observability: not eliminating operational work, but changing it from reactive debugging to proactive refinement. When you can see latency percentiles, message rates, and consumer lag in real-time, you catch problems before they cascade. When you have historical baselines, you can tune alerting thresholds based on data rather than guesswork. The infrastructure requires ongoing attention—but instrumented infrastructure tells you where to focus that attention.

If you need help designing data infrastructure for trading systems—or building the observability layer to understand what's actually happening—reach out. We've built these systems at multiple scales and can help you make the right trade-offs for your situation.

Why Real-Time Matters

The Data Flow

Sources

Ingestion

Processing

Storage

Consumption

Architecture Patterns

Simple Pipeline

Fan-Out Pipeline

Event-Driven Architecture

Technology Choices

Message Queues

Stream Processing

Storage

Design Considerations

Latency vs Throughput

Reliability

Scalability

Data Volume

Operational Realities

What Actually Goes Wrong

Monitoring That Actually Helps

Common Architectures

For Small Teams

For Medium Teams

For Large Teams

The Build vs Buy Decision

When Custom Makes Sense

When Vendor Solutions Win

The Hybrid Approach

Conclusion

Frequently Asked Questions