Market Data Hygiene Part 1: Statistical Methods for Detecting Bad Data
10 minutes read (2546 words)
May 9th, 2026

A single bad print in your tick data can cascade through feature calculations, corrupt a backtest, or trigger an erroneous live trade. We've seen a single erroneous tick—a price that printed 5% away from the market for one trade before reverting—flow through a momentum feature and generate a signal that cost real money. The exchange later cancelled the trade, but our system had already acted on it. The challenge isn't writing code to filter data—it's knowing what to filter and why.
This three-part series covers the methods for detecting problematic market data:
- Part 1 (this post): Statistical methods for detecting point anomalies and systematic errors
- Part 2: Cross-Validation and Contextual Analysis: Cross-asset checks, time-based patterns, venue considerations, and multi-source triangulation
- Part 3: Reference Data and Historical Integrity: Corporate actions, point-in-time correctness, and building a validation framework
These are also the methods we're building into the market data monitoring layer of our observability platform. But we want to be clear upfront: data hygiene isn't a problem you solve once. It requires ongoing tuning, domain-specific calibration, and human judgment to distinguish legitimate anomalies from errors.
That's precisely why systematizing these checks matters. Without a platform, data quality work is reactive—you discover problems when a backtest fails or a trade goes wrong. With systematic monitoring, you shift from firefighting to continuous improvement. Your team (or your agents) can focus on tuning thresholds, investigating edge cases, and improving detection logic rather than manually scanning for obvious errors. The goal isn't to eliminate human judgment—it's to make that judgment count.
The Problem With Naive Approaches
The obvious approach to outlier detection—flag anything more than N standard deviations from the mean—fails immediately with market data.
Prices are non-stationary. NVDA went from $49 to $144 in 2024. That's not an outlier; it's a trend. Computing z-scores on raw prices is meaningless.
Volatility clusters. Financial returns exhibit heteroskedasticity—volatility itself is volatile. A 5% daily move in SPY during March 2020? Normal. A 5% move in August 2023? That would have been front-page news. Static thresholds miss this entirely.
The distribution isn't normal. Returns have fat tails. The October 1987 crash was a 20+ sigma event under Gaussian assumptions—something that shouldn't happen once in the lifetime of the universe. It happened on a Monday afternoon. Z-scores based on normal distributions will either flag too much or too little.
Effective data hygiene requires methods that account for these properties.
Detecting Bad Prints
A "bad print" is an erroneous trade price—typically from a reporting error, a fat-finger trade that was later cancelled, or exchange system issues. They appear as spikes that immediately revert. We've seen them range from the subtle (a $100.05 print when the market was at $100.00) to the absurd (Accenture briefly trading at $0.01 during the 2010 Flash Crash before the exchange cancelled the trades).
The Tick Test
Compare each trade to its immediate neighbors. A bad print typically:
- Deviates significantly from both the previous and next trade
- Reverts immediately (the next trade is back near the previous level)
- Often occurs in isolation (single trade, not a sequence)
The basic structure: flag trade t if it deviates sharply from P(t-1) but P(t+1) returns near P(t-1). The reversion condition distinguishes a bad print from a legitimate gap where prices stay at the new level.
Threshold selection matters. A fixed price threshold fails across instruments with different price levels and volatilities. Practical implementations typically use:
- Percentage of price (e.g., 1-2% deviation)
- Multiple of recent volatility (e.g., 5× the rolling standard deviation of tick-to-tick returns)
- Multiple of the current bid-ask spread (e.g., 10× the spread)
Volatility-adjusted thresholds adapt to market conditions—what's anomalous in a quiet market may be normal during a selloff.
Bid-ask bounce creates false positives. A trade at the ask followed by one at the bid looks like a spike-and-revert pattern, but both trades are legitimate. Filtering by spread multiples rather than absolute deviations helps, but the cleanest solution is comparing trades to concurrent quotes rather than to each other.
Not all single trades are errors. Large institutional block trades, opening auction prints, and closing crosses can legitimately appear as isolated trades at prices away from the prior tick. Context matters: trade size, venue (is this from a block facility?), and time of day (is this the opening print?) all inform whether isolation is suspicious or expected.
The tick test is a useful first-pass filter, but it's not sufficient alone. Combine it with the NBBO comparison, trade clustering analysis, and time-of-day awareness described below.
Comparison to NBBO
For US equities, compare trade prices to the National Best Bid and Offer (NBBO) at the time of the trade. Trades executing significantly outside the NBBO are suspicious:
- Trades above the ask by more than a few ticks
- Trades below the bid by more than a few ticks
- Trades during locked or crossed markets (requires additional context)
This requires synchronized quote data, which adds complexity but dramatically improves detection accuracy.
Trade Clustering Analysis
Legitimate large price moves typically involve multiple trades. A price spike supported by a single trade, followed by trades at the prior level, is almost certainly erroneous.
Look at the volume and trade count around suspected anomalies:
- Single trade at anomalous price → likely bad
- Cluster of trades walking up/down → likely legitimate
- Anomalous price with no volume → data error, not a trade
Cross-Venue Validation
If a stock trades at $50 on NYSE but shows $55 on NASDAQ at the same millisecond, one of them is wrong. Cross-venue comparison catches errors that single-venue analysis misses.
This is particularly useful for:
- Detecting vendor-specific data issues
- Identifying stale quotes from slow venues
- Catching consolidation/aggregation errors in multi-venue feeds
Our platform automates these cross-venue checks, flagging discrepancies for review. But automation surfaces candidates—it doesn't make final judgments. A human still needs to determine whether a flagged discrepancy is a data error, a legitimate but unusual trade, or a venue-specific timing artifact. The value is in reducing the haystack, not in eliminating the need for judgment.
Working With Returns, Not Prices
Most statistical methods work better on returns than prices. Returns are (approximately) stationary, making standard statistical techniques applicable.
Log Returns vs Simple Returns
Log returns are additive across time and symmetric around zero, making them preferable for statistical analysis:
r(t) = ln(P(t) / P(t-1))
For short intervals and small moves, log returns ≈ simple returns. For longer intervals or larger moves, the difference matters.
Return-Based Outlier Detection
Once you're working with returns, you can apply outlier detection more sensibly. But standard deviation still isn't the right measure.
Median Absolute Deviation (MAD) resists outlier influence better than standard deviation:
MAD = median(|r(i) - median(r)|)
Flag returns where |r - median| > k × MAD. Unlike standard deviation, MAD isn't inflated by the very outliers you're trying to detect—but this robustness has limits.
When MAD breaks down. MAD assumes outliers are rare. If 5-10% or more of your data is contaminated, the median itself becomes unreliable, and MAD loses its robustness. For severely corrupted datasets, you may need iterative approaches: detect and remove obvious outliers first, then recompute MAD on the cleaned data.
Choosing k is a tradeoff, not a formula. The threshold multiplier k controls sensitivity. Lower k (e.g., 3) catches more anomalies but flags more legitimate extreme moves. Higher k (e.g., 5+) misses fewer real moves but lets more bad data through. There's no universally correct value.
The right k depends on your downstream use case. For risk calculations, false negatives (missing bad data) are dangerous—use lower k and accept more false positives. For signal generation, false positives (discarding real moves) destroy alpha—use higher k and accept some bad data.
Tail thickness varies by asset class. Large-cap equities have relatively well-behaved tails. Small-caps are wilder. Crypto and frontier markets have fat tails that make even k=5 aggressive. A 10-sigma move in BTC that would be once-per-millennium for Treasury bonds happens multiple times per year. Calibrate k to your asset class, not to a universal constant.
Interquartile Range (IQR) methods have similar properties. Flag returns outside [Q1 - k×IQR, Q3 + k×IQR]. The same caveats apply: IQR resists moderate contamination but fails under heavy contamination, and k must be calibrated to the specific asset's return distribution.
Volatility-Adjusted Thresholds
A 3% daily move means different things in different volatility regimes. Normalize by recent realized volatility:
z(t) = r(t) / σ(t-1)
Where σ(t-1) is estimated from recent data. This flags moves that are unusual relative to current conditions, not unusual relative to all history.
Volatility estimation options, in order of complexity:
Rolling standard deviation (e.g., 20-day window) is the simplest approach. It's transparent and stable, but treats all observations in the window equally and adapts slowly to regime changes. A volatility spike takes 20 days to fully enter the estimate.
Exponentially weighted moving average (EWMA) variance weights recent observations more heavily, adapting faster to changing conditions. The RiskMetrics standard uses λ=0.94 for daily data: σ²(t) = λσ²(t-1) + (1-λ)r²(t). EWMA is nearly as simple as rolling windows but responds more quickly to volatility shifts.
GARCH models capture volatility clustering more formally, but introduce practical friction often glossed over:
- Parameter estimation requires sufficient history and can be sensitive to the very outliers you're trying to detect—a circularity problem
- Model specification choices (GARCH(1,1) vs. other variants) affect results
- Estimation can be unstable for assets with limited history or unusual dynamics
For data quality filtering, EWMA often hits the sweet spot: more responsive than rolling windows, simpler than GARCH, and no parameter estimation beyond choosing λ. Save GARCH for applications where the additional modeling precision justifies the complexity.
Detecting Systematic Errors
The methods above focus on point anomalies—individual bad prints or outlier returns. But some data quality problems are systematic: they affect many data points in a consistent, biased way. These are often more damaging than point anomalies because they don't trigger outlier detection and can persist unnoticed for extended periods.
Stale Data Detection
A feed that stops updating but continues delivering the last known value is dangerous. Your system thinks it has live data when it actually has frozen data. We've seen this happen at the worst possible times—a feed freezing during a volatility spike, with the system happily trading on 10-minute-old prices while the market moved 2%.
Timestamp analysis: Compare the data timestamp to wall clock time. If the gap grows consistently, the feed is falling behind. If the timestamp stops advancing entirely, the feed is stale.
Price staleness: If a normally active instrument shows the same price for an unusually long period, investigate. For liquid equities during market hours, unchanged prices for more than a few seconds may indicate staleness. For less liquid instruments, calibrate expectations accordingly.
Quote staleness: Bid and ask should update frequently for active instruments. A quote that hasn't changed while the underlying market moved is suspect.
Cross-feed comparison: If one feed shows price movement while another shows flat prices for the same instrument, one of them is stale. This is one of the strongest staleness detection methods.
Bid-Ask Inversion and Flip Errors
In valid market data, bid < ask (or bid ≤ ask in locked markets). Persistent bid > ask indicates a data error—either the labels are swapped, or one side is stale.
Detection: Flag any period where bid exceeds ask by more than a small tolerance for more than a brief moment. Momentary crosses can occur legitimately during fast markets, but persistent inversion is a data problem.
Root causes: Common sources include field mapping errors in feed handlers, timezone/timestamp issues causing misaligned quotes, or stale quotes on one side.
Systematic Rounding and Truncation
Some feeds round prices or sizes in ways that introduce bias:
Price rounding: A feed that rounds to fewer decimal places than the instrument actually trades introduces systematic error. If the true price is $10.125 but your feed shows $10.12, you have consistent negative bias.
Size truncation: Some feeds truncate large sizes or round to lot boundaries. A 15,000 share trade reported as 10,000 understates volume systematically.
Detection: Compare your feed to a reference source at high precision. Systematic differences in the least significant digits indicate rounding. Volume comparisons across sources reveal truncation.
Timestamp Drift and Ordering Errors
Clock drift: If your data source's clock drifts relative to true time, timestamps become unreliable. Events that appear simultaneous may have actually occurred seconds apart, or vice versa.
Detection: Compare timestamps across multiple sources for the same events. If source A consistently timestamps events 500ms before source B, one of them has clock issues.
Ordering errors: Some feeds deliver messages out of order, especially under load. A trade that appears to execute before its triggering quote indicates an ordering problem.
Sequence gaps: Missing sequence numbers indicate dropped messages. A feed that frequently gaps under load has reliability problems that may correlate with high-volatility periods—exactly when data quality matters most.
Detecting Systematic Bias Statistically
For biases that affect prices or returns consistently:
Mean comparison: If your feed's prices are systematically higher or lower than a reference, calculate the mean difference over a representative period. A non-zero mean indicates bias.
Correlation breakdown: Instruments that should be highly correlated (e.g., SPY and ES futures) should show correlation near 1.0 in returns. Lower correlation may indicate data quality issues in one or both feeds.
Distribution comparison: Plot the distribution of differences between your feed and a reference. It should be centered at zero with symmetric tails. Skewness or non-zero center indicates systematic bias.
Our platform monitors for these systematic patterns and tracks bias metrics over time. But "expected bounds" must be defined per-instrument and per-feed, tuned based on historical behavior, and updated as market conditions change. This isn't set-and-forget—it's ongoing calibration.
Summary
This post covered statistical methods for detecting both point anomalies and systematic errors in market data:
- Bad print detection: Tick tests, NBBO comparison, trade clustering, cross-venue validation
- Return-based outlier detection: MAD, IQR, and volatility-adjusted thresholds—with their limitations
- Systematic error detection: Staleness, bid-ask inversion, rounding, timestamp drift, and statistical bias tests
In Part 2, we'll cover cross-asset validation, time-based patterns, venue-specific considerations, and multi-source triangulation—the contextual checks that catch errors statistical methods miss.
If you need help building data quality infrastructure for your trading systems, contact us. This is core to what we do.