Architecture¶
Pipeline stages¶
ob-analytics turns order event streams and authoritative trade records into structured analytics:
| Stage | What happens |
|---|---|
| Load & normalize | Parse Bitstamp CSV or LOBSTER message file into a uniform event DataFrame |
| Build trades | Bitstamp: read companion trades.csv (live capture). LOBSTER: extract type 4/5 executions from the events frame via LobsterTradeReader |
| Classify orders | Label each order as market, resting-limit, flashed-limit, market-limit, or unknown |
| Depth & metrics | Track price-level volume, best bid/ask, spread, and liquidity in configurable BPS bins. LOBSTER can use the official orderbook file for ground-truth depth |
| Flow toxicity (post-run) | VPIN, Kyle's lambda, order-flow imbalance computed from result.trades |
| Visualize / export | Depth heatmaps, event maps, trade charts, flow-toxicity plots, HTML galleries. Matplotlib (default) or Plotly backend. Parquet and LOBSTER round-trip I/O |
Design decisions¶
- DataFrames end to end. Pandas for speed; the column-list constants and
validate_*functions inschemas.pydocument the column contracts. - Two API levels —
Pipelinefor one-line runs; individual classes (BitstampLoader,BitstampTradeReader, etc.) for step-by-step control. - Pluggable everything — any object with the right method signature works;
no inheritance required (structural typing via
Protocol). - Per-run parameters live on
RunContext, notFormat. Construction-time parameters (schema choice, fixed venue config) belong on theFormatctor; per-session parameters (trading date, write-time level cap) live onRunContextand are passed toPipeline(format=..., ctx=...). This keeps formats reusable across multiple runs without re-instantiation and avoids baking session state into long-lived objects.
Scale envelope¶
ob-analytics keeps the full event, depth, and trade tables in memory (pandas), so
the working set scales with the event count. Peak RSS grows roughly linearly at
~1 GiB per 1M events, with the depth stages (price_level_volume →
depth_metrics) dominating both memory and time:
| events | peak RSS | depth stages |
|---|---|---|
| 314 k | ~0.43 GiB | ~14 s |
| 628 k | ~0.73 GiB | ~25 s |
| 942 k | ~1.02 GiB | ~38 s |
| 1.26 M | ~1.32 GiB | ~51 s |
(Measured with scripts/bench_scale.py: the bundled
~314k-event sample tiled to each size, each size run in its own process, peak RSS
via getrusage. Slightly conservative — tiling adds some transient overhead.)
Comfortable ceiling ≈ 5M events (~5 GiB) on a typical 16 GB machine — i.e. session-scale data, a few hours of a single liquid instrument. For larger inputs (a full NASDAQ MBO day is 10–100M events, well past this), the recommended pattern is to pre-slice by time window and process each slice independently, concatenating the per-slice outputs. ob-analytics deliberately ships no streaming / out-of-core machinery; pre-slicing keeps the in-memory model simple and predictable (a chunked-run helper could automate it later if real workloads demand it).
Class diagram¶
The package combines protocol-based components with format descriptors that bundle venue-specific defaults.
classDiagram
class Pipeline {
+config: PipelineConfig
+loader: EventLoader
+trade_source: TradeSource
+writer: DataWriter | None
+run(source) PipelineResult
+from_format(name, **kwargs)$ Pipeline
}
class PipelineConfig {
+price_decimals: int
+volume_decimals: int
+price_divisor: int
+timestamp_unit: str
+depth_bps: int
+depth_bins: int
}
class Format {
<<Protocol>>
+name: str
+create_loader(config, ctx) EventLoader
+create_trade_source(config, ctx) TradeSource
+create_writer(config, ctx) DataWriter | None
+compute_depth(events, config, source, ctx) tuple | None
+config_defaults() dict
}
class RunContext {
+trading_date: object | None
}
class BitstampFormat
class LobsterFormat
class EventLoader {
<<Protocol>>
+load(source) DataFrame
}
class TradeSource {
<<Protocol>>
+load(events, source) DataFrame
}
class DataWriter {
<<Protocol>>
+write(data, dest, **kwargs)
}
class PipelineResult {
+events: DataFrame
+trades: DataFrame
+depth: DataFrame
+depth_summary: DataFrame
+config: PipelineConfig
}
Pipeline --> PipelineConfig
Pipeline --> EventLoader
Pipeline --> TradeSource
Pipeline --> DataWriter
Pipeline --> PipelineResult
Pipeline ..> Format : optional
Pipeline ..> RunContext : per-run params
Format <|.. BitstampFormat
Format <|.. LobsterFormat
Data formats¶
| Format | Entry point | Trades |
|---|---|---|
| Bitstamp CSV | Pipeline() (default) |
Companion trades.csv next to orders.csv (e.g. scripts/collect_bitstamp_btcusd.py) |
| LOBSTER | Pipeline(format=LobsterFormat(), ctx=RunContext(trading_date=...)) |
Embedded execution rows (types 4/5) in the message file |
The bundled sample under ob_analytics/_sample_data/ is a modern BTC/USD
capture (orders.csv + trades.csv).
Module map¶
ob_analytics/
├── __init__.py # Public API surface + format registration + sample_csv_path()
├── _sample_data/ # Bundled Bitstamp sample (orders.csv + trades.csv)
├── pipeline.py # Pipeline, PipelineResult, register_format
├── config.py # PipelineConfig (frozen Pydantic model)
├── protocols.py # EventLoader, TradeSource, DataWriter, Format
├── schemas.py # column constants + validators (validate_events_df, …)
├── exceptions.py # ObAnalyticsError hierarchy
├── cli.py # CLI entry point (process, gallery, bitstamp-demo, lobster-demo, capture)
│
├── bitstamp.py # BitstampLoader, BitstampTradeReader, BitstampWriter, BitstampFormat
├── lobster.py # LobsterLoader, LobsterTradeReader, LobsterWriter, LobsterFormat
├── analytics.py # order_aggressiveness, trade_impacts, set_order_types, order_book
├── depth.py # DepthMetricsEngine, price_level_volume, depth_metrics, get_spread
├── data.py # save_data, load_data, writer registry
├── flow_toxicity.py # compute_vpin, compute_kyle_lambda, order_flow_imbalance, KyleLambdaResult
├── _utils.py # Validation, numerics, timestamp conversion helpers
│
├── live/ # Optional live order-book capture ([live] extra)
│ ├── __init__.py # registry: register_capturer, list_capturers, get_capturer
│ ├── _base.py # LiveCapturer protocol, CaptureConfig, CaptureResult, CaptureSink
│ ├── _runner.py # Generic asyncio driver + FileCaptureSink
│ └── bitstamp.py # BitstampCapturer (built-in)
│
└── visualization/ # Plotting subsystem
├── __init__.py # plot() dispatcher + RENDERERS registry, PlotTheme, save_figure
├── gallery.py # HTML gallery generation
├── _data.py # Shared data prep for plot backends
├── _matplotlib.py # Matplotlib renderers
└── _plotly.py # Plotly renderers
ob_analytics.live is an optional async sub-package (install with
pip install "ob-analytics[live]") that captures live exchange data into
the same CSV schema the pipeline reads. Implement the LiveCapturer
protocol -- three async-iterator methods (snapshot, stream,
shutdown_synthetic_events) -- and call register_capturer to add a
new venue; the registry pattern mirrors Format and register_writer.
The runner (run_capturer) handles persistence, raw-frame archival,
signal handling, and meta.json finalisation so capturer authors only
write the per-venue parser.