← Back to home

Starting a Dota Replay Action Decoder

Published March 2026 · 10 min read

I am starting a package for decoding detailed player actions from Dota 2 .dem replay files. Existing parsers like Clarity are strong on broad match summaries, but my target is different: per- player action sequences with rich context.

The end goal is a large-scale dataset for behavior analysis and learning hero pick/ban strategy, eventually pushing toward match result prediction from drafting plus early-game signals.

I have wanted this for a while because most public replay pipelines answer "what happened?" but not "why did this player decide to do that here?" I want the second question to be first-class.

What I want this project to do

Architecture I am building toward

The architecture is intentionally layered so low-level parsing and high-level feature extraction can evolve independently.

The IR contract (the important part)

The IR is the contract between decoding and higher-level logic. It needs to be stable, versioned, and expressive enough that future features rarely force decoder rewrites.

Replay file
  -> Rust decoder/parsing core
  -> IR stream (metadata, draft, ticks, entities, events, user messages)
  -> action extraction modules
  -> context enrichment
  -> Parquet/Arrow tables
  -> Python analysis + modeling

Core IR entities:

Unified PlayerAction schema

Every derived event maps into a shared PlayerAction record:

PlayerAction {
  match_id
  tick
  time_seconds
  player_slot
  team_id
  hero_id
  action_type
  action_args
  context_version
}

Initial action modules include draft actions, ability casts, movement decisions, item usage, ward interactions, and camera/click patterns. These modules are independent and configurable.

How I want to store the data

The core dataset should be analytics-native and language-interoperable from day one:

How I will keep it extensible

Tech stack choices

Per-match processing flow

  1. Ingestion queues a replay path.
  2. Rust decoder parses .dem into IR entities.
  3. Feature modules emit PlayerAction records.
  4. Enrichers attach context to each action.
  5. Serializer writes table outputs (Parquet/Arrow).
  6. Python notebooks/scripts train and evaluate models.

Modeling direction: draft to result

I will start with draft-only baselines and then add early-game action-context features to quantify predictive lift. This should make it possible to separate what is explained by composition from what is explained by player decisions in specific scenarios.

Success is not just accuracy; interpretability matters too. I want to identify which pick/ban patterns and early actions move win probability the most.

Next step is implementation notes and first real outputs from replay batches. Once the pipeline is stable, I can compare models trained on draft-only inputs versus draft + player-action context and measure the gap directly.