TracePipe¶

Row-level data lineage tracking for pandas pipelines.

TracePipe automatically tracks what happens to every row and cell in your DataFrame — drops, transformations, merges, and value changes. Zero code changes required.

Quick Start

Get up and running in 5 minutes
User Guide

Learn the core concepts and features
API Reference

Complete function documentation
Examples

Real-world usage patterns

The Problem¶

Data pipelines are black boxes. When something goes wrong, you're left asking:

"Where did row X go?" — Dropped somewhere, but which step?
"Why is this value wrong?" — It was fine in the source, what changed it?
"How did these rows get merged?" — Which parent records combined?
"Why are there nulls here?" — When did they appear?

df = pd.read_csv("customers.csv")
df = df.dropna()                          # Some rows disappear
df = df.merge(regions, on="zip")          # New rows appear, some vanish
df["income"] = df["income"].fillna(0)     # Values change silently
df = df[df["age"] >= 18]                  # More rows gone
# What actually happened to customer C-789?

Traditional debugging means print() statements, manual diffs, and guesswork.

The Solution¶

import tracepipe as tp
import pandas as pd

tp.enable(mode="debug", watch=["income"])

df = pd.read_csv("customers.csv")
df = df.dropna()
df["income"] = df["income"].fillna(0)
df = df.merge(regions, on="zip")
df = df[df["age"] >= 18]

# What actually happened to customer C-789?
print(tp.trace(df, where={"customer_id": "C-789"}))

Row 789 Journey:
  Status: [DROPPED]
  Dropped by: DataFrame.__getitem__[mask] (step 5)

  Events:
    [SURVIVED] DataFrame.dropna
    [MODIFIED] DataFrame.fillna: income (None → 0)
    [SURVIVED] DataFrame.merge
    [DROPPED]  DataFrame.__getitem__[mask]  ← age filter

Now you know: C-789 had null income (filled to 0), survived the merge, but was dropped by the age filter.

# Pipeline health overview
print(tp.check(df))

TracePipe Check: [OK] Pipeline healthy
  Mode: debug

Retention: 847/1000 (84.7%)
Dropped: 153 rows
  • DataFrame.dropna: 42
  • DataFrame.__getitem__[mask]: 111

Value changes: 23 cells modified
  • DataFrame.fillna: 23 (income)

One import. Complete audit trail.

Key Features¶

Feature	Description
Zero-Code Instrumentation	Works with existing pandas code unchanged
Row-Level Tracking	Know exactly where each row went
Cell Provenance	See before/after values for every change
Merge Parent Tracking	Understand which rows combined
Data Contracts	Validate retention rates and uniqueness
HTML Reports	Generate visual pipeline audits

Installation¶

pip install tracepipe

For optional features:

pip install tracepipe[arrow]   # Parquet/Arrow support
pip install tracepipe[all]     # All optional dependencies

Quick Example¶

import tracepipe as tp
import pandas as pd

# Enable tracking
tp.enable(mode="debug", watch=["price"])

# Your normal pandas code
df = pd.DataFrame({
    "product": ["A", "B", "C"],
    "price": [10.0, None, 30.0]
})
df = df.dropna()
df["price"] = df["price"] * 1.1

# Inspect what happened
print(tp.check(df))      # Health summary
print(tp.trace(df, 0))   # Row 0's journey
print(tp.why(df, "price", 0))  # Why price changed

What's Tracked¶

Operation	Tracking	Completeness
`dropna`, `query`, `df[mask]`	Dropped row IDs	Full
`drop_duplicates`	Dropped→kept mapping (debug mode)	Full
`head`, `tail`, `sample`	Dropped row IDs	Full
`fillna`, `replace`	Cell diffs (watched cols)	Full
`loc[]=`, `iloc[]=`, `at[]=`	Cell diffs	Full
`merge`, `join`	Parent tracking	Full
`pd.concat(axis=0)`	Row IDs + source DataFrame	Full
`pd.concat(axis=1)`	Row IDs (if aligned)	Partial
`groupby().agg()`	Group membership	Full
`apply`, `pipe`	Output tracked	Partial

License¶

TracePipe is released under the MIT License.