TracePipe¶
Row-level data lineage tracking for pandas pipelines.
TracePipe automatically tracks what happens to every row and cell in your DataFrame — drops, transformations, merges, and value changes. Zero code changes required.
-
Get up and running in 5 minutes
-
Learn the core concepts and features
-
Complete function documentation
-
Real-world usage patterns
The Problem¶
Data pipelines are black boxes. When something goes wrong, you're left asking:
- "Where did row X go?" — Dropped somewhere, but which step?
- "Why is this value wrong?" — It was fine in the source, what changed it?
- "How did these rows get merged?" — Which parent records combined?
- "Why are there nulls here?" — When did they appear?
df = pd.read_csv("customers.csv")
df = df.dropna() # Some rows disappear
df = df.merge(regions, on="zip") # New rows appear, some vanish
df["income"] = df["income"].fillna(0) # Values change silently
df = df[df["age"] >= 18] # More rows gone
# What actually happened to customer C-789?
Traditional debugging means print() statements, manual diffs, and guesswork.
The Solution¶
import tracepipe as tp
import pandas as pd
tp.enable(mode="debug", watch=["income"])
df = pd.read_csv("customers.csv")
df = df.dropna()
df["income"] = df["income"].fillna(0)
df = df.merge(regions, on="zip")
df = df[df["age"] >= 18]
# What actually happened to customer C-789?
print(tp.trace(df, where={"customer_id": "C-789"}))
Row 789 Journey:
Status: [DROPPED]
Dropped by: DataFrame.__getitem__[mask] (step 5)
Events:
[SURVIVED] DataFrame.dropna
[MODIFIED] DataFrame.fillna: income (None → 0)
[SURVIVED] DataFrame.merge
[DROPPED] DataFrame.__getitem__[mask] ← age filter
Now you know: C-789 had null income (filled to 0), survived the merge, but was dropped by the age filter.
TracePipe Check: [OK] Pipeline healthy
Mode: debug
Retention: 847/1000 (84.7%)
Dropped: 153 rows
• DataFrame.dropna: 42
• DataFrame.__getitem__[mask]: 111
Value changes: 23 cells modified
• DataFrame.fillna: 23 (income)
One import. Complete audit trail.
Key Features¶
| Feature | Description |
|---|---|
| Zero-Code Instrumentation | Works with existing pandas code unchanged |
| Row-Level Tracking | Know exactly where each row went |
| Cell Provenance | See before/after values for every change |
| Merge Parent Tracking | Understand which rows combined |
| Data Contracts | Validate retention rates and uniqueness |
| HTML Reports | Generate visual pipeline audits |
Installation¶
For optional features:
pip install tracepipe[arrow] # Parquet/Arrow support
pip install tracepipe[all] # All optional dependencies
Quick Example¶
import tracepipe as tp
import pandas as pd
# Enable tracking
tp.enable(mode="debug", watch=["price"])
# Your normal pandas code
df = pd.DataFrame({
"product": ["A", "B", "C"],
"price": [10.0, None, 30.0]
})
df = df.dropna()
df["price"] = df["price"] * 1.1
# Inspect what happened
print(tp.check(df)) # Health summary
print(tp.trace(df, 0)) # Row 0's journey
print(tp.why(df, "price", 0)) # Why price changed
What's Tracked¶
| Operation | Tracking | Completeness |
|---|---|---|
dropna, query, df[mask] |
Dropped row IDs | Full |
drop_duplicates |
Dropped→kept mapping (debug mode) | Full |
head, tail, sample |
Dropped row IDs | Full |
fillna, replace |
Cell diffs (watched cols) | Full |
loc[]=, iloc[]=, at[]= |
Cell diffs | Full |
merge, join |
Parent tracking | Full |
pd.concat(axis=0) |
Row IDs + source DataFrame | Full |
pd.concat(axis=1) |
Row IDs (if aligned) | Partial |
groupby().agg() |
Group membership | Full |
apply, pipe |
Output tracked | Partial |
License¶
TracePipe is released under the MIT License.