Row Tracing¶
Trace the complete journey of any row through your pipeline.
Basic Usage¶
# By row index
trace = tp.trace(df, row=0)
print(trace)
# By business key
trace = tp.trace(df, where={"customer_id": "C-12345"})
print(trace)
Output:
Event Recording
TracePipe records MODIFIED events for cells that change in watched columns. Rows that pass through operations unchanged are not recorded as separate events (they are implicitly "survived"). Drop events are recorded for filtered rows.
The TraceResult Object¶
trace = tp.trace(df, row=0)
# Access fields
trace.row_id # int: internal row ID
trace.status # str: "alive" or "dropped"
trace.is_alive # bool: True if row still exists
trace.events # list[dict]: all events for this row
# For dropped rows
trace.dropped_by # str: operation that dropped the row
trace.dropped_at_step # int: step number
# Provenance (v0.4+)
trace.origin # dict: {"type": "concat"|"merge", ...} or None
trace.representative # dict: for dedup-dropped rows, which row was kept
# Export
trace.to_dict() # dict representation
Finding Rows¶
By Index¶
# Current DataFrame index
tp.trace(df, row=0) # First row in current df
tp.trace(df, row=-1) # Last row in current df
By Business Key¶
# Single key
tp.trace(df, where={"email": "alice@example.com"})
# Multiple keys (AND condition)
tp.trace(df, where={"region": "US", "status": "active"})
# Find row with null value
tp.trace(df, where={"email": None})
Use Business Keys
Business keys are more stable than row indices, which change as rows are filtered.
Event Types¶
| Event Type | Description |
|---|---|
MODIFIED |
One or more cells changed in watched columns |
DROPPED |
Row was removed by a filter operation |
Design Note
TracePipe does not explicitly record "SURVIVED" events because they would create excessive noise for most pipelines. Instead, rows that exist in the final DataFrame are implicitly considered to have survived all operations.
If you need to know which operations a row passed through, check the
steps list via tp.debug.inspect().steps.
Tracing Dropped Rows¶
You can trace rows that were dropped:
dbg = tp.debug.inspect()
# Get IDs of dropped rows
dropped_ids = dbg.dropped_rows()
# Trace a specific dropped row
for rid in list(dropped_ids)[:5]:
trace = dbg.explain_row(rid)
print(f"Row {rid}: dropped by {trace.dropped_by}")
Merge Parent Tracking¶
For rows created by merges, TracePipe tracks their parents:
result = df1.merge(df2, on="id")
trace = tp.trace(result, row=0)
# In debug mode, you can see parent rows
if trace.merge_parents:
print(f"Left parent: {trace.merge_parents.left}")
print(f"Right parent: {trace.merge_parents.right}")
Concat Origin Tracking (v0.4+)¶
When rows come from concatenated DataFrames, TracePipe tracks their source via trace.origin:
df1 = pd.DataFrame({"a": [1, 2]})
df2 = pd.DataFrame({"a": [3, 4]})
result = pd.concat([df1, df2])
# Trace a row that came from df2
trace = tp.trace(result, row=2)
print(trace.origin)
# {"type": "concat", "source_df": 1, "step_id": 5}
The .origin property returns a unified dict with:
type:"concat","merge", orNone(for original rows)source_df: Index in the concat list (0=first DataFrame, 1=second, etc.)step_id: Which pipeline step
Row IDs are preserved through pd.concat(axis=0), so lineage chains correctly:
# Transform df1 before concat
df1["a"] = df1["a"].fillna(0)
result = pd.concat([df1, df2])
# Rows from df1 still have their fillna history
trace = tp.trace(result, row=0) # Shows fillna event from df1
Duplicate Representative Tracking (v0.4+)¶
When drop_duplicates removes rows, TracePipe tracks which row "won" via trace.representative:
df = pd.DataFrame({
"key": ["A", "A", "B"],
"value": [100, 200, 300]
})
df = df.drop_duplicates(subset=["key"], keep="first")
# Trace the dropped row (value=200)
trace = tp.trace(df, row=dropped_row_id)
print(trace.representative)
# {"kept_rid": 42, "subset": ["key"], "keep": "first"}
The .representative property is only set for rows dropped by drop_duplicates:
keep Strategy |
.representative |
|---|---|
keep='first' |
{"kept_rid": 42, ...} — first occurrence kept |
keep='last' |
{"kept_rid": 45, ...} — last occurrence kept |
keep=False |
{"kept_rid": None, ...} — all duplicates removed |
This answers "why did this row disappear?" — it wasn't deleted, it was deduplicated.
Performance Considerations¶
- Row tracing in CI mode is limited (no individual row IDs)
- For large DataFrames, use
where=with indexed columns for faster lookups - Tracing many rows? Use
tp.debug.inspect()for batch access