Skip to content

Row Tracing

Trace the complete journey of any row through your pipeline.

Basic Usage

# By row index
trace = tp.trace(df, row=0)
print(trace)

# By business key
trace = tp.trace(df, where={"customer_id": "C-12345"})
print(trace)

Output:

Row 42 Journey:
  Status: [OK] Alive

  Events: 1
    [MODIFIED] DataFrame.fillna: income

Event Recording

TracePipe records MODIFIED events for cells that change in watched columns. Rows that pass through operations unchanged are not recorded as separate events (they are implicitly "survived"). Drop events are recorded for filtered rows.

The TraceResult Object

trace = tp.trace(df, row=0)

# Access fields
trace.row_id           # int: internal row ID
trace.status           # str: "alive" or "dropped"
trace.is_alive         # bool: True if row still exists
trace.events           # list[dict]: all events for this row

# For dropped rows
trace.dropped_by       # str: operation that dropped the row
trace.dropped_at_step  # int: step number

# Provenance (v0.4+)
trace.origin           # dict: {"type": "concat"|"merge", ...} or None
trace.representative   # dict: for dedup-dropped rows, which row was kept

# Export
trace.to_dict()        # dict representation

Finding Rows

By Index

# Current DataFrame index
tp.trace(df, row=0)      # First row in current df
tp.trace(df, row=-1)     # Last row in current df

By Business Key

# Single key
tp.trace(df, where={"email": "alice@example.com"})

# Multiple keys (AND condition)
tp.trace(df, where={"region": "US", "status": "active"})

# Find row with null value
tp.trace(df, where={"email": None})

Use Business Keys

Business keys are more stable than row indices, which change as rows are filtered.

Event Types

Event Type Description
MODIFIED One or more cells changed in watched columns
DROPPED Row was removed by a filter operation

Design Note

TracePipe does not explicitly record "SURVIVED" events because they would create excessive noise for most pipelines. Instead, rows that exist in the final DataFrame are implicitly considered to have survived all operations.

If you need to know which operations a row passed through, check the steps list via tp.debug.inspect().steps.

Tracing Dropped Rows

You can trace rows that were dropped:

dbg = tp.debug.inspect()

# Get IDs of dropped rows
dropped_ids = dbg.dropped_rows()

# Trace a specific dropped row
for rid in list(dropped_ids)[:5]:
    trace = dbg.explain_row(rid)
    print(f"Row {rid}: dropped by {trace.dropped_by}")

Merge Parent Tracking

For rows created by merges, TracePipe tracks their parents:

result = df1.merge(df2, on="id")
trace = tp.trace(result, row=0)

# In debug mode, you can see parent rows
if trace.merge_parents:
    print(f"Left parent: {trace.merge_parents.left}")
    print(f"Right parent: {trace.merge_parents.right}")

Concat Origin Tracking (v0.4+)

When rows come from concatenated DataFrames, TracePipe tracks their source via trace.origin:

df1 = pd.DataFrame({"a": [1, 2]})
df2 = pd.DataFrame({"a": [3, 4]})
result = pd.concat([df1, df2])

# Trace a row that came from df2
trace = tp.trace(result, row=2)
print(trace.origin)
# {"type": "concat", "source_df": 1, "step_id": 5}

The .origin property returns a unified dict with:

  • type: "concat", "merge", or None (for original rows)
  • source_df: Index in the concat list (0=first DataFrame, 1=second, etc.)
  • step_id: Which pipeline step

Row IDs are preserved through pd.concat(axis=0), so lineage chains correctly:

# Transform df1 before concat
df1["a"] = df1["a"].fillna(0)

result = pd.concat([df1, df2])

# Rows from df1 still have their fillna history
trace = tp.trace(result, row=0)  # Shows fillna event from df1

Duplicate Representative Tracking (v0.4+)

When drop_duplicates removes rows, TracePipe tracks which row "won" via trace.representative:

df = pd.DataFrame({
    "key": ["A", "A", "B"],
    "value": [100, 200, 300]
})
df = df.drop_duplicates(subset=["key"], keep="first")

# Trace the dropped row (value=200)
trace = tp.trace(df, row=dropped_row_id)
print(trace.representative)
# {"kept_rid": 42, "subset": ["key"], "keep": "first"}

The .representative property is only set for rows dropped by drop_duplicates:

keep Strategy .representative
keep='first' {"kept_rid": 42, ...} — first occurrence kept
keep='last' {"kept_rid": 45, ...} — last occurrence kept
keep=False {"kept_rid": None, ...} — all duplicates removed

This answers "why did this row disappear?" — it wasn't deleted, it was deduplicated.

Performance Considerations

  • Row tracing in CI mode is limited (no individual row IDs)
  • For large DataFrames, use where= with indexed columns for faster lookups
  • Tracing many rows? Use tp.debug.inspect() for batch access