Data Governance

Data Lineage Explained: Why It Matters and How to Implement It

BNS Data & AI Insights · February 18, 2025 · 9 min read

Data lineage — the ability to trace where data came from, how it was transformed, and where it flows — is one of the most universally agreed-upon requirements in data management and one of the least actually implemented capabilities. Here is why it matters and how to get started.

What lineage actually means

Data lineage answers three questions: where did this data come from (source), what happened to it along the way (transformation), and where does it go from here (downstream consumption). Column-level lineage is more specific — it traces individual fields through transformations, not just datasets.

The practical value of lineage is in two scenarios: impact analysis (if I change this source column, what downstream reports will break?) and root cause analysis (this number in this dashboard looks wrong — where exactly did it come from?). Both of these scenarios cost significant time without lineage and become trivial with it.

How lineage is captured

Lineage can be captured in three ways: automatically from pipeline metadata (tools like dbt, Airflow and Spark emit lineage as a byproduct of execution), by parsing SQL and transformation code to extract dependencies, or manually through documentation. Automatic capture is the only approach that scales.

dbt is particularly valuable here — every model transformation is explicitly documented, and dbt can generate lineage graphs automatically from the code. If you are building transformation pipelines and not using dbt, you are likely missing significant lineage visibility.

Implementing column-level lineage

Table-level lineage (dataset A feeds dataset B) is relatively straightforward. Column-level lineage (column X from dataset A becomes column Y in dataset B after this transformation) is significantly more valuable and harder to implement.

Start with your most critical downstream reports. Trace the key metrics in those reports back through every transformation to the source systems. Document this manually if needed — even imperfect lineage for your most important data is better than no lineage at all.

Making lineage useful day-to-day

Lineage information locked in a documentation tool that nobody opens is worthless. Lineage needs to be accessible at the point of use — visible when an analyst is working with a dataset, searchable when a developer is making a change, available when a compliance officer is conducting an audit.

The key is integration with your existing workflows. Lineage that appears in your data catalogue alongside the dataset definition, and that is searchable by column name, gets used. Lineage that lives in a separate system requires deliberate navigation and tends to fall out of date.