What is Data Shapley? Measuring the True Value of Data

less than 1 minute read

Published: October 15, 2025

We often focus on model architectures — but what if the most valuable part of your ML system is your data?
Data Shapley assigns a contribution score to each training point, measuring its impact on model performance.

In my ongoing project, I use TreeExplainer and validation-based importance computation to approximate Shapley values efficiently.

Why it matters

Knowing which data points help or hurt your model allows:

Smarter dataset curation
Better fairness and robustness
Insights into which samples actually matter

Imagine debugging a biased model not by tweaking hyperparameters — but by identifying the “toxic” data points.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Subhajit Bag

What is Data Shapley? Measuring the True Value of Data

Why it matters

Share on

You May Also Enjoy

Scalable influence and fact tracing for large language models pretraining

Why Language Models Hallucinate: The Epidemic of Penalizing Uncertainty

Teaching Humanoids Without MoCap: Inside TWIST2’s Portable Data Collection System

Motivation

What I Learned from Hackathons (and Losing One!)