Data Exploration¶
Data exploration focuses on identifying useful, noisy, or problematic samples while training.
Core ideas¶
Wrap your dataset/dataloader so sample IDs are tracked.
Annotate data with tags.
Temporarily discard problematic samples.
Query subsets for review or targeted retraining.
Minimal example¶
import weightslab as wl
train_loader = wl.watch_or_edit(
train_dataset,
flag="data",
loader_name="train_loader",
batch_size=16,
shuffle=True,
compute_hash=True,
)
wl.tag_samples([10, 42, 77], "hard_examples", mode="add")
wl.discard_samples([5, 9], discarded=True)
hard_ids = wl.get_samples_by_tag("hard_examples", origin="train")
discarded_ids = wl.get_discarded_samples(origin="train")
Workflow pattern¶
flowchart TD
A[Run Training] --> B[Inspect Signals]
B --> C{Sample Quality?}
C -- Poor --> D[Tag / Discard]
C -- Good --> E[Keep]
D --> F[Retrain]
E --> F
Recommendations¶
Start with a small tag vocabulary (for example:
hard_examples,noisy_label).Keep discard operations reversible by tracking them in your experiment notes.
Re-evaluate discarded sets periodically after model improvements.