Every robot failure is a training signal. Most robotics teams have the logs — they don't have the pipeline to find the incidents, surface what happened, and get structured labels into the hands of the ML team. We build that pipeline: MCAP ingestion, multi-signal incident detection, Claude-assisted triage, FiftyOne curation, and CVAT annotation handoff. 90 days from raw logs to a queryable failure dataset.
A warehouse robot running 12 hours a day generates hundreds of gigabytes of sensor data. Most of it is uneventful. Somewhere in that stream is the 90-second window where localization degraded, recovery behaviors triggered, and the mission aborted — and that window contains more information about model failure modes than a thousand hours of clean operation.
The problem isn't data volume. It's that the failure signal is buried: AMCL covariance spikes on one topic, path deviation appears on another, the behavior tree abort is a third. Nobody is correlating them manually across a fleet of 50 robots. The incidents that reach the ML team are the ones dramatic enough that a human noticed — which means the subtle, high-signal failures go unlabeled.
The Physical AI Data Flywheel is the pipeline that surfaces the subtle ones.
We ingest MCAP files (the ROS 2 native format, Foxglove-native) and parse them into structured dataframes per topic. The target topics vary by robot type — for navigation failures: AMCL pose, lidar scans, behavior tree logs, odometry, velocity commands. We handle compressed images, pointclouds, and custom message types. The ingestion layer normalizes timestamps and joins topics into a coherent incident view.
We monitor three or more independent signals simultaneously. An incident window is confirmed when at least two signals fire within a configurable time window of each other — this eliminates single-sensor noise and false positives. For navigation failures, the signals are: localization covariance spike, plan deviation beyond threshold, and Nav2 abort events. The combination is what matters; any single signal in isolation may be routine.
We send structured telemetry metrics from the incident window to Claude: covariance values, scan statistics, diagnostic messages, the sequence and timing of signal events. Claude returns a triage hypothesis — what most likely caused this incident, confidence level with justification, investigation steps for the robotics engineer, and data gaps that would increase diagnostic confidence.
This is not a chatbot. Claude receives structured telemetry context and returns structured output that gets cached and served from the dashboard. The language is intentional: "triage hypothesis" and "investigation steps" — the system augments engineer judgment, it doesn't replace it.
We extract 50-100 frames from the MCAP spanning the incident window: baseline frames from before the incident, frames across each failure phase, and post-incident frames. Each frame is tagged with its phase (normal, lidar_degradation, localization_uncertainty, recovery_behavior, abort) and annotated with numeric metadata from the telemetry (covariance value, path deviation, scan dropout rate at that timestamp).
CLIP embeddings run on all extracted frames. This enables visual similarity search — "find frames that look like this failure frame" — and surfaces clusters of visually similar incidents across the dataset. Teams that run this on months of historical logs regularly find incident patterns they didn't know existed.
The FiftyOne dataset exports to CVAT Image 1.1 format — a structured annotation task ready for the labeling team. Frame-level tags from the pipeline become CVAT labels. The labeling team opens structured tasks, not a folder of unlabeled images. This is where the flywheel closes: labeled data flows back to model training, which improves the behaviors that generated the incidents.
Foxglove is where engineers review incidents before labeling. We configure a panel layout that shows the incident replay seeked to the fault window: raw lidar vs. corrupted lidar side by side, particle cloud scatter, AMCL covariance plot over time, camera feed. The visual diff between clean and degraded sensor data is immediately obvious to any engineer looking at it — and it's the thing that makes triage decisions defensible. We embed Foxglove replay directly in the dashboard so stakeholders can scrub through incidents without a ROS environment.
The name matters. A one-time data cleanup project isn't a flywheel. The flywheel is what happens after the first 90 days.
New incidents generate new MCAP files. The detection pipeline runs on new logs automatically. New incidents flow into the FiftyOne dataset tagged and ready for review. The labeling team has a continuous stream of structured annotation tasks instead of periodic fire drills. Model retraining triggers when the dataset grows past a threshold. The model improves, which changes the failure mode distribution, which generates new training signal.
Most teams that engage us for the initial 90-day build extend into ongoing pipeline operations. The alternative is a one-time dataset that ages out of relevance as the robot software and operational environment evolve.
Tell us about your fleet: robot type, log volume, how incidents are currently surfaced, and what your ML team is blocked on. We'll tell you whether the Physical AI Data Flywheel maps to your situation and what a 90-day engagement would look like.