Overview
This study relies on 23 months of Stanford Energy Control Lab data to analyze lithium-ion battery capacity fade, impedance growth, and energy degradation. Advanced statistics, exploratory data analysis, anomaly detection, and machine learning combine to produce actionable State of Health (SOH) forecasts for BMS deployments, EV maintenance, and grid-scale storage diagnostics.
Motivation
- Capacity fade, impedance growth, and thermal stress increase risk for EVs and stationary storage alike.
- Accurate SOH estimation informs warranty exposure, Remaining Useful Life planning, and chemistry R&D.
- The pipeline trains on real driving/charging patterns, yielding high-fidelity predictions grounded in field data.
Dataset Description
Two complementary sources feed the models:
- Cycling dataset: Captures charge/discharge behavior with voltage, current, temperature, capacity, step index, and test time across CC/CV charging and UDDS discharge profiles.
- Diagnostic dataset: Periodic capacity tests, EIS, and HPPC sessions expose long-term degradation trends that daily cycling alone might miss.
Exploratory Data Analysis
- Voltage & Current: Charging stabilizes around 4.20 V while current stays within controlled limits.
- Capacity Decline: Early formation loss appears as a sharp drop, followed by gradual stabilization.
- Temperature: Readings cluster near 23 Â deg C with spikes during high-load events.
- Correlation: Strong relationships such as Discharge Energy ↔ Test Time and Voltage ↔ Step Index guide feature engineering.
- Anomaly Detection: Isolation Forest flags abnormal voltage dips, thermal spikes, and irregular discharge episodes, improving downstream model reliability.


Machine Learning Pipeline
Two prediction tasks drive the modeling:
- State of Health (SOH) Prediction
- Discharge Energy (Wh) Prediction
Features come from both cycling and diagnostic streams, focusing on signals with clear physical meaning.
SOH Prediction (Random Forest)
- Inputs: Test time, voltage, current, and charge energy.
- Performance: RÂ^2 = 0.9875, MAE = 0.14, RMSE = 0.24.
- Ten-fold cross-validation confirms generalization, making the model ready for real-time BMS integration.

Energy Prediction (Linear Regression)
Discharge energy is modeled as a linear combination of test time and step index:
E_discharge = Î^20 + Î^21 * TestTime + Î^22 * StepIndex- RÂ^2: 0.7321
- RMSE: 3.04
- MAE: 2.37
This lightweight estimator gives fast energy forecasts for embedded systems.
Clustering Analysis
K-Means segments cycles into three degradation profiles:
- Cluster 1 – Healthy: Highest capacity and lowest temperature variance.
- Cluster 2 – Moderate Aging: Predictable voltage behavior with mid-level degradation.
- Cluster 3 – Severe Aging: Frequent voltage drops, thermal instability, and minimal capacity.
These clusters support targeted maintenance and replacement strategies.
Key Findings
- The Random Forest SOH model surpasses published benchmarks with near-perfect accuracy.
- Test time, voltage, current, and step index emerge as dominant predictors.
- K-Means reveals distinct degradation behavior groups for proactive decision-making.
- Isolation Forest outlier removal boosts data quality and model trustworthiness.
- Even small temperature shifts correlate strongly with SOH decline, underscoring thermal management needs.
Practical Applications
- BMS: Real-time SOH and Remaining Useful Life estimates with automated degradation alerts.
- EV Fleets: Predictive maintenance, range-accuracy improvements, and warranty optimization.
- R&D & Manufacturing: Early-cycle loss diagnostics, new chemistry evaluation, and improved formation processes.
Conclusion
Combining statistical analysis, EDA, Random Forests, linear regression, clustering, and anomaly detection provides a full picture of lithium-ion battery aging. With RÂ^2 = 0.9875 on SOH predictions, the pipeline demonstrates how data-driven diagnostics can transform EV and industrial energy storage operations while laying a solid foundation for future predictive battery analytics.
Summary
I built a comprehensive battery-aging analysis that ingests 23 months of cycling plus diagnostic data, models capacity fade and energy degradation with high-accuracy Random Forest and regression baselines, and pinpoints risky cycles via K-Means and Isolation Forest.