COEQ ML Model Testing Checklist

What Google tests — and what your organisation should too.

At COEQ, we believe AI quality is not optional — it is the foundation of trust. One of the most referenced frameworks in the AI testing space comes from Google: a structured set of 28 ML assertions. It covers four critical domains — Data, Model Development, Infrastructure, and Monitoring — and it remains the benchmark every serious AI team should measure itself against.

Here is the checklist in full, with COEQ commentary on why each domain matters.

ML Data

Poor data is the silent killer of ML systems. Before a model is ever trained, the quality, governance, and testability of your data pipeline determines everything that follows.

  • Feature expectations are captured in a schema
  • All features are beneficial
  • No feature's cost is too much
  • Features adhere to metalevel requirements
  • The data pipeline has appropriate privacy controls
  • New features can be added quickly
  • All input feature code is tested

Model Development

A model that performs well in the non-production and fails in production is a liability, not an asset. These checks ensure your model is robust, fair, and genuinely better than the alternatives.

  • Model specs are reviewed and submitted
  • Offline and online metrics correlate
  • All hyperparameters have been tuned
  • The impact of model staleness is known
  • A simpler model is not better
  • Model quality is sufficient on important data slices
  • The model is tested for considerations of inclusion

ML Infrastructure

Even a great model will fail if the infrastructure around it is fragile. Reproducibility, testability, and rollback capability are non-negotiable in production ML systems.

  • Training is reproducible
  • Model specs are unit tested
  • The ML pipeline is integration tested
  • Model quality is validated before serving
  • The model is debuggable
  • Models are canaried before serving
  • Serving models can be rolled back

Monitoring Tests

Deployment is not the finish line. ML systems degrade silently — through data drift, model staleness, and infrastructure regression. Monitoring is how you keep a model honest over time.

  • Dependency changes result in notification
  • Data invariants hold for inputs
  • Training and serving are not skewed
  • Models are not too stale
  • Models are numerically stable
  • Computing performance has not regressed
  • Prediction quality has not regressed

The COEQ Perspective

At COEQ, the ML model testing checklist sits at the core of how we assess, advise, and test AI systems for our clients. If your organisation is building or deploying ML systems without a structured test framework, you are not just taking a technical risk — you are taking a reputational one.

How many of these 28 checks can your team tick off today? If the answer is uncertain, that is exactly where COEQ can help.