How to run an MLI audit

An MLI audit produces one score per criterion, four pillar averages, a total, and a findings list ordered by score impact. The implementation guides tell you how to fix one thing; this page tells you how to find what to fix — and how to compare one site against another when you're auditing more than your own.

Two paths

With the Chrome extension

The fastest path. Install the extension, load any page in Chrome, and click the toolbar icon. Claude runs the 12 criteria against the live page and returns a report in under a minute.

Best for: auditing your own site, or any single site where you don't need the reasoning visible step by step.

Install for Chrome →

Manually

The slower, more transparent path. Read the criterion definitions on the methodology page, fetch the page (browser view-source, curl, or your audit tool of choice), score each criterion 1–5 against the rubric, and record a skip reason where a criterion doesn't apply.

Best for: comparative audits across a sample, sites where the extension can't run (intranet pages, gated content, non-Chrome environments), or any case where the reasoning needs to be fully visible — a researcher publishing scores has to show their work; an implementer fixing their own site usually doesn't.

Read the criteria spec →

Before you start

Two choices to make before either path.

Pick the URL

Score the page an agent would fetch when a user's query lands — that usually isn't the homepage. For a legal aid clinic, audit the asylum services page. For a regional bakery, audit the wholesale orders page. The homepage is the right target only when the homepage is the service surface (a single-program nonprofit, a one-product site).

Pick the comparison set

A single score is a measurement. A sample is a finding. If you're auditing your own site, one URL is enough — you'll fix what's broken regardless of how anyone else scores. If you're studying a sector — community housing organizations in a city, immigration legal services across a region — pick the sample first and audit consistently: same URL convention, same audit-date window, same rubric pass.

Reading the report

The score scale

Each criterion is scored 1–5. 5 means fully implemented to the rubric; 1 means absent. The rubric for each criterion lives on the methodology page, and the implementation guides walk you from a low score to a high one.

Pillar averages

Each pillar (Identity, Reachability, Structure, Currency) averages its three criterion scores. Skipped criteria are excluded from the average, not zeroed — a multilingual-reach criterion that doesn't apply to a single-language site shouldn't drag the Reachability pillar down.

Skip-if criteria

Three criteria can be skipped when the underlying feature doesn't apply: R3 (multilingual reach), C2 (time-sensitive markup), and C3 (eligibility, cost, and availability). The audit records an auditable reason for every skip — for example, "no Event or Offer schema present; no hours mentioned in prose" — so a reader can challenge the decision. Skip-if is a judgment call, not a free pass.

The Public-interest tag

Five criteria — I2, I3, R3, C2, C3 — carry a Public-interest tag in the report. The tag tells you which criteria carry stakes for community organizations and underrepresented language groups. It does not change the math: findings rank by score impact alone, and readers apply their own weighting if their context calls for it.

The total score

The total is the unweighted average of the four pillar averages. It's useful for ranking findings on a single site, and for comparing one site against another in the same audit window. It is not a verdict — a site can score well overall and still fail a load-bearing criterion that matters more in context than the score reflects.

What to do with results

If you're an implementer

Sort findings by score impact and start with the lowest. The implementation guides are linked from each criterion in the report. Identity scores tend to be load-bearing — when I1, I2, or I3 are weak, fix those first; gains on later pillars carry less weight if agents still can't say who the organization is.

If you're a researcher

One site's score is not a finding. The MLI's empirical contribution comes from comparative audits — scoring multiple sites in the same sector on the same criteria in the same window. Look for patterns: do community legal aid clinics score lower on I3 than the immigration law firms in the same market? Do housing nonprofits and civic-access organizations score differently on R3 than legal aid providers? The methodology page's Evidence basis section says more about what comparison can and can't tell you.

Limits

Snapshot in time. Sites change. A low score this month may be a higher score next month, and vice versa. An audit dates itself.
Non-deterministic between runs. Claude's judgment varies at the borderline between adjacent scores — 3 vs. 4, for example. The skip-reason field mitigates this for skip decisions but not for borderline scoring. For comparative work, run audits in a single window and document the rubric pass.
A score is a measurement, not a verdict. The score tells you what an agent can read; it doesn't tell you whether the page is accurate, trustworthy, or worth recommending. MLI measures legibility, not merit.

Next steps

Browse the 12 implementation guides →

Each guide walks one criterion from spec to verification, with code examples and a checklist.

Read the full methodology →

The rubric for each criterion, plus the framework's evidence basis and limits.