rl-list.com · Methodology

Methodology

How this directory is researched, sourced, and ranked, and, just as importantly, what we deliberately leave blank. The whole point of rl-list.com is that you can audit every claim.

What we’re trying to do

Procurement teams choosing an RL-environment vendor care about scale, velocity, quality, cost, and data specs. Almost none of that is on the public web, it surfaces only in direct vendor engagement. So we don’t pretend to answer it. Instead, for every company we research the public proxies for those questions, cite each one, and tag how confident we are.

What a buyer wants to know	The public proxy we source
How fast can they scale production	Headcount, headcount growth, open roles, funding
Quality & rigor of the work	Researchers on the team, their backgrounds, published papers/benchmarks
Will they survive / are they credible	Capital raised, investors, customers, founding year
Can we clear security review	SOC 2 / ISO certifications
Footprint & jurisdiction	HQ + office locations

Our confidence tiers

Every non-trivial field on every vendor page carries one of these tags, so you never have to guess how solid a number is:

confirmed, from a primary or official source (the company’s own site, filings, an audit registry).
reported, from a credible third party (Crunchbase, reputable press).
estimated, a clearly-labeled inference, used sparingly (e.g. a headcount band read off a visible team list).
unknown, we could not source it from public materials. We show this openly instead of filling the gap.

The rules we hold ourselves to

We never fabricate. “unknown” is a valid, common answer. An honest gap is more useful to a buyer than a confident guess.
Every quantitative claim is cited with a source link and the date we accessed it.
Self-claimed ≠ verified. A customer logo on a vendor’s own site is a claim, and we label it self-claimed; only third-party-corroborated relationships are marked verified. Frontier-lab ties are flagged.
We flag staleness and conflicts when sources disagree or data is more than ~12 months old.
Public sources only. No paywall circumvention, no login-gated scraping.

What we deliberately don’t publish

The numbers frontier labs actually request in an RFI, task/sample counts, unique environments, pass@1 and difficulty, capability and complexity splits, data-type breakdowns, harness and data format, and unit/total pricing, are not on the public web and only come from direct engagement. We do not estimate them, and we do not let an AI “reason toward” a plausible figure. Those fields stay blank on purpose. If you see a number on rl-list.com, it has a source.

How the ranking works, the RL List score

We rank only the dedicated, pure-play RL-environment vendors. Three groups are deliberately excluded from the ranking and listed separately for reference, because they aren’t like-for-like comparable: data-labeling incumbents moving into environments (Scale AI, Surge AI, Mercor), execution-infrastructure providers (sandbox/compute layers), and open-source projects. Mixing a $1B labeling incumbent into a list of focused environment startups would mislead more than it informs.

Within the ranked set, order is driven by a transparent formula we call the RL List score, the same calculation applied to every vendor, so the baseline order stays auditable. A small number of vendors we have reviewed in depth are placed editorially; everyone else falls where the score puts them. The score combines:

Scale & traction, funding raised (on a log scale, so it rewards order-of-magnitude differences, not vanity precision) and customer traction.
Signals, a research team and its pedigree, named/verified customers and frontier-lab ties, and security posture (SOC 2 / certifications).
Data confidence & verification, our overall confidence in the record, plus a bonus for each field backed by a primary source. Two companies with similar scale are separated by how much of their story is actually provable.

The score is not a product-quality or endorsement rating, it reflects scale, signal, and how verifiable a vendor’s public record is. A company lower down is often simply earlier-stage or harder to verify, not a weaker product.

Category rankings. The use-case pages (coding, computer use, enterprise workflows) start from the same RL List score, restricted to the vendors that work in that category and renumbered #1..#N, then apply the same allowance for a few editorial placements where we've looked hard at a vendor's fit for that specific use case. So a vendor's category position can differ from its overall rank: a specialist can lead a category it's built for even if a broader vendor sits higher overall. Editorial placements are the exception, not the rule, and incumbents, infrastructure and open-source projects remain unranked everywhere.

Freshness

Every vendor page shows a “last updated” date, and the directory is re-verified on a rolling basis. This snapshot was last updated 2026-07-15.