How to Evaluate an AI Colocation Operator: A 12-Point Checklist

Evaluating a colocation operator for an AI workload is a different exercise than evaluating one for general enterprise IT. The power densities are higher, the SLAs are tighter, the cooling demands are nonstandard, and the operational complexity scales nonlinearly with rack density. The buyers in our network ask us to vet operators against twelve specific criteria. This is the full list, with what we're actually looking at on each.

1. Power reliability and contract structure

How is power procured, what's the rate structure, and what's the historical uptime on the supply side? A facility with a take-or-pay PPA at a fixed rate is structurally different from one buying block-and-index from a wholesale market. Ask for 24 months of metered uptime data at the substation, not just the data hall. Look for ride-through capability on grid events and clearly documented procedures for utility curtailment.

2. Cooling system design and headroom

What's the cooling capacity per rack, what's the design margin, and how does the system perform at the upper end of the design envelope? Modern AI workloads push 30-100+ kW per rack. An operator who tells you they support 50 kW racks but whose chiller plant is sized for 25 kW average is selling you a problem. Ask for design loads versus nameplate capacity, and ask what's been tested at peak.

3. Liquid cooling readiness

If liquid cooling isn't already deployed, what's the path to it? Direct-to-chip and immersion solutions both require infrastructure beyond standard CRAC/CRAH. Look at floor loading, plumbing routes, secondary loop capacity, and whether the operator has actually commissioned a liquid system before. A first-time liquid deployment in your hall is risk you should price in or push back on.

4. Fiber and network connectivity

How many distinct carriers terminate in the facility, what's the path diversity, and what's the latency to the major exchange points and cloud regions? AI training workloads can tolerate some latency to checkpointing storage; AI inference cannot. Look at peering arrangements and verify that the carriers list isn't an aspiration or a list of carriers that 'could' bring service.

5. Operational team and depth

Who's running the facility, how long have they been there, and what's the bench? A facility that depends on one or two senior operators is fragile. Ask about training programs, on-call coverage structure, and how they handle vacation/turnover. Walk the floor with an operator on shift, not just sales.

6. Security — physical and logical

Tier of physical security (perimeter, mantraps, biometric controls, badging), but also the procedural side: who has access to your space, what's logged, how are vendor entries handled. On the logical side: network segmentation, out-of-band management isolation, and whether the operator runs their own monitoring stack on your cages.

7. Maintenance windows and notification

What's the standard maintenance window structure, how much notice do you get, and what triggers an emergency window? Look at the maintenance log for the past 12 months. Frequent, short, unannounced windows are a leading indicator of operational immaturity.

8. SLA structure and remedies

Read the actual SLA, not the marketing summary. Look for the definition of an outage, the exclusions (planned maintenance, force majeure, customer-caused), and the remedy. Service credits capped at one month of revenue are common and arguably weak. Ask what triggers a termination right.

9. Credit quality and financial stability

If the operator is private, what's the capital structure and runway? If they're public, look at debt covenants and capex commitments. The wrong scenario is being three years into a five-year lease when the operator restructures and your power rates re-set or service degrades during the restructuring.

10. Expansion path and reserve capacity

Even if you don't need it today, what's the path to 2-5x your current footprint? Reserved capacity costs money but compresses risk; first-right-of-refusal on adjacent space is sometimes available. AI deployments tend to expand faster than expected — bake the option in now.

11. Insurance and regulatory posture

What's the operator's insurance stack, what's the coverage for catastrophic events, and what compliance certifications do they hold (SOC 2, ISO 27001, HIPAA-ready if relevant)? Look at the audit reports, not just the certifications. A certification with material exceptions is more concerning than no certification at all.

12. Cultural fit and responsiveness

Subjective but important. How do they handle a problem? Test this before you sign. Send a non-critical question to support and time the response. Visit at 2am if you can. The operator's culture during normal operations tells you a lot about how they'll behave during a real incident.

Putting it together

No operator scores ten out of ten on every criterion. The goal isn't perfection; it's understanding the gaps and pricing them. A weak SLA is recoverable if the credit quality is strong. A first-time liquid cooling deployment is recoverable if the operational team has done it elsewhere. The combinations that scare us are: weak credit + thin operations + new cooling technology, in any combination.