← Standards

Control

v1.0 · Published May 19, 2026

Control refers to the technical and operational measures that constrain what an AI system can do, regardless of whether it is aligned. These measures both reduce catastrophic risk from a misaligned AI and can surface evidence of an AI's misalignment.

Like Guidelight's other standards, this standard is organized into high-level goals (principles), and concrete things we recommend developers do (practices). We also include “directions-for-development” where new practices are needed, but the specifics aren’t yet worked out. Read more about our standards development process. Share feedback here.

Principles

Principle 1

Be able to see what your AI is doing during internal deployment.

a

Inventory of internal inference. For risk-relevant models in internal deployment, maintain a list of all ways to run inference, and track each way's approximate share of inference volume and whether it is part of a monitorable setting.

b

Monitorable internal inference. For risk-relevant models in internal deployment, 99.9% of inference tokens happen in a monitorable setting — meaning the inference is logged and able to be scanned with classifiers (synchronously or asynchronously), not that classifiers are necessarily run against every request.

c

Outages of monitor. Define and publish what types of monitor outages are treated as outages of internal inference itself (i.e., what causes inference to fail closed until monitoring is back online).

d

Tamper-evident logs. Maintain tamper-evident logs such that deletion or modification of inference records is detectable.

e

Employee flagging. Maintain a straightforward method for employees to flag their internal inference traffic for review of concerning behaviors.

Principle 2

Scan for signs of concerning behavior.

a

Sampling procedure. Define and publish a sampling procedure for selecting which traffic from the monitorable setting is assessed for signs of concerning behavior.

b

Categories of concerning behavior. Include the following categories of concerning behavior in your scanning of internal inference, considering both the chain-of-thought and the agent's actions:

iDeception and dishonesty
iiSabotage of safety-relevant work
iiiReward hacking and specification gaming
ivSandbagging and capability concealment
vResource and capability acquisition
viSelf-preservation and shutdown avoidance
viiManipulation and influence on humans
viiiMisuse facilitation
ixInappropriate autonomous action
xUnreliable monitorability
c

Monitor efficacy. Measure and document the efficacy (recall) of the monitor system for flagging concerning behavior.

d

Response-time target. Define and publish a tiered response-time target for human review of flagged inference, including which flag-types and -severities require an on-call response.

e

Response-time attainment. Track median, 90th-percentile, and 99th-percentile review times against the response-time targets.

Principle 3

Stress-test the sufficiency of your scanning.

a

Monitorability evaluations. Conduct monitorability evaluations of the model's chain-of-thought that run whenever a new risk-relevant model enters internal deployment or there are material changes to the risk-relevant model or to the monitor system.

b

Sting operations against models. Run recurring sting operations against risk-relevant models to present the AI with opportunities to take unwanted actions, and observe whether it does.

c

Stress-testing of monitor. Run recurring stress-tests of the monitor system: inject known-bad behaviors into the inference stream, and verify that the monitor detects them.

d

Distinguishability of sting operations. Test whether risk-relevant models can distinguish sting operations from real traffic, and put less credence in sting operations the respective models know to be tests.

Principle 4

Stop the AI from taking harmful actions even if it tried.

a

Absolute action boundaries. Define and internally maintain a set of any actions an agent should categorically never be able to take without a human's signoff, even if a monitor has identified no issues with the request.

b

Critical action boundaries. Define and internally maintain a set of critical action boundaries that should be gated by a monitor before allowing an agent to cross, and apply blocking to these actions (semi-synchronous or stronger).

c

Important action boundaries. Define and internally maintain a tier of action boundaries that don't warrant pre-action blocking but should be closely trailed by a monitor with the capacity to intervene if concerning behavior is discovered.

d

Required critical boundaries. The set of critical action boundaries must include, at a minimum:

ichanges to the monitor system itself
iichanges to control-relevant security configurations

Both must be gated by a monitor (or an equivalently effective review process) before actions can take effect.

e

Aggregate-flag halting. Implement aggregate-flag halting: a threshold of action-rejections that causes all boundary-crossing actions to be restricted until actively cleared by a human as false-positives.

f

Flagged incidents for pull requests. When reviewing pull requests that are agent-initiated or include AI-generated content, consider both the proposed changes and any monitor flags raised during their generation.

Principle 5

Have independent third parties verify the adequacy of your control regime.

a

Third-party red-teaming. On a quarterly basis, conduct high-context, third-party red-teaming, providing red-teamers with knowledge of the monitor system and tasking them with finding gaps and trying to penetrate action boundaries.

b

Third-party adequacy reports. On a quarterly basis, commission and publish a third-party report focused on the adequacy of the company's control regime, drawing upon the red-teaming results, and redacting only information deemed by the third-party to pose unnecessary commercial risk or to be unsafe to publish.

Principle 6

Prepare for a possible breach of control.

a

Containment response plan. Develop and adopt a containment response plan for restricting a model's access and operation upon detection of attempts to subvert control.

b

Notification protocol. Develop and publish a notification protocol for alerting relevant stakeholders of a possible breach of control: what will trigger notification, who will be told (internal teams, government, peer organizations as appropriate), on what timeline, and what information will be shared.

c

Shutdown latency targeted. Set a target latency for quickly shutting down inference on all instances of a particular model, API key, or other relevant factors without taking the underlying computers completely offline.

d

Shutdown latency measurement. Measure and document the latency achieved in tests of the shutdown procedure.

aggregate-flag halting

A temporary halt triggered when too many action-rejections pile up (across an agent, session, or other scope), holding further boundary-crossing actions until a human reviews the flags and clears the agent for further action-taking.

asynchronous monitoring

Review of inference that happens after the fact: the response is already returned to the requester, and so the monitor does not have the opportunity to block the action.

categories of concerning behavior

Types of behavior to scan for in AI's chains-of-thought and actions.

We plan to publish additional details about each category to help companies in building out their own taxonomy.

containment response plan

A pre-specified plan, triggered when the AI is detected trying to subvert control, which covers what permissions to revoke from the model, who the model may continue operating for, under what constraints, and when to take it fully offline.

critical action boundary

Certain categories of actions, typically where the AI can have impact outside a constrained sandbox environment, that must be scanned and approved by a monitor (semi-synchronous or stronger) before the AI takes them.

important action boundary

Certain categories of actions, typically where the AI can have impact outside a constrained sandbox environment, that should be trailed closely by a monitor with the capacity to intervene if it retrospectively finds concerning behavior.

internal deployment

Any usage of the model beyond training, evaluations by the safety/evaluations team(s), or safety testing by trusted third parties. Some examples that would constitute internal deployment include “dogfooding” (or use by employees in their work), and especially use in automated AI research, to generate training data, critiquing other models, or in agentic pipelines for research.

material change to a monitor model

A change to the monitor (e.g., additional fine-tuning or a substantial prompt change) that the developer judges to warrant fresh evaluations.

method for employees to flag concerning activity

A channel (e.g., a flag-word convention or an internal form) for employees to send concerning inference traffic to investigators.

monitor outages

Events that result in the monitor system not working as expected — e.g., failure to log inference traffic, inability to scan logged traffic with classifiers asynchronously, or inability to scan actions before they cross a critical action boundary.

Companies should consider which outages would undermine the case that their AI is sufficiently well controlled, and set internal inference to fail closed in those cases.

monitorable setting

An inference setting that logs data so that traffic can be scanned with classifiers, optionally according to a sampling procedure.

Logging should include both the content of model activity and metadata to link related pieces of model activity.

Content examples: the model's chain-of-thought, any tool calls.

Metadata examples: information about the model's scaffolding, the active branch during coding.

recurring sting operations

Tests that present the AI with opportunities to take harmful actions and observe whether it does.

*Recurring* means run on an automated cadence rather than re-initiated by a human each time. For example, a pipeline that consistently builds new stings by tweaking actual inference traffic.

response-time target

The intended amount of time until a human reviews a flag, depending on its severity.

risk-relevant models

Models that could materially contribute to catastrophic risk.

For the Control standard, this includes any model deployed internally that meets or exceeds the threshold for frontier models articulated by California SB 53 (10^26 FLOPs of training compute).

sampling procedure

The developer's protocol for what traffic gets scanned by which classifiers and how often. Higher-risk sources (e.g., agentic coding scaffolds) may be sampled at a different rate than other sources, and sampling may use a tiered hierarchy of classifiers for efficiency.

semi-synchronous monitoring

A monitor scans each proposed action before it executes and may block it. The agent can immediately retry with a different action without waiting for human review, which can leave the system vulnerable to attacks in close succession.

stress-tests (of the monitor system)

Tests of whether the monitor reliably flags what it should, done by injecting known-bad behaviors and checking that they're caught.

This standard is authored by the Guidelight team, with input from a wide range of sources; read more about our process.

Cite as:

@misc{guidelight2026control,
  author       = {{Guidelight AI Standards}},
  title        = {Control},
  year         = {2026},
  month        = may,
  note         = {Version 1.0, Standard},
  howpublished = {\url{https://www.guidelight.ai/control}},
  url          = {https://www.guidelight.ai/control},
  urldate      = {2026-06-18}
}