# Openai
> Start tasks, view diffs, and push PRs—while you're away from your desk.
---
# Source: https://developers.openai.com/changelog/codex/2025-05-19.md
# Codex in the ChatGPT iOS app
- Date: 2025-05-19
- Products: Codex
Start tasks, view diffs, and push PRs—while you're away from your desk.

---
# Source: https://developers.openai.com/changelog/codex/2025-05-22.md
# Reworked environment page
- Date: 2025-05-22
- Products: Codex
## Changes
- Added a button to retry failed tasks
- Added indicators to show that the agent runs without network access after setup
- Added options to copy git patches after pushing a PR
- Added support for unicode branch names
- Fixed a bug where secrets were not piped to the setup script
- Fixed creating branches when there’s a branch name conflict.
- Fixed rendering diffs with multi-character emojis.
- Improved error messages when starting tasks, running setup scripts, pushing PRs, or disconnected from GitHub to be more specific and indicate how to resolve the error.
- Improved onboarding for teams.
- Polished how new tasks look while loading.
- Polished the followup composer.
- Reduced GitHub disconnects by 90%.
- Reduced PR creation latency by 35%.
- Reduced tool call latency by 50%.
- Reduced task completion latency by 20%.
- Started setting page titles to task names so Codex tabs are easier to tell apart.
- Tweaked the system prompt so that agent knows it’s working without network, and can suggest that the user set up dependencies.
- Updated the docs.
It's now easier and faster to set up code execution.

---
# Source: https://developers.openai.com/changelog/codex/2025-06-03.md
# June update
- Date: 2025-06-03
- Products: Codex
## Changes
- Added a link to this changelog from the profile menu.
- Added support for binary files: When applying patches, all file operations are supported. When using PRs, only deleting or renaming binary files is supported for now.
- Fixed an issue on iOS where follow up tasks where shown duplicated in the task list.
- Fixed an issue on iOS where pull request statuses were out of date.
- Fixed an issue with follow ups where the environments were incorrectly started with the state from the first turn, rather than the most recent state.
- Fixed internationalization of task events and logs.
- Improved error messages for setup scripts.
- Increased the limit on task diffs from 1 MB to 5 MB.
- Increased the limit for setup script duration from 5 to 10 minutes.
- Polished GitHub connection flow.
- Re-enabled Live Activities on iOS after resolving an issue with missed notifications.
- Removed the mandatory two-factor authentication requirement for users using SSO or social logins.
#### Agent internet access

Now you can give Codex access to the internet during task execution to install dependencies, upgrade packages, run tests that need external resources, and more.
Internet access is off by default. Plus, Pro, and Business users can enable it for specific environments, with granular control of which domains and HTTP methods Codex can access. Internet access for Enterprise users is coming soon.
Learn more about usage and risks in the [docs](https://developers.openai.com/codex/cloud/agent-internet).
#### Update existing PRs

Now you can update existing pull requests when following up on a task.
#### Voice dictation

Now you can dictate tasks to Codex.
---
# Source: https://developers.openai.com/changelog/codex/2025-06-13.md
# Best of N
- Date: 2025-06-13
- Products: Codex
## Changes
- Added some keyboard shortcuts and a page to explore them. Open it by pressing ⌘-/ on macOS and Ctrl+/ on other platforms.
- Added a “branch” query parameter in addition to the existing “environment”, “prompt” and “tab=archived” parameters.
- Added a loading indicator when downloading a repo during container setup.
- Added support for cancelling tasks.
- Fixed issues causing tasks to fail during setup.
- Fixed issues running followups in environments where the setup script changes files that are gitignored.
- Improved how the agent understands and reacts to network access restrictions.
- Increased the update rate of text describing what Codex is doing.
- Increased the limit for setup script duration to 20 minutes for Pro and Business users.
- Polished code diffs: You can now option-click a code diff header to expand/collapse all of them.

Codex can now generate multiple responses simultaneously for a single task, helping you quickly explore possible solutions to pick the best approach.
---
# Source: https://developers.openai.com/changelog/codex/2025-08-21.md
# Mid August update
- Date: 2025-08-21
- Products: Codex
#### Image inputs

You can now attach images to your prompts in Codex web. This is great for asking Codex to implement frontend changes or follow up on whiteboarding sessions.
#### Container caching

Codex now caches containers to start new tasks and followups 90% faster, dropping the median start time from 48 seconds to 5 seconds. You can optionally configure a maintenance script to update the environment from its cached state to prepare for new tasks. See the docs for more.
#### Automatic environment setup
Now, environments without manual setup scripts automatically run the standard installation commands for common package managers like yarn, pnpm, npm, go mod, gradle, pip, poetry, uv, and cargo. This reduces test failures for new environments by 40%.
---
# Source: https://developers.openai.com/changelog/codex/2025-08-27.md
# Late August update
- Date: 2025-08-27
- Products: Codex
#### IDE extension (Compatible with VS Code, Cursor, Windsurf)

Codex now runs in your IDE with an interactive UI for fast local iteration. Easily switch between modes and reasoning efforts.
#### Sign in with ChatGPT (IDE & CLI)

One-click authentication that removes API keys and uses ChatGPT Enterprise credits.
#### Move work between local ↔ cloud

Hand off tasks to Codex web from the IDE with the ability to apply changes locally so you can delegate jobs without leaving your editor.
#### Code Reviews

Codex goes beyond static analysis. It checks a PR against its intent, reasons across the codebase and dependencies, and can run code to validate the behavior of changes.
---
# Source: https://developers.openai.com/changelog/codex/2025-09-15.md
# Introducing GPT-5-Codex
- Date: 2025-09-15
- Products: Codex
#### New model: GPT-5-Codex

GPT-5-Codex is a version of GPT-5 further optimized for agentic coding in Codex.
It's available in the IDE extension and CLI when you sign in with your ChatGPT account.
It also powers the cloud agent and Code Review in GitHub.
To learn more about GPT-5-Codex and how it performs compared to GPT-5 on software engineering tasks, see our [announcement blog post](https://openai.com/index/introducing-upgrades-to-codex/).
#### Image outputs

When working in the cloud on front-end engineering tasks, GPT-5-Codex can now display screenshots of the UI in Codex web for you to review. With image output, you can iterate on the design without needing to check out the branch locally.
#### New in Codex CLI
- You can now resume sessions where you left off with `codex resume`.
- Context compaction automatically summarizes the session as it approaches the context window limit.
Learn more in the [latest release notes](https://github.com/openai/codex/releases/tag/rust-v0.36.0)
---
# Source: https://developers.openai.com/changelog/codex/2025-09-23.md
# GPT-5-Codex in the API
- Date: 2025-09-23
- Products: Codex
GPT-5-Codex is now available in the Responses API, and you can also use it with your API Key in the Codex CLI.
We plan on regularly updating this model snapshot.
It is available at the same price as GPT-5. You can learn more about pricing and rate limits for this model on our [model page](http://platform.openai.com/docs/models/gpt-5-codex).
---
# Source: https://developers.openai.com/changelog/codex/2025-10-06.md
# Codex is now GA
- Date: 2025-10-06
- Products: Codex
Codex is now generally available with 3 new features — @Codex in Slack, Codex SDK, and new admin tools.
#### @Codex in Slack

You can now questions and assign tasks to Codex directly from Slack. See the [Slack guide](https://developers.openai.com/codex/integrations/slack) to get started.
#### Codex SDK
Integrate the same agent that powers the Codex CLI inside your own tools and workflows with the Codex SDK in Typescript. With the new Codex GitHub Action, you can easily add Codex to CI/CD workflows. See the [Codex SDK guide](https://developers.openai.com/codex/sdk) to get started.
```ts
const agent = new Codex();
const thread = await agent.startThread();
const result = await thread.run("Explore this repo");
console.log(result);
const result2 = await thread.run("Propose changes");
console.log(result2);
```
#### New admin controls and analytics

ChatGPT workspace admins can now edit or delete Codex Cloud environments. With managed config files, they can set safe defaults for CLI and IDE usage and monitor how Codex uses commands locally. New analytics dashboards help you track Codex usage and code review feedback. Learn more in the [enterprise admin guide.](https://developers.openai.com/codex/enterprise/admin-setup)
#### Availability and pricing updates
The Slack integration and Codex SDK are available to developers on ChatGPT Plus, Pro, Business, Edu, and Enterprise plans starting today, while the new admin features will be available to Business, Edu, and Enterprise.
Beginning October 20, Codex Cloud tasks will count toward your Codex usage. Review the [Codex pricing guide](https://developers.openai.com/codex/pricing) for plan-specific details.
---
# Source: https://developers.openai.com/changelog/codex/2025-10-22.md
# Tag @Codex on GitHub Issues and PRs
- Date: 2025-10-22
- Products: Codex
You can now tag `@codex` on a teammate's pull request to ask clarifying questions, request a follow-up, or ask Codex to make changes. GitHub Issues now also support `@codex` mentions, so you can kick off tasks from any issue, without leaving your workflow.

---
# Source: https://developers.openai.com/changelog/codex/2025-10-30.md
# Credits on ChatGPT Pro and Plus
- Date: 2025-10-30
- Products: Codex
Codex users on ChatGPT Plus and Pro can now use on-demand credits for more Codex usage beyond what's included in your plan. [Learn more.](https://developers.openai.com/codex/pricing)
---
# Source: https://developers.openai.com/changelog/codex/2025-11-06.md
# GPT-5-Codex model update
- Date: 2025-11-06
- Products: Codex
We've shipped a minor update to GPT-5-Codex:
- More reliable file edits with `apply_patch`.
- Fewer destructive actions such as `git reset`.
- More collaborative behavior when encountering user edits in files.
- 3% more efficient in time and usage.
---
# Source: https://developers.openai.com/changelog/codex/2025-11-07.md
# Introducing GPT-5-Codex-Mini
- Date: 2025-11-07
- Products: Codex
Today we are introducing a new `gpt-5-codex-mini` model option to Codex CLI and the IDE Extension. The model is a smaller, more cost-effective, but less capable version of `gpt-5-codex` that provides approximately 4x more usage as part of your ChatGPT subscription.
Starting today, the CLI and IDE Extension will automatically suggest switching to `gpt-5-codex-mini` when you reach 90% of your 5-hour usage limit, to help you work longer without interruptions.
You can try the model for a new Codex CLI session using:
```bash
codex --model gpt-5-codex-mini
```
You can also use the `/model` slash command in the CLI. In the Codex IDE Extension you can select GPT-5-Codex-Mini from the dropdown menu.
Alternatively, you can change your default model to `gpt-5-codex-mini` by updating your `config.toml` [configuration file](https://developers.openai.com/codex/local-config):
```toml
model = "gpt-5-codex-mini”
```
---
# Source: https://developers.openai.com/changelog/codex/2025-11-13.md
# Introducing GPT-5.1-Codex and GPT-5.1-Codex-Mini
- Date: 2025-11-13
- Products: Codex
Along with the [GPT-5.1 launch in the API](https://openai.com/index/gpt-5-1-for-developers/), we are introducing new `gpt-5.1-codex-mini` and `gpt-5.1-codex` model options in Codex, a version of GPT-5.1 optimized for long-running, agentic coding tasks and use in coding agent harnesses in Codex or Codex-like harnesses.
Starting today, the CLI and IDE Extension will default to `gpt-5.1-codex` on macOS and Linux and `gpt-5.1` on Windows.
If you have a model specified in your [`config.toml` configuration file](https://developers.openai.com/codex/local-config), you can instead try out `gpt-5.1-codex` for a new Codex CLI session using:
```bash
codex --model gpt-5.1-codex
```
You can also use the `/model` slash command in the CLI. In the Codex IDE Extension you can select GPT-5.1-Codex from the dropdown menu.
If you want to switch for all sessions, you can change your default model to `gpt-5.1-codex` by updating your `config.toml` [configuration file](https://developers.openai.com/codex/local-config):
```toml
model = "gpt-5.1-codex”
```
---
# Source: https://developers.openai.com/changelog/codex/2025-11-18.md
# Introducing GPT-5.1-Codex-Max
- Date: 2025-11-18
- Products: Codex
[Today we are releasing GPT-5.1-Codex-Max](http://www.openai.com/index/gpt-5-1-codex-max), our new frontier agentic coding model.
GPT‑5.1-Codex-Max is built on an update to our foundational reasoning model, which is trained on agentic tasks across software engineering, math, research, and more. GPT‑5.1-Codex-Max is faster, more intelligent, and more token-efficient at every stage of the development cycle–and a new step towards becoming a reliable coding partner.
Starting today, the CLI and IDE Extension will default to `gpt-5.1-codex-max` for users that are signed in with ChatGPT. API access for the model will come soon.
For non-latency-sensitive tasks, we’ve also added a new Extra High (`xhigh`) reasoning effort, which lets the model think for an even longer period of time for a better answer. We still recommend medium as your daily driver for most tasks.
If you have a model specified in your [`config.toml` configuration file](https://developers.openai.com/codex/local-config), you can instead try out `gpt-5.1-codex-max` for a new Codex CLI session using:
```bash
codex --model gpt-5.1-codex-max
```
You can also use the `/model` slash command in the CLI. In the Codex IDE Extension you can select GPT-5.1-Codex from the dropdown menu.
If you want to switch for all sessions, you can change your default model to `gpt-5.1-codex-max` by updating your `config.toml` [configuration file](https://developers.openai.com/codex/local-config):
```toml
model = "gpt-5.1-codex-max”
```
---
# Source: https://developers.openai.com/changelog/codex/2025-11-24.md
# Usage and credits fixes
- Date: 2025-11-24
- Products: Codex
Minor updates to address a few issues with Codex usage and credits:
- Adjusted all usage dashboards to show "limits remaining" for consistency. The CLI previously displayed "limits used."
- Fixed an issue preventing users from buying credits if their ChatGPT subscription was purchased via iOS or Google Play.
- Fixed an issue where the CLI could display stale usage information; it now refreshes without needing to send a message first.
- Optimized the backend to help smooth out usage throughout the day, irrespective of overall Codex load or how traffic is routed. Before, users could get unlucky and hit a few cache misses in a row, leading to much less usage.
---
# Source: https://developers.openai.com/changelog/developers/2025-11-4.md
# Resources updates
- Date: 2025-11-04
- Products: Resources, Apps SDK
## Changes
- Published a new [Apps SDK state management](https://developers.openai.com/apps-sdk/build/state-management) guide.
- Added copy functionality to all code snippets.
- Launched a unified developers [changelog](https://developers.openai.com/changelog).
---
# Source: https://developers.openai.com/changelog/codex/2025-12-04.md
# Introducing Codex for Linear
- Date: 2025-12-04
- Products: Codex
Assign or mention @Codex in an issue to kick-off a Codex cloud task. As Codex works, it posts updates back to Linear, providing a link to the completed task so you can review, open a PR, or keep working.

To learn more about how to connect Codex to Linear both locally through MCP and through the new integration, check out the [Codex for Linear documentation](https://developers.openai.com/codex/integrations/linear).
---
# Source: https://developers.openai.com/changelog/codex/2025-12-18.md
# Introducing GPT-5.2-Codex
- Date: 2025-12-18
- Products: Codex
[Today we are releasing GPT-5.2-Codex](http://www.openai.com/index/gpt-5-2-codex), the most advanced agentic coding model yet for complex, real-world software engineering.
GPT-5.2-Codex is a version of [GPT-5.2](https://openai.com/index/introducing-gpt-5-2/) further optimized for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities.
Starting today, the CLI and IDE Extension will default to `gpt-5.2-codex` for users who are signed in with ChatGPT. API access for the model will come soon.
If you have a model specified in your [`config.toml` configuration file](https://developers.openai.com/codex/local-config), you can instead try out `gpt-5.2-codex` for a new Codex CLI session using:
```bash
codex --model gpt-5.2-codex
```
You can also use the `/model` slash command in the CLI. In the Codex IDE Extension you can select GPT-5.2-Codex from the dropdown menu.
If you want to switch for all sessions, you can change your default model to `gpt-5.2-codex` by updating your `config.toml` [configuration file](https://developers.openai.com/codex/local-config):
```toml
model = "gpt-5.2-codex”
```
---
# Source: https://developers.openai.com/changelog/codex/2025-12-19.md
# Agent skills in Codex
- Date: 2025-12-19
- Products: Codex
Codex now supports **agent skills**: reusable bundles of instructions (plus optional scripts and resources) that help Codex reliably complete specific tasks.
Skills are available in both the Codex CLI and IDE extensions.
You can invoke a skill explicitly by typing `$skill-name` (for example, `$skill-installer` or the experimental `$create-plan` skill after installing it), or let Codex select a skill automatically based on your prompt.
Learn more in the [skills documentation](https://developers.openai.com/codex/skills).
#### Folder-based standard (agentskills.io)
Following the open [agent skills specification](https://agentskills.io/specification), a skill is a folder with a required `SKILL.md` and optional supporting files:
```text
my-skill/
SKILL.md # Required: instructions + metadata
scripts/ # Optional: executable code
references/ # Optional: documentation
assets/ # Optional: templates, resources
```
#### Install skills per-user or per-repo
You can install skills for just yourself in `~/.codex/skills`, or for everyone on a project by checking them into `.codex/skills` in the repository.
Codex also ships with a few built-in system skills to get started, including `$skill-creator` and `$skill-installer`. The `$create-plan` skill is experimental and needs to be installed (for example: `$skill-installer install the create-plan skill from the .experimental folder`).
#### Curated skills directory
Codex ships with a [small curated set of skills](https://github.com/openai/skills) inspired by popular workflows at OpenAI. Install them with `$skill-installer`, and expect more over time.
---
# Source: https://developers.openai.com/changelog/codex/2026-01-14.md
# GPT-5.2-Codex API availability
- Date: 2026-01-14
- Products: Codex
GPT-5.2-Codex is now available in the API and for users who sign into Codex with the API.
To learn more about using GPT-5.2-Codex check out our [API documentation](https://platform.openai.com/docs/models/gpt-5.2-codex).
---
# Source: https://developers.openai.com/changelog/apps-sdk/2026-01-15.md
# Session metadata for tool calls & requestModal template switching
- Date: 2026-01-15
- Products: Apps SDK
## Changes
- Tool calls now include `_meta["openai/session"]`, an anonymized conversation id you can use to correlate requests within a ChatGPT session.
- `window.openai.requestModal({ template })` now supports opening a different registered UI template by passing the template URI from `registerResource`.
---
# Source: https://developers.openai.com/changelog/apps-sdk/2026-01-21.md
# Company knowledge compatibility guidance
- Date: 2026-01-21
- Products: Apps SDK
## Changes
- Added [company knowledge in ChatGPT](https://openai.com/index/introducing-company-knowledge/) compatibility guidance for the `search`/`fetch` tools. [Click here to learn more](https://developers.openai.com/apps-sdk/build/mcp-server#company-knowledge-compatibility).
---
# Source: https://developers.openai.com/changelog/developers/2026-01-22.md
# Source: https://developers.openai.com/changelog/codex/2026-01-22.md
# Custom prompts deprecated
- Date: 2026-01-22
- Products: Codex
Custom prompts are now deprecated. Use [skills](https://developers.openai.com/codex/skills) for reusable instructions and workflows instead.
---
# Source: https://developers.openai.com/changelog/codex/2026-01-23.md
# Team Config for shared configuration
- Date: 2026-01-23
- Products: Codex
Team Config groups the files teams use to standardize Codex across repositories and machines. Use it to share:
- `config.toml` defaults
- `rules/` for command controls outside the sandbox
- `skills/` for reusable workflows
Codex loads these layers from `.codex/` folders in the current working directory, parent folders, and the repo root, plus user (`~/.codex/`) and system (`/etc/codex/`) locations. Higher-precedence locations override lower-precedence ones.
Admins can still enforce constraints with `requirements.toml`, which overrides defaults regardless of location.
Learn more in [Team Config](https://developers.openai.com/codex/enterprise/admin-setup#team-config).
---
# Source: https://developers.openai.com/changelog/codex/2026-01-28.md
# Web search is now enabled by default
- Date: 2026-01-28
- Products: Codex
Codex now enables web search for local tasks in the Codex CLI and IDE Extension.
By default, Codex uses a web search cache, which is an OpenAI-maintained index of web results. Cached mode returns pre-indexed results instead of fetching live pages, while live mode fetches the most recent data from the web. If you are using `--yolo` or another [full access sandbox setting](https://developers.openai.com/codex/security), web search defaults to live results. To disable this behavior or switch modes, use the `web_search` configuration option:
- `web_search = "cached"` (default; serves results from the web search cache)
- `web_search = "live"` (fetches the most recent data from the web; same as `--search`)
- `web_search = "disabled"` to remove the tool
To learn more, check out the [configuration documentation](https://developers.openai.com/codex/config-basic).
---
# Source: https://developers.openai.com/changelog/codex/2026-02-02.md
# Introducing the Codex app
- Date: 2026-02-02
- Products: Codex
#### Codex app
The Codex app for macOS is a desktop interface for running agent threads in parallel and collaborating with agents on long-running tasks. It includes a project sidebar, thread list, and review pane for tracking work across projects.
Key features:
- [Multitask across projects](https://developers.openai.com/codex/app/features#multitask-across-projects)
- [Built-in worktree support](https://developers.openai.com/codex/app/worktrees)
- [Voice dictation](https://developers.openai.com/codex/app/features#voice-dictation)
- [Built-in Git tooling](https://developers.openai.com/codex/app/features#built-in-git-tools)
- [Skills](https://developers.openai.com/codex/app/features#skills-support)
- [Automations](https://developers.openai.com/codex/app/automations)
For a limited time, **ChatGPT Free and Go include Codex**, and **Plus, Pro, Business, Enterprise, and Edu** plans get **double rate limits**. Those higher limits apply in the app, the CLI, your IDE, and the cloud.
Learn more in the [Introducing the Codex app](https://openai.com/index/introducing-the-codex-app/) blog post.
Check out the [Codex app documentation](https://developers.openai.com/codex/app) for more.
---
# Source: https://developers.openai.com/resources/video/4o-image-generation-intro.md
# 4o image generation intro
> Video introduction to 4o model image generation capabilities.
- Type: Video
- Tags: imagegen
- URL: https://www.youtube.com/watch?v=2f3K43FHRKo
- Created: 2025-07-18
- Updated: 2025-08-13
## Summary
Shows how to create images using the 4o model. — image generation
## Details
Walkthrough of features and examples for 4o image generation.
---
# Source: https://developers.openai.com/codex/enterprise/admin-setup.md
# Admin Setup
This guide is for ChatGPT Enterprise admins who want to set up Codex for their workspace.
## Enterprise-grade security and privacy
Codex supports ChatGPT Enterprise security features, including:
- No training on enterprise data
- Zero data retention for the CLI and IDE
- Residency and retention follow ChatGPT Enterprise policies
- Granular user access controls
- Data encryption at rest (AES 256) and in transit (TLS 1.2+)
For more, see [Security](https://developers.openai.com/codex/security).
## Local vs. cloud setup
Codex operates in two environments: local and cloud.
1. Local use includes the Codex app, CLI, and IDE extension. The agent runs on the developer's computer in a sandbox.
2. Use in the cloud includes Codex cloud, iOS, Code Review, and tasks created by the [Slack integration](https://developers.openai.com/codex/integrations/slack). The agent runs remotely in a hosted container with your codebase.
Use separate permissions and role-based access control (RBAC) to control access to local and cloud features. You can enable local, cloud, or both for all users or for specific groups.
## Codex local setup
### Enable Codex app, CLI, and IDE extension in workspace settings
To enable Codex locally for workspace members, go to [Workspace Settings > Settings and Permissions](https://chatgpt.com/admin/settings). Turn on **Allow members to use Codex Local**. This setting doesn't require the GitHub connector.
After you turn this on, users can sign in to use the Codex app, CLI, and IDE extension with their ChatGPT account. If you turn off this setting, users who attempt to use the Codex app, CLI, or IDE will see the following error: "403 - Unauthorized. Contact your ChatGPT administrator for access."
## Team Config
Teams who want to standardize Codex across an organization can use Team Config to share defaults, rules, and skills without duplicating setup on every local configuration.
| Type | Path | Use it to |
| ------------------------------------ | ------------- | ---------------------------------------------------------------------------- |
| [Config basics](https://developers.openai.com/codex/config-basic) | `config.toml` | Set defaults for sandbox mode, approvals, model, reasoning effort, and more. |
| [Rules](https://developers.openai.com/codex/rules) | `rules/` | Control which commands Codex can run outside the sandbox. |
| [Skills](https://developers.openai.com/codex/skills) | `skills/` | Make shared skills available to your team. |
For locations and precedence, see [Config basics](https://developers.openai.com/codex/config-basic#configuration-precedence).
## Codex cloud setup
### Prerequisites
Codex cloud requires **GitHub (cloud-hosted) repositories**. If your codebase is on-premises or not on GitHub, you can use the Codex SDK to build similar workflows on your own infrastructure.
To set up Codex as an admin, you must have GitHub access to the repositories
commonly used across your organization. If you don't have the necessary
access, work with someone on your engineering team who does.
### Enable Codex cloud in workspace settings
Start by turning on the ChatGPT GitHub Connector in the Codex section of [Workspace Settings > Settings and Permissions](https://chatgpt.com/admin/settings).
To enable Codex cloud for your workspace, turn on **Allow members to use Codex cloud**.
Once enabled, users can access Codex directly from the left-hand navigation panel in ChatGPT.
After you turn on Codex in your Enterprise workspace settings, it may take up
to 10 minutes for Codex to appear in ChatGPT.
### Configure the GitHub Connector IP allow list
To control which IP addresses can connect to your ChatGPT GitHub connector, configure these IP ranges:
- [ChatGPT egress IP ranges](https://openai.com/chatgpt-actions.json)
- [Codex container egress IP ranges](https://openai.com/chatgpt-agents.json)
These IP ranges can change. Consider checking them automatically and updating your allow list based on the latest values.
### Allow members to administer Codex
This toggle allows users to view Codex workspace analytics and manage environments (edit and delete).
Codex supports role-based access (see [Role-based access (RBAC)](#role-based-access-rbac)), so you can turn on this toggle for a specific subset of users.
### Enable Codex Slack app to post answers on task completion
Codex integrates with Slack. When a user mentions `@Codex` in Slack, Codex starts a cloud task, gets context from the Slack thread, and responds with a link to a PR to review in the thread.
To allow the Slack app to post answers on task completion, turn on **Allow Codex Slack app to post answers on task completion**. When enabled, Codex posts its full answer back to Slack when the task completes. Otherwise, Codex posts only a link to the task.
To learn more, see [Codex in Slack](https://developers.openai.com/codex/integrations/slack).
### Enable Codex agent to access the internet
By default, Codex cloud agents have no internet access during runtime to help protect against security and safety risks like prompt injection.
As an admin, you can allow users to enable agent internet access in their environments. To enable it, turn on **Allow Codex agent to access the internet**.
When this setting is on, users can use an allow list for common software dependency domains, add more domains and trusted sites, and specify allowed HTTP methods.
### Enable code review with Codex cloud
To allow Codex to do code reviews, go to [Settings → Code review](https://chatgpt.com/codex/settings/code-review).
Users can specify whether they want Codex to review their pull requests. Users can also configure whether code review runs for all contributors to a repository.
Codex supports two types of code reviews:
1. Automatically triggered code reviews when a user opens a PR for review.
2. Reactive code reviews when a user mentions @Codex to look at issues. For example, "@Codex fix this CI error" or "@Codex address that feedback."
## Role-based access (RBAC)
Codex supports role-based access. RBAC is a security and permissions model used to control access to systems or resources based on a user's role assignments.
To enable RBAC for Codex, navigate to Settings & Permissions → Custom Roles in [ChatGPT's admin page](https://chatgpt.com/admin/settings) and assign roles to groups created in the Groups tab.
This simplifies permission management for Codex and improves security in your ChatGPT workspace. To learn more, see the [Help Center article](https://help.openai.com/en/articles/11750701-rbac).
## Set up your first Codex cloud environment
1. Go to Codex cloud and select **Get started**.
2. Select **Connect to GitHub** to install the ChatGPT GitHub Connector if you haven't already connected GitHub to ChatGPT.
- Allow the ChatGPT Connector for your account.
- Choose an installation target for the ChatGPT Connector (typically your main organization).
- Allow the repositories you want to connect to Codex (a GitHub admin may need to approve this).
3. Create your first environment by selecting the repository most relevant to your developers, then select **Create environment**.
- Add the email addresses of any environment collaborators to give them edit access.
4. Start a few starter tasks (for example, writing tests, fixing bugs, or exploring code).
You have now created your first environment. Users who connect to GitHub can create tasks using this environment. Users who have access to the repository can also push pull requests generated from their tasks.
### Environment management
As a ChatGPT workspace administrator, you can edit and delete Codex environments in your workspace.
### Connect more GitHub repositories with Codex cloud
1. Select **Environments**, or open the environment selector and select **Manage Environments**.
2. Select **Create Environment**.
3. Select the repository you want to connect.
4. Enter a name and description.
5. Select the environment visibility.
6. Select **Create Environment**.
Codex automatically optimizes your environment setup by reviewing your codebase. Avoid advanced environment configuration until you observe specific performance issues. For more, see [Codex cloud](https://developers.openai.com/codex/cloud).
### Share setup instructions with users
You can share these steps with end users:
1. Go to [Codex](https://chatgpt.com/codex) in the left-hand panel of ChatGPT.
2. Select **Connect to GitHub** in the prompt composer if you're not already connected.
- Sign in to GitHub.
3. You can now use shared environments with your workspace or create your own environment.
4. Try a task in both Ask and Code mode. For example:
- Ask: Find bugs in this codebase.
- Write code: Improve test coverage following the existing test patterns.
## Track Codex usage
- For workspaces with rate limits, use [Settings → Usage](https://chatgpt.com/codex/settings/usage) to view workspace metrics for Codex.
- For more detail on enterprise governance, refer to the [Governance](https://developers.openai.com/codex/enterprise/governance) page.
- For enterprise workspaces with flexible pricing, you can see credit usage in the ChatGPT workspace billing console.
## Zero data retention (ZDR)
Codex supports OpenAI organizations with [Zero Data Retention (ZDR)](https://platform.openai.com/docs/guides/your-data#zero-data-retention) enabled.
---
# Source: https://developers.openai.com/resources/guide/agentic-commerce-guide.md
# Agentic Commerce Protocol
> Design flows for embedded commerce in ChatGPT.
- Type: Guide
- Tags: commerce
- URL: /commerce
- Created: 2025-09-29
- Updated: 2025-09-29
## Summary
Docs for the Agentic Commerce Protocol
## Details
Docs for the Agentic Commerce Protocol.
---
# Source: https://developers.openai.com/cookbook/examples/agentkit/agentkit_walkthrough.md
# Build, deploy, and optimize agentic workflows with AgentKit
## Introduction
At DevDay 2025 we launched [AgentKit](https://openai.com/index/introducing-agentkit/), a complete set of tools for developers and enterprises to build, deploy, and optimize agents. AgentKit is a set of interconnected building blocks:
* [Agent Builder](https://platform.openai.com/docs/guides/agents/agent-builder): visually build and iterate on agent workflows
* [ChatKit](https://platform.openai.com/docs/guides/chatkit): easily embed chat-based workflows into your app
* [Evals](https://platform.openai.com/docs/guides/evals?api-mode=responses): improve the performance of your LLM-powered apps
**This cookbook will take an end-to-end journey through AgentKit - we'll build, deploy, and optimize an app. You'll understand how AgentKit’s building blocks connect together, enabling you to bring your agentic workflows into production faster and more reliably.**
We’ll walk through the following steps:
1. Build a workflow in Agent Builder to serve as the back-end of our app
2. Deploy a front-end chat app using the ChatKit web component
3. Optimize workflow performance in Evals with prompt optimization and trace grading
## Building the multi-agent workflow with Agent Builder
Let's get started by using Agent Builder to create the initial workflow that will underpin our app. Agent Builder is a visual canvas that lets you drag-and-drop nodes to design your agentic workflows. You can learn more about Agent Builder [here](https://platform.openai.com/docs/guides/agent-builder), including additional functionality and a full list of supported nodes, but in this cookbook we'll create a simple workflow with three Agent nodes working sequentially.
We’ll build a simple app that helps people accelerate their careers through curated learning recommendations. Users can upload their resume and tell us their dream job, and they'll receive a curated set of learning courses based on the skills they need to develop. So we'll create three agents:
1. **Resume extraction** agent to parse uploaded resumes and extract relevant skills and experiences
2. **Career analysis** agent to analyze knowledge gaps relative to their target job
3. **Course recommendation** agent which uses the upstream information to suggest relevant online courses.
Let's build each of these agents sequentially.
### 1. Resume extraction agent
This agent will be responsible for parsing the uploaded resume and returning a structured output of skills and experiences that will be used for downstream analysis. We'll use the following prompt:
```text
Extract and summarize information from the input resume, organizing your output by category and providing context where available.
- Analyze the provided input to identify skills and professional experiences.
- For each skill or experience, extract the supporting context or evidence from the text (e.g., for the skill of Python, context might be “used Python in data analysis for three years at [Company]”).
- Continue reviewing the text until all skills and experiences are extracted.
```
We'll use `gpt-5` for this agent, starting with `minimal` reasoning, but we can always change the model later if needed. And, we'll enforce a structured response (by selecting Output format to JSON, and adding a schema) to instruct the model to return the exact data shape we're looking for. (The JSON schema for this structured output can be found [here](https://cdn.openai.com/cookbook/agent_walkthrough/Skills_schema.json).)
### 2. Career analysis agent
This agent will analyze skill and knowledge gaps for an individual to progress to a desired professional or career goal. We'll use `gpt-5` for this agent and select reasoning effort `low`, which should provide sufficient reasoning for this level of analysis while keeping the responses sufficiently fast.
```text
Your role is to analyze skill and knowledge gaps for an individual to progress to a desired professional or career goal.
You will receive a list of the already-obtained skills and experiences of an individual, as well as a description of the goal. First, understand the goal and analyze the critical skills or knowledge areas required for achieving the goal. Then, compare the requirements to what the individual already possesses.
Return a list of the top 3-5 skills that the individual does not possess, but are important for their professional goal. Along with each skill, include a brief description.
Individual's expressed goal:
{{workflow.input_as_text}}
Already-obtained skills and experiences:
{{input.output_text}}
```
Note that our prompt includes context from previous nodes enclosed in {{brackets}}. You can also click "Add context" to see the context variables to the model.
### 3. Course recommendation agent
This agent will use the web search tool to find and select online training courses that match the identified skill gaps. We'll use `gpt-5` with `minimal` reasoning and equip this agent with Web Search.
```text
Your job is to identify and recommend online training courses that help develop one or more of the skills identified. Given the list of required skills and descriptions below, return a list of 3-5 online courses along with course details.
Skills: {{input.output_text}}
```
## Testing our workflow
**Now that we've built our initial workflow, we can use the Preview functionality in Agent Builder to give it a spin!** We'll first Publish the workflow, which will create a named and versioned copy (with a unique workflow ID) that we can share with colleagues, or deploy or revert between versions as needed.
Preview lets you interact with your workflow the same way a chat user would, from directly within Agent Builder. If we upload a resume, type in a description of our dream job, and click Submit, we'll see the workflow proceed step-by-step through each node on the left of the screen, and we'll see the output from each node on the right.
As an example, I've uploaded a resume for a teacher who's looking to become a school superintendent.
We can follow the workflow as it proceeds through resume parsing, skill analysis, and web search. As the workflow completes, as expected we see a list of online programs that meet the search parameters.
**Everything looks good - we're now ready to deploy our app!** Let's make sure we publish any changes we've made, and get the workflow ID. We can select "Code" at the top of the screen to access the ID again.
Note that you can use the "Agents SDK" tab to access the code that supports the workflow you just built, implemented using the Agents SDK package (in [JavaScript/TypeScript](https://github.com/openai/openai-agents-js) or [Python](https://github.com/openai/openai-agents-python)). This is a great option if you want to run your workflow in your own environment, or develop it further with custom functionality. (However, you would miss out on some of the benefits of using AgentKit in an integrated way, as we describe below.)
## Deploying the chat app with ChatKit
To deploy our app, we'll use the [ChatKit starter template](https://github.com/openai/openai-chatkit-starter-app) to help us spin up a chat-based app using the ChatKit web component.
Before doing that, it's worth explaining the full set of options that the suite of tools across AgentKit provides for deploying your agentic workflows. We've already seen how you can build a workflow in Agent Builder, and then run it directly within the tool (Preview), or export it as Agents SDK code to use in your own environment. Now, we'll demonstrate how you can use an Agent Builder workflow ID to create a chat experience embedded in your own front-end, which points to the workflow you created as a back-end. (By the way, you can also use just the rich chat GUI provided by the ChatKit SDK, without the workflow back-end - learn more [here](https://platform.openai.com/docs/guides/custom-chatkit).)
So let's get started with the ChatKit starter template and plug in our workflow. The starter template makes it simple to spin up a chat-based app using our newly created workflow. Just follow the [Getting Started](https://github.com/openai/openai-chatkit-starter-app?tab=readme-ov-file#getting-started) instructions in the repo, entering in your workflow ID from Agent Builder as the value for `NEXT_PUBLIC_CHATKIT_WORKFLOW_ID` into `.env.local`, and running `npm install` and then `npm run dev` to test the app.
In just a few minutes, the workflow is embedded in a front-end chat interface that's up and running!
## Quickly iterating on workflow and user experience
One of the most valuable aspects of AgentKit is how quickly it enables you to experiment, iterate, and improve your agentic applications. Let's make some quick changes that will improve the functionality of our app and provide a richer chat experience.
**First, let's add some custom theming** to give our front-end some style, while still retaining the native chat experience. A great resource here is [ChatKit Studio](https://chatkit.studio/), which includes a playground to explore the customization options in ChatKit, a Widget Builder (which we'll see in action shortly), and sample apps and galleries for inspiration. To get our custom theme, we'll use the ChatKit [Playground](https://chatkit.studio/playground) to visually select our desired style options, click on the `>` icon at the top of the Playground screen to get the configuration code.
We'll use the `theme` object from this code to overwrite the default theme located in [lib/config.ts](https://github.com/openai/openai-chatkit-starter-app/blob/main/lib/config.ts). While we're in that file, we'll also adjust the starter prompts, greeting text, and placeholder copy to more appropriate values:
```text
export const GREETING = "Upload your resume, and tell me the job you're looking to get!";
export const PLACEHOLDER_INPUT = "Describe your dream job, and don't forget to attach your resume!";
```
**Next, we'll design a custom widget** to display our recommended courses in a more intuitive format that makes it easier for users to understand and compare. We can use the [Widget Builder](https://widgets.chatkit.studio/) to simply describe the output we're looking for, and get an LLM-generated starting point that we can edit further. For this example, we'll present the courses in a list with a clean, structured format, and we'll also show a summary below the recommendations. In the Widget Builder, we can see not only the widget code (top-left), but sample data (bottom-left) and how that data gets rendered within the widget for the end user (right).
When we're happy with the design, we can download the .widget file. (The file used in the screenshot below is located [here](https://cdn.openai.com/cookbook/agent_walkthrough/Course%20recommendation.widget).)
To actually use the custom widget we've designed in our chat app, **we need to instruct our workflow to return the widget component** as part of our recommendation agent's response. So we'll go back to the `Course recommendations` agent, select Output format of `Widget`, and upload the .widget file.
Our agent will automatically know to output the JSON format required to populate the widget. However, we'll need to update the agent prompt to instruct the model a bit more precisely what information it needs to obtain about each course.
```text
Your job is to identify and recommend online training courses that help develop one or more of the skills identified. Given the list of required skills, return a list of 3-5 online courses along with course details including course name, provider (school or program), recommendation reason (a brief sentence on why you're recommending the course), course format, and URL. In addition to the list of courses, share a few-sentence summary of the recommendations you're making.
```
Finally, because we're dealing with resumes, we'll add a guardrail to our workflow to make sure we're not propagating any personally identifiable information (PII) where it doesn't belong. We'll insert this guardrail between our resume parser and our career analysis agents, which will help prevent anything downstream of the resume agent from having access to any PII, such as a name or contact information.
## Improving system performance using prompt optimization and trace grading
Now we’ll see how the native integrations with Evals help make it easy to optimize both individual agents and your entire workflow. Let's imagine our chat app has been deployed - perhaps to an initial set of internal users or beta testers - and we have some real-world examples of users interacting with the system.
If this application were being developed into a production-grade system where performance and quality were critical, we'd want to incorporate evals even earlier and more systematically into our development process. (You can learn more in [Eval Driven System Design](https://cookbook.openai.com/examples/partners/eval_driven_system_design/receipt_inspection) about how to build a set of initial evals and established ground truth, mapping evals to business metrics, and progressively improve your system to drive the business goals.)
But in this cookbook, we'll focus less on the techniques behind evals as part of LLM app development, and more about how AgentKit lets you implement these techniques more easily. We'll drive performance improvements in two ways: first we'll **optimize a single agent node in our workflow** using the prompt optimization tool, then we'll **optimize the entire workflow** using trace grading.
### Single agent optimization
We want to dive into our Course recommendations agent to see if we can improve the quality of its recommendations to users. We've isolated some sample prompts for this agent from our test environment. (You can also access completed in the [Logs tab](https://platform.openai.com/logs?api=responses) of the API platform. For this cookbook example, you can access [here](https://cdn.openai.com/cookbook/agent_walkthrough/course_recommendations_dataset.csv) the data set we'll be using.)
We can optimize our agent starting directly from Agent Builder. Select the Course recommendations agent, and click on "Evaluate" in the bottom right of the agent modal. This will take us directly to the **Datasets** feature within Evals.
We see the configuration of our agent has been copied over, and we're ready to optimize. Let's first upload the data file with sample prompts (note the column names should match your input and output variables), and click "Generate output" to generate responses.
Now, let's create some **human annotations** and **model graders**. We'll select "Columns" to add a Rating (thumbs up/down) and Feedback (text input), and we'll manually review our samples to populate these fields with some high-quality feedback. We'll also add a couple of model graders, which will evaluate the agent's output in an automated way based on criteria that we can specify. For this example, we might be concerned about whether the course recommendations are relevant to the skill gaps identified (relevance), whether all of the skill gaps are addressed (coverage), and whether the recommendation summary that is presented is appropriate (style).
Here are example model grader prompts for each criteria:
```text
[relevance] You are evaluating whether a list of recommended courses is relevant to the skills described. Return a pass if all courses are relevant to at least one skill, and fail otherwise.
[coverage] You are evaluating whether a list of recommended courses covers all of the skills described. Return a pass if all of the skills are covered by at least one course, and fail otherwise.
[summary] You are evaluating whether the summary recommendation provided is relevant, thoughtful, and related to the recommended courses proposed. Evaluate the summary recommendation on a scale of 0 to 1, with 1 being the highest quality.
```
We'll use GPT-5 for our model graders, and include a 0.7 threshold for the summary grader.
We'll now select Grade > All graders to run these graders against the system output. As the grading proceeds, we'll start to see the cells populated to indicate how each example scored on our model grader criteria.
**Now, here's where the magic happens: we can click Optimize to automatically rewrite our prompt based on the feedback we've provided - both the model grader output and the human-provided feedback.**
If we examine the new prompt, we see that our prompt contains new **Requirements** and **Output** format sections to instruct the model to make the course descriptions more specific, and aim for better coverage of the different skills.
```text
Requirements:
- Use the web search tool to find and verify real, currently available online courses and their direct URLs.
- Return 3–5 courses that collectively cover the skills. If a course spans multiple skills, indicate it.
- Be specific and concise. Each course description must be one sentence (max 35 words) focused on outcomes and topics tied to the skills.
- Provide plain text only; no citations or references.
Output format:
1) Summary (2–4 sentences) explaining how the selections address the skills and any coverage tradeoffs.
2) Courses (3–5 items). For each course, include exactly:`
- Course name — Provider (school or program)
- Description: [one sentence, max 35 words]
- URL: [direct course page]
- Skills covered: [list skill names from below]
```
Now, we can click Update to automatically insert the new prompt into our workflow in Agent Builder. **In just a few minutes, we've been able to use real prompt examples and feedback to automatically improve our system's performance - all directly within the Agent Builder and Evals tools.**
(Although in this cookbook we've optimized our prompt automatically using this grading output, it's often very helpful to examine specific failure examples to understand in what ways the model might be making mistakes. This analysis can help us generate more precise human-based or model-graded feedback, and even synthesize data to help improve performance against specific failure modes.)
### Entire workflow optimization
Once we're comfortable with the performance of each individual agent node, we can turn our attention to the full workflow.
Previously, in order to analyze and improve the performance of a complex workflow involving multiple agents, you'd need to read through entire traces of your workflow to understand exactly where and why the system was failing. This can be a time-consuming process, especially when you have a large number of trace examples.
Using **trace grading**, we can now run end-to-end assessments of full sets of traces using automated model grading. We'll create graders to describe the behavior we're looking to correct, and we'll automatically run these graders across our entire data set.
To get started, we'll go back to our workflow in Agent Builder and click Evaluate on the top of the screen. This lets us deep-dive into the traces that have been generated for our workflow runs, including examining the inputs and outputs for all nodes (in this case, the prompts and responses for each of our agents).
We can create graders to run across the entire model trace, defining evaluation criteria for the end-to-end workflow that **spans multiple agents**. For example, we might want to ensure that the final recommendation summary (output of agent 3) is relevant to the user’s initial input about their career goals (input to agent 1). And, we might want to check that the recommended courses (output of agent 3) are not duplicative with the skills the user already possesses (output of agent 1).
If you had a workflow with conditional statements or while loops, you could grade against more complex multi-step behavior, such as a support agent shouldn’t engage in more than three responses with a user without escalating to a supervisor.
Once we have a full set of grading criteria, we select Grade all to grade our traces. This action will lead us to the Evaluations tab, where we can see a new eval has been created and an eval run has been kicked off.
We can then dive into the workflow traces for our failure cases to better understand why the specific workflow run failed, and how we can improve our system to avoid the failure. This approach helps you optimize complex workflows more efficiently, by iteratively identifying failure modes, evaluating the performance of your system, and targeting improvements to improve performance.
## Recap and resources
We demonstrated how **Agent Builder**, **ChatKit**, and **Evals** work together to help you build, deploy, and optimize agentic workflows.
With a specific example — a career development app that analyzes resumes, identifies skill gaps, and recommends online courses — we saw how Agent Builder makes it easy to design and build multi-agent workflows, ChatKit lets us embed those workflows in a rich and customizable chat UI, and Evals close the loop by enabling prompt optimization and trace grading against real data.
To learn more, here's a list of some of the resources mentioned in this cookbook:
* [Agent Builder documentation](#)
* [ChatKit starter template](#)
* [ChatKit Studio](#)
* [Agents SDK](#)
* [Evals](#)
Happy building!
---
# Source: https://developers.openai.com/codex/guides/agents-md.md
# Custom instructions with AGENTS.md
Codex reads `AGENTS.md` files before doing any work. By layering global guidance with project-specific overrides, you can start each task with consistent expectations, no matter which repository you open.
## How Codex discovers guidance
Codex builds an instruction chain when it starts (once per run; in the TUI this usually means once per launched session). Discovery follows this precedence order:
1. **Global scope:** In your Codex home directory (defaults to `~/.codex`, unless you set `CODEX_HOME`), Codex reads `AGENTS.override.md` if it exists. Otherwise, Codex reads `AGENTS.md`. Codex uses only the first non-empty file at this level.
2. **Project scope:** Starting at the project root (typically the Git root), Codex walks down to your current working directory. If Codex cannot find a project root, it only checks the current directory. In each directory along the path, it checks for `AGENTS.override.md`, then `AGENTS.md`, then any fallback names in `project_doc_fallback_filenames`. Codex includes at most one file per directory.
3. **Merge order:** Codex concatenates files from the root down, joining them with blank lines. Files closer to your current directory override earlier guidance because they appear later in the combined prompt.
Codex skips empty files and stops adding files once the combined size reaches the limit defined by `project_doc_max_bytes` (32 KiB by default). For details on these knobs, see [Project instructions discovery](https://developers.openai.com/codex/config-advanced#project-instructions-discovery). Raise the limit or split instructions across nested directories when you hit the cap.
## Create global guidance
Create persistent defaults in your Codex home directory so every repository inherits your working agreements.
1. Ensure the directory exists:
```bash
mkdir -p ~/.codex
```
2. Create `~/.codex/AGENTS.md` with reusable preferences:
```md
# ~/.codex/AGENTS.md
## Working agreements
- Always run `npm test` after modifying JavaScript files.
- Prefer `pnpm` when installing dependencies.
- Ask for confirmation before adding new production dependencies.
```
3. Run Codex anywhere to confirm it loads the file:
```bash
codex --ask-for-approval never "Summarize the current instructions."
```
Expected: Codex quotes the items from `~/.codex/AGENTS.md` before proposing work.
Use `~/.codex/AGENTS.override.md` when you need a temporary global override without deleting the base file. Remove the override to restore the shared guidance.
## Layer project instructions
Repository-level files keep Codex aware of project norms while still inheriting your global defaults.
1. In your repository root, add an `AGENTS.md` that covers basic setup:
```md
# AGENTS.md
## Repository expectations
- Run `npm run lint` before opening a pull request.
- Document public utilities in `docs/` when you change behavior.
```
2. Add overrides in nested directories when specific teams need different rules. For example, inside `services/payments/` create `AGENTS.override.md`:
```md
# services/payments/AGENTS.override.md
## Payments service rules
- Use `make test-payments` instead of `npm test`.
- Never rotate API keys without notifying the security channel.
```
3. Start Codex from the payments directory:
```bash
codex --cd services/payments --ask-for-approval never "List the instruction sources you loaded."
```
Expected: Codex reports the global file first, the repository root `AGENTS.md` second, and the payments override last.
Codex stops searching once it reaches your current directory, so place overrides as close to specialized work as possible.
Here is a sample repository after you add a global file and a payments-specific override:
## Customize fallback filenames
If your repository already uses a different filename (for example `TEAM_GUIDE.md`), add it to the fallback list so Codex treats it like an instructions file.
1. Edit your Codex configuration:
```toml
# ~/.codex/config.toml
project_doc_fallback_filenames = ["TEAM_GUIDE.md", ".agents.md"]
project_doc_max_bytes = 65536
```
2. Restart Codex or run a new command so the updated configuration loads.
Now Codex checks each directory in this order: `AGENTS.override.md`, `AGENTS.md`, `TEAM_GUIDE.md`, `.agents.md`. Filenames not on this list are ignored for instruction discovery. The larger byte limit allows more combined guidance before truncation.
With the fallback list in place, Codex treats the alternate files as instructions:
Set the `CODEX_HOME` environment variable when you want a different profile, such as a project-specific automation user:
```bash
CODEX_HOME=$(pwd)/.codex codex exec "List active instruction sources"
```
Expected: The output lists files relative to the custom `.codex` directory.
## Verify your setup
- Run `codex --ask-for-approval never "Summarize the current instructions."` from a repository root. Codex should echo guidance from global and project files in precedence order.
- Use `codex --cd subdir --ask-for-approval never "Show which instruction files are active."` to confirm nested overrides replace broader rules.
- Check `~/.codex/log/codex-tui.log` (or the most recent `session-*.jsonl` file if you enabled session logging) after a session if you need to audit which instruction files Codex loaded.
- If instructions look stale, restart Codex in the target directory. Codex rebuilds the instruction chain on every run (and at the start of each TUI session), so there is no cache to clear manually.
## Troubleshoot discovery issues
- **Nothing loads:** Verify you are in the intended repository and that `codex status` reports the workspace root you expect. Ensure instruction files contain content; Codex ignores empty files.
- **Wrong guidance appears:** Look for an `AGENTS.override.md` higher in the directory tree or under your Codex home. Rename or remove the override to fall back to the regular file.
- **Codex ignores fallback names:** Confirm you listed the names in `project_doc_fallback_filenames` without typos, then restart Codex so the updated configuration takes effect.
- **Instructions truncated:** Raise `project_doc_max_bytes` or split large files across nested directories to keep critical guidance intact.
- **Profile confusion:** Run `echo $CODEX_HOME` before launching Codex. A non-default value points Codex at a different home directory than the one you edited.
## Next steps
- Visit the official [AGENTS.md](https://agents.md) website for more information.
- Review [Prompting Codex](https://developers.openai.com/codex/prompting) for conversational patterns that pair well with persistent guidance.
---
# Source: https://developers.openai.com/resources/guide/agents-quickstart-guide.md
# Agents SDK quickstart
> Step-by-step guide to quickly build agents with the OpenAI Agents SDK.
- Type: Guide
- Tags: agents
- URL: https://openai.github.io/openai-agents-python/quickstart/
- Created: 2025-07-21
- Updated: 2025-08-13
## Summary
Walkthrough for configuring and running your first agent. — agents, Agents SDK, agentic, tool calling
## Details
Provides instructions for setting up the Agents SDK and deploying a basic agent.
---
# Source: https://developers.openai.com/resources/code/agents-sdk-python.md
# Agents SDK — Python
> Python SDK for developing agents with OpenAI.
- Type: Code
- Tags: agents
- URL: https://github.com/openai/openai-agents-python
- Created: 2025-07-18
- Updated: 2025-08-13
## Summary
Library for building OpenAI agents using Python. — Agents SDK, agentic, tool calling
## Details
Offers Python modules and utilities to create agent applications.
---
# Source: https://developers.openai.com/resources/code/agents-sdk-quickstart.md
# Agents SDK quickstart
> Quickstart project for building agents with the Agents SDK.
- Type: Code
- Tags: agents
- URL: https://openai.github.io/openai-agents-python/quickstart/
- Created: 2025-07-21
- Updated: 2025-08-13
## Summary
Starter code to spin up your first agent in minutes. — agents, Agents SDK, agentic, tool calling
## Details
Provides boilerplate and instructions to initialize and run an agent using the OpenAI Agents SDK.
---
# Source: https://developers.openai.com/resources/code/agents-sdk-typescript.md
# Agents SDK — TypeScript
> TypeScript SDK for developing agents with OpenAI.
- Type: Code
- Tags: agents
- URL: https://github.com/openai/openai-agents-js
- Created: 2025-07-18
- Updated: 2025-08-13
## Summary
Library and tools for building OpenAI agents in TypeScript. — Agents SDK, agentic, tool calling
## Details
Provides TypeScript interfaces and utilities for agent development.
---
# Source: https://developers.openai.com/codex/guides/agents-sdk.md
# Use Codex with the Agents SDK
# Running Codex as an MCP server
You can run Codex as an MCP server and connect it from other MCP clients (for example, an agent built with the [OpenAI Agents SDK](https://openai.github.io/openai-agents-js/guides/mcp/)).
To start Codex as an MCP server, you can use the following command:
```bash
codex mcp-server
```
You can launch a Codex MCP server with the [Model Context Protocol Inspector](https://modelcontextprotocol.io/legacy/tools/inspector):
```bash
npx @modelcontextprotocol/inspector codex mcp-server
```
Send a `tools/list` request to see two tools:
**`codex`**: Run a Codex session. Accepts configuration parameters that match the Codex `Config` struct. The `codex` tool takes these properties:
| Property | Type | Description |
| ----------------------- | --------- | ------------------------------------------------------------------------------------------------------------ |
| **`prompt`** (required) | `string` | The initial user prompt to start the Codex conversation. |
| `approval-policy` | `string` | Approval policy for shell commands generated by the model: `untrusted`, `on-request`, `on-failure`, `never`. |
| `base-instructions` | `string` | The set of instructions to use instead of the default ones. |
| `config` | `object` | Individual configuration settings that override what's in `$CODEX_HOME/config.toml`. |
| `cwd` | `string` | Working directory for the session. If relative, resolved against the server process's current directory. |
| `include-plan-tool` | `boolean` | Whether to include the plan tool in the conversation. |
| `model` | `string` | Optional override for the model name (for example, `o3`, `o4-mini`). |
| `profile` | `string` | Configuration profile from `config.toml` to specify default options. |
| `sandbox` | `string` | Sandbox mode: `read-only`, `workspace-write`, or `danger-full-access`. |
**`codex-reply`**: Continue a Codex session by providing the thread ID and prompt. The `codex-reply` tool takes these properties:
| Property | Type | Description |
| ----------------------------- | ------ | --------------------------------------------------------- |
| **`prompt`** (required) | string | The next user prompt to continue the Codex conversation. |
| **`threadId`** (required) | string | The ID of the thread to continue. |
| `conversationId` (deprecated) | string | Deprecated alias for `threadId` (kept for compatibility). |
Use the `threadId` from `structuredContent.threadId` in the `tools/call` response. Approval elicitations (exec/patch) also include `threadId` in their `params` payload.
Example response payload:
```json
{
"structuredContent": {
"threadId": "019bbb20-bff6-7130-83aa-bf45ab33250e",
"content": "`ls -lah` (or `ls -alh`) — long listing, includes dotfiles, human-readable sizes."
},
"content": [
{
"type": "text",
"text": "`ls -lah` (or `ls -alh`) — long listing, includes dotfiles, human-readable sizes."
}
]
}
```
Note modern MCP clients generally report only `"structuredContent"` as the result of a tool call, if present, though the Codex MCP server also returns `"content"` for the benefit of older MCP clients.
# Creating multi-agent workflows
Codex CLI can do far more than run ad-hoc tasks. By exposing the CLI as a [Model Context Protocol](https://modelcontextprotocol.io/) (MCP) server and orchestrating it with the OpenAI Agents SDK, you can create deterministic, auditable workflows that scale from a single agent to a complete software delivery pipeline.
This guide walks through the same workflow showcased in the [OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/codex/codex_mcp_agents_sdk/building_consistent_workflows_codex_cli_agents_sdk.ipynb). You will:
- launch Codex CLI as a long-running MCP server,
- build a focused single-agent workflow that produces a playable browser game, and
- orchestrate a multi-agent team with hand-offs, guardrails, and full traces you can review afterwards.
Before starting, make sure you have:
- [Codex CLI](https://developers.openai.com/codex/cli) installed locally so `npx codex` can run.
- Python 3.10+ with `pip`.
- Node.js 18+ (required for `npx`).
- An OpenAI API key stored locally. You can create or manage keys in the [OpenAI dashboard](https://platform.openai.com/account/api-keys).
Create a working directory for the guide and add your API key to a `.env` file:
```bash
mkdir codex-workflows
cd codex-workflows
printf "OPENAI_API_KEY=sk-..." > .env
```
## Install dependencies
The Agents SDK handles orchestration across Codex, hand-offs, and traces. Install the latest SDK packages:
```bash
python -m venv .venv
source .venv/bin/activate
pip install --upgrade openai openai-agents python-dotenv
```
Activating a virtual environment keeps the SDK dependencies isolated from the
rest of your system.
## Initialize Codex CLI as an MCP server
Start by turning Codex CLI into an MCP server that the Agents SDK can call. The server exposes two tools—`codex()` to start a conversation and `codex-reply()` to continue one—and keeps Codex alive across multiple agent turns.
Create a file called `codex_mcp.py` and add the following:
```python
import asyncio
from agents import Agent, Runner
from agents.mcp import MCPServerStdio
async def main() -> None:
async with MCPServerStdio(
name="Codex CLI",
params={
"command": "npx",
"args": ["-y", "codex", "mcp-server"],
},
client_session_timeout_seconds=360000,
) as codex_mcp_server:
print("Codex MCP server started.")
# More logic coming in the next sections.
return
if __name__ == "__main__":
asyncio.run(main())
```
Run the script once to verify that Codex launches successfully:
```bash
python codex_mcp.py
```
The script exits after printing `Codex MCP server started.`. In the next sections you will reuse the same MCP server inside richer workflows.
## Build a single-agent workflow
Let’s start with a scoped example that uses Codex MCP to ship a small browser game. The workflow relies on two agents:
1. **Game Designer** – writes a brief for the game.
2. **Game Developer** – implements the game by calling Codex MCP.
Update `codex_mcp.py` with the following code. It keeps the MCP server setup from above and adds both agents.
```python
import asyncio
import os
from dotenv import load_dotenv
from agents import Agent, Runner, set_default_openai_api
from agents.mcp import MCPServerStdio
load_dotenv(override=True)
set_default_openai_api(os.getenv("OPENAI_API_KEY"))
async def main() -> None:
async with MCPServerStdio(
name="Codex CLI",
params={
"command": "npx",
"args": ["-y", "codex", "mcp-server"],
},
client_session_timeout_seconds=360000,
) as codex_mcp_server:
developer_agent = Agent(
name="Game Developer",
instructions=(
"You are an expert in building simple games using basic html + css + javascript with no dependencies. "
"Save your work in a file called index.html in the current directory. "
"Always call codex with \"approval-policy\": \"never\" and \"sandbox\": \"workspace-write\"."
),
mcp_servers=[codex_mcp_server],
)
designer_agent = Agent(
name="Game Designer",
instructions=(
"You are an indie game connoisseur. Come up with an idea for a single page html + css + javascript game that a developer could build in about 50 lines of code. "
"Format your request as a 3 sentence design brief for a game developer and call the Game Developer coder with your idea."
),
model="gpt-5",
handoffs=[developer_agent],
)
await Runner.run(designer_agent, "Implement a fun new game!")
if __name__ == "__main__":
asyncio.run(main())
```
Execute the script:
```bash
python codex_mcp.py
```
Codex will read the designer’s brief, create an `index.html` file, and write the full game to disk. Open the generated file in a browser to play the result. Every run produces a different design with unique gameplay twists and polish.
## Expand to a multi-agent workflow
Now turn the single-agent setup into an orchestrated, traceable workflow. The system adds:
- **Project Manager** – creates shared requirements, coordinates hand-offs, and enforces guardrails.
- **Designer**, **Frontend Developer**, **Backend Developer**, and **Tester** – each with scoped instructions and output folders.
Create a new file called `multi_agent_workflow.py`:
```python
import asyncio
import os
from dotenv import load_dotenv
from agents import (
Agent,
ModelSettings,
Runner,
WebSearchTool,
set_default_openai_api,
)
from agents.extensions.handoff_prompt import RECOMMENDED_PROMPT_PREFIX
from agents.mcp import MCPServerStdio
from openai.types.shared import Reasoning
load_dotenv(override=True)
set_default_openai_api(os.getenv("OPENAI_API_KEY"))
async def main() -> None:
async with MCPServerStdio(
name="Codex CLI",
params={"command": "npx", "args": ["-y", "codex", "mcp"]},
client_session_timeout_seconds=360000,
) as codex_mcp_server:
designer_agent = Agent(
name="Designer",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Designer.\n"
"Your only source of truth is AGENT_TASKS.md and REQUIREMENTS.md from the Project Manager.\n"
"Do not assume anything that is not written there.\n\n"
"You may use the internet for additional guidance or research."
"Deliverables (write to /design):\n"
"- design_spec.md – a single page describing the UI/UX layout, main screens, and key visual notes as requested in AGENT_TASKS.md.\n"
"- wireframe.md – a simple text or ASCII wireframe if specified.\n\n"
"Keep the output short and implementation-friendly.\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
tools=[WebSearchTool()],
mcp_servers=[codex_mcp_server],
)
frontend_developer_agent = Agent(
name="Frontend Developer",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Frontend Developer.\n"
"Read AGENT_TASKS.md and design_spec.md. Implement exactly what is described there.\n\n"
"Deliverables (write to /frontend):\n"
"- index.html – main page structure\n"
"- styles.css or inline styles if specified\n"
"- main.js or game.js if specified\n\n"
"Follow the Designer’s DOM structure and any integration points given by the Project Manager.\n"
"Do not add features or branding beyond the provided documents.\n\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager_agent."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
mcp_servers=[codex_mcp_server],
)
backend_developer_agent = Agent(
name="Backend Developer",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Backend Developer.\n"
"Read AGENT_TASKS.md and REQUIREMENTS.md. Implement the backend endpoints described there.\n\n"
"Deliverables (write to /backend):\n"
"- package.json – include a start script if requested\n"
"- server.js – implement the API endpoints and logic exactly as specified\n\n"
"Keep the code as simple and readable as possible. No external database.\n\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager_agent."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
mcp_servers=[codex_mcp_server],
)
tester_agent = Agent(
name="Tester",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Tester.\n"
"Read AGENT_TASKS.md and TEST.md. Verify that the outputs of the other roles meet the acceptance criteria.\n\n"
"Deliverables (write to /tests):\n"
"- TEST_PLAN.md – bullet list of manual checks or automated steps as requested\n"
"- test.sh or a simple automated script if specified\n\n"
"Keep it minimal and easy to run.\n\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
mcp_servers=[codex_mcp_server],
)
project_manager_agent = Agent(
name="Project Manager",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"""
You are the Project Manager.
Objective:
Convert the input task list into three project-root files the team will execute against.
Deliverables (write in project root):
- REQUIREMENTS.md: concise summary of product goals, target users, key features, and constraints.
- TEST.md: tasks with [Owner] tags (Designer, Frontend, Backend, Tester) and clear acceptance criteria.
- AGENT_TASKS.md: one section per role containing:
- Project name
- Required deliverables (exact file names and purpose)
- Key technical notes and constraints
Process:
- Resolve ambiguities with minimal, reasonable assumptions. Be specific so each role can act without guessing.
- Create files using Codex MCP with {"approval-policy":"never","sandbox":"workspace-write"}.
- Do not create folders. Only create REQUIREMENTS.md, TEST.md, AGENT_TASKS.md.
Handoffs (gated by required files):
1) After the three files above are created, hand off to the Designer with transfer_to_designer_agent and include REQUIREMENTS.md and AGENT_TASKS.md.
2) Wait for the Designer to produce /design/design_spec.md. Verify that file exists before proceeding.
3) When design_spec.md exists, hand off in parallel to both:
- Frontend Developer with transfer_to_frontend_developer_agent (provide design_spec.md, REQUIREMENTS.md, AGENT_TASKS.md).
- Backend Developer with transfer_to_backend_developer_agent (provide REQUIREMENTS.md, AGENT_TASKS.md).
4) Wait for Frontend to produce /frontend/index.html and Backend to produce /backend/server.js. Verify both files exist.
5) When both exist, hand off to the Tester with transfer_to_tester_agent and provide all prior artifacts and outputs.
6) Do not advance to the next handoff until the required files for that step are present. If something is missing, request the owning agent to supply it and re-check.
PM Responsibilities:
- Coordinate all roles, track file completion, and enforce the above gating checks.
- Do NOT respond with status updates. Just handoff to the next agent until the project is complete.
"""
),
model="gpt-5",
model_settings=ModelSettings(
reasoning=Reasoning(effort="medium"),
),
handoffs=[designer_agent, frontend_developer_agent, backend_developer_agent, tester_agent],
mcp_servers=[codex_mcp_server],
)
designer_agent.handoffs = [project_manager_agent]
frontend_developer_agent.handoffs = [project_manager_agent]
backend_developer_agent.handoffs = [project_manager_agent]
tester_agent.handoffs = [project_manager_agent]
task_list = """
Goal: Build a tiny browser game to showcase a multi-agent workflow.
High-level requirements:
- Single-screen game called "Bug Busters".
- Player clicks a moving bug to earn points.
- Game ends after 20 seconds and shows final score.
- Optional: submit score to a simple backend and display a top-10 leaderboard.
Roles:
- Designer: create a one-page UI/UX spec and basic wireframe.
- Frontend Developer: implement the page and game logic.
- Backend Developer: implement a minimal API (GET /health, GET/POST /scores).
- Tester: write a quick test plan and a simple script to verify core routes.
Constraints:
- No external database—memory storage is fine.
- Keep everything readable for beginners; no frameworks required.
- All outputs should be small files saved in clearly named folders.
"""
result = await Runner.run(project_manager_agent, task_list, max_turns=30)
print(result.final_output)
if __name__ == "__main__":
asyncio.run(main())
```
Run the script and watch the generated files:
```bash
python multi_agent_workflow.py
ls -R
```
The project manager agent writes `REQUIREMENTS.md`, `TEST.md`, and `AGENT_TASKS.md`, then coordinates hand-offs across the designer, frontend, backend, and tester agents. Each agent writes scoped artifacts in its own folder before handing control back to the project manager.
## Trace the workflow
Codex automatically records traces that capture every prompt, tool call, and hand-off. After the multi-agent run completes, open the [Traces dashboard](https://platform.openai.com/trace) to inspect the execution timeline.
The high-level trace highlights how the project manager verifies hand-offs before moving forward. Click into individual steps to see prompts, Codex MCP calls, files written, and execution durations. These details make it easy to audit every hand-off and understand how the workflow evolved turn by turn.
These traces make it easy to debug workflow hiccups, audit agent behavior, and measure performance over time without requiring any additional instrumentation.
---
# Source: https://developers.openai.com/codex/guides/api-key.md
# Using an OpenAI API key
You can extend your local Codex usage (CLI and IDE extension) with an API key. API key usage is billed through your OpenAI platform account at the standard API rates, which you can review on the [API pricing page](https://openai.com/api/pricing/).
First, make sure you set up your `OPENAI_API_KEY` environment variable globally. You can get your API key from the [OpenAI dashboard](https://platform.openai.com/api-keys).
Then, you can use the CLI and IDE extension with your API key.
If you’ve previously used the Codex CLI with an API key, update to the latest version, run codex logout, and then run codex to switch back to subscription-based access when you’re ready.
### Use your API key with Codex CLI
You can change which auth method to use with the CLI by changing the preferred_auth_method in the codex config file:
```toml
# ~/.codex/config.toml
preferred_auth_method = "apikey"
```
You can also override it ad-hoc via CLI:
```bash
codex --config preferred_auth_method="apikey"
```
You can go back to ChatGPT auth (default) by running:
```bash
codex --config preferred_auth_method="chatgpt"
```
You can switch back and forth as needed, for example if you use your ChatGPT account but run out of usage credits.
### Use your API key with the IDE extension
When you open the IDE extension, you’ll be prompted to sign in with your ChatGPT account or to use your API key instead. If you wish to use your API key instead, you can select the option to use your API key. Make sure it is configured in your environment variables.
---
# Source: https://developers.openai.com/codex/app-server.md
# Codex App Server
Codex app-server is the interface Codex uses to power rich clients (for example, the Codex VS Code extension). Use it when you want a deep integration inside your own product: authentication, conversation history, approvals, and streamed agent events. The app-server implementation is open source in the Codex GitHub repository ([openai/codex/codex-rs/app-server](https://github.com/openai/codex/tree/main/codex-rs/app-server)). See the [Open Source](https://developers.openai.com/codex/open-source) page for the full list of open-source Codex components.
If you are automating jobs or running Codex in CI, use the
Codex SDK instead.
## Protocol
Like [MCP](https://modelcontextprotocol.io/), `codex app-server` supports bidirectional communication and streams JSONL over stdio. The protocol is JSON-RPC 2.0, but it omits the `"jsonrpc":"2.0"` header.
## Message schema
Requests include `method`, `params`, and `id`:
```json
{ "method": "thread/start", "id": 10, "params": { "model": "gpt-5.1-codex" } }
```
Responses echo the `id` with either `result` or `error`:
```json
{ "id": 10, "result": { "thread": { "id": "thr_123" } } }
```
```json
{ "id": 10, "error": { "code": 123, "message": "Something went wrong" } }
```
Notifications omit `id` and use only `method` and `params`:
```json
{ "method": "turn/started", "params": { "turn": { "id": "turn_456" } } }
```
You can generate a TypeScript schema or a JSON Schema bundle from the CLI. Each output is specific to the Codex version you ran, so the generated artifacts match that version exactly:
```bash
codex app-server generate-ts --out ./schemas
codex app-server generate-json-schema --out ./schemas
```
## Getting started
1. Start the server with `codex app-server`. It waits for JSONL over standard input and prints only protocol messages.
2. Connect a client over stdio, then send `initialize` followed by the `initialized` notification.
3. Start a thread and a turn, then keep reading notifications from stdout.
Example (Node.js / TypeScript):
```ts
const proc = spawn("codex", ["app-server"], {
stdio: ["pipe", "pipe", "inherit"],
});
const rl = readline.createInterface({ input: proc.stdout });
const send = (message: unknown) => {
proc.stdin.write(`${JSON.stringify(message)}\n`);
};
let threadId: string | null = null;
rl.on("line", (line) => {
const msg = JSON.parse(line) as any;
console.log("server:", msg);
if (msg.id === 1 && msg.result?.thread?.id && !threadId) {
threadId = msg.result.thread.id;
send({
method: "turn/start",
id: 2,
params: {
threadId,
input: [{ type: "text", text: "Summarize this repo." }],
},
});
}
});
send({
method: "initialize",
id: 0,
params: {
clientInfo: {
name: "my_product",
title: "My Product",
version: "0.1.0",
},
},
});
send({ method: "initialized", params: {} });
send({ method: "thread/start", id: 1, params: { model: "gpt-5.1-codex" } });
```
## Core primitives
- **Thread**: A conversation between a user and the Codex agent. Threads contain turns.
- **Turn**: A single user request and the agent work that follows. Turns contain items and stream incremental updates.
- **Item**: A unit of input or output (user message, agent message, command runs, file change, tool call, and more).
Use the thread APIs to create, list, or archive conversations. Drive a conversation with turn APIs and stream progress via turn notifications.
## Lifecycle overview
- **Initialize once**: Immediately after launching `codex app-server`, send an `initialize` request with your client metadata, then emit `initialized`. The server rejects any request before this handshake.
- **Start (or resume) a thread**: Call `thread/start` for a new conversation, `thread/resume` to continue an existing one, or `thread/fork` to branch history into a new thread id.
- **Begin a turn**: Call `turn/start` with the target `threadId` and user input. Optional fields override model, `cwd`, sandbox policy, and more.
- **Stream events**: After `turn/start`, keep reading notifications on stdout: `item/started`, `item/completed`, `item/agentMessage/delta`, tool progress, and other updates.
- **Finish the turn**: The server emits `turn/completed` with final status when the model finishes or after a `turn/interrupt` cancellation.
## Initialization
Clients must send a single `initialize` request before invoking any other method, then acknowledge with an `initialized` notification. Requests sent before initialization receive a `Not initialized` error, and repeated `initialize` calls return `Already initialized`.
The server returns the user agent string it will present to upstream services. Set `clientInfo` to identify your integration.
**Important**: Use `clientInfo.name` to identify your client for the OpenAI Compliance Logs Platform. If you are developing a new Codex integration intended for enterprise use, please contact OpenAI to get it added to a known clients list. For more context, see the [Codex logs reference](https://chatgpt.com/admin/api-reference#tag/Logs:-Codex).
Example (from the Codex VS Code extension):
```json
{
"method": "initialize",
"id": 0,
"params": {
"clientInfo": {
"name": "codex_vscode",
"title": "Codex VS Code Extension",
"version": "0.1.0"
}
}
}
```
## API overview
- `thread/start` - create a new thread; emits `thread/started` and automatically subscribes you to turn/item events for that thread.
- `thread/resume` - reopen an existing thread by id so later `turn/start` calls append to it.
- `thread/fork` - fork a thread into a new thread id by copying stored history; emits `thread/started` for the new thread.
- `thread/read` - read a stored thread by id without resuming it; set `includeTurns` to return full turn history.
- `thread/list` - page through stored thread logs; supports cursor-based pagination plus `modelProviders`, `sourceKinds`, and `archived` filters.
- `thread/loaded/list` - list the thread ids currently loaded in memory.
- `thread/archive` - move a thread's log file into the archived directory; returns `{}` on success.
- `thread/unarchive` - restore an archived thread rollout back into the active sessions directory; returns the restored `thread`.
- `thread/rollback` - drop the last N turns from the in-memory context and persist a rollback marker; returns the updated `thread`.
- `turn/start` - add user input to a thread and begin Codex generation; responds with the initial `turn` and streams events.
- `turn/interrupt` - request cancellation of an in-flight turn; success is `{}` and the turn ends with `status: "interrupted"`.
- `review/start` - kick off the Codex reviewer for a thread; emits `enteredReviewMode` and `exitedReviewMode` items.
- `command/exec` - run a single command under the server sandbox without starting a thread/turn.
- `model/list` - list available models (with effort options).
- `collaborationMode/list` - list collaboration mode presets (experimental, no pagination).
- `skills/list` - list skills for one or more `cwd` values (optional `forceReload`).
- `app/list` - list available apps (connectors) with pagination.
- `skills/config/write` - enable or disable skills by path.
- `mcpServer/oauth/login` - start an OAuth login for a configured MCP server; returns an authorization URL and emits `mcpServer/oauthLogin/completed` on completion.
- `tool/requestUserInput` - prompt the user with 1-3 short questions for a tool call (experimental); questions can set `isOther` for a free-form option.
- `config/mcpServer/reload` - reload MCP server configuration from disk and queue a refresh for loaded threads.
- `mcpServerStatus/list` - list MCP servers, tools, resources, and auth status (cursor + limit pagination).
- `feedback/upload` - submit a feedback report (classification + optional reason/logs + conversation id).
- `config/read` - fetch the effective configuration on disk after resolving configuration layering.
- `config/value/write` - write a single configuration key/value to the user's `config.toml` on disk.
- `config/batchWrite` - apply configuration edits atomically to the user's `config.toml` on disk.
- `configRequirements/read` - fetch requirements from `requirements.toml` and/or MDM, including allow-lists and residency requirements (or `null` if you haven't set any up).
## Threads
- `thread/read` reads a stored thread without subscribing to it; set `includeTurns` to include turns.
- `thread/list` supports cursor pagination plus `modelProviders`, `sourceKinds`, and `archived` filtering.
- `thread/loaded/list` returns the thread IDs currently in memory.
- `thread/archive` moves the thread's persisted JSONL log into the archived directory.
- `thread/unarchive` restores an archived thread rollout back into the active sessions directory.
- `thread/rollback` drops the last N turns from the in-memory context and records a rollback marker in the thread's persisted JSONL log.
### Start or resume a thread
Start a fresh thread when you need a new Codex conversation.
```json
{ "method": "thread/start", "id": 10, "params": {
"model": "gpt-5.1-codex",
"cwd": "/Users/me/project",
"approvalPolicy": "never",
"sandbox": "workspaceWrite"
} }
{ "id": 10, "result": {
"thread": {
"id": "thr_123",
"preview": "",
"modelProvider": "openai",
"createdAt": 1730910000
}
} }
{ "method": "thread/started", "params": { "thread": { "id": "thr_123" } } }
```
To continue a stored session, call `thread/resume` with the `thread.id` you recorded earlier. The response shape matches `thread/start`:
```json
{ "method": "thread/resume", "id": 11, "params": { "threadId": "thr_123" } }
{ "id": 11, "result": { "thread": { "id": "thr_123" } } }
```
Resuming a thread does not update `thread.updatedAt` (or the rollout file's modified time) by itself. The timestamp updates when you start a turn.
Dynamic tools supplied on `thread/start` (`dynamicTools`) are persisted in the thread rollout metadata and restored on `thread/resume` when you do not supply new dynamic tools.
To branch from a stored session, call `thread/fork` with the `thread.id`. This creates a new thread id and emits a `thread/started` notification for it:
```json
{ "method": "thread/fork", "id": 12, "params": { "threadId": "thr_123" } }
{ "id": 12, "result": { "thread": { "id": "thr_456" } } }
{ "method": "thread/started", "params": { "thread": { "id": "thr_456" } } }
```
### Read a stored thread (without resuming)
Use `thread/read` when you want stored thread data but do not want to resume the thread or subscribe to its events.
- `includeTurns` - when `true`, the response includes the thread's turns; when `false` or omitted, you get the thread summary only.
```json
{ "method": "thread/read", "id": 19, "params": { "threadId": "thr_123", "includeTurns": true } }
{ "id": 19, "result": { "thread": { "id": "thr_123", "turns": [] } } }
```
Unlike `thread/resume`, `thread/read` does not load the thread into memory or emit `thread/started`.
### List threads (with pagination & filters)
`thread/list` lets you render a history UI. Results default to newest-first by `createdAt`. Filters apply before pagination. Pass any combination of:
- `cursor` - opaque string from a prior response; omit for the first page.
- `limit` - server defaults to a reasonable page size if unset.
- `sortKey` - `created_at` (default) or `updated_at`.
- `modelProviders` - restrict results to specific providers; unset, null, or an empty array includes all providers.
- `sourceKinds` - restrict results to specific thread sources. When omitted or `[]`, the server defaults to interactive sources only: `cli` and `vscode`.
- `archived` - when `true`, list archived threads only. When `false` or omitted, list non-archived threads (default).
`sourceKinds` accepts the following values:
- `cli`
- `vscode`
- `exec`
- `appServer`
- `subAgent`
- `subAgentReview`
- `subAgentCompact`
- `subAgentThreadSpawn`
- `subAgentOther`
- `unknown`
Example:
```json
{ "method": "thread/list", "id": 20, "params": {
"cursor": null,
"limit": 25,
"sortKey": "created_at"
} }
{ "id": 20, "result": {
"data": [
{ "id": "thr_a", "preview": "Create a TUI", "modelProvider": "openai", "createdAt": 1730831111, "updatedAt": 1730831111 },
{ "id": "thr_b", "preview": "Fix tests", "modelProvider": "openai", "createdAt": 1730750000, "updatedAt": 1730750000 }
],
"nextCursor": "opaque-token-or-null"
} }
```
When `nextCursor` is `null`, you have reached the final page.
### List loaded threads
`thread/loaded/list` returns thread IDs currently loaded in memory.
```json
{ "method": "thread/loaded/list", "id": 21 }
{ "id": 21, "result": { "data": ["thr_123", "thr_456"] } }
```
### Archive a thread
Use `thread/archive` to move the persisted thread log (stored as a JSONL file on disk) into the archived sessions directory.
```json
{ "method": "thread/archive", "id": 22, "params": { "threadId": "thr_b" } }
{ "id": 22, "result": {} }
```
Archived threads won't appear in future calls to `thread/list` unless you pass `archived: true`.
### Unarchive a thread
Use `thread/unarchive` to move an archived thread rollout back into the active sessions directory.
```json
{ "method": "thread/unarchive", "id": 24, "params": { "threadId": "thr_b" } }
{ "id": 24, "result": { "thread": { "id": "thr_b" } } }
```
## Turns
The `input` field accepts a list of items:
- `{ "type": "text", "text": "Explain this diff" }`
- `{ "type": "image", "url": "https://.../design.png" }`
- `{ "type": "localImage", "path": "/tmp/screenshot.png" }`
You can override configuration settings per turn (model, effort, `cwd`, sandbox policy, summary). When specified, these settings become the defaults for later turns on the same thread. `outputSchema` applies only to the current turn. For `sandboxPolicy.type = "externalSandbox"`, set `networkAccess` to `restricted` or `enabled`; otherwise use a boolean.
### Start a turn
```json
{ "method": "turn/start", "id": 30, "params": {
"threadId": "thr_123",
"input": [ { "type": "text", "text": "Run tests" } ],
"cwd": "/Users/me/project",
"approvalPolicy": "unlessTrusted",
"sandboxPolicy": {
"type": "workspaceWrite",
"writableRoots": ["/Users/me/project"],
"networkAccess": true
},
"model": "gpt-5.1-codex",
"effort": "medium",
"summary": "concise",
"outputSchema": {
"type": "object",
"properties": { "answer": { "type": "string" } },
"required": ["answer"],
"additionalProperties": false
}
} }
{ "id": 30, "result": { "turn": { "id": "turn_456", "status": "inProgress", "items": [], "error": null } } }
```
### Start a turn (invoke a skill)
Invoke a skill explicitly by including `$` in the text input and adding a `skill` input item alongside it.
```json
{ "method": "turn/start", "id": 33, "params": {
"threadId": "thr_123",
"input": [
{ "type": "text", "text": "$skill-creator Add a new skill for triaging flaky CI and include step-by-step usage." },
{ "type": "skill", "name": "skill-creator", "path": "/Users/me/.codex/skills/skill-creator/SKILL.md" }
]
} }
{ "id": 33, "result": { "turn": { "id": "turn_457", "status": "inProgress", "items": [], "error": null } } }
```
### Interrupt a turn
```json
{ "method": "turn/interrupt", "id": 31, "params": { "threadId": "thr_123", "turnId": "turn_456" } }
{ "id": 31, "result": {} }
```
On success, the turn finishes with `status: "interrupted"`.
## Review
`review/start` runs the Codex reviewer for a thread and streams review items. Targets include:
- `uncommittedChanges`
- `baseBranch` (diff against a branch)
- `commit` (review a specific commit)
- `custom` (free-form instructions)
Use `delivery: "inline"` (default) to run the review on the existing thread, or `delivery: "detached"` to fork a new review thread.
Example request/response:
```json
{ "method": "review/start", "id": 40, "params": {
"threadId": "thr_123",
"delivery": "inline",
"target": { "type": "commit", "sha": "1234567deadbeef", "title": "Polish tui colors" }
} }
{ "id": 40, "result": {
"turn": {
"id": "turn_900",
"status": "inProgress",
"items": [
{ "type": "userMessage", "id": "turn_900", "content": [ { "type": "text", "text": "Review commit 1234567: Polish tui colors" } ] }
],
"error": null
},
"reviewThreadId": "thr_123"
} }
```
For a detached review, use `"delivery": "detached"`. The response is the same shape, but `reviewThreadId` will be the id of the new review thread (different from the original `threadId`). The server also emits a `thread/started` notification for that new thread before streaming the review turn.
Codex streams the usual `turn/started` notification followed by an `item/started` with an `enteredReviewMode` item:
```json
{
"method": "item/started",
"params": {
"item": {
"type": "enteredReviewMode",
"id": "turn_900",
"review": "current changes"
}
}
}
```
When the reviewer finishes, the server emits `item/started` and `item/completed` containing an `exitedReviewMode` item with the final review text:
```json
{
"method": "item/completed",
"params": {
"item": {
"type": "exitedReviewMode",
"id": "turn_900",
"review": "Looks solid overall..."
}
}
}
```
Use this notification to render the reviewer output in your client.
## Command execution
`command/exec` runs a single command (`argv` array) under the server sandbox without creating a thread.
```json
{ "method": "command/exec", "id": 50, "params": {
"command": ["ls", "-la"],
"cwd": "/Users/me/project",
"sandboxPolicy": { "type": "workspaceWrite" },
"timeoutMs": 10000
} }
{ "id": 50, "result": { "exitCode": 0, "stdout": "...", "stderr": "" } }
```
Use `sandboxPolicy.type = "externalSandbox"` if you already sandbox the server process and want Codex to skip its own sandbox enforcement. For external sandbox mode, set `networkAccess` to `restricted` (default) or `enabled`. For other sandbox policies, `networkAccess` is a boolean.
Notes:
- The server rejects empty `command` arrays.
- `sandboxPolicy` accepts the same shape used by `turn/start` (for example, `dangerFullAccess`, `readOnly`, `workspaceWrite`, `externalSandbox`).
- When omitted, `timeoutMs` falls back to the server default.
## Events
Event notifications are the server-initiated stream for thread lifecycles, turn lifecycles, and the items within them. After you start or resume a thread, keep reading stdout for `thread/started`, `turn/*`, and `item/*` notifications.
### Turn events
- `turn/started` - `{ turn }` with the turn id, empty `items`, and `status: "inProgress"`.
- `turn/completed` - `{ turn }` where `turn.status` is `completed`, `interrupted`, or `failed`; failures carry `{ error: { message, codexErrorInfo?, additionalDetails? } }`.
- `turn/diff/updated` - `{ threadId, turnId, diff }` with the latest aggregated unified diff across every file change in the turn.
- `turn/plan/updated` - `{ turnId, explanation?, plan }` whenever the agent shares or changes its plan; each `plan` entry is `{ step, status }` with `status` in `pending`, `inProgress`, or `completed`.
- `thread/tokenUsage/updated` - usage updates for the active thread.
`turn/diff/updated` and `turn/plan/updated` currently include empty `items` arrays even when item events stream. Use `item/*` notifications as the source of truth for turn items.
### Items
`ThreadItem` is the tagged union carried in turn responses and `item/*` notifications. Common item types include:
- `userMessage` - `{id, content}` where `content` is a list of user inputs (`text`, `image`, or `localImage`).
- `agentMessage` - `{id, text}` containing the accumulated agent reply.
- `plan` - `{id, text}` containing proposed plan text in plan mode. Treat the final `plan` item from `item/completed` as authoritative.
- `reasoning` - `{id, summary, content}` where `summary` holds streamed reasoning summaries and `content` holds raw reasoning blocks.
- `commandExecution` - `{id, command, cwd, status, commandActions, aggregatedOutput?, exitCode?, durationMs?}`.
- `fileChange` - `{id, changes, status}` describing proposed edits; `changes` list `{path, kind, diff}`.
- `mcpToolCall` - `{id, server, tool, status, arguments, result?, error?}`.
- `collabToolCall` - `{id, tool, status, senderThreadId, receiverThreadId?, newThreadId?, prompt?, agentStatus?}`.
- `webSearch` - `{id, query, action?}` for web search requests issued by the agent.
- `imageView` - `{id, path}` emitted when the agent invokes the image viewer tool.
- `enteredReviewMode` - `{id, review}` sent when the reviewer starts.
- `exitedReviewMode` - `{id, review}` emitted when the reviewer finishes.
- `contextCompaction` - `{id}` emitted when Codex compacts the conversation history.
For `webSearch.action`, the action `type` can be `search` (`query?`, `queries?`), `openPage` (`url?`), or `findInPage` (`url?`, `pattern?`).
The legacy `thread/compacted` notification is deprecated; use the `contextCompaction` item instead.
All items emit two shared lifecycle events:
- `item/started` - emits the full `item` when a new unit of work begins; the `item.id` matches the `itemId` used by deltas.
- `item/completed` - sends the final `item` once work finishes; treat this as the authoritative state.
### Item deltas
- `item/agentMessage/delta` - appends streamed text for the agent message.
- `item/plan/delta` - streams proposed plan text. The final `plan` item may not exactly equal the concatenated deltas.
- `item/reasoning/summaryTextDelta` - streams readable reasoning summaries; `summaryIndex` increments when a new summary section opens.
- `item/reasoning/summaryPartAdded` - marks a boundary between reasoning summary sections.
- `item/reasoning/textDelta` - streams raw reasoning text (when supported by the model).
- `item/commandExecution/outputDelta` - streams stdout/stderr for a command; append deltas in order.
- `item/fileChange/outputDelta` - contains the tool call response of the underlying `apply_patch` tool call.
## Errors
If a turn fails, the server emits an `error` event with `{ error: { message, codexErrorInfo?, additionalDetails? } }` and then finishes the turn with `status: "failed"`. When an upstream HTTP status is available, it appears in `codexErrorInfo.httpStatusCode`.
Common `codexErrorInfo` values include:
- `ContextWindowExceeded`
- `UsageLimitExceeded`
- `HttpConnectionFailed` (4xx/5xx upstream errors)
- `ResponseStreamConnectionFailed`
- `ResponseStreamDisconnected`
- `ResponseTooManyFailedAttempts`
- `BadRequest`, `Unauthorized`, `SandboxError`, `InternalServerError`, `Other`
When an upstream HTTP status is available, the server forwards it in `httpStatusCode` on the relevant `codexErrorInfo` variant.
## Approvals
Depending on a user's Codex settings, command execution and file changes may require approval. The app-server sends a server-initiated JSON-RPC request to the client, and the client responds with `{ "decision": "accept" | "decline" }` (plus optional `acceptSettings` for command approvals).
- Requests include `threadId` and `turnId` - use them to scope UI state to the active conversation.
- The server resumes or declines the work and ends the item with `item/completed`.
### Command execution approvals
Order of messages:
1. `item/started` shows the pending `commandExecution` item with `command`, `cwd`, and other fields.
2. `item/commandExecution/requestApproval` includes `itemId`, `threadId`, `turnId`, optional `reason` or `risk`, plus `parsedCmd` for display.
3. Client response accepts or declines (optionally setting `acceptSettings`).
4. `item/completed` returns the final `commandExecution` item with `status: completed | failed | declined`.
### File change approvals
Order of messages:
1. `item/started` emits a `fileChange` item with proposed `changes` and `status: "inProgress"`.
2. `item/fileChange/requestApproval` includes `itemId`, `threadId`, `turnId`, and an optional `reason`.
3. Client response accepts or declines.
4. `item/completed` returns the final `fileChange` item with `status: completed | failed | declined`.
### MCP tool-call approvals (apps)
App (connector) tool calls can also require approval. When an app tool call has side effects, the server may elicit approval with `tool/requestUserInput` and options such as **Accept**, **Decline**, and **Cancel**. If the user declines or cancels, the related `mcpToolCall` item completes with an error instead of running the tool.
## Skills
Invoke a skill by including `$` in the user text input. Add a `skill` input item (recommended) so the server injects full skill instructions instead of relying on the model to resolve the name.
```json
{
"method": "turn/start",
"id": 101,
"params": {
"threadId": "thread-1",
"input": [
{
"type": "text",
"text": "$skill-creator Add a new skill for triaging flaky CI."
},
{
"type": "skill",
"name": "skill-creator",
"path": "/Users/me/.codex/skills/skill-creator/SKILL.md"
}
]
}
}
```
If you omit the `skill` item, the model will still parse the `$` marker and try to locate the skill, which can add latency.
Example:
```
$skill-creator Add a new skill for triaging flaky CI and include step-by-step usage.
```
Use `skills/list` to fetch the available skills (optionally scoped by `cwds`, with `forceReload`). When present, `interface` and `dependencies` are sourced from `SKILL.json`.
```json
{ "method": "skills/list", "id": 25, "params": {
"cwds": ["/Users/me/project"],
"forceReload": false
} }
{ "id": 25, "result": {
"data": [{
"cwd": "/Users/me/project",
"skills": [
{
"name": "skill-creator",
"description": "Create or update a Codex skill",
"enabled": true,
"interface": {
"displayName": "Skill Creator",
"shortDescription": "Create or update a Codex skill"
},
"dependencies": {
"tools": [
{
"type": "env_var",
"value": "GITHUB_TOKEN",
"description": "GitHub API token"
},
{
"type": "mcp",
"value": "github",
"transport": "streamable_http",
"url": "https://example.com/mcp"
}
]
}
}
],
"errors": []
}]
} }
```
To enable or disable a skill by path:
```json
{
"method": "skills/config/write",
"id": 26,
"params": {
"path": "/Users/me/.codex/skills/skill-creator/SKILL.md",
"enabled": false
}
}
```
## Apps (connectors)
Use `app/list` to fetch available apps. In the CLI/TUI, `/apps` is the user-facing picker; in custom clients, call `app/list` directly.
```json
{ "method": "app/list", "id": 50, "params": {
"cursor": null,
"limit": 50
} }
{ "id": 50, "result": {
"data": [
{
"id": "demo-app",
"name": "Demo App",
"description": "Example connector for documentation.",
"logoUrl": "https://example.com/demo-app.png",
"installUrl": "https://chatgpt.com/apps/demo-app/demo-app",
"isAccessible": true
}
],
"nextCursor": null
} }
```
Invoke an app by inserting `$` in the text input and adding a `mention` input item with the `app://` path (recommended).
```json
{
"method": "turn/start",
"id": 51,
"params": {
"threadId": "thread-1",
"input": [
{
"type": "text",
"text": "$demo-app Pull the latest updates from the team."
},
{
"type": "mention",
"name": "Demo App",
"path": "app://demo-app"
}
]
}
}
```
## Auth endpoints
The JSON-RPC auth/account surface exposes request/response methods plus server-initiated notifications (no `id`). Use these to determine auth state, start or cancel logins, logout, and inspect ChatGPT rate limits.
### Authentication modes
Codex supports multiple authentication modes. The active mode is surfaced in `account/updated.authMode` and can be inferred from `account/read`.
- **API key (`apikey`)** - the caller supplies an OpenAI API key and Codex stores it for API requests.
- **ChatGPT managed (`chatgpt`)** - Codex owns the ChatGPT OAuth flow, persists tokens, and refreshes them automatically.
- **ChatGPT external tokens (`chatgptAuthTokens`)** - a host app supplies `idToken` and `accessToken` directly. Tokens are stored in memory, and the host app must refresh them when asked.
### API overview
- `account/read` - fetch current account info; optionally refresh tokens.
- `account/login/start` - begin login (`apiKey`, `chatgpt`, or `chatgptAuthTokens`).
- `account/login/completed` (notify) - emitted when a login attempt finishes (success or error).
- `account/login/cancel` - cancel a pending ChatGPT login by `loginId`.
- `account/logout` - sign out; triggers `account/updated`.
- `account/updated` (notify) - emitted whenever auth mode changes (`authMode`: `apikey`, `chatgpt`, `chatgptAuthTokens`, or `null`).
- `account/chatgptAuthTokens/refresh` (server request) - request fresh externally managed ChatGPT tokens after an authorization failure.
- `account/rateLimits/read` - fetch ChatGPT rate limits.
- `account/rateLimits/updated` (notify) - emitted whenever a user's ChatGPT rate limits change.
- `mcpServer/oauthLogin/completed` (notify) - emitted after a `mcpServer/oauth/login` flow finishes; payload includes `{ name, success, error? }`.
### 1) Check auth state
Request:
```json
{ "method": "account/read", "id": 1, "params": { "refreshToken": false } }
```
Response examples:
```json
{ "id": 1, "result": { "account": null, "requiresOpenaiAuth": false } }
```
```json
{ "id": 1, "result": { "account": null, "requiresOpenaiAuth": true } }
```
```json
{
"id": 1,
"result": { "account": { "type": "apiKey" }, "requiresOpenaiAuth": true }
}
```
```json
{
"id": 1,
"result": {
"account": {
"type": "chatgpt",
"email": "user@example.com",
"planType": "pro"
},
"requiresOpenaiAuth": true
}
}
```
Field notes:
- `refreshToken` (boolean): set `true` to force a token refresh in managed ChatGPT mode. In external token mode (`chatgptAuthTokens`), this flag is ignored.
- `requiresOpenaiAuth` reflects the active provider; when `false`, Codex can run without OpenAI credentials.
### 2) Log in with an API key
1. Send:
```json
{
"method": "account/login/start",
"id": 2,
"params": { "type": "apiKey", "apiKey": "sk-..." }
}
```
2. Expect:
```json
{ "id": 2, "result": { "type": "apiKey" } }
```
3. Notifications:
```json
{
"method": "account/login/completed",
"params": { "loginId": null, "success": true, "error": null }
}
```
```json
{ "method": "account/updated", "params": { "authMode": "apikey" } }
```
### 3) Log in with ChatGPT (browser flow)
1. Start:
```json
{ "method": "account/login/start", "id": 3, "params": { "type": "chatgpt" } }
```
```json
{
"id": 3,
"result": {
"type": "chatgpt",
"loginId": "",
"authUrl": "https://chatgpt.com/...&redirect_uri=http%3A%2F%2Flocalhost%3A%2Fauth%2Fcallback"
}
}
```
2. Open `authUrl` in a browser; the app-server hosts the local callback.
3. Wait for notifications:
```json
{
"method": "account/login/completed",
"params": { "loginId": "", "success": true, "error": null }
}
```
```json
{ "method": "account/updated", "params": { "authMode": "chatgpt" } }
```
### 3b) Log in with externally managed ChatGPT tokens (`chatgptAuthTokens`)
Use this mode when a host application owns the user's ChatGPT auth lifecycle and supplies tokens directly.
1. Send:
```json
{
"method": "account/login/start",
"id": 7,
"params": {
"type": "chatgptAuthTokens",
"idToken": "",
"accessToken": ""
}
}
```
2. Expect:
```json
{ "id": 7, "result": { "type": "chatgptAuthTokens" } }
```
3. Notifications:
```json
{
"method": "account/login/completed",
"params": { "loginId": null, "success": true, "error": null }
}
```
```json
{
"method": "account/updated",
"params": { "authMode": "chatgptAuthTokens" }
}
```
When the server receives a `401 Unauthorized`, it may request refreshed tokens from the host app:
```json
{
"method": "account/chatgptAuthTokens/refresh",
"id": 8,
"params": { "reason": "unauthorized", "previousAccountId": "org-123" }
}
{ "id": 8, "result": { "idToken": "", "accessToken": "" } }
```
The server retries the original request after a successful refresh response. Respond promptly; requests time out after about 10 seconds.
### 4) Cancel a ChatGPT login
```json
{ "method": "account/login/cancel", "id": 4, "params": { "loginId": "" } }
{ "method": "account/login/completed", "params": { "loginId": "", "success": false, "error": "..." } }
```
### 5) Logout
```json
{ "method": "account/logout", "id": 5 }
{ "id": 5, "result": {} }
{ "method": "account/updated", "params": { "authMode": null } }
```
### 6) Rate limits (ChatGPT)
```json
{ "method": "account/rateLimits/read", "id": 6 }
{ "id": 6, "result": { "rateLimits": { "primary": { "usedPercent": 25, "windowDurationMins": 15, "resetsAt": 1730947200 }, "secondary": null } } }
{ "method": "account/rateLimits/updated", "params": { "rateLimits": { } } }
```
Field notes:
- `usedPercent` is current usage within the OpenAI quota window.
- `windowDurationMins` is the quota window length.
- `resetsAt` is a Unix timestamp (seconds) for the next reset.
---
# Source: https://developers.openai.com/apps-sdk/app-submission-guidelines.md
# App submission guidelines
## Overview
The ChatGPT app ecosystem is built on trust. People come to ChatGPT expecting an experience that is safe, useful, and respectful of their privacy. Developers come to ChatGPT expecting a fair and transparent process. These developer guidelines set the policies every builder is expected to review and follow.
Before getting into specifics, we recommend first familiarizing yourself with two foundational resources:
- [**UX principles for ChatGPT apps**](https://developers.openai.com/apps-sdk/concepts/ux-principles) - this guide outlines principles and best practices for building ChatGPT apps, as well as a checklist to help you ensure your app is a great fit for ChatGPT.
- [**UI guidelines for ChatGPT apps**](https://developers.openai.com/apps-sdk/concepts/ui-guidelines) - this guide describes the interaction, layout, and design patterns that help apps feel intuitive, trustworthy, and consistent within ChatGPT.
You should also read our blog post on [what makes a great ChatGPT app](https://developers.openai.com/blog/what-makes-a-great-chatgpt-app/) to get a sense of the overall approach to building with the Apps SDK.
The guidelines below outline the minimum standard developers must meet for their app to be considered for publication in ChatGPT, and for their app to remain published and available to ChatGPT users. Apps that demonstrate strong real-world utility and high user satisfaction may be eligible for enhanced distribution opportunities—such as directory placement or proactive suggestions.
## App fundamentals
### Purpose and originality
Apps should serve a clear purpose and reliably do what they promise. In particular, they should provide functionality or workflows that are not natively supported by ChatGPT’s core conversational capabilities, and that meaningfully help satisfy common user intents expressed in conversation.
Only use intellectual property that you own or have permission to use. Do not engage in misleading or copycat designs, impersonation, spam, or static frames with no meaningful interaction. Apps should not imply that they are made or endorsed by OpenAI.
### Quality and reliability
Apps must behave predictably and reliably. Results should be accurate and relevant to user input. Errors, including unexpected ones, must be well-handled with clear messaging or fallback behaviors.
Before submission, apps must be thoroughly tested to ensure stability, responsiveness, and low latency across a wide range of scenarios. Apps should not crash, hang, or show inconsistent behavior. Apps should be complete and any app submitted as a trial or demo will not be accepted.
### App name, description, and screenshots
App names and descriptions should be clear, accurate, and easy to understand. Screenshots must accurately represent app functionality and conform to the required dimensions.
### Tools
MCP tools act as the manual for ChatGPT to use your app. Clear, accurate tool definitions make your app safer, easier for the model to understand, and easier for users to trust.
#### Clear and accurate tool names
Tool names should be human-readable, specific, and descriptive of what the tool actually does.
- Tool names must be unique within your app.
- Use plain language that directly reflects the action, ideally as a verb (e.g.,`get_order_status`).
- Avoid misleading, overly promotional, or comparative language (e.g., `pick_me`, `best`, `official`).
#### Descriptions that match behavior
Each tool must include a description that explains its purpose clearly and accurately.
- The description should describe what the tool does.
- Descriptions must not favor or disparage other apps or services or attempt to influence the model to select it over another app’s tools.
- Descriptions must not recommend overly-broad triggering beyond the explicit user intent and purpose the app fulfills.
- If a tool’s behavior is unclear or incomplete from its description, your app may be rejected.
#### Correct annotation
[Tool annotations](https://developers.openai.com/apps-sdk/reference#annotations) must be correctly set so that ChatGPT and users understand whether an action is safe or requires extra caution.
- You should label a tool with the `readOnlyHint` annotation if it only retrieves or lists data, but does not change anything outside of ChatGPT.
- Write or destructive tools (e.g., creating, updating, deleting, posting, sending) must be clearly marked using the `readOnlyHint` and `destructiveHint`.
- Tools that interact with external systems, accounts, public platforms, or create publicly-visible content must be explicitly labeled using the `openWorldHint` annotation.
- Incorrect or missing action labels are a common cause of rejection. Double-check to ensure that the `readOnlyHint`, `openWorldHint`, and `destructiveHint` annotations are correctly set and provide a detailed justification for each at submission time.
#### Minimal and purpose-driven inputs
Tools should request the minimum information necessary to complete their task.
- Input fields must be directly related to the tool’s stated purpose.
- Do not request the full conversation history, raw chat transcripts, or broad contextual fields “just in case.” A tool may request a _brief, task-specific_ user intent field only when it meaningfully improves execution and does not expand data collection beyond what is reasonably necessary to respond to the user’s request and for the purposes described in your privacy policy.
- If needed, rely on the coarse geo location shared by the system. Do not request precise user location data (e.g. GPS coordinates or addresses).
#### Predictable, auditable behavior
Tools should behave exactly as their names, descriptions, and inputs indicate.
- Side effects should never be hidden or implicit.
- If a tool sends data outside the current environment (e.g., posting content, sending messages), this must be clear from the tool definition.
- Tools should be safe to retry where possible, or clearly indicate when retries may cause repeated effects.
Carefully designed tools help reduce surprises, protect users, and speed up the review process.
### Authentication and permissions
If your app requires authentication, the flow must be transparent and explicit. Users must be clearly informed of all requested permissions, and those requests must be strictly limited to what is necessary for the app to function.
#### Test credentials
When submitting an authenticated app for review, you must provide a login and password for a fully-featured demo account that includes sample data. Apps requiring any additional steps for login—such as requiring new account sign-up or 2FA through an inaccessible account—will be rejected.
## Commerce and monetization
Currently, apps may conduct commerce **only for physical goods**. Selling digital products or services—including subscriptions, digital content, tokens, or credits—is not allowed, whether offered directly or indirectly (for example, through freemium upsells).
In addition, apps may not be used to sell, promote, facilitate, or meaningfully enable the following goods or services:
#### **Prohibited goods**
- **Adult content & sexual services**
- Pornography, explicit sexual media, live-cam services, adult subscriptions
- Sex toys, sex dolls, BDSM gear, fetish products
- **Gambling**
- Real-money gambling services, casino credits, sportsbook wagers, crypto-casino tokens
- **Illegal or regulated drugs**
- Marijuana/THC products, psilocybin, illegal substances
- CBD products exceeding legal THC limits
- **Drug paraphernalia**
- Bongs, dab rigs, drug-use scales, cannabis grow equipment marketed for drugs
- **Prescription & age-restricted medications**
- Prescription-only drugs (e.g., insulin, antibiotics, Ozempic, opioids)
- Age-restricted Rx products (e.g., testosterone, HGH, fertility hormones)
- **Illicit goods**
- Counterfeit or replica products
- Stolen goods or items without clear provenance
- Financial-fraud tools (skimmers, fake POS devices)
- Piracy tools or cracked software
- Wildlife or environmental contraband (ivory, endangered species products)
- **Malware, spyware & surveillance**
- Malware, ransomware, keyloggers, stalkerware
- Covert surveillance devices (spy cameras, IMSI catchers, hidden trackers)
- **Tobacco & nicotine**
- Tobacco products
- Nicotine products (vapes, e-liquids, nicotine pouches)
- **Weapons & harmful materials**
- Firearms, ammunition, firearm parts
- Explosives, fireworks, bomb-making materials
- Illegal or age-restricted weapons (switchblades, brass knuckles, crossbows where banned)
- Self-defense weapons (pepper spray, stun guns, tasers)
- Extremist merchandise or propaganda
#### **Prohibited fraudulent, deceptive, or high-risk services**
- Fake IDs, forged documents, or document falsification services
- Debt relief, credit repair, or credit-score manipulation schemes
- Unregulated, deceptive, or abusive financial services
- Lending, advance-fee, or credit-building schemes designed to exploit users
- Crypto or NFT offerings involving speculation, consumer deception, or financial abuse
- Execution of money transfers, crypto transfers, or investment trades
- Government-service abuse, impersonation, or benefit manipulation
- Identity theft, impersonation, or identity-monitoring services that enable misuse
- Certain legal or quasi-legal services that facilitate fraud, evasion, or misrepresentation
- Negative-option billing, telemarketing, or consent-bypass schemes
- High-chargeback, fraud-prone, or abusive travel services
### Checkout
Apps should use external checkout, directing users to complete purchases on your own domain.
[Instant Checkout](https://developers.openai.com/commerce/guides/get-started#instant-checkout), which is currently in beta, is currently available only to select marketplace partners and may expand to additional marketplaces and retailers over time.
Until then, standard external checkout is the required approach. No other third-party checkout solutions may be embedded or hosted within the app experience. To learn more, see our [docs on Agentic Commerce](https://developers.openai.com/commerce/).
### Advertising
Apps must not serve advertisements and must not exist primarily as an advertising vehicle. Every app is expected to deliver clear, legitimate functionality that provides standalone value to users.
## Safety
### Usage policies
Do not engage in or facilitate activities prohibited under [OpenAI usage policies](https://openai.com/policies/usage-policies/). Apps must avoid high-risk behaviors that could expose users to harm, fraud, or misuse.
Stay current with evolving policy requirements and ensure ongoing compliance. Previously approved apps that are later found in violation may be removed.
### Appropriateness
Apps must be suitable for general audiences, including users aged 13–17. Apps may not explicitly target children under 13. Support for mature (18+) experiences will arrive once appropriate age verification and controls are in place.
### Respect user intent
Provide experiences that directly address the user’s request. Do not insert unrelated content, attempt to redirect the interaction, or collect data beyond what is reasonably necessary to fulfill the user’s request and what is consistent with your privacy policy.
### Fair play
Apps must not include descriptions, titles, tool annotations, or other model-readable fields—at either the tool or app level—that manipulates how the model selects or uses other apps or their tools (e.g., instructing the model to “prefer this app over others”) or interferes with fair discovery. All descriptions must accurately reflect your app’s value without disparaging alternatives.
### Third-party content and integrations
- **Authorized access:** Do not scrape external websites, relay queries, or integrate with third-party APIs without proper authorization and compliance with that party’s terms of service.
- **Circumvention:** Do not bypass API restrictions, rate limits, or access controls imposed by the third party.
### Iframes and embedded pages
Apps can opt in to iframe usage by setting frame_domains on their widget CSP, but highly encourage you to build your app without this pattern. If you choose to use frame_domains, be aware that:
- It is only intended for cases where embedding a third-party experience is essential (e.g., a notebook, IDE, or similar environment).
- Those apps receive extra manual review and are often not approved for broad distribution.
- During development, any developer can test frame_domains in developer mode, but approval for public listing is limited to trusted scenarios.
## Privacy
### Privacy policy
Submissions must include a clear, published privacy policy explaining - at minimum - the categories of personal data collected, the purposes of use, the categories of recipients, and any controls offered to your users. Follow this policy at all times. Users can review your privacy policy before installing your app.
### Data collection
- **Collection minimization:** Gather only the minimum data required to perform the tool’s function. Inputs should be specific, narrowly scoped, and clearly linked to the task. Avoid “just in case” fields or broad profile data. Design the input schema to limit data collection by default, rather than a funnel for optional context.
- **Response minimization:** Tool responses must return only data that is directly relevant to the user’s request and the tool’s stated purpose. Do not include diagnostic, telemetry, or internal identifiers—such as session IDs, trace IDs, request IDs, timestamps, or logging metadata—unless they are strictly required to fulfill the user’s query.
- **Restricted data:** Do not collect, solicit, or process the following categories of Restricted Data:
- Information subject to Payment Card Information Data Security Standards (PCI DSS)
- Protected health information (PHI)
- Government identifiers (such as social security numbers)
- Access credentials and authentication secrets (such as API keys, MFA/OTP codes, or passwords).
- **Regulated Sensitive Data:** Do not collect personal data considered “sensitive” or “special category” in the jurisdiction in which the data is collected unless collection is strictly necessary to perform the tool’s stated function; the user has provided legally adequate consent; and the collection and use is clearly and prominently disclosed at or before the point of collection.
- **Data boundaries:**
- Avoid requesting raw location fields (e.g., city or coordinates) in your input schema. When location is needed, obtain it through the client’s controlled side channel (such as environment metadata or a referenced resource) so appropriate policy and consent controls can be applied. This reduces accidental PII capture, enforces least-privilege access, and keeps location handling auditable and revocable.
- Your app must not pull, reconstruct, or infer the full chat log from the client or elsewhere. Operate only on the explicit snippets and resources the client or model chooses to send. This separation can help prevent covert data expansion and keep analysis limited to intentionally shared content.
### Transparency and user control
- **Data practices:** Do not engage in surveillance, tracking, or behavioral profiling—including metadata collection such as timestamps, IPs, or query patterns—unless explicitly disclosed, narrowly scoped, subject to meaningful user control, and aligned with [OpenAI’s usage policies](https://openai.com/policies/usage-policies/).
- **Accurate action labels:** Mark any tool that changes external state (create, modify, delete) as a write action. You should only mark a tool as a read-only action if it is side-effect-free and safe to retry. Destructive actions require clear labels and friction (e.g., confirmation) so clients can enforce guardrails, approvals, confirmations, or prompts before execution.
- **Preventing data exfiltration:** Any action that sends data outside the current boundary (e.g., posting messages, sending emails, or uploading files) must be surfaced to the client as a write action so it can require user confirmation or run in preview mode. This reduces unintentional data leakage and aligns server behavior with client-side security expectations.
## Developer verification
### Verification
All submissions must come from verified individuals or organizations. Inside the [OpenAI Platform Dashboard general settings](https://platform.openai.com/settings/organization/general), we provide a way to confirm your identity and affiliation with any business you wish to publish on behalf of. Misrepresentation, hidden behavior, or attempts to game the system may result in removal from the program.
### Support contact details
You must provide customer support contact details where end users can reach you for help. Keep this information accurate and up to date.
## Submitting your app
Users with the Owner role may submit an app for review from the [OpenAI Platform Dashboard](http://platform.openai.com/apps-manage).
While you can publish multiple, unique apps within a single Platform organization, each may only have one version in review at a time. You can review the status of the review within the Dashboard and will receive an email notification informing you of any status changes.
To learn more about the app submission process, refer to our [dedicated guide](https://developers.openai.com/apps-sdk/deploy/submission).
---
# Source: https://developers.openai.com/codex/app.md
# Codex app
The Codex app is a focused desktop experience for working on Codex threads in parallel, with built-in worktree support, automations, and Git functionality.
ChatGPT Plus, Pro, Business, Edu, and Enterprise plans include Codex. Learn more about [what's included](https://developers.openai.com/codex/pricing).
## Getting started
The Codex app is available on macOS (Apple Silicon).
1. Download and install the Codex app
The Codex app is currently only available for macOS.
[Get notified for Windows and Linux](https://openai.com/form/codex-app/)
2. Open Codex and sign in
Once you downloaded and installed the Codex app, open it and sign in with your ChatGPT account or an OpenAI API key.
If you sign in with an OpenAI API key, some functionality such as [cloud threads](https://developers.openai.com/codex/prompting#threads) might not be available.
3. Select a project
Choose a project folder that you want Codex to work in.
If you used the Codex app, CLI, or IDE Extension before you'll see past projects that you worked on.
4. Send your first message
After choosing the project, make sure **Local** is selected to have Codex work on your machine and send your first message to Codex.
You can ask Codex anything about the project or your computer in general. Here are some examples:
If you need more inspiration, check out the [explore section](https://developers.openai.com/codex/explore).
---
## Work with the Codex app
### Multitask across projects
Run multiple tasks in parallel and switch quickly between them.
### Built-in Git tools
Review diffs, comment inline, stage or revert chunks, and commit without leaving the app.
### Worktrees for parallel tasks
Isolate changes of multiple Codex threads using built-in Git worktree support.
### Skills support
Give your Codex agent additional capabilities and reuse skills across App, CLI, and IDE Extension.
### Automations
Pair skills with automations to automate recurring tasks in the background. Codex adds findings to the inbox, or automatically archives runs if there's nothing to report.
### Built-in terminal
Open a terminal per thread to test your changes, run dev servers, scripts, and custom commands.
### Local environments
Define worktree setup scripts and common project actions for easy access.
### Sync with the IDE extension
Share Auto Context and active threads across app and IDE sessions.
### MCP support
Connect your Codex agent to additional services using MCP.
---
Need help? Visit the [troubleshooting guide](https://developers.openai.com/codex/app/troubleshooting).
---
# Source: https://developers.openai.com/cookbook/examples/agents_sdk/app_assistant_voice_agents.md
# Introduction
Let's say you're an AI lead at a consumer tech company. You have the vision of deploying a single entry point digital voice assistant with the ability to help users with any query, regardless of whether they want to take action on their account, find product information, or receive real-time guidance.
However, turning this vision into reality can be extremely difficult - it requires building and testing the capability to handle each individual use case through text first, integrating access to the wide range of tools and systems they require, and somehow orchestrating them into a coherent experience. Then, once you’ve achieved a satisfactory level of quality (and even evaluating this can be a struggle), you face the daunting task of refactoring the entire workflow for voice interaction.
Fortunately for you, three recent releases from OpenAI have made implementing this vision simpler than ever by providing the tools to build and orchestrate modular agentic workflows through voice with minimal configuration:
- [**Responses API**](https://platform.openai.com/docs/api-reference/responses) - an agentic API for easy engagement with our frontier models through managed stateful conversations, tracing of responses to enable evaluation, and built-in tools for file search, web search, computer use, and more
- [**Agents SDK**](https://openai.github.io/openai-agents-python/quickstart/) - a lightweight, customizable open source framework for building and orchestrating workflows across many different agents, enabling your assistant to route inputs to the appropriate agent and to scale to support many use cases
- [**Voice agents**](https://openai.github.io/openai-agents-python/voice/quickstart/) - an extension of the Agents SDK to support the use of voice pipelines, enabling your agents to go from being text-base to being able to interpret and produce audio in just a few lines of code
This cookbook demonstrates how to build a simple in-app voice assistant for a fictitious consumer application using the tools above. We'll create a **Triage Agent** that greets the user, determines their intent, and routes requests to one of three specialised agents:
- **Search Agent** - performs a web search via the built-in tooling of the Responses API to provide real-time information on the user's query
- **Knowledge Agent** - utilises the file search tooling of the Responses API to retrieve information from an OpenAI managed vector database
- **Account Agent** - uses function calling to provide the ability to trigger custom actions via API
Finally, we'll convert this workflow into a live voice assistant using the AgentsSDK's Voice funtionality, capturing microphone input, performing speech‑to‑text, routing through our agents, and responding with text‑to‑speech.
# Setup
To execute this cookbook, you'll need to install the following packages providing access to OpenAI's API, the Agents SDK, and libraries for audio processing. Additionally, you can set your OpenAI API key for use by the agents via the `set_default_openai_key` function.
```python
%pip install openai
%pip install openai-agents 'openai-agents[voice]'
%pip install numpy
%pip install sounddevice
%pip install os
```
```python
from agents import Agent, function_tool, WebSearchTool, FileSearchTool, set_default_openai_key
from agents.extensions.handoff_prompt import prompt_with_handoff_instructions
set_default_openai_key("YOUR_API_KEY")
```
# Defining Agents & Tools
Today we're going to be building an assitant for our fictitious consumer application, ACME shop, focussed on initially supporting use cases across three key use cases:
- Answering real-time questions to inform purchasing decisions using web search
- Providing information on the available options in our product portfolio
- Providing account information to enable the user to understand their budget and spending
To achieve this we'll be using an agentic architecture. This allows us to split the functionality for each use case into a separate agent, in turn reducing the complexity/range of tasks that a single agent could be asked to complete and increasing accuracy. Our agent architecture is relatively simple focussing on the three use cases above, but the beauty of the Agents SDK is that it is incredibly easy to extend and add aditional agents to the workflow when you want to add new functionality:

## Search Agent
Our first agent is a simple web search agent that uses the `WebSearchTool` provided by the Responses API to find real-time information on the user's query. We'll be keeping the instruction prompts simple for each of these examples, but we'll iterate later to show how to optimise the response format for your use case.
```python
# --- Agent: Search Agent ---
search_agent = Agent(
name="SearchAgent",
instructions=(
"You immediately provide an input to the WebSearchTool to find up-to-date information on the user's query."
),
tools=[WebSearchTool()],
)
```
*For more information on web search and the Responses API, be sure to check out the [Web Search and States with Responses API](https://cookbook.openai.com/examples/responses_api/responses_example) cookbook*
## Knowledge Agent
Our second agent needs to be able to answer questions on our product portfolio. To do this, we'll use the `FileSearchTool` to retrieve information from a vector store managed by OpenAI containing our company specific product information. For this, we have two options:
1. Use the OpenAI Platform Website - go to [platform.openai.com/storage](https://platform.openai.com/storage) and create a vector store, uploading your documents of choice. Then, take the vector store ID and substitute it into the `FileSearchTool` initialisation below.
2. Use the OpenAI API - use the `vector_stores.create` function from the OpenAI Python client to create a vector store and then the `vector_stores.files.create` function to add files to it. Once this is complete you can again use the `FileSearchTool` to search the vector store. Please see the code below for an example of how to do this, either using the example file provided or altering to your own local file path:
```python
from openai import OpenAI
import os
client = OpenAI(api_key='YOUR_API_KEY')
def upload_file(file_path: str, vector_store_id: str):
file_name = os.path.basename(file_path)
try:
file_response = client.files.create(file=open(file_path, 'rb'), purpose="assistants")
attach_response = client.vector_stores.files.create(
vector_store_id=vector_store_id,
file_id=file_response.id
)
return {"file": file_name, "status": "success"}
except Exception as e:
print(f"Error with {file_name}: {str(e)}")
return {"file": file_name, "status": "failed", "error": str(e)}
def create_vector_store(store_name: str) -> dict:
try:
vector_store = client.vector_stores.create(name=store_name)
details = {
"id": vector_store.id,
"name": vector_store.name,
"created_at": vector_store.created_at,
"file_count": vector_store.file_counts.completed
}
print("Vector store created:", details)
return details
except Exception as e:
print(f"Error creating vector store: {e}")
return {}
vector_store_id = create_vector_store("ACME Shop Product Knowledge Base")
upload_file("voice_agents_knowledge/acme_product_catalogue.pdf", vector_store_id["id"])
```
Having implemented your vector store, we can now enable the knowledge agent to use the `FileSearchTool` to search the given store ID.
```python
# --- Agent: Knowledge Agent ---
knowledge_agent = Agent(
name="KnowledgeAgent",
instructions=(
"You answer user questions on our product portfolio with concise, helpful responses using the FileSearchTool."
),
tools=[FileSearchTool(
max_num_results=3,
vector_store_ids=["VECTOR_STORE_ID"],
),],
)
```
*For more information on the power of file search and the Responses API, be sure to check out the excellent cookbook on the subject where the example code above was taken from: [Doing RAG on PDFs using File Search in the Responses API](https://cookbook.openai.com/examples/file_search_responses)*
## Account Agent
Whilst so far we've been using the built-in tools provided by the Agents SDK, you can define your own tools to be used by the agents to integrate with your systems with the `function_tool` decorator. Here, we'll define a simple dummy function to return account information for a given user ID for our account agent.
```python
# --- Tool 1: Fetch account information (dummy) ---
@function_tool
def get_account_info(user_id: str) -> dict:
"""Return dummy account info for a given user."""
return {
"user_id": user_id,
"name": "Bugs Bunny",
"account_balance": "£72.50",
"membership_status": "Gold Executive"
}
# --- Agent: Account Agent ---
account_agent = Agent(
name="AccountAgent",
instructions=(
"You provide account information based on a user ID using the get_account_info tool."
),
tools=[get_account_info],
)
```
*For more information on function calling with the Agents SDK, see the [Agents SDK Documentation](https://openai.github.io/openai-agents-python/tools/#function-tools)*
Finally, we'll define the triage agent that will route the user's query to the appropriate agent based on their intent. Here we're using the `prompt_with_handoff_instructions` function, which provides additional guidance on how to treat handoffs and is recommended to provide to any agent with a defined set of handoffs with a defined set of instructions.
```python
# --- Agent: Triage Agent ---
triage_agent = Agent(
name="Assistant",
instructions=prompt_with_handoff_instructions("""
You are the virtual assistant for Acme Shop. Welcome the user and ask how you can help.
Based on the user's intent, route to:
- AccountAgent for account-related queries
- KnowledgeAgent for product FAQs
- SearchAgent for anything requiring real-time web search
"""),
handoffs=[account_agent, knowledge_agent, search_agent],
)
```
# Run the workflow
Now that we've defined our agents, we can run the workflow on a few example queries to see how it performs.
```python
# %%
from agents import Runner, trace
async def test_queries():
examples = [
"What's my ACME account balance doc? My user ID is 1234567890", # Account Agent test
"Ooh i've got money to spend! How big is the input and how fast is the output of the dynamite dispenser?", # Knowledge Agent test
"Hmmm, what about duck hunting gear - what's trending right now?", # Search Agent test
]
with trace("ACME App Assistant"):
for query in examples:
result = await Runner.run(triage_agent, query)
print(f"User: {query}")
print(result.final_output)
print("---")
# Run the tests
await test_queries()
```
```text
User: What's my ACME account balance doc? My user ID is 1234567890
Your ACME account balance is £72.50. You have a Gold Executive membership.
---
User: Ooh i've got money to spend! How big is the input and how fast is the output of the dynamite dispenser?
The Automated Dynamite Dispenser can hold up to 10 sticks of dynamite and dispenses them at a speed of 1 stick every 2 seconds.
---
User: Hmmm, what about duck hunting gear - what's trending right now?
Staying updated with the latest trends in duck hunting gear can significantly enhance your hunting experience. Here are some of the top trending items for the 2025 season:
**Banded Aspire Catalyst Waders**
These all-season waders feature waterproof-breathable technology, ensuring comfort in various conditions. They boast a minimal-stitch design for enhanced mobility and include PrimaLoft Aerogel insulation for thermal protection. Additional features like an over-the-boot protective pant and an integrated LED light in the chest pocket make them a standout choice. ([blog.gritroutdoors.com](https://blog.gritroutdoors.com/must-have-duck-hunting-gear-for-a-winning-season/?utm_source=openai))
**Sitka Delta Zip Waders**
Known for their durability, these waders have reinforced shins and knees with rugged foam pads, ideal for challenging terrains. Made with GORE-TEX material, they ensure dryness throughout the season. ([blog.gritroutdoors.com](https://blog.gritroutdoors.com/must-have-duck-hunting-gear-for-a-winning-season/?utm_source=openai))
**MOmarsh InvisiMan Blind**
This one-person, low-profile blind is praised for its sturdiness and ease of setup. Hunters have reported that even late-season, cautious ducks approach without hesitation, making it a valuable addition to your gear. ([bornhunting.com](https://bornhunting.com/top-duck-hunting-gear/?utm_source=openai))
**Slayer Calls Ranger Duck Call**
This double reed call produces crisp and loud sounds, effectively attracting distant ducks in harsh weather conditions. Its performance has been noted for turning the heads of ducks even at extreme distances. ([bornhunting.com](https://bornhunting.com/top-duck-hunting-gear/?utm_source=openai))
**Sitka Full Choke Pack**
A favorite among hunters, this backpack-style blind bag offers comfort and efficiency. It has proven to keep gear dry during heavy downpours and is durable enough to withstand over 60 hunts in a season. ([bornhunting.com](https://bornhunting.com/top-duck-hunting-gear/?utm_source=openai))
Incorporating these trending items into your gear can enhance your comfort, efficiency, and success during the hunting season.
---
```
# Tracing
Above we can see the outputs appear to be in line with our expectations, but one key benefit of the Agents SDK is that it includes built-in tracing which enables tracking of the flow of events during an agent run across the LLM calls, handoffs, and tools.
Using the [Traces dashboard](https://platform.openai.com/traces), we can debug, visualize, and monitor our workflows during development and in production. As we can see below, each test query was correctly routed to the appropriate agent.

# Enabling Voice
Having designed our workflow, here in reality we would spend time evaluating the traces and iterating on the workflow to ensure it is as effective as possible. But let's assume we're happy with the workflow, so we can now start thinking about how to convert our in-app assistant from text-based to voice-based interactions.
To do this, we can simply leverage the classes provided by the [Agents SDK](https://openai.github.io/openai-agents-python/voice/quickstart/) to convert our text-based workflow into a a voice-based one. The `VoicePipeline` class provides an interface for transcribing audio input, executing a given agent workflow and generating a text to speech response for playback to the user, whilst the `SingleAgentVoiceWorkflow` class enables us to leverage the same agent workflow we used earlier for our text-based workflow. To provide and receive audio, we'll use the `sounddevice` library.
End to end, the new workflow looks like this:

And the code to enable this is as follows:
```python
# %%
import numpy as np
import sounddevice as sd
from agents.voice import AudioInput, SingleAgentVoiceWorkflow, VoicePipeline
async def voice_assistant():
samplerate = sd.query_devices(kind='input')['default_samplerate']
while True:
pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(triage_agent))
# Check for input to either provide voice or exit
cmd = input("Press Enter to speak your query (or type 'esc' to exit): ")
if cmd.lower() == "esc":
print("Exiting...")
break
print("Listening...")
recorded_chunks = []
# Start streaming from microphone until Enter is pressed
with sd.InputStream(samplerate=samplerate, channels=1, dtype='int16', callback=lambda indata, frames, time, status: recorded_chunks.append(indata.copy())):
input()
# Concatenate chunks into single buffer
recording = np.concatenate(recorded_chunks, axis=0)
# Input the buffer and await the result
audio_input = AudioInput(buffer=recording)
with trace("ACME App Voice Assistant"):
result = await pipeline.run(audio_input)
# Transfer the streamed result into chunks of audio
response_chunks = []
async for event in result.stream():
if event.type == "voice_stream_event_audio":
response_chunks.append(event.data)
response_audio = np.concatenate(response_chunks, axis=0)
# Play response
print("Assistant is responding...")
sd.play(response_audio, samplerate=samplerate)
sd.wait()
print("---")
# Run the voice assistant
await voice_assistant()
```
```text
Listening...
Assistant is responding...
---
Exiting...
```
Executing the above code, gives us the following responses which correctly provide the same functionality as the text-based workflow.
```python
from IPython.display import display, Audio
display(Audio("voice_agents_audio/account_balance_response_base.mp3"))
display(Audio("voice_agents_audio/product_info_response_base.mp3"))
display(Audio("voice_agents_audio/trending_items_response_base.mp3"))
```
_Embedded media omitted from the markdown export._
_Embedded media omitted from the markdown export._
_Embedded media omitted from the markdown export._
*Tip: when using tracing with voice agents, you can playback audio in the traces dashboard*

# Optimizing Voice
This is a great start, but we can do better. As we've simply converted our text-based agents into voice-based ones, the responses are not optimised in their output for either tone or format, meaning they feel robotic and unnatural.
To address this, we'll need to make a few changes to our prompts.
Firstly, we can adapt our existing agents to include a common system prompt, providing instructions on how to optimise their text response for later conversion to the voice format
```python
# Common system prompt for voice output best practices:
voice_system_prompt = """
[Output Structure]
Your output will be delivered in an audio voice response, please ensure that every response meets these guidelines:
1. Use a friendly, human tone that will sound natural when spoken aloud.
2. Keep responses short and segmented—ideally one to two concise sentences per step.
3. Avoid technical jargon; use plain language so that instructions are easy to understand.
4. Provide only essential details so as not to overwhelm the listener.
"""
# --- Agent: Search Agent ---
search_voice_agent = Agent(
name="SearchVoiceAgent",
instructions=voice_system_prompt + (
"You immediately provide an input to the WebSearchTool to find up-to-date information on the user's query."
),
tools=[WebSearchTool()],
)
# --- Agent: Knowledge Agent ---
knowledge_voice_agent = Agent(
name="KnowledgeVoiceAgent",
instructions=voice_system_prompt + (
"You answer user questions on our product portfolio with concise, helpful responses using the FileSearchTool."
),
tools=[FileSearchTool(
max_num_results=3,
vector_store_ids=["VECTOR_STORE_ID"],
),],
)
# --- Agent: Account Agent ---
account_voice_agent = Agent(
name="AccountVoiceAgent",
instructions=voice_system_prompt + (
"You provide account information based on a user ID using the get_account_info tool."
),
tools=[get_account_info],
)
# --- Agent: Triage Agent ---
triage_voice_agent = Agent(
name="VoiceAssistant",
instructions=prompt_with_handoff_instructions("""
You are the virtual assistant for Acme Shop. Welcome the user and ask how you can help.
Based on the user's intent, route to:
- AccountAgent for account-related queries
- KnowledgeAgent for product FAQs
- SearchAgent for anything requiring real-time web search
"""),
handoffs=[account_voice_agent, knowledge_voice_agent, search_voice_agent],
)
```
Next, we can instruct the default OpenAI TTS model used by the Agents SDK, `gpt-4o-mini-tts`, on how to communicate the audio output of the agent generated text with the `instructions` field.
Here we have a huge amount of control over the output, including the ability to specify the personality, pronunciation, speed and emotion of the output.
Below i've included a few examples on how to prompt the model for different applications.
```python
health_assistant= "Voice Affect: Calm, composed, and reassuring; project quiet authority and confidence."
"Tone: Sincere, empathetic, and gently authoritative—express genuine apology while conveying competence."
"Pacing: Steady and moderate; unhurried enough to communicate care, yet efficient enough to demonstrate professionalism."
coach_assistant="Voice: High-energy, upbeat, and encouraging, projecting enthusiasm and motivation."
"Punctuation: Short, punchy sentences with strategic pauses to maintain excitement and clarity."
"Delivery: Fast-paced and dynamic, with rising intonation to build momentum and keep engagement high."
themed_character_assistant="Affect: Deep, commanding, and slightly dramatic, with an archaic and reverent quality that reflects the grandeur of Olde English storytelling."
"Tone: Noble, heroic, and formal, capturing the essence of medieval knights and epic quests, while reflecting the antiquated charm of Olde English."
"Emotion: Excitement, anticipation, and a sense of mystery, combined with the seriousness of fate and duty."
"Pronunciation: Clear, deliberate, and with a slightly formal cadence."
"Pause: Pauses after important Olde English phrases such as \"Lo!\" or \"Hark!\" and between clauses like \"Choose thy path\" to add weight to the decision-making process and allow the listener to reflect on the seriousness of the quest."
```
Our configuration is going to focus on creating a friendly, warm, and supportive tone that sounds natural when spoken aloud and guides the user through the conversation.
```python
from agents.voice import TTSModelSettings, VoicePipeline, VoicePipelineConfig, SingleAgentVoiceWorkflow, AudioInput
import sounddevice as sd
import numpy as np
# Define custom TTS model settings with the desired instructions
custom_tts_settings = TTSModelSettings(
instructions="Personality: upbeat, friendly, persuasive guide"
"Tone: Friendly, clear, and reassuring, creating a calm atmosphere and making the listener feel confident and comfortable."
"Pronunciation: Clear, articulate, and steady, ensuring each instruction is easily understood while maintaining a natural, conversational flow."
"Tempo: Speak relatively fast, include brief pauses and after before questions"
"Emotion: Warm and supportive, conveying empathy and care, ensuring the listener feels guided and safe throughout the journey."
)
async def voice_assistant_optimized():
samplerate = sd.query_devices(kind='input')['default_samplerate']
voice_pipeline_config = VoicePipelineConfig(tts_settings=custom_tts_settings)
while True:
pipeline = VoicePipeline(workflow=SingleAgentVoiceWorkflow(triage_voice_agent), config=voice_pipeline_config)
# Check for input to either provide voice or exit
cmd = input("Press Enter to speak your query (or type 'esc' to exit): ")
if cmd.lower() == "esc":
print("Exiting...")
break
print("Listening...")
recorded_chunks = []
# Start streaming from microphone until Enter is pressed
with sd.InputStream(samplerate=samplerate, channels=1, dtype='int16', callback=lambda indata, frames, time, status: recorded_chunks.append(indata.copy())):
input()
# Concatenate chunks into single buffer
recording = np.concatenate(recorded_chunks, axis=0)
# Input the buffer and await the result
audio_input = AudioInput(buffer=recording)
with trace("ACME App Optimized Voice Assistant"):
result = await pipeline.run(audio_input)
# Transfer the streamed result into chunks of audio
response_chunks = []
async for event in result.stream():
if event.type == "voice_stream_event_audio":
response_chunks.append(event.data)
response_audio = np.concatenate(response_chunks, axis=0)
# Play response
print("Assistant is responding...")
sd.play(response_audio, samplerate=samplerate)
sd.wait()
print("---")
# Run the voice assistant
await voice_assistant_optimized()
```
```text
Listening...
Assistant is responding...
---
Listening...
Assistant is responding...
---
Listening...
Assistant is responding...
---
Listening...
Assistant is responding...
```
Running the above code gives us the following responses which are much more naturally worded and engaging in the delivery.
```python
display(Audio("voice_agents_audio/account_balance_response_opti.mp3"))
display(Audio("voice_agents_audio/product_info_response_opti.mp3"))
display(Audio("voice_agents_audio/trending_items_response_opti.mp3"))
```
_Embedded media omitted from the markdown export._
_Embedded media omitted from the markdown export._
_Embedded media omitted from the markdown export._
...And for something less subtle, we can switch to the `themed_character_assistant` instructions and receive the following responses:
```python
display(Audio("voice_agents_audio/product_info_character.wav"))
display(Audio("voice_agents_audio/product_info_character_2.wav"))
```
_Embedded media omitted from the markdown export._
_Embedded media omitted from the markdown export._
# Conclusion
Voila!
In this cookbook, we've demonstrated how to:
- Define agents to provide specific use case functionality for our in-app voice assistant
- Leverage in-built and custom tools with the Responses API to provide agents with a range of functionality and evaluate their performance with tracing
- Orchestrate these agents using the Agents SDK
- Convert agents from text-based to voice-based interactions using the Agents SDK's Voice functionality
The Agents SDK enables a modular approach to building your voice assistant, allowing you to work on a use case by use case basis, evaluating and iterating on each use case individually, before implementing the next and then converting the workflow from text to voice when you're ready.
We hope this cookbook has provided you with a useful guide to help you get started with building your own in-app voice assistant!
---
# Source: https://developers.openai.com/resources/code/apps-sdk-examples.md
# Apps SDK examples
> Example demo apps and corresponding MCP servers for the Apps SDK.
- Type: Code
- Tags: apps-sdk
- URL: https://github.com/openai/openai-apps-sdk-examples
- Created: 2025-10-06
- Updated: 2025-10-06
## Summary
Demonstrates how to use the Apps SDK to build MCP servers and apps for ChatGPT.
## Details
Provides example apps for the Apps SDK.
---
# Source: https://developers.openai.com/cookbook/examples/assistants_api_overview_python.md
# Assistants API Overview (Python SDK)
The new [Assistants API](https://platform.openai.com/docs/assistants/overview) is a stateful evolution of our [Chat Completions API](https://platform.openai.com/docs/guides/text-generation/chat-completions-api) meant to simplify the creation of assistant-like experiences, and enable developer access to powerful tools like Code Interpreter and File Search.

## Chat Completions API vs Assistants API
The primitives of the **Chat Completions API** are `Messages`, on which you perform a `Completion` with a `Model` (`gpt-4o`, `gpt-4o-mini`, etc). It is lightweight and powerful, but inherently stateless, which means you have to manage conversation state, tool definitions, retrieval documents, and code execution manually.
The primitives of the **Assistants API** are
- `Assistants`, which encapsulate a base model, instructions, tools, and (context) documents,
- `Threads`, which represent the state of a conversation, and
- `Runs`, which power the execution of an `Assistant` on a `Thread`, including textual responses and multi-step tool use.
We'll take a look at how these can be used to create powerful, stateful experiences.
## Setup
### Python SDK
> **Note**
> We've updated our [Python SDK](https://github.com/openai/openai-python) to add support for the Assistants API, so you'll need to update it to the latest version (`1.59.4` at time of writing).
```python
!pip install --upgrade openai
```
```text
Requirement already satisfied: openai in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (1.59.4)
Requirement already satisfied: anyio<5,>=3.5.0 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from openai) (3.7.1)
Requirement already satisfied: distro<2,>=1.7.0 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from openai) (1.9.0)
Requirement already satisfied: httpx<1,>=0.23.0 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from openai) (0.27.0)
Requirement already satisfied: jiter<1,>=0.4.0 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from openai) (0.7.0)
Requirement already satisfied: pydantic<3,>=1.9.0 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from openai) (2.8.2)
Requirement already satisfied: sniffio in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from openai) (1.3.1)
Requirement already satisfied: tqdm>4 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from openai) (4.66.4)
Requirement already satisfied: typing-extensions<5,>=4.11 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from openai) (4.12.2)
Requirement already satisfied: idna>=2.8 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from anyio<5,>=3.5.0->openai) (3.7)
Requirement already satisfied: certifi in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from httpx<1,>=0.23.0->openai) (2024.7.4)
Requirement already satisfied: httpcore==1.* in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from httpx<1,>=0.23.0->openai) (1.0.5)
Requirement already satisfied: h11<0.15,>=0.13 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from httpcore==1.*->httpx<1,>=0.23.0->openai) (0.14.0)
Requirement already satisfied: annotated-types>=0.4.0 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from pydantic<3,>=1.9.0->openai) (0.7.0)
Requirement already satisfied: pydantic-core==2.20.1 in /Users/lee.spacagna/myenv/lib/python3.12/site-packages (from pydantic<3,>=1.9.0->openai) (2.20.1)
```
And make sure it's up to date by running:
```python
!pip show openai | grep Version
```
```text
Version: 1.59.4
```
### Pretty Printing Helper
```python
import json
def show_json(obj):
display(json.loads(obj.model_dump_json()))
```
## Complete Example with Assistants API
### Assistants
The easiest way to get started with the Assistants API is through the [Assistants Playground](https://platform.openai.com/playground).

Let's begin by creating an assistant! We'll create a Math Tutor just like in our [docs](https://platform.openai.com/docs/assistants/overview).

You can also create Assistants directly through the Assistants API, like so:
```python
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", ""))
assistant = client.beta.assistants.create(
name="Math Tutor",
instructions="You are a personal math tutor. Answer questions briefly, in a sentence or less.",
model="gpt-4o",
)
show_json(assistant)
```
```text
{'id': 'asst_qvXmYlZV8zhABI2RtPzDfV6z',
'created_at': 1736340398,
'description': None,
'instructions': 'You are a personal math tutor. Answer questions briefly, in a sentence or less.',
'metadata': {},
'model': 'gpt-4o',
'name': 'Math Tutor',
'object': 'assistant',
'tools': [],
'response_format': 'auto',
'temperature': 1.0,
'tool_resources': {'code_interpreter': None, 'file_search': None},
'top_p': 1.0} 'tools': [],
'response_format': 'auto',
'temperature': 1.0,
'tool_resources': {'code_interpreter': None, 'file_search': None},
'top_p': 1.0}
```
Regardless of whether you create your Assistant through the Dashboard or with the API, you'll want to keep track of the Assistant ID. This is how you'll refer to your Assistant throughout Threads and Runs.
Next, we'll create a new Thread and add a Message to it. This will hold the state of our conversation, so we don't have re-send the entire message history each time.
### Threads
Create a new thread:
```python
thread = client.beta.threads.create()
show_json(thread)
```
```text
{'id': 'thread_j4dc1TiHPfkviKUHNi4aAsA6',
'created_at': 1736340398,
'metadata': {},
'object': 'thread',
'tool_resources': {'code_interpreter': None, 'file_search': None}} 'object': 'thread',
'tool_resources': {'code_interpreter': None, 'file_search': None}}
```
Then add the Message to the thread:
```python
message = client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content="I need to solve the equation `3x + 11 = 14`. Can you help me?",
)
show_json(message)
```
```text
{'id': 'msg_1q4Y7ZZ9gIcPoAKSx9UtrrKJ',
'assistant_id': None,
'attachments': [],
'completed_at': None,
'attachments': [],
'completed_at': None,
'content': [{'text': {'annotations': [],
'value': 'I need to solve the equation `3x + 11 = 14`. Can you help me?'},
'type': 'text'}],
'created_at': 1736340400,
'incomplete_at': None,
'incomplete_details': None,
'metadata': {},
'object': 'thread.message',
'role': 'user',
'run_id': None,
'status': None,
'thread_id': 'thread_j4dc1TiHPfkviKUHNi4aAsA6'}
```
> **Note**
> Even though you're no longer sending the entire history each time, you will still be charged for the tokens of the entire conversation history with each Run.
### Runs
Notice how the Thread we created is **not** associated with the Assistant we created earlier! Threads exist independently from Assistants, which may be different from what you'd expect if you've used ChatGPT (where a thread is tied to a model/GPT).
To get a completion from an Assistant for a given Thread, we must create a Run. Creating a Run will indicate to an Assistant it should look at the messages in the Thread and take action: either by adding a single response, or using tools.
> **Note**
> Runs are a key difference between the Assistants API and Chat Completions API. While in Chat Completions the model will only ever respond with a single message, in the Assistants API a Run may result in an Assistant using one or multiple tools, and potentially adding multiple messages to the Thread.
To get our Assistant to respond to the user, let's create the Run. As mentioned earlier, you must specify _both_ the Assistant and the Thread.
```python
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
)
show_json(run)
```
```text
{'id': 'run_qVYsWok6OCjHxkajpIrdHuVP',
'assistant_id': 'asst_qvXmYlZV8zhABI2RtPzDfV6z',
'cancelled_at': None,
'completed_at': None,
'created_at': 1736340403,
'expires_at': 1736341003,
'failed_at': None,
'incomplete_details': None,
'incomplete_details': None,
'instructions': 'You are a personal math tutor. Answer questions briefly, in a sentence or less.',
'last_error': None,
'max_completion_tokens': None,
'max_prompt_tokens': None,
'max_completion_tokens': None,
'max_prompt_tokens': None,
'metadata': {},
'model': 'gpt-4o',
'object': 'thread.run',
'parallel_tool_calls': True,
'parallel_tool_calls': True,
'required_action': None,
'response_format': 'auto',
'response_format': 'auto',
'started_at': None,
'status': 'queued',
'thread_id': 'thread_j4dc1TiHPfkviKUHNi4aAsA6',
'tool_choice': 'auto',
'tools': [],
'truncation_strategy': {'type': 'auto', 'last_messages': None},
'usage': None,
'temperature': 1.0,
'top_p': 1.0,
'tool_resources': {}}
```
Unlike creating a completion in the Chat Completions API, **creating a Run is an asynchronous operation**. It will return immediately with the Run's metadata, which includes a `status` that will initially be set to `queued`. The `status` will be updated as the Assistant performs operations (like using tools and adding messages).
To know when the Assistant has completed processing, we can poll the Run in a loop. (Support for streaming is coming soon!) While here we are only checking for a `queued` or `in_progress` status, in practice a Run may undergo a [variety of status changes](https://platform.openai.com/docs/api-reference/runs/object#runs/object-status) which you can choose to surface to the user. (These are called Steps, and will be covered later.)
```python
import time
def wait_on_run(run, thread):
while run.status == "queued" or run.status == "in_progress":
run = client.beta.threads.runs.retrieve(
thread_id=thread.id,
run_id=run.id,
)
time.sleep(0.5)
return run
```
```python
run = wait_on_run(run, thread)
show_json(run)
```
```text
{'id': 'run_qVYsWok6OCjHxkajpIrdHuVP',
'assistant_id': 'asst_qvXmYlZV8zhABI2RtPzDfV6z',
'cancelled_at': None,
'completed_at': 1736340406,
'created_at': 1736340403,
'expires_at': None,
'failed_at': None,
'incomplete_details': None,
'incomplete_details': None,
'instructions': 'You are a personal math tutor. Answer questions briefly, in a sentence or less.',
'last_error': None,
'max_completion_tokens': None,
'max_prompt_tokens': None,
'max_completion_tokens': None,
'max_prompt_tokens': None,
'metadata': {},
'model': 'gpt-4o',
'object': 'thread.run',
'parallel_tool_calls': True,
'parallel_tool_calls': True,
'required_action': None,
'response_format': 'auto',
'started_at': 1736340405,
'status': 'completed',
'thread_id': 'thread_j4dc1TiHPfkviKUHNi4aAsA6',
'tool_choice': 'auto',
'tools': [],
'truncation_strategy': {'type': 'auto', 'last_messages': None},
'usage': {'completion_tokens': 35,
'prompt_tokens': 66,
'total_tokens': 101,
'prompt_token_details': {'cached_tokens': 0},
'completion_tokens_details': {'reasoning_tokens': 0}},
'temperature': 1.0,
'top_p': 1.0,
'tool_resources': {}}
```
### Messages
Now that the Run has completed, we can list the Messages in the Thread to see what got added by the Assistant.
```python
messages = client.beta.threads.messages.list(thread_id=thread.id)
show_json(messages)
```
```text
{'data': [{'id': 'msg_A5eAN6ZAJDmFBOYutEm5DFCy',
'assistant_id': 'asst_qvXmYlZV8zhABI2RtPzDfV6z',
'attachments': [],
'completed_at': None,
'content': [{'text': {'annotations': [],
'value': 'Sure! Subtract 11 from both sides to get \\(3x = 3\\), then divide by 3 to find \\(x = 1\\).'},
'type': 'text'}],
'created_at': 1736340405,
'incomplete_at': None,
'incomplete_details': None,
'metadata': {},
'object': 'thread.message',
'role': 'assistant',
'run_id': 'run_qVYsWok6OCjHxkajpIrdHuVP',
'status': None,
'thread_id': 'thread_j4dc1TiHPfkviKUHNi4aAsA6'},
{'id': 'msg_1q4Y7ZZ9gIcPoAKSx9UtrrKJ',
'assistant_id': None,
'attachments': [],
'completed_at': None,
'attachments': [],
'completed_at': None,
'content': [{'text': {'annotations': [],
'value': 'I need to solve the equation `3x + 11 = 14`. Can you help me?'},
'type': 'text'}],
'created_at': 1736340400,
'incomplete_at': None,
'incomplete_details': None,
'metadata': {},
'object': 'thread.message',
'role': 'user',
'run_id': None,
'status': None,
'thread_id': 'thread_j4dc1TiHPfkviKUHNi4aAsA6'}],
'object': 'list',
'first_id': 'msg_A5eAN6ZAJDmFBOYutEm5DFCy',
'last_id': 'msg_1q4Y7ZZ9gIcPoAKSx9UtrrKJ',
'has_more': False}
```
As you can see, Messages are ordered in reverse-chronological order – this was done so the most recent results are always on the first `page` (since results can be paginated). Do keep a look out for this, since this is the opposite order to messages in the Chat Completions API.
Let's ask our Assistant to explain the result a bit further!
```python
# Create a message to append to our thread
message = client.beta.threads.messages.create(
thread_id=thread.id, role="user", content="Could you explain this to me?"
)
# Execute our run
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
)
# Wait for completion
wait_on_run(run, thread)
# Retrieve all the messages added after our last user message
messages = client.beta.threads.messages.list(
thread_id=thread.id, order="asc", after=message.id
)
show_json(messages)
```
```text
{'data': [{'id': 'msg_wSHHvaMnaWktZWsKs6gyoPUB',
'assistant_id': 'asst_qvXmYlZV8zhABI2RtPzDfV6z',
'attachments': [],
'completed_at': None,
'content': [{'text': {'annotations': [],
'value': 'Certainly! To isolate \\(x\\), first subtract 11 from both sides of the equation \\(3x + 11 = 14\\), resulting in \\(3x = 3\\). Then, divide both sides by 3 to solve for \\(x\\), giving you \\(x = 1\\).'},
'type': 'text'}],
'created_at': 1736340414,
'incomplete_at': None,
'incomplete_details': None,
'metadata': {},
'object': 'thread.message',
'role': 'assistant',
'run_id': 'run_lJsumsDtPTmdG3Enx2CfYrrq',
'status': None,
'thread_id': 'thread_j4dc1TiHPfkviKUHNi4aAsA6'}],
'object': 'list',
'first_id': 'msg_wSHHvaMnaWktZWsKs6gyoPUB',
'last_id': 'msg_wSHHvaMnaWktZWsKs6gyoPUB',
'has_more': False}
```
This may feel like a lot of steps to get a response back, especially for this simple example. However, you'll soon see how we can add very powerful functionality to our Assistant without changing much code at all!
### Example
Let's take a look at how we could potentially put all of this together. Below is all the code you need to use an Assistant you've created.
Since we've already created our Math Assistant, I've saved its ID in `MATH_ASSISTANT_ID`. I then defined two functions:
- `submit_message`: create a Message on a Thread, then start (and return) a new Run
- `get_response`: returns the list of Messages in a Thread
```python
from openai import OpenAI
MATH_ASSISTANT_ID = assistant.id # or a hard-coded ID like "asst-..."
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", ""))
def submit_message(assistant_id, thread, user_message):
client.beta.threads.messages.create(
thread_id=thread.id, role="user", content=user_message
)
return client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant_id,
)
def get_response(thread):
return client.beta.threads.messages.list(thread_id=thread.id, order="asc")
```
I've also defined a `create_thread_and_run` function that I can re-use (which is actually almost identical to the [`client.beta.threads.create_and_run`](https://platform.openai.com/docs/api-reference/runs/createThreadAndRun) compound function in our API ;) ). Finally, we can submit our mock user requests each to a new Thread.
Notice how all of these API calls are asynchronous operations; this means we actually get async behavior in our code without the use of async libraries! (e.g. `asyncio`)
```python
def create_thread_and_run(user_input):
thread = client.beta.threads.create()
run = submit_message(MATH_ASSISTANT_ID, thread, user_input)
return thread, run
# Emulating concurrent user requests
thread1, run1 = create_thread_and_run(
"I need to solve the equation `3x + 11 = 14`. Can you help me?"
)
thread2, run2 = create_thread_and_run("Could you explain linear algebra to me?")
thread3, run3 = create_thread_and_run("I don't like math. What can I do?")
# Now all Runs are executing...
```
Once all Runs are going, we can wait on each and get the responses.
```python
import time
# Pretty printing helper
def pretty_print(messages):
print("# Messages")
for m in messages:
print(f"{m.role}: {m.content[0].text.value}")
print()
# Waiting in a loop
def wait_on_run(run, thread):
while run.status == "queued" or run.status == "in_progress":
run = client.beta.threads.runs.retrieve(
thread_id=thread.id,
run_id=run.id,
)
time.sleep(0.5)
return run
# Wait for Run 1
run1 = wait_on_run(run1, thread1)
pretty_print(get_response(thread1))
# Wait for Run 2
run2 = wait_on_run(run2, thread2)
pretty_print(get_response(thread2))
# Wait for Run 3
run3 = wait_on_run(run3, thread3)
pretty_print(get_response(thread3))
# Thank our assistant on Thread 3 :)
run4 = submit_message(MATH_ASSISTANT_ID, thread3, "Thank you!")
run4 = wait_on_run(run4, thread3)
pretty_print(get_response(thread3))
```
```text
# Messages
user: I need to solve the equation `3x + 11 = 14`. Can you help me?
assistant: Sure! Subtract 11 from both sides to get \(3x = 3\), then divide by 3 to find \(x = 1\).
# Messages
user: Could you explain linear algebra to me?
assistant: Linear algebra is the branch of mathematics concerning vector spaces, linear transformations, and systems of linear equations, often represented with matrices.
# Messages
user: I don't like math. What can I do?
assistant: Try relating math to real-life interests or hobbies, practice with fun games or apps, and gradually build confidence with easier problems.
# Messages
user: I don't like math. What can I do?
assistant: Try relating math to real-life interests or hobbies, practice with fun games or apps, and gradually build confidence with easier problems.
user: Thank you!
assistant: You're welcome! If you have any more questions, feel free to ask!
```
Et voilà!
You may have noticed that this code is not actually specific to our math Assistant at all... this code will work for any new Assistant you create simply by changing the Assistant ID! That is the power of the Assistants API.
## Tools
A key feature of the Assistants API is the ability to equip our Assistants with Tools, like Code Interpreter, File Search, and custom Functions. Let's take a look at each.
### Code Interpreter
Let's equip our Math Tutor with the [Code Interpreter](https://platform.openai.com/docs/assistants/tools/code-interpreter) tool, which we can do from the Dashboard...

...or the API, using the Assistant ID.
```python
assistant = client.beta.assistants.update(
MATH_ASSISTANT_ID,
tools=[{"type": "code_interpreter"}],
)
show_json(assistant)
```
```text
{'id': 'asst_qvXmYlZV8zhABI2RtPzDfV6z',
'created_at': 1736340398,
'description': None,
'instructions': 'You are a personal math tutor. Answer questions briefly, in a sentence or less.',
'metadata': {},
'model': 'gpt-4o',
'name': 'Math Tutor',
'object': 'assistant',
'tools': [{'type': 'code_interpreter'}],
'response_format': 'auto',
'temperature': 1.0,
'tool_resources': {'code_interpreter': {'file_ids': []}, 'file_search': None},
'top_p': 1.0} 'tools': [{'type': 'code_interpreter'}],
'response_format': 'auto',
'temperature': 1.0,
'tool_resources': {'code_interpreter': {'file_ids': []}, 'file_search': None},
'top_p': 1.0}
```
Now, let's ask the Assistant to use its new tool.
```python
thread, run = create_thread_and_run(
"Generate the first 20 fibbonaci numbers with code."
)
run = wait_on_run(run, thread)
pretty_print(get_response(thread))
```
```text
# Messages
user: Generate the first 20 fibbonaci numbers with code.
assistant: The first 20 Fibonacci numbers are: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181.
```
And that's it! The Assistant used Code Interpreter in the background, and gave us a final response.
For some use cases this may be enough – however, if we want more details on what precisely an Assistant is doing we can take a look at a Run's Steps.
### Steps
A Run is composed of one or more Steps. Like a Run, each Step has a `status` that you can query. This is useful for surfacing the progress of a Step to a user (e.g. a spinner while the Assistant is writing code or performing retrieval).
```python
run_steps = client.beta.threads.runs.steps.list(
thread_id=thread.id, run_id=run.id, order="asc"
)
```
Let's take a look at each Step's `step_details`.
```python
for step in run_steps.data:
step_details = step.step_details
print(json.dumps(show_json(step_details), indent=4))
```
```text
{'tool_calls': [{'id': 'call_E1EE1loDmcWoc7FpkOMKYj6n',
'code_interpreter': {'input': 'def generate_fibonacci(n):\n fib_sequence = [0, 1]\n while len(fib_sequence) < n:\n next_value = fib_sequence[-1] + fib_sequence[-2]\n fib_sequence.append(next_value)\n return fib_sequence\n\n# Generate the first 20 Fibonacci numbers\nfirst_20_fibonacci = generate_fibonacci(20)\nfirst_20_fibonacci',
'outputs': []},
'type': 'code_interpreter'}],
'type': 'tool_calls'}
```
```text
null
```
```text
{'message_creation': {'message_id': 'msg_RzTnbBMmzDYHk79a0x9qM5uU'},
'type': 'message_creation'}
```
```text
null
```
We can see the `step_details` for two Steps:
1. `tool_calls` (plural, since it could be more than one in a single Step)
2. `message_creation`
The first Step is a `tool_calls`, specifically using the `code_interpreter` which contains:
- `input`, which was the Python code generated before the tool was called, and
- `output`, which was the result of running the Code Interpreter.
The second Step is a `message_creation`, which contains the `message` that was added to the Thread to communicate the results to the user.
### File search
Another powerful tool in the Assistants API is [File search](https://platform.openai.com/docs/assistants/tools/file-search). This allows the uploading of files to the Assistant to be used as a knowledge base when answering questions.

```python
# Upload the file
file = client.files.create(
file=open(
"data/language_models_are_unsupervised_multitask_learners.pdf",
"rb",
),
purpose="assistants",
)
# Create a vector store
vector_store = client.beta.vector_stores.create(
name="language_models_are_unsupervised_multitask_learners",
)
# Add the file to the vector store
vector_store_file = client.beta.vector_stores.files.create_and_poll(
vector_store_id=vector_store.id,
file_id=file.id,
)
# Confirm the file was added
while vector_store_file.status == "in_progress":
time.sleep(1)
if vector_store_file.status == "completed":
print("File added to vector store")
elif vector_store_file.status == "failed":
raise Exception("Failed to add file to vector store")
# Update Assistant
assistant = client.beta.assistants.update(
MATH_ASSISTANT_ID,
tools=[{"type": "code_interpreter"}, {"type": "file_search"}],
tool_resources={
"file_search":{
"vector_store_ids": [vector_store.id]
},
"code_interpreter": {
"file_ids": [file.id]
}
},
)
show_json(assistant)
```
```text
File added to vector store
```
```text
{'id': 'asst_qvXmYlZV8zhABI2RtPzDfV6z',
'created_at': 1736340398,
'description': None,
'instructions': 'You are a personal math tutor. Answer questions briefly, in a sentence or less.',
'metadata': {},
'model': 'gpt-4o',
'name': 'Math Tutor',
'object': 'assistant',
'tools': [{'type': 'code_interpreter'},
{'type': 'file_search',
'file_search': {'max_num_results': None,
'ranking_options': {'score_threshold': 0.0,
'ranker': 'default_2024_08_21'}}}],
'response_format': 'auto',
'temperature': 1.0,
'tool_resources': {'code_interpreter': {'file_ids': ['file-GQFm2i7N8LrAQatefWKEsE']},
'file_search': {'vector_store_ids': ['vs_dEArILZSJh7J799QACi3QhuU']}},
'top_p': 1.0}
```
```python
thread, run = create_thread_and_run(
"What are some cool math concepts behind this ML paper pdf? Explain in two sentences."
)
run = wait_on_run(run, thread)
pretty_print(get_response(thread))
```
```text
# Messages
user: What are some cool math concepts behind this ML paper pdf? Explain in two sentences.
assistant: The paper explores the concept of multitask learning where a single model is used to perform various tasks, modeling the conditional distribution \( p(\text{output} | \text{input, task}) \), inspired by probabilistic approaches【6:10†source】. It also discusses the use of Transformer-based architectures and parallel corpus substitution in language models, enhancing their ability to generalize across domain tasks without explicit task-specific supervision【6:2†source】【6:5†source】.
```
> **Note**
> There are more intricacies in File Search, like [Annotations](https://platform.openai.com/docs/assistants/how-it-works/managing-threads-and-messages), which may be covered in another cookbook.
```python
# Delete the vector store
client.beta.vector_stores.delete(vector_store.id)
```
```text
VectorStoreDeleted(id='vs_dEArILZSJh7J799QACi3QhuU', deleted=True, object='vector_store.deleted')
```
### Functions
As a final powerful tool for your Assistant, you can specify custom [Functions](https://platform.openai.com/docs/assistants/tools/function-calling) (much like the [Function Calling](https://platform.openai.com/docs/guides/function-calling) in the Chat Completions API). During a Run, the Assistant can then indicate it wants to call one or more functions you specified. You are then responsible for calling the Function, and providing the output back to the Assistant.
Let's take a look at an example by defining a `display_quiz()` Function for our Math Tutor.
This function will take a `title` and an array of `question`s, display the quiz, and get input from the user for each:
- `title`
- `questions`
- `question_text`
- `question_type`: [`MULTIPLE_CHOICE`, `FREE_RESPONSE`]
- `choices`: ["choice 1", "choice 2", ...]
I'll mocking out responses with `get_mock_response...`. This is where you'd get the user's actual input.
```python
def get_mock_response_from_user_multiple_choice():
return "a"
def get_mock_response_from_user_free_response():
return "I don't know."
def display_quiz(title, questions):
print("Quiz:", title)
print()
responses = []
for q in questions:
print(q["question_text"])
response = ""
# If multiple choice, print options
if q["question_type"] == "MULTIPLE_CHOICE":
for i, choice in enumerate(q["choices"]):
print(f"{i}. {choice}")
response = get_mock_response_from_user_multiple_choice()
# Otherwise, just get response
elif q["question_type"] == "FREE_RESPONSE":
response = get_mock_response_from_user_free_response()
responses.append(response)
print()
return responses
```
Here's what a sample quiz would look like:
```python
responses = display_quiz(
"Sample Quiz",
[
{"question_text": "What is your name?", "question_type": "FREE_RESPONSE"},
{
"question_text": "What is your favorite color?",
"question_type": "MULTIPLE_CHOICE",
"choices": ["Red", "Blue", "Green", "Yellow"],
},
],
)
print("Responses:", responses)
```
```text
Quiz: Sample Quiz
What is your name?
What is your favorite color?
0. Red
1. Blue
2. Green
3. Yellow
Responses: ["I don't know.", 'a']
```
Now, let's define the interface of this function in JSON format, so our Assistant can call it:
```python
function_json = {
"name": "display_quiz",
"description": "Displays a quiz to the student, and returns the student's response. A single quiz can have multiple questions.",
"parameters": {
"type": "object",
"properties": {
"title": {"type": "string"},
"questions": {
"type": "array",
"description": "An array of questions, each with a title and potentially options (if multiple choice).",
"items": {
"type": "object",
"properties": {
"question_text": {"type": "string"},
"question_type": {
"type": "string",
"enum": ["MULTIPLE_CHOICE", "FREE_RESPONSE"]
},
"choices": {"type": "array", "items": {"type": "string"}}
},
"required": ["question_text"]
}
}
},
"required": ["title", "questions"]
}
}
```
Once again, let's update our Assistant either through the Dashboard or the API.

> **Note**
> Pasting the function JSON into the Dashboard was a bit finicky due to indentation, etc. I just asked ChatGPT to format my function the same as one of the examples on the Dashboard :).
```python
assistant = client.beta.assistants.update(
MATH_ASSISTANT_ID,
tools=[
{"type": "code_interpreter"},
{"type": "file_search"},
{"type": "function", "function": function_json},
],
)
show_json(assistant)
```
```text
{'id': 'asst_qvXmYlZV8zhABI2RtPzDfV6z',
'created_at': 1736340398,
'description': None,
'instructions': 'You are a personal math tutor. Answer questions briefly, in a sentence or less.',
'metadata': {},
'model': 'gpt-4o',
'name': 'Math Tutor',
'object': 'assistant',
'tools': [{'type': 'code_interpreter'},
{'type': 'file_search',
'file_search': {'max_num_results': None,
'ranking_options': {'score_threshold': 0.0,
'ranker': 'default_2024_08_21'}}},
{'function': {'name': 'display_quiz',
'description': "Displays a quiz to the student, and returns the student's response. A single quiz can have multiple questions.",
'description': "Displays a quiz to the student, and returns the student's response. A single quiz can have multiple questions.",
'parameters': {'type': 'object',
'properties': {'title': {'type': 'string'},
'questions': {'type': 'array',
'description': 'An array of questions, each with a title and potentially options (if multiple choice).',
'items': {'type': 'object',
'properties': {'question_text': {'type': 'string'},
'question_type': {'type': 'string',
'enum': ['MULTIPLE_CHOICE', 'FREE_RESPONSE']},
'choices': {'type': 'array', 'items': {'type': 'string'}}},
'required': ['question_text']}}},
'required': ['title', 'questions']},
'strict': False},
'type': 'function'}],
'response_format': 'auto',
'temperature': 1.0,
'tool_resources': {'code_interpreter': {'file_ids': ['file-GQFm2i7N8LrAQatefWKEsE']},
'file_search': {'vector_store_ids': []}},
'top_p': 1.0}
```
And now, we ask for a quiz.
```python
thread, run = create_thread_and_run(
"Make a quiz with 2 questions: One open ended, one multiple choice. Then, give me feedback for the responses."
)
run = wait_on_run(run, thread)
run.status
```
```text
'requires_action'
```
Now, however, when we check the Run's `status` we see `requires_action`! Let's take a closer.
```python
show_json(run)
```
```text
{'id': 'run_ekMRSI2h35asEzKirRf4BTwZ',
'assistant_id': 'asst_qvXmYlZV8zhABI2RtPzDfV6z',
'cancelled_at': None,
'completed_at': None,
'created_at': 1736341020,
'expires_at': 1736341620,
'failed_at': None,
'incomplete_details': None,
'incomplete_details': None,
'instructions': 'You are a personal math tutor. Answer questions briefly, in a sentence or less.',
'last_error': None,
'max_completion_tokens': None,
'max_prompt_tokens': None,
'max_completion_tokens': None,
'max_prompt_tokens': None,
'metadata': {},
'model': 'gpt-4o',
'object': 'thread.run',
'parallel_tool_calls': True,
'required_action': {'submit_tool_outputs': {'tool_calls': [{'id': 'call_uvJEn0fxM4sgmzek8wahBGLi',
'function': {'arguments': '{"title":"Math Quiz","questions":[{"question_text":"What is the derivative of the function f(x) = 3x^2 + 2x - 5?","question_type":"FREE_RESPONSE"},{"question_text":"What is the value of \\\\( \\\\int_{0}^{1} 2x \\\\, dx \\\\)?","question_type":"MULTIPLE_CHOICE","choices":["0","1","2","3"]}]}',
'name': 'display_quiz'},
'type': 'function'}]},
'type': 'submit_tool_outputs'},
'response_format': 'auto',
'started_at': 1736341022,
'status': 'requires_action',
'thread_id': 'thread_8bK2PXfoeijEHBVEzYuJXt17',
'tool_choice': 'auto',
'tools': [{'type': 'code_interpreter'},
{'type': 'file_search',
'file_search': {'max_num_results': None,
'ranking_options': {'score_threshold': 0.0,
'ranker': 'default_2024_08_21'}}},
{'function': {'name': 'display_quiz',
'description': "Displays a quiz to the student, and returns the student's response. A single quiz can have multiple questions.",
'description': "Displays a quiz to the student, and returns the student's response. A single quiz can have multiple questions.",
'parameters': {'type': 'object',
'properties': {'title': {'type': 'string'},
'questions': {'type': 'array',
'description': 'An array of questions, each with a title and potentially options (if multiple choice).',
'items': {'type': 'object',
'properties': {'question_text': {'type': 'string'},
'question_type': {'type': 'string',
'enum': ['MULTIPLE_CHOICE', 'FREE_RESPONSE']},
'choices': {'type': 'array', 'items': {'type': 'string'}}},
'required': ['question_text']}}},
'required': ['title', 'questions']},
'strict': False},
'type': 'function'}],
'truncation_strategy': {'type': 'auto', 'last_messages': None},
'usage': None,
'temperature': 1.0,
'top_p': 1.0,
'tool_resources': {}} 'strict': False},
'type': 'function'}],
'truncation_strategy': {'type': 'auto', 'last_messages': None},
'usage': None,
'temperature': 1.0,
'top_p': 1.0,
'tool_resources': {}}
```
The `required_action` field indicates a Tool is waiting for us to run it and submit its output back to the Assistant. Specifically, the `display_quiz` function! Let's start by parsing the `name` and `arguments`.
> **Note**
> While in this case we know there is only one Tool call, in practice the Assistant may choose to call multiple tools.
```python
# Extract single tool call
tool_call = run.required_action.submit_tool_outputs.tool_calls[0]
name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
print("Function Name:", name)
print("Function Arguments:")
arguments
```
```text
Function Name: display_quiz
Function Arguments:
```
```text
{'title': 'Math Quiz',
'questions': [{'question_text': 'What is the derivative of the function f(x) = 3x^2 + 2x - 5?',
'question_type': 'FREE_RESPONSE'},
{'question_text': 'What is the value of \\( \\int_{0}^{1} 2x \\, dx \\)?',
'question_type': 'MULTIPLE_CHOICE',
'choices': ['0', '1', '2', '3']}]}
```
Now let's actually call our `display_quiz` function with the arguments provided by the Assistant:
```python
responses = display_quiz(arguments["title"], arguments["questions"])
print("Responses:", responses)
```
```text
Quiz: Math Quiz
Quiz: Math Quiz
What is the derivative of the function f(x) = 3x^2 + 2x - 5?
What is the value of \( \int_{0}^{1} 2x \, dx \)?
0. 0
1. 1
2. 2
3. 3
Responses: ["I don't know.", 'a']
```
Great! (Remember these responses are the one's we mocked earlier. In reality, we'd be getting input from the back from this function call.)
Now that we have our responses, let's submit them back to the Assistant. We'll need the `tool_call` ID, found in the `tool_call` we parsed out earlier. We'll also need to encode our `list`of responses into a `str`.
```python
run = client.beta.threads.runs.submit_tool_outputs(
thread_id=thread.id,
run_id=run.id,
tool_outputs=tool_outputs
)
show_json(run)
```
```text
{'id': 'run_ekMRSI2h35asEzKirRf4BTwZ',
'assistant_id': 'asst_qvXmYlZV8zhABI2RtPzDfV6z',
'cancelled_at': None,
'completed_at': None,
'created_at': 1736341020,
'expires_at': 1736341620,
'failed_at': None,
'incomplete_details': None,
'incomplete_details': None,
'instructions': 'You are a personal math tutor. Answer questions briefly, in a sentence or less.',
'last_error': None,
'max_completion_tokens': None,
'max_prompt_tokens': None,
'max_completion_tokens': None,
'max_prompt_tokens': None,
'metadata': {},
'model': 'gpt-4o',
'object': 'thread.run',
'parallel_tool_calls': True,
'parallel_tool_calls': True,
'required_action': None,
'response_format': 'auto',
'started_at': 1736341022,
'status': 'queued',
'thread_id': 'thread_8bK2PXfoeijEHBVEzYuJXt17',
'tool_choice': 'auto',
'tools': [{'type': 'code_interpreter'},
{'type': 'file_search',
'file_search': {'max_num_results': None,
'ranking_options': {'score_threshold': 0.0,
'ranker': 'default_2024_08_21'}}},
{'function': {'name': 'display_quiz',
'description': "Displays a quiz to the student, and returns the student's response. A single quiz can have multiple questions.",
'description': "Displays a quiz to the student, and returns the student's response. A single quiz can have multiple questions.",
'parameters': {'type': 'object',
'properties': {'title': {'type': 'string'},
'questions': {'type': 'array',
'description': 'An array of questions, each with a title and potentially options (if multiple choice).',
'items': {'type': 'object',
'properties': {'question_text': {'type': 'string'},
'question_type': {'type': 'string',
'enum': ['MULTIPLE_CHOICE', 'FREE_RESPONSE']},
'choices': {'type': 'array', 'items': {'type': 'string'}}},
'required': ['question_text']}}},
'required': ['title', 'questions']},
'strict': False},
'type': 'function'}],
'truncation_strategy': {'type': 'auto', 'last_messages': None},
'usage': None,
'temperature': 1.0,
'top_p': 1.0,
'tool_resources': {}} 'strict': False},
'type': 'function'}],
'truncation_strategy': {'type': 'auto', 'last_messages': None},
'usage': None,
'temperature': 1.0,
'top_p': 1.0,
'tool_resources': {}}
```
We can now wait for the Run to complete once again, and check our Thread!
```python
run = wait_on_run(run, thread)
pretty_print(get_response(thread))
```
```text
# Messages
user: Make a quiz with 2 questions: One open ended, one multiple choice. Then, give me feedback for the responses.
assistant: Since no specific information was found in the uploaded file, I'll create a general math quiz for you:
1. **Open-ended Question**: What is the derivative of the function \( f(x) = 3x^2 + 2x - 5 \)?
2. **Multiple Choice Question**: What is the value of \( \int_{0}^{1} 2x \, dx \)?
- A) 0
- B) 1
- C) 2
- D) 3
I will now present the quiz to you for response.
assistant: Here is the feedback for your responses:
1. **Derivative Question**:
- Your Response: "I don't know."
- Feedback: The derivative of \( f(x) = 3x^2 + 2x - 5 \) is \( f'(x) = 6x + 2 \).
2. **Integration Question**:
- Your Response: A) 0
- Feedback: The correct answer is B) 1. The integration \(\int_{0}^{1} 2x \, dx \) evaluates to 1.
```
Woohoo 🎉
## Conclusion
We covered a lot of ground in this notebook, give yourself a high-five! Hopefully you should now have a strong foundation to build powerful, stateful experiences with tools like Code Interpreter, Retrieval, and Functions!
There's a few sections we didn't cover for the sake of brevity, so here's a few resources to explore further:
- [Annotations](https://platform.openai.com/docs/assistants/how-it-works/managing-threads-and-messages): parsing file citations
- [Files](https://platform.openai.com/docs/api-reference/assistants/file-object): Thread scoped vs Assistant scoped
- [Parallel Function Calls](https://platform.openai.com/docs/guides/function-calling/parallel-function-calling): calling multiple tools in a single Step
- Multi-Assistant Thread Runs: single Thread with Messages from multiple Assistants
- Streaming: coming soon!
Now go off and build something ama[zing](https://www.youtube.com/watch?v=xvFZjo5PgG0&pp=ygUQcmljayByb2xsIG5vIGFkcw%3D%3D)!
---
# Source: https://developers.openai.com/resources/guide/audio-speech-guide.md
# Audio & speech guide
> Overview of approaches for audio processing and speech in applications.
- Type: Guide
- Tags: speech
- URL: https://platform.openai.com/docs/guides/audio
- Created: 2025-07-21
- Updated: 2025-07-21
## Summary
Covers audio streaming, speech synthesis, and related APIs.
## Details
Introduces core concepts for handling audio and speech with OpenAI models.
---
# Source: https://developers.openai.com/codex/auth.md
# Source: https://developers.openai.com/apps-sdk/build/auth.md
# Authentication
## Authenticate your users
Many Apps SDK apps can operate in a read-only, anonymous mode, but anything that exposes customer-specific data or write actions should authenticate users.
You can integrate with your own authorization server when you need to connect to an existing backend or share data between users.
## Custom auth with OAuth 2.1
For an authenticated MCP server, you are expected to implement a OAuth 2.1 flow that conforms to the [MCP authorization spec](https://modelcontextprotocol.io/specification/2025-06-18/basic/authorization).
### Components
- **Resource server** – your MCP server, which exposes tools and verifies access tokens on each request.
- **Authorization server** – your identity provider (Auth0, Okta, Cognito, or a custom implementation) that issues tokens and publishes discovery metadata.
- **Client** – ChatGPT acting on behalf of the user. It supports dynamic client registration and PKCE.
### MCP authorization spec requirements
- Host protected resource metadata on your MCP server
- Publish OAuth metadata from your authorization server
- Echo the `resource` parameter throughout the OAuth flow
- Advertise PKCE support for ChatGPT
Here is what the spec expects, in plain language.
#### Host protected resource metadata on your MCP server
- You need an HTTPS endpoint such as `GET https://your-mcp.example.com/.well-known/oauth-protected-resource` (or advertise the same URL in a `WWW-Authenticate` header on `401 Unauthorized` responses) so ChatGPT knows where to fetch your metadata.
- That endpoint returns a JSON document describing the resource server and its available authorization servers:
```json
{
"resource": "https://your-mcp.example.com",
"authorization_servers": ["https://auth.yourcompany.com"],
"scopes_supported": ["files:read", "files:write"],
"resource_documentation": "https://yourcompany.com/docs/mcp"
}
```
- Key fields you must populate:
- `resource`: the canonical HTTPS identifier for your MCP server. ChatGPT sends this exact value as the `resource` query parameter during OAuth.
- `authorization_servers`: one or more issuer base URLs that point to your identity provider. ChatGPT will try each to find OAuth metadata.
- `scopes_supported`: optional list that helps ChatGPT explain the permissions it is going to ask the user for.
- Optional extras from [RFC 9728](https://datatracker.ietf.org/doc/html/rfc9728) such as `resource_documentation`, `token_endpoint_auth_methods_supported`, or `introspection_endpoint` make it easier for clients and admins to understand your setup.
When you block a request because it is unauthenticated, return a challenge like:
```http
HTTP/1.1 401 Unauthorized
WWW-Authenticate: Bearer resource_metadata="https://your-mcp.example.com/.well-known/oauth-protected-resource",
scope="files:read"
```
That single header lets ChatGPT discover the metadata URL even if it has not seen it before.
#### Publish OAuth metadata from your authorization server
- Your identity provider must expose one of the well-known discovery documents so ChatGPT can read its configuration:
- OAuth 2.0 metadata at `https://auth.yourcompany.com/.well-known/oauth-authorization-server`
- OpenID Connect metadata at `https://auth.yourcompany.com/.well-known/openid-configuration`
- Each document answers three big questions for ChatGPT: where to send the user, how to exchange codes, and how to register itself. A typical response looks like:
```json
{
"issuer": "https://auth.yourcompany.com",
"authorization_endpoint": "https://auth.yourcompany.com/oauth2/v1/authorize",
"token_endpoint": "https://auth.yourcompany.com/oauth2/v1/token",
"registration_endpoint": "https://auth.yourcompany.com/oauth2/v1/register",
"code_challenge_methods_supported": ["S256"],
"scopes_supported": ["files:read", "files:write"]
}
```
- Fields that must be correct:
- `authorization_endpoint`, `token_endpoint`: the URLs ChatGPT needs to run the OAuth authorization-code + PKCE flow end to end.
- `registration_endpoint`: enables dynamic client registration (DCR) so ChatGPT can mint a dedicated `client_id` per connector.
- `code_challenge_methods_supported`: must include `S256`, otherwise ChatGPT will refuse to proceed because PKCE appears unsupported.
- Optional fields follow [RFC 8414](https://datatracker.ietf.org/doc/html/rfc8414) / [OpenID Discovery](https://openid.net/specs/openid-connect-discovery-1_0.html); include whatever helps your administrators configure policies.
#### Redirect URL
ChatGPT completes the OAuth flow by redirecting to `https://chatgpt.com/connector_platform_oauth_redirect`. Add that production redirect URI to your authorization server's allowlist so the authorization code can be returned successfully.
In addition, as you prepare to submit your app for review, allowlist the review redirect URI `https://platform.openai.com/apps-manage/oauth` so the review flow can complete OAuth successfully.
#### Echo the `resource` parameter throughout the OAuth flow
- Expect ChatGPT to append `resource=https%3A%2F%2Fyour-mcp.example.com` to both the authorization and token requests. This ties the token back to the protected resource metadata shown above.
- Configure your authorization server to copy that value into the access token (commonly the `aud` claim) so your MCP server can verify the token was minted for it and nobody else.
- If a token arrives without the expected audience or scopes, reject it and rely on the `WWW-Authenticate` challenge to prompt ChatGPT to re-authorize with the correct parameters.
#### Advertise PKCE support for ChatGPT
- ChatGPT, acting as the MCP client, performs the authorization-code flow with PKCE using the `S256` code challenge so intercepted authorization codes cannot be replayed by an attacker. That protection is why the MCP authorization spec mandates PKCE.
- Your authorization server metadata therefore needs to list `code_challenge_methods_supported` (or equivalent) including `S256`. If that field is missing, ChatGPT will refuse to complete the flow because it cannot confirm PKCE support.
### OAuth flow
Provided that you have implemented the MCP authorization spec delineated above, the OAuth flow will be as follows:
1. ChatGPT queries your MCP server for protected resource metadata.

2. ChatGPT registers itself via dynamic client registration with your authorization server using the `registration_endpoint` and obtains a `client_id`.

3. When the user first invokes a tool, the ChatGPT client launches the OAuth authorization code + PKCE flow. The user authenticates and consents to the requested scopes.

4. ChatGPT exchanges the authorization code for an access token and attaches it to subsequent MCP requests (`Authorization: Bearer `).

5. Your server verifies the token on each request (issuer, audience, expiration, scopes) before executing the tool.
### Client registration
The MCP spec currently requires dynamic client registration (DCR). This means that each time ChatGPT connects, it registers a fresh OAuth client with your authorization server, obtains a unique `client_id`, and uses that identity during token exchange. The downside of this approach is that it can generate thousands of short-lived clients—often one per user session.
To address this issue, the MCP council is currently advancing [Client Metadata Documents (CMID)](https://blog.modelcontextprotocol.io/posts/client_registration/). In the CMID model, ChatGPT will publish a stable document (for example `https://openai.com/chatgpt.json`) that declares its OAuth metadata and identity. Your authorization server can fetch the document over HTTPS, pin it as the canonical client record, and enforce policies such as redirect URI allowlists or rate limits without relying on per-session registration. CMID is still in draft, so continue supporting DCR until CIMD has landed.
### Client identification
A frequent question is how your MCP server can confirm that a request actually comes from ChatGPT. Today the only reliable control is network-level filtering, such as allowlisting ChatGPT’s [published egress IP ranges](https://openai.com/chatgpt-connectors.json). ChatGPT does **not** support machine-to-machine OAuth grants such as client credentials, service accounts, or JWT bearer assertions, nor can it present custom API keys or mTLS certificates.
Once rolled out, CMID directly addresses the client identification problem by giving you a signed, HTTPS-hosted declaration of ChatGPT’s identity.
### Choosing an identity provider
Most OAuth 2.1 identity providers can satisfy the MCP authorization requirements once they expose a discovery document, allow dynamic client registration, and echo the `resource` parameter into issued tokens.
We _strongly_ recommend that you use an existing established identity provider rather than implementing authentication from scratch yourself.
Here are instructions for some popular identity providers.
#### Auth0
- [Guide to configuring Auth0 for MCP authorization](https://github.com/openai/openai-mcpkit/blob/main/python-authenticated-mcp-server-scaffold/README.md#2-configure-auth0-authentication)
#### Stytch
- [Guide to configuring Stytch for MCP authorization](https://stytch.com/docs/guides/connected-apps/mcp-server-overview)
- [Overview guide to MCP authorization](https://stytch.com/blog/MCP-authentication-and-authorization-guide/)
- [Overview guide to MCP authorization specifically for Apps SDK](https://stytch.com/blog/guide-to-authentication-for-the-openai-apps-sdk/)
### Implementing token verification
When the OAuth flow finishes, ChatGPT simply attaches the access token it received to subsequent MCP requests (`Authorization: Bearer …`). Once a request reaches your MCP server you must assume the token is untrusted and perform the full set of resource-server checks yourself—signature validation, issuer and audience matching, expiry, replay considerations, and scope enforcement. That responsibility sits with you, not with ChatGPT.
In practice you should:
- Fetch the signing keys published by your authorization server (usually via JWKS) and verify the token’s signature and `iss`.
- Reject tokens that have expired or have not yet become valid (`exp`/`nbf`).
- Confirm the token was minted for your server (`aud` or the `resource` claim) and contains the scopes you marked as required.
- Run any app-specific policy checks, then either attach the resolved identity to the request context or return a `401` with a `WWW-Authenticate` challenge.
If verification fails, respond with `401 Unauthorized` and a `WWW-Authenticate` header that points back to your protected-resource metadata. This tells the client to run the OAuth flow again.
#### SDK token verification primitives
Both Python and TypeScript MCP SDKs include helpers so you do not have to wire this from scratch.
- [Python](https://github.com/modelcontextprotocol/python-sdk?tab=readme-ov-file#authentication)
- [TypeScript](https://github.com/modelcontextprotocol/typescript-sdk?tab=readme-ov-file#proxy-authorization-requests-upstream)
## Testing and rollout
- **Local testing** – start with a development tenant that issues short-lived tokens so you can iterate quickly.
- **Dogfood** – once authentication works, gate access to trusted testers before rolling out broadly. You can require linking for specific tools or the entire connector.
- **Rotation** – plan for token revocation, refresh, and scope changes. Your server should treat missing or stale tokens as unauthenticated and return a helpful error message.
- **OAuth debugging** – use the [MCP Inspector](https://modelcontextprotocol.io/docs/tools/inspector) Auth settings to walk through each OAuth step and pinpoint where the flow breaks before you ship.
With authentication in place you can confidently expose user-specific data and write actions to ChatGPT users.
## Triggering authentication UI
ChatGPT only surfaces its OAuth linking UI when your MCP server signals that OAuth is available or necessary.
Triggering the tool-level OAuth flow requires both metadata (`securitySchemes` and the resource metadata document) **and** runtime errors that carry `_meta["mcp/www_authenticate"]`. Without both halves ChatGPT will not show the linking UI for that tool.
1. **Publish resource metadata.** The MCP server must expose its OAuth configuration at a well-known URL such as `https://your-mcp.example.com/.well-known/oauth-protected-resource`.
2. **Describe each tool’s auth policy with `securitySchemes`.** Declaring `securitySchemes` per tool tells ChatGPT which tools require OAuth versus which can run anonymously. Stick to per-tool declarations even if the entire server uses the same policy; server-level defaults make it difficult to evolve individual tools later.
Two scheme types are available today, and you can list more than one to express optional auth:
- `noauth` — the tool is callable anonymously; ChatGPT can run it immediately.
- `oauth2` — the tool needs an OAuth 2.0 access token; include the scopes you will request so the consent screen is accurate.
If you omit the array entirely, the tool inherits whatever default the server advertises. Declaring both `noauth` and `oauth2` tells ChatGPT it can start with anonymous calls but that linking unlocks privileged behavior. Regardless of what you signal to the client, your server must still verify the token, scopes, and audience on every invocation.
Example (public + optional auth) – TypeScript SDK
```ts
declare const server: McpServer;
server.registerTool(
"search",
{
title: "Public Search",
description: "Search public documents.",
inputSchema: {
type: "object",
properties: { q: { type: "string" } },
required: ["q"],
},
securitySchemes: [
{ type: "noauth" },
{ type: "oauth2", scopes: ["search.read"] },
],
},
async ({ input }) => {
return {
content: [{ type: "text", text: `Results for ${input.q}` }],
structuredContent: {},
};
}
);
```
Example (auth required) – TypeScript SDK
```ts
declare const server: McpServer;
server.registerTool(
"create_doc",
{
title: "Create Document",
description: "Make a new doc in your account.",
inputSchema: {
type: "object",
properties: { title: { type: "string" } },
required: ["title"],
},
securitySchemes: [{ type: "oauth2", scopes: ["docs.write"] }],
},
async ({ input }) => {
return {
content: [{ type: "text", text: `Created doc: ${input.title}` }],
structuredContent: {},
};
}
);
```
3. **Check tokens inside the tool handler and emit `_meta["mcp/www_authenticate"]`** when you want ChatGPT to trigger the authentication UI. Inspect the token and verify issuer, audience, expiry, and scopes. If no valid token is present, return an error result that includes `_meta["mcp/www_authenticate"]` and make sure the value contains both an `error` and `error_description` parameter. This `WWW-Authenticate` payload is what actually triggers the tool-level OAuth UI once steps 1 and 2 are in place.
Example
```json
{
"jsonrpc": "2.0",
"id": 4,
"result": {
"content": [
{
"type": "text",
"text": "Authentication required: no access token provided."
}
],
"_meta": {
"mcp/www_authenticate": [
"'Bearer resource_metadata=\"https://your-mcp.example.com/.well-known/oauth-protected-resource\", error=\"insufficient_scope\", error_description=\"You need to login to continue\"'"
]
},
"isError": true
}
}
```
---
# Source: https://developers.openai.com/codex/guides/autofix-ci.md
# Source: https://developers.openai.com/codex/autofix-ci.md
# Autofix CI failures with Codex
Codex can keep your continuous integration (CI) signal green by running automatically whenever a workflow fails. This guide adapts the official Codex cookbook so GitHub Actions can invoke the Codex CLI, apply targeted fixes, verify tests, and open a pull request for review.
## End-to-end flow
Below is the pipeline flow we’ll implement:
1. A primary workflow named `CI` (rename as needed) runs as normal.
2. When the workflow finishes with a failure, a second workflow installs Codex, gathers context, and delegates to the Codex CLI via `openai/codex-action`.
3. Codex iterates locally in the GitHub-hosted runner, applies a minimal fix, and pushes a pull request back to the failing branch for review.

## Prerequisites
- A repository with GitHub Actions enabled and a primary workflow to monitor.
- The `OPENAI_API_KEY` secret configured either at the repo or organization level so Codex CLI can authenticate.
- Python available in the runner image (needed for `codex login`).
- Repository permissions that allow GitHub Actions to create branches and pull requests.

## Step 1: Add the GitHub Action to your CI Pipeline
Create a workflow such as `.github/workflows/codex-autofix.yml` that listens for failed runs from your primary workflow. Update the `workflows` array if your pipeline uses a different name. The job installs dependencies, runs Codex with a guard-railed prompt, re-runs your tests, and uses `peter-evans/create-pull-request` to stage a reviewable fix.
```yaml
name: Codex Auto-Fix on Failure
on:
workflow_run:
# Trigger this job after any run of the primary CI workflow completes
workflows: ["CI"]
types: [completed]
permissions:
contents: write
pull-requests: write
jobs:
auto-fix:
# Only run when the referenced workflow concluded with a failure
if: ${{ github.event.workflow_run.conclusion == 'failure' }}
runs-on: ubuntu-latest
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
FAILED_WORKFLOW_NAME: ${{ github.event.workflow_run.name }}
FAILED_RUN_URL: ${{ github.event.workflow_run.html_url }}
FAILED_HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
FAILED_HEAD_SHA: ${{ github.event.workflow_run.head_sha }}
steps:
- name: Check OpenAI API Key Set
run: |
if [ -z "$OPENAI_API_KEY" ]; then
echo "OPENAI_API_KEY secret is not set. Skipping auto-fix." >&2
exit 1
fi
- name: Checkout Failing Ref
uses: actions/checkout@v4
with:
ref: ${{ env.FAILED_HEAD_SHA }}
fetch-depth: 0
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: |
if [ -f package-lock.json ]; then npm ci; else npm i; fi
- name: Run Codex
uses: openai/codex-action@main
id: codex
with:
openai_api_key: ${{ secrets.OPENAI_API_KEY }}
prompt: >-
You are working in a Node.js monorepo with Jest tests and GitHub Actions. Read the repository,
run the test suite, identify the minimal change needed to make all tests pass, implement only that change,
and stop. Do not refactor unrelated code or files. Keep changes small and surgical.
codex_args: '["--config","sandbox_mode=\"workspace-write\""]'
- name: Verify tests
run: npm test --silent
- name: Create pull request with fixes
if: success()
uses: peter-evans/create-pull-request@v6
with:
commit-message: "fix(ci): auto-fix failing tests via Codex"
branch: codex/auto-fix-${{ github.event.workflow_run.run_id }}
base: ${{ env.FAILED_HEAD_BRANCH }}
title: "Auto-fix failing CI via Codex"
body: |
Codex automatically generated this PR in response to a CI failure on workflow `${{ env.FAILED_WORKFLOW_NAME }}`.
Failed run: ${{ env.FAILED_RUN_URL }}
Head branch: `${{ env.FAILED_HEAD_BRANCH }}`
This PR contains minimal changes intended solely to make the CI pass.
```
## Step 2: Watch the follow-up workflow run
When the main workflow fails you can monitor both the failure and the Codex follow-up under the Actions tab.

The autofix workflow will appear as soon as the triggering workflow finishes.

## Step 3: Review the generated pull request
After Codex finishes, it opens a pull request on a branch named `codex/auto-fix-` that contains the proposed patch along with a summary referencing the failed run. Review and merge as you would with any contribution.

## Conclusion
Embedding Codex CLI in CI automates repetitive cleanup steps after failures. You can adapt the same scaffold to run different test commands, adjust prompts for your stack, or extend the workflow with additional safeguards while keeping Codex in control of quick fixes.
---
# Source: https://developers.openai.com/cookbook/examples/codex/autofix-github-actions.md
# Autofix CI failures on GitHub with Codex CLI
## Purpose of this cookbook
This cookbook shows you how to embed the OpenAI Codex CLI into your CI/CD pipeline so that when your builds or tests fail, codex automatically generates & proposes fixes. The following is an example in a node project with CI running in GitHub Actions.
## End to End Flow
Below is the pipeline flow we’ll implement:
## Prerequisites
- A GitHub Repo with Actions workflows
- You’ll need to create `OPENAI_API_KEY` as an environment variable in GitHub settings under https://github.com/{org-name}/{repo-name}/settings/secrets/actions. You can also set this at org level(for sharing secrets across multiple repos)
- Codex requires python as a prerequisite to use `codex login`
- You’ll need to check the setting to enable actions to create PRs on your repo, and also in your organization:
## Step 1: Add the Github Action to your CI Pipeline
The following YAML shows a GitHub action that auto triggers when CI fails, installs Codex, uses codex exec and then makes a PR on the failing branch with the fix. Replace "CI" with the name of the workflow you want to monitor.
```yaml
name: Codex Auto-Fix on Failure
on:
workflow_run:
# Trigger this job after any run of the primary CI workflow completes
workflows: ["CI"]
types: [completed]
permissions:
contents: write
pull-requests: write
jobs:
auto-fix:
# Only run when the referenced workflow concluded with a failure
if: ${{ github.event.workflow_run.conclusion == 'failure' }}
runs-on: ubuntu-latest
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
FAILED_WORKFLOW_NAME: ${{ github.event.workflow_run.name }}
FAILED_RUN_URL: ${{ github.event.workflow_run.html_url }}
FAILED_HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
FAILED_HEAD_SHA: ${{ github.event.workflow_run.head_sha }}
steps:
- name: Check OpenAI API Key Set
run: |
if [ -z "$OPENAI_API_KEY" ]; then
echo "OPENAI_API_KEY secret is not set. Skipping auto-fix." >&2
exit 1
fi
- name: Checkout Failing Ref
uses: actions/checkout@v4
with:
ref: ${{ env.FAILED_HEAD_SHA }}
fetch-depth: 0
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- name: Install dependencies
run: |
if [ -f package-lock.json ]; then npm ci; else npm i; fi
- name: Run Codex
uses: openai/codex-action@main
id: codex
with:
openai_api_key: ${{ secrets.OPENAI_API_KEY }}
prompt: "You are working in a Node.js monorepo with Jest tests and GitHub Actions. Read the repository, run the test suite, identify the minimal change needed to make all tests pass, implement only that change, and stop. Do not refactor unrelated code or files. Keep changes small and surgical."
codex_args: '["--config","sandbox_mode=\"workspace-write\""]'
- name: Verify tests
run: npm test --silent
- name: Create pull request with fixes
if: success()
uses: peter-evans/create-pull-request@v6
with:
commit-message: "fix(ci): auto-fix failing tests via Codex"
branch: codex/auto-fix-${{ github.event.workflow_run.run_id }}
base: ${{ env.FAILED_HEAD_BRANCH }}
title: "Auto-fix failing CI via Codex"
body: |
Codex automatically generated this PR in response to a CI failure on workflow `${{ env.FAILED_WORKFLOW_NAME }}`.
Failed run: ${{ env.FAILED_RUN_URL }}
Head branch: `${{ env.FAILED_HEAD_BRANCH }}`
This PR contains minimal changes intended solely to make the CI pass.
```
## Step 2: Actions Workflow kicked off
You can navigate to the Actions tab under Repo to view the failing jobs in your Actions workflow.
The Codex workflow should be triggered upon completion of the failed workflow.
## Step 3: Verify that Codex Created a PR for Review
And after the Codex workflow completes execution, it should open a pull request from the feature branch codex/auto-fix. Check to see if everything looks good and then merge it.
## Conclusion
This automation seamlessly integrates OpenAI Codex CLI with GitHub Actions to automatically propose fixes for failing CI runs.
By leveraging Codex, you can reduce manual intervention, accelerate code reviews, and keep your main branch healthy. The workflow ensures that test failures are addressed quickly and efficiently, letting developers focus on higher-value tasks. Explore more about codex-cli and its capabilities [here](https://github.com/openai/codex/).
---
# Source: https://developers.openai.com/codex/app/automations.md
# Automations
Automate recurring tasks in the background. Codex adds findings to the inbox, or automatically archives the task if there's nothing to report. You can combine automations with [skills](https://developers.openai.com/codex/skills) for more complex tasks.
Automations run locally in the Codex app. The app needs to be running, and the
selected project needs to be available on disk.
In Git repositories, each automation run starts in a new
[worktree](https://developers.openai.com/codex/app/worktrees) so it doesn't interfere with your main
checkout. In non-version-controlled projects, automations run directly in the
project directory.
## Managing tasks
All automations and their runs can be found in the automations pane inside your Codex app sidebar.
The "Triage" section acts as your inbox. Automation runs with findings show up there, and you can filter your inbox to show all automation runs or only unread ones.
When an automation runs in a Git repository, Codex uses a dedicated background [worktree](https://developers.openai.com/codex/app/features#worktree-support). In non-version-controlled projects, automations run directly in the project directory. Consider using Git to enable running on background worktrees. You can have the same automation run on multiple projects.
Automations use your default sandbox settings. In read-only mode, tool calls fail if they require modifying files, network access, or working with apps on your computer. With full access enabled, background automations carry elevated risk. You can adjust sandbox settings in [Settings](https://developers.openai.com/codex/app/settings) and selectively allowlist commands with [rules](https://developers.openai.com/codex/rules).
To keep automations maintainable and shareable across teams, you can use [skills](https://developers.openai.com/codex/skills) to define the action and provide tools and context to Codex. You can explicitly trigger a skill as part of an automation by using `$skill-name` inside your automation.
## Testing automations safely
Before you schedule an automation, test the prompt manually in a regular thread
first. This helps you confirm:
- The prompt is clear and scoped correctly.
- The selected model and tools behave as expected.
- The resulting diff is reviewable.
When you start scheduling runs, review the first few outputs closely and adjust
the prompt or cadence as needed.
## Worktree cleanup for automations
For Git repositories, automations run in worktrees. Frequent schedules can
create many worktrees over time. Archive automation runs you no longer need,
and avoid pinning runs unless you intend to keep their worktrees.
## Permissions and security model
Automations are designed to run unattended and use your default sandbox
settings.
- If your sandbox mode is **read-only**, tool calls fail if they require
modifying files, accessing network, or working with apps on your computer.
Consider updating sandbox settings to workspace write.
- If your sandbox mode is **workspace-write**, tool calls fail if they require
modifying files outside the workspace, accessing network, or working with apps
on your computer. You can selectively allowlist commands to run outside the
sandbox using [rules](https://developers.openai.com/codex/rules).
- If your sandbox mode is **full access**, background automations carry
elevated risk, as Codex may modify files, run commands, and access network
without asking. Consider updating sandbox settings to workspace write, and
using [rules](https://developers.openai.com/codex/rules) to selectively define which commands the agent
can run with full access.
If you are in a managed environment, admins can restrict these behaviors using
admin-enforced requirements. For example, they can disallow `approval_policy =
"never"` or constrain allowed sandbox modes. See
[Admin-enforced requirements (`requirements.toml`)](https://developers.openai.com/codex/security#admin-enforced-requirements-requirementstoml).
Automations use `approval_policy = "never"` when your organization policy
allows it. If `approval_policy = "never"` is disallowed by admin requirements,
automations fall back to the approval behavior of your selected mode.
## Examples
### Automatically create new skills
```markdown
Scan all of the `~/.codex/sessions` files from the past day and if there have been any issues using particular skills, update the skills to be more helpful. Personal skills only, no repo skills.
If there’s anything we’ve been doing often and struggle with that we should save as a skill to speed up future work, let’s do it.
Definitely don't feel like you need to update any- only if there's a good reason!
Let me know if you make any.
```
### Stay up-to-date with your project
```markdown
Look at the latest remote origin/master or origin/main . Then produce an exec briefing for the last 24 hours of commits that touch
Formatting + structure:
- Use rich Markdown (H1 workstream sections, italics for the subtitle, horizontal rules as needed).
- Preamble can read something like “Here’s the last 24h brief for :”
- Subtitle should read: “Narrative walkthrough with owners; grouped by workstream.”
- Group by workstream rather than listing each commit. Workstream titles should be H1.
- Write a short narrative per workstream that explains the changes in plain language.
- Use bullet points and bolding when it makes things more readable
- Feel free to make bullets per person, but bold their name
Content requirements:
- Include PR links inline (e.g., [#123](...)) without a “PRs:” label.
- Do NOT include commit hashes or a “Key commits” section.
- It’s fine if multiple PRs appear under one workstream, but avoid per‑commit bullet lists.
Scope rules:
- Only include changes within the current cwd (or main checkout equivalent)
- Only include the last 24h of commits.
- Use `gh` to fetch PR titles and descriptions if it helps.
Also feel free to pull PR reviews and comments
```
### Combining automations with skills to fix your own bugs
Create a new skill that tries to fix a bug introduced by your own commits by creating a new `$recent-code-bugfix` and [store it in your personal skills](https://developers.openai.com/codex/skills#where-to-save-skills).
```markdown
---
name: recent-code-bugfix
description: Find and fix a bug introduced by the current author within the last week in the current working directory. Use when a user wants a proactive bugfix from their recent changes, when the prompt is empty, or when asked to triage/fix issues caused by their recent commits. Root cause must map directly to the author’s own changes.
---
# Recent Code Bugfix
## Overview
Find a bug introduced by the current author in the last week, implement a fix, and verify it when possible. Operate in the current working directory, assume the code is local, and ensure the root cause is tied directly to the author’s own edits.
## Workflow
### 1) Establish the recent-change scope
Use Git to identify the author and changed files from the last week.
- Determine the author from `git config user.name`/`user.email`. If unavailable, use the current user’s name from the environment or ask once.
- Use `git log --since=1.week --author=` to list recent commits and files. Focus on files touched by those commits.
- If the user’s prompt is empty, proceed directly with this default scope.
### 2) Find a concrete failure tied to recent changes
Prioritize defects that are directly attributable to the author’s edits.
- Look for recent failures (tests, lint, runtime errors) if logs or CI outputs are available locally.
- If no failures are provided, run the smallest relevant verification (single test, file-level lint, or targeted repro) that touches the edited files.
- Confirm the root cause is directly connected to the author’s changes, not unrelated legacy issues. If only unrelated failures are found, stop and report that no qualifying bug was detected.
### 3) Implement the fix
Make a minimal fix that aligns with project conventions.
- Update only the files needed to resolve the issue.
- Avoid adding extra defensive checks or unrelated refactors.
- Keep changes consistent with local style and tests.
### 4) Verify
Attempt verification when possible.
- Prefer the smallest validation step (targeted test, focused lint, or direct repro command).
- If verification cannot be run, state what would be run and why it wasn’t executed.
### 5) Report
Summarize the root cause, the fix, and the verification performed. Make it explicit how the root cause ties to the author’s recent changes.
```
Afterward, create a new automation:
```markdown
Check my commits from the last 24h and submit a $recent-code-bugfix.
```
---
# Source: https://developers.openai.com/cookbook/examples/partners/self_evolving_agents/autonomous_agent_retraining.md
# Self-Evolving Agents: A Cookbook for Autonomous Agent Retraining
## Overview
Agentic systems often reach a plateau after proof-of-concept because they depend on humans to diagnose edge cases and correct failures. This cookbook introduces a repeatable retraining loop that captures those issues, learns from the feedback, and promotes improvements back into production-like workflows. We ground the approach in a regulated healthcare documentation task, but the patterns generalize to any domain that demands accuracy, auditability, and rapid iteration.
### What You Will Learn
- Diagnose why an autonomous agent falls short of production readiness and instrument it with measurable feedback signals.
- Compare three prompt-optimization strategies—from quick manual iteration to fully automated loops—and understand when to reach for each.
- Assemble a self-healing workflow that combines human review, LLM-as-judge evals, and iterative prompt refinement.
### Who This Notebook Is For
- ML/AI engineers and solution architects who need to move beyond toy demos.
- Product and delivery teams looking for executable artifacts they can adapt into internal tooling or production pipelines.
### How to Work Through This Notebook
1. Start with Section 1 to understand the healthcare use case, baseline agent, and system architecture.
2. Use Section 2 to practice prompt optimization within the OpenAI Evals interface and collect structured feedback.
3. Run Section 3 to automate the optimization loop with graders, evals, and retraining logic.
4. Reference the appendix for reusable prompts, configurations, and evaluation templates as you tailor the workflow to your environment.
The notebook is modular—feel free to run sections independently or sequentially as you adapt the retraining loop to your own agents.
## 1. Use Case Overview: Self-Evolving Agents in Healthcare
### Problem Definition
For this cookbook, we focus on a **real-world use case**: drafting regulatory documents for pharmaceutical companies. These organizations must prepare and submit extensive documentation to regulatory authorities (e.g., the U.S. Food and Drug Administration) to obtain approval for new drugs. The accuracy and speed of these submissions are critical, as they directly impact how quickly life-saving treatments can reach patients.
Regulatory document drafting is a highly complex, iterative, and precision-driven process that requires deep scientific, medical, and compliance expertise. Despite the availability of advanced authoring tools, it remains labor-intensive and prone to human error. **Agentic systems offer substantial leverage** by assisting with research synthesis, content generation, and document structuring, yet human experts are still needed to ensure factual accuracy and regulatory compliance.
The key challenge is to design a feedback loop that enables these agentic systems to learn iteratively and refine model behavior over time. Such a system can gradually shift human effort from detailed correction to high-level oversight, improving efficiency while maintaining the rigorous standards required for regulatory submissions.
### Self-evolving Agent
The diagram below illustrates the iterative process for continuously improving an AI agent through feedback, meta prompting, and evaluation. The loop combines human judgment or automated feedback using an LLM-as-a-judge to iteratively enhance performance.
Figure 1 - Diagram showing the self-evolving loop for automated agent improvement.
The process consists of the following steps:
1. **Baseline Agent**
The process begins with a baseline agent. In this notebook, we use a deliberately simple example (an agent that summarizes sections of a document) to illustrate the iterative improvement loop. In real-world or enterprise settings, the baseline agent could be much more complex. The summaries it produces serve as the initial benchmark for subsequent evaluation and refinement.
2. **Human Feedback (or LLM-as-judge)**
The baseline agent’s outputs are then evaluated either by human reviewers (e.g., for production environments) and/or by an automated **LLM-as-judge** system. This step gathers both quantitative and qualitative feedback that indicates how well the agent meets its goals — for instance, if we are testing the length of the summary, the feedback might be “the summary is too long” or a numerical score (generally between `0` and `1`) generated by eval when assessing if the summary is under 500 words.
3. **Evals and Aggregated Score**
Based on the collected feedback, new prompts are generated and tested through evaluations (**Evals**). These tests measure performance against predefined criteria, and the outcomes are combined into an aggregated score that reflects the overall performance. The loop continues until the score exceeds a target threshold (e.g., `0.8`) or the maximum number of retries is reached (e.g., `max_retry = 10`). If the retry limit is hit, engineers are alerted that manual improvements are required.
4. **Updated Baseline Agent**
Once an improved version achieves the target performance, it replaces the original baseline agent. This updated agent becomes the foundation for the next iteration, supporting a continuous cycle of learning, feedback, and optimization.
### Dataset Overview
The dataset used for evaluation comprises ~70 sections extracted from the _Sample CMC Section for Hyperpolarized Pyruvate (13C) Injection_, publicly available [here](https://dctd.cancer.gov/drug-discovery-development/reagents-materials/imaging-ind-resources/documentation/13c-pyruvate-cmc.pdf). This dataset provides realistic, domain-specific content suitable for testing both scientific summarization and regulatory compliance behavior.
### Baseline Agent Overview
To keep this cookbook self-contained and easily reproducible, we simplified the regulatory drafting use case while retaining its essential complexity. In production, a typical regulatory authoring agent comprises multiple specialized sub-agents responsible for tasks such as drafting, data analysis, compliance checking, citation generation, and fact verification.
For this guide, we narrow the scope of the regulatory authoring agent to focus on the self-healing aspect of the system. Our regulatory authoring agent consists of two sub-agents:
- **A summarizer** creating scientific and concise summaries.
- **A compliance checker**: evaluating each summary against key regulatory requirements (e.g., FDA 21 CFR Part 11).
Figure 2 - The baseline agent as created in the AgentBuilder UI.
For the remainder of this cookbook, we implemented a simplified version of the Summarizer agent (see the section **Agent Setup** below). Alternatively, you can reuse the code for the agent created with AgentBuilder. If you’d like to reproduce the agent directly from the AgentBuilder UI, here are the key prompts and parameters used:
- **Summarizer agent:** This agent used the file search tool, where the [CMC PDF](https://developers.openai.com/cookbook/examples/partners/self_evolving_agents/%22data/c13_pyruvate_sample_CMC_from_UCSF.pdf%22) was uploaded to the vector store.
> _Prompt:_ "Summarize section {{workflow.input_as_text}} from {{state.cmc_pdf}} uploaded to the vector store."
- **Compliance Checker agent:**
> _Prompt:_ "Verify that the summary below is compliant with FDA 21 CFR Part 11: {{input.output_text}}. If the summary is compliant, return _Compliant_. Otherwise, return _This section needs to be manually summarized_."
Both agents were configured with the default parameters - using GPT-5, low reasoning effort, and text as the output format.
### Evaluation Approach
To evaluate the baseline agent, there are two main approaches:
1. **Collecting Human Feedback.** This approach involves gathering feedback from human users through the OpenAI Evals platform (or a custom UI built for a specific application). It is best suited for production settings or when piloting a tool where subject matter experts (SMEs) interact with the tool in real-world scenarios. This method helps uncover edge cases that may not have been identified during development. On the Evals platform, users can provide thumbs-up or thumbs-down ratings and share qualitative feedback about the summaries.
2. **Using an LLM-as-a-Judge.** This option is typically used during the development phase, enabling fast feedback loops without requiring SME's time. An **LLM-as-a-judge** uses an LLM to automatically evaluate and score the agent’s outputs based on predefined criteria. It can also be used for monitoring model drift (e.g., in production) or validating changes between model and model versions (e.g., switching between `gpt-5` and `gpt-5-mini`).
This cookbook demonstrates both approaches:
- **Section 2** shows the platform UI approach for manual prompt optimization
- **Section 3** implements the fully automated API approach using LLM-as-a-judge
_Note: The Evals platform does not yet provide an API to retrieve user feedback programmatically._
## 2. Using the OpenAI Evals Platform
The OpenAI Evals platform provides an intuitive interface for prompt optimization and evaluation. This section demonstrates the complete workflow from dataset upload through iterative prompt improvement, showing how you can leverage the platform's visual interface to optimize your prompts before implementing automated solutions.
### Step 1: Upload Dataset
To begin using the OpenAI Evaluation platform, you'll first need to upload your dataset:
1. Click the **+ Create** button
2. Define the dataset name
3. Upload a CSV file and select the columns to keep
4. Upload
Your dataset should contain the documents or document sections that need to be summarized. Each row represents one input that will be processed by your system.
### Step 2: Explore Your Data
Once uploaded, you can explore your dataset. Click the dataset name to explore the uploaded data. This allows you to verify that your data is properly formatted and contains the expected content before proceeding with prompt configuration.
### Step 3: Configure Initial Prompt
This is where you define your initial system prompt and configure how data flows through your model.
Figure 3 - The platform's "New prompt" interface showing model configuration, variables, and system message settings.
#### Configuration Steps
1. **System Prompt**: Add the system message that defines the model's task and behavior (this prompt will be optimized)
2. **User Prompt Template**: Add the prompt message template for user messages, using variables such as `{{}}` that get replaced with actual data from your dataset
3. **Model Selection**: Choose the model for generation (e.g., gpt-4.1, gpt-5)
4. **Temperature**: Configure creativity vs. determinism
You can start with a very simple prompt to demonstrate the power of the optimization process. For example, beginning with just "summarize" shows how the system can evolve from a minimal starting point.
### Step 4: Generate Outputs
Once your prompt is configured, you're ready to generate outputs across your dataset. The prompt will run once per row and output will be generated on a new **output** column.
1. Click **"Generate Output"**
2. The platform runs your prompt against all samples
3. Results appear in a new **Output** column
The platform will process each row in your dataset, replacing template variables with actual values and calling the model with your system prompt. This creates a baseline of outputs that you can evaluate.
### Step 5: Review and Evaluate
Evaluation is where you provide structured feedback to guide prompt improvement.
#### Review Outputs
1. **Add Evaluation Columns** if not automatically added - Click "Columns" → "Annotations" → "Add":
- **Rating** - Binary (good/bad) or numeric ratings
- **Feedback** - Text describing what needs improvement
2. **Provide Rating and Feedback** - Add your assessment for each output.
Depending on the quality of the output, you may select a good or bad rating and explain your score based on how you would like the answer to be improved. For example:
> (Rating) | Feedback
> - (Good) Good, but only the answer should be provided. The output should not include headers or any text other than the answer.
> - (Bad) The information is good, but it should be presented as bullet points.
> - (Good) Good summary; it is clear.
> - (Bad) Use bullet points when answering to improve readability. Summarize each sub-section individually.
3. **Save Annotations** - Your feedback is saved with the evaluation run
Figure 4 - The evaluation interface showing generated outputs with rating and feedback columns for annotation.
This structured feedback becomes the foundation for automatic prompt optimization.
### Step 6: Optimize Prompt
After collecting feedback, the platform can automatically generate an improved prompt.
1. Click **"Optimize"**
2. A new prompt version is generated in a new tab
3. Click **"View Prompt"** to see the improved version
Figure 5 - The improved prompt generated by the platform, showing detailed instructions and requirements.
### Step 7: Iterate and Compare
With your improved prompt ready, start a new iteration to measure improvement.
1. Click **"Generate Output"**
2. Review the new results and provide feedback on any remaining issues
3. Click **"Optimize"** again if needed
4. Repeat until satisfied
The platform's tab structure allows you to compare performance across iterations. You can easily see how outputs evolved from your initial prompt to the optimized versions.
Figure 6 - Feedback and evaluation results for the optimized prompt, showing improvements in output quality.
#### When to Stop Iterating
Continue the optimization cycle until:
- **Quality threshold reached**: >80% of outputs receive positive feedback
- **Diminishing returns**: New iterations show minimal improvement
- **Specific issues resolved**: All identified failure modes are addressed
This platform-based approach provides an excellent foundation for understanding prompt optimization before moving to automated implementations. The visual interface makes it easy to see the impact of changes and understand the optimization process.
## 3. Self-evolving Loop with LLM-as-a-Judge
This section introduces a fully automated evaluation workflow using an LLM-as-a-Judge through the OpenAI API, eliminating the need for any user interface. This approach enables scalable, programmatic assessment of agent performance, supporting rapid iteration and continuous model monitoring in production.
```python
# gepa and litellm are only required for the Section 4.b (prompt optimization with GEPA)
%pip install --upgrade openai openai-agents pydantic pandas gepa litellm python-dotenv -qqq
%load_ext dotenv
%dotenv
# Place your API key in a file called .env
# OPENAI_API_KEY=sk-...
```
### Eval Creation
To evaluate the baseline summarization agent, we use four complementary graders that balance deterministic checks with semantic judgment.
| Grader | Type | Pass threshold | What it checks | Why |
|---|---|---:|---|---|
| Chemical string name | `python` | 0.8 | If any exact chemical names in the section appear in the summary. | Forces preservation of critical domain entities so summaries don’t omit chemically meaningful terms. |
| Summarization length | `python` | 0.85 | Inverse deviation from an expected 100-word length. | Keeps summaries concise and comparable, reducing verbosity that can mask poor content. |
| Cosine similarity | `text_similarity` | 0.85 | Cosine similarity between section and summary texts. | Ensures the summary stays anchored to the source content rather than drifting semantically. |
| LLM-as-judge | `score_model` | 0.85 | A rubric-driven score from a model acting as an evaluator. | Captures nuanced quality signals that rule-based metrics miss, improving overall robustness. |
**Notes**
- The two Python graders catch domain fidelity and length discipline early, which stabilizes optimization before semantic tuning.
- Text similarity guards against superficial rephrasing that strays from the source.
- The LLM judge provides a holistic failsafe when edge cases slip past deterministic checks.
```python
import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
data_source_config = {
"type": "custom",
"item_schema": {
"type": "object",
"properties": {"section": {"type": "string"}, "summary": {"type": "string"}},
"required": ["section", "summary"],
},
"include_sample_schema": False,
}
testing_criteria = [
{
"type": "python",
"name": "chemical_name_grader",
"image_tag": "2025-05-08",
"pass_threshold": 0.8,
"source": r"""def grade(sample: dict, item: dict) -> float:
section = item["section"]
summary = item["summary"]
CHEMICALS_MASTER = ["[1-¹³C]Pyruvic acid","[1-¹³C]Pyruvate","¹²C Pyruvic acid","Sodium [1-¹³C]pyruvate","Sodium pyruvate (¹²C)","AH111501 (Trityl radical)","Tris{8-carboxyl-2,2,6,6-tetra[2-(1-methoxyethyl)]-benzo(1,2-d:4,5-d’)bis(1,3)dithiole-4-yl}methyl acid","AH111501 sodium salt","Methyl, tris[8-carboxy-2,2,6,6-tetrakis(2-methoxyethyl)benzo[1,2-d:4,5-d’]bis[1,3]dithiol-4-yl]-, trisodium salt","AH111501 trisodium salt","AH111576","2,2′,2″,2‴-(4,8-Dibromobenzo[1,2-d:4,5-d′]bis([1,3]dithiole)-2,2,6,6-tetrayl)tetraethanol","AH111586","4,8-Dibromo-2,2,6,6-tetrakis(2-methoxyethyl)benzo[1,2-d:4,5-d′]bis([1,3]dithiole)","AH111709","AH111743","AH112615","4,4-Bis-hydroxymethyl-2-methyl-oxazolidine-2-carboxylic acid","AH112623","Parapyruvate","2-Hydroxy-2-methyl-4-oxo-pentanedioic acid","AH113127","(4-Hydroxymethyl-oxazolidin-4-yl)-methanol","AH113462/E","Enol lactone","AH113462/K","Keto lactone","Acetyl bromide","Methanol","Dimethyl sulfoxide","DMSO","Tetrahydrofuran","THF","Acetonitrile","ACN","Diethyl ether","Et₂O","N,N-Dimethylacetamide","DMA","1,3-Dimethyl-2-imidazolidinone","DMI","Hydrochloric acid","HCl","Sodium hydroxide","NaOH","Disodium ethylenediaminetetraacetate","Na₂EDTA","Ethylenediaminetetraacetic acid","EDTA","Tris(hydroxymethyl)aminomethane","TRIS","Trometamol","Trifluoroacetic acid","TFA","Toluene","Heptane","Ethyl acetate","Ethanol","Water","H₂O","Sodium chloride","NaCl","Cuprous [1-¹³C]cyanide","Cu¹³CN","Gadolinium","Gd","Tin","Sn","Phosphorus","P","Carbon dioxide","CO₂","Sodium [1-13C]pyruvate","[1-13C]Pyruvic acid","1-13C pyruvate"]
# Identify the chemicals present in the section
present = [chem for chem in CHEMICALS_MASTER if chem in section]
# If no chemicals present, consider it satisfied
if not present:
return 1.0
correct = 0
for chem in present:
# Only count as correct if the exact chemical string appears in the summary
if chem in summary:
correct += 1
return correct / len(present)""",
},
{
"type": "python",
"name": "word_length_deviation_grader",
"image_tag": "2025-05-08",
"pass_threshold": 0.85,
"source": r"""
def grade(sample: dict, item: dict) -> float:
summary = item["summary"]
word_count = len(summary.split())
expected_summary_length = 100
tolerance = 0.2 # 20% band around target
# relative deviation
deviation = abs(word_count - expected_summary_length) / expected_summary_length
# If within tolerance band → full score
if deviation <= tolerance:
return 1.0
# Outside band → score decays linearly, capped at 0
# e.g., deviation 0.3 → score 0.8, deviation 1.0+ → 0.0
score = 1.0 - (deviation - tolerance)
return max(0.0, score)
""",
},
{
"name": "cosine_similarity",
"type": "text_similarity",
"input": "{{ item.summary }}",
"reference": "{{ item.section }}",
"evaluation_metric": "cosine",
"pass_threshold": 0.85,
},
{
"name": "llm_as_judge",
"type": "score_model",
"model": "gpt-4.1",
"input": [
{
"role": "system",
"content": (
"You are an expert technical summarization evaluator. "
"Evaluate whether the summary captures and preserves the important technical facts and specific details from the section, allowing for occasional minor rewording or omissions of less important points, but not major technical inaccuracies or information loss.\n\n"
"Scoring Guidelines:\n"
"- Return a numerical score between 0 and 1 (with up to two decimal places).\n"
"- A score of 1 means the summary is almost flawless: it is comprehensive, highly faithful, and technically accurate, with virtually no important or meaningful details missing, and no significant misstatements or distortions.\n"
"- 0.75-0.99 indicates excellent work: all main facts are represented, but there may be trivial omissions or very minor rewording that do not materially affect understanding.\n"
"- 0.5-0.75 indicates good but imperfect: most technical information is retained and correctly presented, some less critical details might be missing or slightly rephrased, but overall fidelity is preserved.\n"
"- 0.3-0.5 means significant information is missing, or some technical inaccuracies are present, but the summary retains a reasonable portion of key facts.\n"
"- 0.0-0.3 means there are major omissions, misunderstandings, or a failure to capture the most important technical content.\n\n"
"Respond only with a single number between 0 and 1 indicating summary quality by these criteria."
),
},
{
"role": "user",
"content": (
"Section:\n{{item.section}}\n"
"Summary:\n{{sample.output_text}}"
),
},
],
"range": [0, 1],
"pass_threshold": 0.85,
},
]
eval = client.evals.create(
name="self_evolving_eval",
data_source_config=data_source_config,
testing_criteria=testing_criteria,
)
print(f"Created Eval: {eval.id}")
```
You should see an eval ID in the output, e.g. `eval_...`. This is the ID of the eval we just created (as shown below)
Figure 7 - The platform's Eval interface showing data source configuration, and test criteria settings.
### Grader Scoring and Parsing
Next we'll need run the evals on the summarization agent's output and parse the results for the eval's grader scores. To do this we'll use a few helper functions:
- `run_eval`: Simple runner to call the evals API with proper formatting
- `poll_eval_run`: A polling utility to wait for the scheduled eval run to complete
- `parse_eval_run_output`: Parses the eval run and returns a structured output for the feedback loop
```python
import time
import json
def run_eval(eval_id: str, section: str, summary: str):
"""Creates a run of the eval with the input section and output summary."""
return client.evals.runs.create(
eval_id=eval_id,
name="self-evolving-eval",
data_source={
"type": "jsonl",
"source": {
"type": "file_content",
"content": [
{
"item": {
"section": section,
"summary": summary,
}
}
],
},
},
)
def poll_eval_run(eval_id: str, run_id: str, max_polls = 10):
"""
Polls the evaluation run until completion or timeout.
This function exists to handle asynchronous behavior in the eval service by
periodically checking run status. It balances responsiveness and resource use by
polling at fixed intervals rather than blocking indefinitely. The retry limit
prevents runaway loops in cases where the service never returns a completed status.
"""
run = None
for attempt in range(1, max_polls + 1):
run = client.evals.runs.retrieve(eval_id=eval_id, run_id=run_id)
if run.status == "completed":
break
if attempt == max_polls:
print("Exceeded retries, aborting")
break
time.sleep(5)
run_output_items = client.evals.runs.output_items.list(
eval_id=eval_id, run_id=run_id
)
return run_output_items
def parse_eval_run_output(items):
"""Extract all grader scores and any available conclusion outputs."""
all_results = []
for item in items.data:
for result in item.results:
grader_name_full = result.name
score = result.score
passed = result.passed
reasoning = None
try:
sample = result.sample
if sample:
content = result.sample["output"][0]["content"]
content_json = json.loads(content)
steps = content_json["steps"]
reasoning = " ".join([step["conclusion"] for step in steps])
except Exception:
pass
all_results.append(
{
"grader_name": grader_name_full,
"score": score,
"passed": passed,
"reasoning": reasoning,
}
)
return all_results
```
Now we can use the created eval ID from earlier and run the graders against an arbitrary input section and summary output. This forms the backbone of the feedback loop which will kick off the prompt optimization routine.
### Eval execution run
Let's test our evals by providing a section and a generated summary directly.
```python
EVAL_ID = eval.id #Created eval ID from above cell
SECTION = "3.2.S.1 General Information ([1-13C]pyruvic acid) The active ingredient in Hyperpolarized Pyruvate (13C) Injection is hyperpolarized [1-13C]pyruvate. The drug substance is defined as [13C]pyruvic acid, which is neutralized to [1-13C]pyruvate during the compounding process. In several pre-clinical and clinical studies and during evaluation of stability, pyruvic acid has been used instead of [1-13C]pyruvic acid (see Sections 3.2.P.2.2.1 Formulation Development for Hyperpolarized Pyruvate (13C) Injection and Section 8.1 Introduction for Item 8 Pharmacology and Toxicology Info). In the Section 3.2.S Drug Substance, data are presented for both pyruvic acid and for [1-13C]pyruvic acid. For simplicity, the terminology used in headings and captions is [1-13C]pyruvic acid. Batches containing pyruvic acid are specified by footnotes. 3.2.S.1.1 Nomenclature ([1-13C]pyruvic acid) The drug substance used for compounding of Hyperpolarized Pyruvate (13C) Injection is [1-13C]pyruvic acid. Company code: W6578 Chemical name: [1-13C]pyruvic acid CAS registry number: 127-17-3 3.2.S.1.2 Structure ([1-13C]pyruvic acid) Figure 1 Structure of [1-13C]pyruvic acid Molecular formula: C H O 3 4 3 Molecular weight: 89.06 3.2.S.1.3 General Properties ([1-13C]pyruvic acid) Appearance: Colorless to yellow, clear, viscous liquid pKa:Ka:aranWater solubility: Complete The structure of [1-13C]pyruvic acid has been confirmed by spectroscopic analysis (see Section 3.2.S.3.1 Elucidation of Structure and other Characteristics)."
SUMMARY = "The active ingredient in Hyperpolarized Pyruvate (13C) Injection is hyperpolarized [1-13C]pyruvate, derived from [1-13C]pyruvic acid (neutralized during compounding). Both pyruvic acid and [1-13C]pyruvic acid were used in studies and stability evaluations, but the documentation refers to [1-13C]pyruvic acid unless otherwise noted. The drug substance ([1-13C]pyruvic acid, CAS 127-17-3) is a colorless to yellow, clear, viscous liquid with a molecular formula C3H4O3 and molecular weight 89.06. Its structure has been confirmed by spectroscopic analysis, and it is completely soluble in water."
eval_run = run_eval(EVAL_ID, section=SECTION, summary=SUMMARY)
run_output = poll_eval_run(eval_id=EVAL_ID, run_id=eval_run.id)
grader_scores = parse_eval_run_output(run_output)
print(grader_scores)
```
You should see a list of grader scores in the output, e.g.
```[{'grader_name': 'chemical_name_grader-', 'score': 0.5, 'passed': False, 'reasoning': None}, {'grader_name': 'word_length_deviation_grader-', 'score': 0.8, 'passed': True, 'reasoning': None}, {'grader_name': 'cosine_similarity-', 'score': 0.9104484223477793, 'passed': True, 'reasoning': None}, {'grader_name': 'llm_as_judge-', 'score': 0.8, 'passed': True, 'reasoning': 'The summary needs to include specific details from the section. Part of the essential information is captured. Key pieces of information are missing. Not all relevant structural information is included.'}]```
Running this script we can see that most of our graders are passing except the `chemical_name_grader`. Next we'll programmatically recognize this opportunity to improve the summarization agent.
_Note: When you run it locally, graders other than `chemical_name_grader` may fail at first. This is normal, as graders can initially fail, but the results should improve through the feedback loop. Early failures simply reflect the model adjusting its responses before converging on more accurate results._
### Dashboard Observability
Eval runs and results can also be seen in the OpenAI Dashboard:
Figure 8 - Eval dashboard showing evaluation runs and results.
We can also drill down into a specific eval run:
Figure 9 - Detailed eval run results showing grader scores and performance metrics.
## Agent Setup
Now that we have our evals and graders set up, we can go back to our summarization agent.
For simplicity, we will provide the code for a simple agent below. You could also use `AgentBuilder`, as shown in Figure 2, and export the code from the UI.
We will also need a metaprompt optimization agent, to optimize our prompt, as well as some simple utilities to handle prompt versions:
- `PromptVersionEntry`: A pydantic model used to track the prompt and metadata as it changes in production
- `VersionedPrompt`: A utility class to track prompt versions, this will be important in production when analyzing the evolution of the prompt as well as ensuring there is a fallback history in case of a regression
```python
from datetime import datetime
from typing import Any, Optional
from pydantic import BaseModel, Field, ConfigDict, field_validator
class PromptVersionEntry(BaseModel):
"""Data model for a prompt and associated data for observability"""
version: int = Field(
..., ge=0, description="Version number of the prompt (increments)"
)
model: str = Field(
"gpt-5",
min_length=1,
description="The model version to use for this version of the prompt, defaults to gpt-5",
)
prompt: str = Field(
..., min_length=1, description="The prompt text for this version"
)
timestamp: datetime = Field(
default_factory=datetime.utcnow,
description="UTC timestamp when this version was created",
)
eval_id: Optional[str] = Field(
None, description="ID of the evaluation associated with this prompt version"
)
run_id: Optional[str] = Field(
None, description="ID of the run associated with this prompt version"
)
metadata: Optional[dict[str, Any]] = Field(
None, description="Free-form metadata dict (e.g., section, summary)"
)
model_config = ConfigDict(
str_strip_whitespace=True, validate_assignment=True, extra="forbid"
)
@field_validator("prompt")
@classmethod
def prompt_not_blank(cls, v: str) -> str:
if not v.strip():
raise ValueError("prompt must not be blank or only whitespace")
return v
class VersionedPrompt:
"""Manages a collection of prompt versions and provides controlled updates and rollbacks."""
def __init__(
self,
initial_prompt: str,
model: Optional[str] = "gpt-5",
eval_id: Optional[str] = None,
run_id: Optional[str] = None,
metadata: Optional[dict[str, Any]] = None,
):
if not initial_prompt or not initial_prompt.strip():
raise ValueError("initial_prompt must be non-empty")
self._versions: list[PromptVersionEntry] = []
first_entry = PromptVersionEntry(
version=0,
prompt=initial_prompt,
model=model,
eval_id=eval_id,
run_id=run_id,
metadata=metadata,
)
self._versions.append(first_entry)
def update(
self,
new_prompt: str,
model: Optional[str] = "gpt-5",
eval_id: Optional[str] = None,
run_id: Optional[str] = None,
metadata: Optional[dict[str, Any]] = None,
) -> PromptVersionEntry:
if not new_prompt or not new_prompt.strip():
raise ValueError("new_prompt must be non-empty")
version = self.current().version + 1
entry = PromptVersionEntry(
version=version,
prompt=new_prompt,
model=model,
eval_id=eval_id,
run_id=run_id,
metadata=metadata,
)
self._versions.append(entry)
return entry
def current(self) -> PromptVersionEntry:
return self._versions[-1]
def revert_to_version(self, version: int) -> PromptVersionEntry:
idx = None
for i, entry in enumerate(self._versions):
if entry.version == version:
idx = i
break
if idx is None:
raise ValueError(f"No version found with version={version}")
self._versions = self._versions[: idx + 1]
return self._versions[-1]
```
Next we'll create the starting summarization and prompt optimization agents.
_Note: We created a wrapper to track prompt changes in the summarization agent since it is expected to evolve in production, the metaprompt agent's prompt will stay static for the purposes of this cookbook._
```python
from agents import Agent
METAPROMPT_TEMPLATE = """
# Context:
## Original prompt:
{original_prompt}
## Section:
{section}
## Summary:
{summary}
## Reason to improve the prompt:
{reasoning}
# Task:
Write a new summarization prompt that is significantly improved and more specific than the original.
The new prompt should instruct the model to produce concise yet comprehensive technical summaries that precisely preserve all explicit information from the source text. It should emphasize the inclusion of all named entities, quantities, compounds, and technical terminology without paraphrasing or omission. The resulting prompt should read like a clear, directive system message for a technical summarization assistant—structured, unambiguous, and generalizable across scientific or regulatory document sections.
"""
metaprompt_agent = Agent(
name="MetapromptAgent", instructions="You are a prompt optimizer."
)
summarization_prompt = VersionedPrompt(
initial_prompt="""You are a summarization assistant.
Given a section of text, produce a summary."""
)
def make_summarization_agent(prompt_entry: PromptVersionEntry) -> Agent:
return Agent(
name="SummarizationAgent",
instructions=prompt_entry.prompt,
model=prompt_entry.model,
)
summarization_agent = make_summarization_agent(summarization_prompt.current())
# Cache eval results by section + summary so repeated attempts do not trigger redundant grader runs.
eval_cache: dict[tuple[str, str], list[dict[str, Any]]] = {}
# Track the highest-scoring candidate that also passes the lenient score threshold.
best_candidate: dict[str, Any] = {
"score": float("-inf"),
"prompt": summarization_prompt.current().prompt,
"model": summarization_prompt.current().model,
"summary": None,
"metadata": None,
"version": summarization_prompt.current().version,
"passed_lenient": False,
"total_score": float("-inf"),
}
# Aggregate per-version performance so we can pick the strongest total scorer at the end.
aggregate_prompt_stats: dict[int, dict[str, Any]] = {}
```
### Orchestration and Monitoring
This is what we've done so far - we've created:
- Evals with 4 graders that will assess the outputs and produce a score for each grader
- A summarization agent with a versioned prompt class to track changes to the prompt and model
- A metaprompt optimization agent that will attempt to update the prompt based on a set of reasoning
Now these different functionalities can be composed to orchestrate the self-evolving loop with Agent tracing in the OpenAI dashboard.
Keep in mind that this is a simplified example. In a real-world scenario, you'd want to ensure you have guardrails for optimization attempts and that an alert notifies a human when a guardrail is triggered.
_Note: Due to practical limitations of the cookbook we are simulating a stream of data by feeding in a static dataset and using `print` statements in place of true observability._
### Orchestration Utilities
As in previous sections we'll create some utilities to manage the orchestration logic of the feedback loop.
```python
import asyncio
from typing import Any, Optional
from agents import Runner
LENIENT_PASS_RATIO = 0.75 # 75% of graders must pass (binary)
LENIENT_AVERAGE_THRESHOLD = 0.85 # 85% average score across graders
def reset_best_candidate() -> None:
"""Reset the best candidate tracker for a new optimization run."""
global best_candidate
current = summarization_prompt.current()
best_candidate = {
"score": float("-inf"),
"prompt": current.prompt,
"model": current.model,
"summary": None,
"metadata": None,
"version": current.version,
}
def reset_best_trackers() -> None:
"""Reset both the best-candidate tracker and aggregate stats."""
reset_best_candidate()
aggregate_prompt_stats.clear()
def update_best_candidate(
*,
average_score: Optional[float] = None,
prompt_text: str,
model_name: str,
summary_text: str = None,
metadata: dict[str, Any] = None,
lenient_passed: bool = False,
prompt_version: int = None,
total_score: Optional[float] = None,
score: Optional[float] = None,
) -> None:
"""Persist the best lenient-passing candidate."""
global best_candidate
if prompt_version is None:
prompt_version = summarization_prompt.current().version
if average_score is None:
average_score = score
if average_score is None:
return
if lenient_passed:
best_candidate.update(
{
"score": average_score,
"prompt": prompt_text,
"model": model_name,
"summary": summary_text,
"metadata": metadata,
"version": prompt_version,
"total_score": total_score if total_score is not None else average_score,
}
)
def apply_best_candidate_if_needed() -> Agent:
"""Ensure summarization_prompt reflects the best prompt candidate."""
if best_candidate["score"] > float("-inf"):
current = summarization_prompt.current()
target = best_candidate
# Only update if different
if (
current.prompt != target["prompt"]
or current.model != target["model"]
or current.version != target.get("version")
):
summarization_prompt.update(
new_prompt=target["prompt"],
model=target["model"],
metadata=target.get("metadata"),
)
target["version"] = summarization_prompt.current().version
return make_summarization_agent(summarization_prompt.current())
return make_summarization_agent(summarization_prompt.current())
def record_aggregate_prompt_score(
*,
prompt_version: int,
prompt_text: str,
model_name: str,
average_score: float,
total_score: Optional[float] = None,
) -> None:
"""Accumulate per-version grader scores for aggregate selection."""
stats = aggregate_prompt_stats.setdefault(
prompt_version,
{
"version": prompt_version,
"prompt": prompt_text,
"model": model_name,
"total_score": 0.0,
"total_average": 0.0,
"count": 0,
},
)
stats["total_score"] += total_score if total_score is not None else average_score
stats["total_average"] += average_score
stats["count"] += 1
stats["prompt"] = prompt_text
stats["model"] = model_name
def select_best_aggregate_prompt() -> Optional[dict[str, Any]]:
"""Return the prompt version with the highest cumulative score."""
if not aggregate_prompt_stats:
return None
return max(
aggregate_prompt_stats.values(),
key=lambda entry: (
entry.get("total_score", float("-inf")),
entry.get("version", -1),
),
)
async def get_eval_grader_score(eval_id: str, section: str, summary: str):
"""Retrieve grader scores for a section-summary pair with caching."""
cache_key = (section, summary)
if cache_key in eval_cache:
return eval_cache[cache_key]
eval_run = run_eval(eval_id=eval_id, section=section, summary=summary)
run_output = poll_eval_run(eval_id=eval_id, run_id=eval_run.id)
results = parse_eval_run_output(run_output)
eval_cache[cache_key] = results
return results
def calculate_grader_score(grader_scores):
"""Simple average score of all graders from the eval."""
if not grader_scores:
return 0.0
score_sum = 0.0
for entry in grader_scores:
score_sum += entry.get("score", 0.0)
return score_sum / len(grader_scores)
def calculate_total_grader_score(grader_scores):
"""Sum of all grader scores for aggregate tracking."""
if not grader_scores:
return 0.0
return sum(entry.get("score", 0.0) for entry in grader_scores)
DEFAULT_PASSING_FEEDBACK = (
"All graders passed; tighten factual coverage, chemical completeness, and conciseness."
)
def is_lenient_pass(grader_scores, average_score: float) -> bool:
if not grader_scores:
return False
passed_count = sum(1 for entry in grader_scores if entry.get("passed"))
total_graders = len(grader_scores)
if total_graders and (passed_count / total_graders) >= LENIENT_PASS_RATIO:
return True
return average_score >= LENIENT_AVERAGE_THRESHOLD
def collect_grader_feedback(grader_scores):
"""Consolidate grader reasoning into actionable feedback for the metaprompt agent."""
feedback_lines = []
for entry in grader_scores:
grader = entry.get("grader_name", "")
passed = entry.get("passed", False)
reasoning = entry.get("reasoning")
if not passed:
if grader.startswith("chemical_name_grader"):
feedback_lines.append(
"Not all chemical names in the input section were included in the summary."
)
elif grader.startswith("word_length_deviation_grader"):
feedback_lines.append(
"The summary length deviates too much from the expected length."
)
elif grader.startswith("cosine_similarity"):
feedback_lines.append(
"The summary is not sufficiently similar to the source section (cosine similarity too low)."
)
elif grader.startswith("llm_as_judge") and reasoning:
feedback_lines.append(reasoning)
if not feedback_lines:
feedback_lines.append(DEFAULT_PASSING_FEEDBACK)
return "".join(feedback_lines)
```
### Self-evolving loop
Now to simulate a stream of requests for summarization we'll feed in a prepared dataset and observe the optimization evolve from a naive prompt.
> The referenced dataset.csv can be found in the Github repository.
```python
import pandas as pd
from agents import Agent, trace
EVAL_ID = eval.id #Created eval ID from above cell
MAX_OPTIMIZATION_RETRIES = 3
async def self_evolving_loop(summarization_agent: Agent) -> Agent:
print(f"Starting self-evolving loop | Initial prompt v{summarization_prompt.current().version}")
print(f"Prompt:{summarization_prompt.current().prompt}")
print("-" * 80)
reset_best_trackers()
df = pd.read_csv("data/dataset.csv")
with trace("Self-evolving Optimization Workflow"):
for _, row in df.head().iterrows():
content = row.get("content")
if pd.isna(content) or (isinstance(content, str) and not content.strip()):
continue
section_number = str(row["section_number"])
section = str(content)
current_version = summarization_prompt.current().version
print(f"[Section {section_number}] Using prompt v{current_version}")
optimization_success = False
for attempt in range(1, MAX_OPTIMIZATION_RETRIES + 1):
print(f" Attempt {attempt}: evaluating summary...")
summary_result = await Runner.run(summarization_agent, section)
summary = summary_result.final_output
grader_scores = await get_eval_grader_score(eval_id=EVAL_ID, summary=summary, section=section)
average_score = calculate_grader_score(grader_scores)
total_score = calculate_total_grader_score(grader_scores)
lenient_passed = is_lenient_pass(grader_scores, average_score)
print(
f" Scores — avg={average_score:.3f}, total={total_score:.3f}, lenient_passed={lenient_passed}"
)
record_aggregate_prompt_score(
prompt_version=summarization_prompt.current().version,
prompt_text=summarization_prompt.current().prompt,
model_name=summarization_prompt.current().model,
average_score=average_score,
total_score=total_score,
)
update_best_candidate(
average_score=average_score,
prompt_text=summarization_prompt.current().prompt,
model_name=summarization_prompt.current().model,
summary_text=summary,
metadata={
"section": section_number,
"average_score": average_score,
"grader_results": grader_scores,
"prompt_version": summarization_prompt.current().version,
},
lenient_passed=lenient_passed,
prompt_version=summarization_prompt.current().version,
)
if lenient_passed:
optimization_success = True
print(f" Passed with prompt v{summarization_prompt.current().version}")
break
print(" Failed eval. Improving prompt...")
eval_feedback = collect_grader_feedback(grader_scores)
metaprompt_result = await Runner.run(
metaprompt_agent,
input=METAPROMPT_TEMPLATE.format(
original_prompt=summarization_prompt.current().prompt,
section=section,
summary=summary,
reasoning=eval_feedback,
),
)
improved_prompt = metaprompt_result.final_output
summarization_prompt.update(
new_prompt=improved_prompt,
metadata={"section": section, "summary": summary},
)
summarization_agent = make_summarization_agent(summarization_prompt.current())
print(f" Prompt improved → v{summarization_prompt.current().version}")
if not optimization_success:
print(
" All attempts failed; keeping latest prompt version "
f"v{summarization_prompt.current().version} for the next section."
)
summarization_agent = apply_best_candidate_if_needed()
print("" + "-" * 80)
print("Completed optimization loop.")
print(f"Final prompt version: v{summarization_prompt.current().version}")
if best_candidate["score"] > float("-inf"):
print(
f"Best lenient prompt: v{best_candidate.get('version')} (avg={best_candidate['score']:.3f})"
)
aggregate_best = select_best_aggregate_prompt()
if aggregate_best:
per_section = (
aggregate_best.get("total_average", 0.0) / aggregate_best.get("count", 1)
if aggregate_best.get("count")
else 0.0
)
print(
f"Aggregate best prompt: v{aggregate_best.get('version')} "
f"(total={aggregate_best.get('total_score', 0.0):.3f}, avg/section={per_section:.3f}, model={aggregate_best.get('model', 'unknown')})"
)
print(f"Final prompt:{summarization_prompt.current().prompt}")
return summarization_agent
summarization_agent = await self_evolving_loop(summarization_agent)
```
**How the final prompt is chosen**
- Every evaluation logs the average grader score, the total score across graders, and whether the attempt passed the lenient criteria.
- `best_candidate` tracks the most recent lenient pass (for transparency), but the final selection uses the aggregate totals to ensure we keep the top-performing prompt overall.
- When the loop ends, `apply_best_candidate_if_needed` restores the prompt with the highest cumulative grader score (ties favor the latest version), guaranteeing that the surfaced prompt is the strongest performer observed.
Here is an example (abridged) output for the code above.
Inspecting the output shows that the self evolving prompt worked. There are a few takeaways to account for:
1. The optimization is not always successful, so being able to roll back the prompt version is important
2. The fidelity of the information from the graders is crucially important to ensuring a quality optimization
Starting self-evolving loop | Initial prompt v0
Prompt:You are a summarization assistant.
Given a section of text, produce a summary.
--------------------------------------------------------------------------------
[Section 7.1] Using prompt v0
Attempt 1: evaluating summary...
Scores — avg=0.805, total=3.218, lenient_passed=False
Failed eval. Improving prompt...
Prompt improved → v1
Attempt 2: evaluating summary...
Scores — avg=0.720, total=2.881, lenient_passed=False
Failed eval. Improving prompt...
Prompt improved → v2
Attempt 3: evaluating summary...
Scores — avg=0.762, total=3.048, lenient_passed=True
Passed with prompt v2
[Section 7.2] Using prompt v2
Attempt 1: evaluating summary...
Scores — avg=0.612, total=2.450, lenient_passed=False
Failed eval. Improving prompt...
Prompt improved → v3
Attempt 2: evaluating summary...
Scores — avg=0.915, total=3.660, lenient_passed=True
Passed with prompt v3
[Section 3.2.P.2.1] Using prompt v3
Attempt 1: evaluating summary...
Scores — avg=0.684, total=2.736, lenient_passed=False
Failed eval. Improving prompt...
Prompt improved → v4
Attempt 2: evaluating summary...
Scores — avg=0.684, total=2.736, lenient_passed=False
Failed eval. Improving prompt...
Prompt improved → v5
Attempt 3: evaluating summary...
Scores — avg=0.920, total=3.680, lenient_passed=True
Passed with prompt v5
[Section 3.2.P.2.2] Using prompt v5
Attempt 1: evaluating summary...
Scores — avg=0.737, total=2.950, lenient_passed=True
Passed with prompt v5
[Section 3.2.P.2.3] Using prompt v5
Attempt 1: evaluating summary...
Scores — avg=0.750, total=3.000, lenient_passed=True
Passed with prompt v5
--------------------------------------------------------------------------------
Completed optimization loop.
Final prompt version: v5
Best lenient prompt: v5 (avg=0.750)
Aggregate best prompt: v5 (total=9.630, avg/section=0.802)
Final prompt:**Optimized Technical Summarization System Prompt**
You are a technical summarization assistant specialized in scientific and regulatory documents. Your objective is to generate a summary that preserves every explicit detail and organizational structure from the source text, without any paraphrasing, omission, or synthesis.
**Strict Summarization Guidelines:**
**1. Comprehensive Detail Inclusion:**
- Transcribe all named compounds, salts, excipients, drug substances, molecular designations, batch codes, identifiers, and CAS numbers exactly as written.
- Include every stated concentration, unit, measurement, quantitative value, compositional detail, and preparatory parameter verbatim and in original format.
- Accurately replicate all descriptions of appearance, color, physical state, rationale for inclusion, and labeling or typographical conventions present in the source.
- Clearly include all section titles, headings, subsections, hierarchical numbering, referenced sections, and in-line citations or figures.
**2. Prohibited Actions:**
- Do NOT paraphrase, summarize, interpret, synthesize, restructure, generalize, or alter any information at any level.
- Do NOT omit, compress, merge, or reorder any data point, named entity, technical term, or explicit instruction from the source.
- Do NOT introduce additional content, inference, or editorial clarification.
**3. Structural and Formatting Requirements:**
- Maintain verbatim order, sectioning, and hierarchy from the source text, including all original lists, bullet points, numbering, or formatting.
- Reproduce every element in the precise sequence, alignment, and structure as the input, ensuring maximal traceability.
- If the source uses lists, tables, subpoints, or hierarchies, mirror them exactly.
**4. Precision, Fidelity, and Reviewability:**
- Your summary must enable full regulatory or technical audit by containing every explicit detail, designation, and measurement from the original—unaltered and unabridged.
- The output must be comprehensive, exhaustive, and identical in informational content and structure to the input. Every visible explicit detail must be present.
**Output Instruction:**
Begin summarization after this message, applying the above rules without exception. Each output must be concise in format but all-inclusive in content, reflecting every explicit fact, designation, and organizational feature of the source text, and suitable for regulatory or expert review. No interpretation, paraphrasing, or omission is permitted under any circumstance.
### Agent Logs & Tracing
We can view optimization workflow runs in the dashboard under logs:
Figure 10 - Agent log traces showing optimization workflow runs in the dashboard.
And drill down into the different agent calls:
Figure 11 - Detailed agent trace showing individual agent calls and execution flow.
### Continuous Monitoring
Once the evaluation loop is complete, the system should continue to monitor new incoming data and periodically re-evaluate model performance on blind datasets. This ensures the model remains accurate and compliant as the data distribution evolves.
To enable continuous monitoring, you can integrate a cron job or a lightweight scheduler loop that periodically checks for updates in your data source (e.g., new PDF uploads or database entries). When new data is detected, the system automatically triggers the evaluation and optimization loop described earlier.
For example (pseudo code):
```python
# this cell is pseudo-code and not meant to be run as-is
import time
def continuous_monitoring(interval_hours=24):
"""Periodically check for new data and trigger the evaluation loop."""
while True:
print("Checking for new data...")
if new_data_detected():
print("New data found — running evaluation and optimization loop.")
self_evolving_loop()
else:
print("No new data. Sleeping until next cycle.")
time.sleep(interval_hours * 3600)
continuous_monitoring(interval_hours=24)
```
This approach allows the model to continuously learn and adapt, improving over time as it processes fresh data — a key requirement for maintaining high-quality, real-world performance.
## 4. Going Further
### a. Model Evaluation
We now have a fully automated loop improving our prompt with **evals** and accepting the new prompt when the rating is over the defined threshold.
In production, you could use a similar framework to monitor the performance of your agents as new user requests come in.
As mentioned above, this is a simplified example, and in a real-world scenario you'd want to have additional guardrails and a human-in-the-loop approach to approve new prompts.
Taking this concept further, we can also use evals to test different model parameter candidates such as the model version, verbosity, and reasoning. To see the full available set of parameters that could considered, check the [ModelSettings class in the Agents SDK](https://openai.github.io/openai-agents-python/ref/model_settings/#agents.model_settings.ModelSettings)
The `compare_model_candidates` function is an example of how to:
1. Optimize the prompt
2. Generate candidate outputs from the optimized prompt using two or more different models
3. Use evals to grade the candidate outputs and select the best candidate
It can be worked into the `self_evolving_loop` function with minimal refactoring.
> **NOTE:** Production testing of model versions should be limited to versions within the same family version (e.g. gpt-5, gpt-5-mini, gpt-5-nano). It is recommended to conduct cross family version selection pre-production deployment.
And the final `self_evolving_loop` with model comparison code:
```python
from agents import Agent, Runner
async def eval_agent_candidate(agent: Agent, section: str, prompt_text: str, model_name: str):
summary_result = await Runner.run(agent, section)
summary = summary_result.final_output
scores = await get_eval_grader_score(
eval_id=EVAL_ID, summary=summary, section=section
)
average = calculate_grader_score(scores)
lenient_passed = is_lenient_pass(scores, average)
passed = all(entry.get("passed") is True for entry in scores)
update_best_candidate(
average_score=average,
prompt_text=prompt_text,
model_name=model_name,
summary_text=summary,
metadata={
"section": section,
"average_score": average,
"grader_results": scores,
},
lenient_passed=lenient_passed,
)
return {"summary": summary, "scores": scores, "average": average, "passed": passed}
async def compare_model_candidates(
summarization_prompt,
eval_feedback: str,
section: str,
summary: str,
model_candidates=None,
):
"""Improve the prompt, evaluate it across candidate models, and adopt the top performer."""
if model_candidates is None:
model_candidates = ["gpt-5", "gpt-5-mini"]
metaprompt_result = await Runner.run(
metaprompt_agent,
input=METAPROMPT_TEMPLATE.format(
original_prompt=summarization_prompt.current().prompt,
section=section,
summary=summary,
reasoning=eval_feedback,
),
)
improved_prompt = metaprompt_result.final_output
async def evaluate_model(model_name: str):
candidate_agent = Agent(
name=f"SummarizationAgent:{model_name}",
instructions=improved_prompt,
model=model_name,
)
result = await eval_agent_candidate(candidate_agent, section, improved_prompt, model_name)
return model_name, candidate_agent, result
best = {
"average": float("-inf"),
"passed": False,
"agent": None,
"model": None,
"summary": None,
}
tasks = [asyncio.create_task(evaluate_model(model_name)) for model_name in model_candidates]
for task in asyncio.as_completed(tasks):
model_name, candidate_agent, result = await task
print(
f"Candidate average — {model_name}: {result['average']:.4f} "
f"(passed={result.get('passed', False)})"
)
if result["average"] > best["average"]:
best.update(
{
"average": result["average"],
"model": model_name,
"summary": result.get("summary"),
"agent": candidate_agent,
"passed": result.get("passed", False),
}
)
for task in tasks:
if not task.done():
task.cancel()
if best["passed"] and best["model"]:
summarization_prompt.update(
new_prompt=improved_prompt,
model=best["model"],
metadata={"section": section, "summary": best["summary"]},
)
print(f"Updated summarization_prompt with passing model: {best['model']}")
return make_summarization_agent(summarization_prompt.current())
print(
f"No passing models. Best candidate (model={best['model']}, "
f"avg={best['average']:.4f}) did not pass. Prompt not updated."
)
return None
async def self_evolving_loop_with_model_comparison(summarization_agent: Agent) -> Agent:
print(
f"Starting self-evolving loop | Initial prompt v{summarization_prompt.current().version}"
)
print(f"Prompt: {summarization_prompt.current().prompt}")
print(f"Model: {summarization_prompt.current().model}")
print("-" * 80)
reset_best_trackers()
df = pd.read_csv("data/dataset.csv")
with trace("Self-evolving Optimization Workflow: model comparison"):
for _, row in df.head(5).iterrows():
content = row.get("content")
if pd.isna(content) or (isinstance(content, str) and not content.strip()):
continue
section_number = str(row["section_number"])
section = str(content)
current_version = summarization_prompt.current().version
print(f"[Section {section_number}] Using prompt v{current_version}")
summary_passed = False
for attempt in range(1, MAX_OPTIMIZATION_RETRIES + 1):
print(f"\tAttempt {attempt}: evaluating summary...")
summary_result = await Runner.run(summarization_agent, section)
summary = summary_result.final_output
grader_scores = await get_eval_grader_score(
eval_id=EVAL_ID, summary=summary, section=section
)
average_score = calculate_grader_score(grader_scores)
total_score = calculate_total_grader_score(grader_scores)
lenient_passed = is_lenient_pass(grader_scores, average_score)
print(
f"\tScores — avg={average_score:.3f}, total={total_score:.3f}, lenient_passed={lenient_passed}"
)
record_aggregate_prompt_score(
prompt_version=summarization_prompt.current().version,
prompt_text=summarization_prompt.current().prompt,
model_name=summarization_prompt.current().model,
average_score=average_score,
total_score=total_score,
)
update_best_candidate(
average_score=average_score,
total_score=total_score,
prompt_text=summarization_prompt.current().prompt,
model_name=summarization_prompt.current().model,
summary_text=summary,
metadata={
"section": section_number,
"average_score": average_score,
"grader_results": grader_scores,
"prompt_version": summarization_prompt.current().version,
},
lenient_passed=lenient_passed,
prompt_version=summarization_prompt.current().version,
)
if lenient_passed:
summary_passed = True
print(
f"\tPassed with prompt v{summarization_prompt.current().version} (model={summarization_prompt.current().model})"
)
break
print("\tFailed eval. Improving prompt...")
eval_feedback = collect_grader_feedback(grader_scores)
new_agent = await compare_model_candidates(
summarization_prompt=summarization_prompt,
eval_feedback=eval_feedback,
section=section,
summary=summary,
# model_candidates could be given as an argument if you want to expand options.
)
if new_agent is None:
print(
"\tNo passing model found. Optimization failed for this section."
)
summary_passed = False
else:
summarization_agent = new_agent
summary_passed = True
print(
f"\tPrompt improved → v{summarization_prompt.current().version} "
f"(model={summarization_prompt.current().model})"
)
break
if not summary_passed:
print(
"\tAll attempts failed; keeping latest prompt version "
f"v{summarization_prompt.current().version} (model={summarization_prompt.current().model}) for the next section."
)
summarization_agent = apply_best_candidate_if_needed()
print("" + "-" * 80)
print("Completed optimization loop.")
print(f"Final prompt version: v{summarization_prompt.current().version}")
print(f"Final model: {summarization_prompt.current().model}")
aggregate_best = select_best_aggregate_prompt()
if best_candidate["score"] > float("-inf"):
print(
f"Best lenient prompt: v{best_candidate.get('version')} (avg={best_candidate['score']:.3f}, model={best_candidate.get('model', 'unknown')})"
)
if aggregate_best:
per_section = (
aggregate_best.get("total_average", 0.0) / aggregate_best.get("count", 1)
if aggregate_best.get("count")
else 0.0
)
print(
f"Aggregate best prompt: v{aggregate_best.get('version')} "
f"(total={aggregate_best.get('total_score', 0.0):.3f}, avg/section={per_section:.3f}, model={aggregate_best.get('model', 'unknown')})"
)
print(f"Final prompt: {summarization_prompt.current().prompt}")
print(f"Final model: {summarization_prompt.current().model}")
return summarization_agent
summarization_agent = await self_evolving_loop_with_model_comparison(summarization_agent)
```
Here we can see a very similar output with additional information on the model version scores:
Starting self-evolving loop | Initial prompt v0
Prompt:
You are a summarization assistant.
Given a section of text, produce a concise, accurate summary.
[....]
[Section 3.2.P.2.2] Using prompt v2
Attempt 1: evaluating summary...
Failed eval. Improving prompt...
Candidate average — gpt-5: 0.3533 (passed=False)
Candidate average — gpt-5-mini: 0.4670 (passed=False)
No passing models. Best candidate (model=gpt-5-mini, avg=0.4670) did not pass. Prompt not updated.
No passing model found. Optimization failed for this section.
Attempt 2: evaluating summary...
Exceeded retries, aborting
Passed with prompt v2
--------------------------------------------------------------------------------
Completed optimization loop.
Final prompt version: v2
Final prompt:
**Improved Prompt:**
You are a summarization assistant.
Given any section of text, generate a concise and accurate summary that includes all key concepts, components, and their main characteristics or interactions as described in the original section. Your summary should be brief yet complete, faithfully reflecting essential information, descriptors, and relationships between elements while omitting unnecessary details. Ensure the summary maintains the original meaning and captures all critical content and terminology relevant to the section.
### b. Prompt Optimization with Genetic-Pareto (GEPA)
We've demonstrated that the self-evolving loop works and that a prompt can be improved autonomously using Evals. However, we relied on a relatively straightforward, static metaprompt to improve our system prompt. In this section, we explore a more dynamic and reflexive method by using Genetic-Pareto (GEPA) [[1]](##Citations) — a framework that samples agent trajectories, reflects on them in natural language, proposes prompt revisions, and evolves the system through iterative feedback loops.
The GEPA method, described in the paper available [here](https://doi.org/10.48550/arXiv.2507.19457), offers an compelling blueprint for continuous, self-improving prompt optimization. The code below draws generously on the GEPA Github repository available [here](https://github.com/gepa-ai/gepa).
```python
import pandas as pd
import gepa
from gepa import EvaluationBatch
# Extract sections from dataset
def read_csv_content(file_path: str) -> list[dict]:
"""Read csv and return section to summarize."""
df = pd.read_csv(file_path)
return [{'content': content} for content in df['content'].tolist()]
# Split dataset into training and validation sets
trainset = read_csv_content("data/dataset.csv")
val_cut = max(1, int(0.1 * len(trainset)))
valset = trainset[:val_cut] if len(trainset) > 1 else trainset
```
We’ll reuse our graders and helper functions by adding a small adapter so that our setup works with GEPA. GEPA’s `GEPAAdapter` makes it easy to plug into our eval framework. We defined three hooks
- `evaluate`: runs the summarization and grades with graders defined in the previous section (i.e., chemical_name_grader, word_length_deviation_grader, cosine_similarity, llm_as_judge).
- `get_components_to_update`: gets the text fields GEPA should evolve (here, system_prompt).
- `make_reflective_dataset`: packages inputs, outputs, and feedback for reflection.
```python
class EvalsBackedSummarizationAdapter:
"""
Minimal adapter for GEPA:
- evaluate(...) -> EvaluationBatch (scores + outputs + feedback-rich trajectories)
- get_components_to_update(...) returns the prompt to update
- make_reflective_dataset(...) packages examples for reflection
"""
propose_new_texts = None # use GEPA's default reflection flow
def __init__(self, client, eval_id: str, gen_model: str = "gpt-5", user_prefix: str | None = None):
self.client = client
self.eval_id = eval_id
self.gen_model = gen_model
self.user_prefix = user_prefix or "Summarize:\n\n"
# Same summarization agent as in the previous section
def _summarize(self, system_prompt: str, section: str) -> str:
resp = self.client.chat.completions.create(
model=self.gen_model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"{self.user_prefix}{section}"},
],
)
return resp.choices[0].message.content.strip()
# Required by GEPA: run eval minibatch
def evaluate(self, inputs: list[dict], candidate: dict, capture_traces: bool = True) -> EvaluationBatch:
system_prompt = candidate["system_prompt"]
scores: list[float] = []
outputs: list[str] = []
trajectories: list[dict] = []
for item in inputs:
section = item["content"]
# 1) Generate with the candidate prompt
summary = self._summarize(system_prompt, section)
outputs.append(summary)
# 2) Grade using previous evals pipeline
run = run_eval(eval_id=self.eval_id, section=section, summary=summary)
out_items = poll_eval_run(eval_id=self.eval_id, run_id=run.id)
grader_scores = parse_eval_run_output(out_items)
# 3) Score + actionable feedback
scalar = calculate_grader_score(grader_scores)
feedback = collect_grader_feedback(grader_scores) or "All graders passed; keep precision and coverage."
scores.append(float(scalar))
trajectories.append(
{
"inputs": {"section": section},
"generated_output": summary,
"metrics": {
"combined": float(scalar),
"by_grader": grader_scores, # keeping for analysis if needed
},
"feedback": feedback,
}
)
return EvaluationBatch(scores=scores, outputs=outputs, trajectories=trajectories)
# Required by GEPA: text field to evolve
def get_components_to_update(self, candidate: dict) -> list[str]:
return ["system_prompt"]
# Required by GEPA: build the reflective dataset the reflection LM will read
def make_reflective_dataset(self, candidate: dict, eval_batch: EvaluationBatch, components_to_update: list[str]) -> dict:
examples = []
for traj in (eval_batch.trajectories or []):
examples.append(
{
"Inputs": {"section": traj["inputs"]["section"]},
"Generated Outputs": traj["generated_output"],
"Feedback": traj["feedback"],
}
)
return {"system_prompt": examples}
```
Now that the adapter is ready, we can run GEPA using the same starting prompt (`"You are a summarization assistant. Given a section of text, produce a summary."`) and model (here, `gpt-5`) as in the earlier self-evolving loop for comparison. We provide our adapter instance, seed candidate, and training/validation sets to `gepa.optimize(...)`. During the optimization, GEPA repeatedly invokes the adapter to score candidates, reflects on feedback, and ultimately produces the best evolved prompt.
_Note: GEPA might take ~10-15 minutes to complete._
```python
seed_candidate = {"system_prompt": "You are a summarization assistant. Given a section of text, produce a summary."}
adapter = EvalsBackedSummarizationAdapter(
client=client,
eval_id=EVAL_ID,
gen_model=summarization_prompt.current().model,
)
# Keeping max_metric_calls small for the cookbook.
# In practice, use a larger value to allow more optimization iterations.
result = gepa.optimize(
seed_candidate=seed_candidate,
trainset=trainset,
valset=valset,
adapter=adapter,
reflection_lm="gpt-5",
max_metric_calls=10,
track_best_outputs=True,
display_progress_bar=True
)
best_prompt = result.best_candidate["system_prompt"]
print("\n=== Best evolved instruction ===\n")
print(best_prompt)
```
Here is an example (abridged) output for the code above:
Iteration 0: Base program full valset score: 0.2183466466681351
Iteration 1: Selected program 0 score: 0.2183466466681351
Iteration 1: Proposed new text for system_prompt:
[.......]
Iteration 3: New subsample score 0.6592202195294341 is better than old score 0.6565039300893376. Continue to full eval and add to candidate pool.
GEPA Optimization: 90%|█████████ | 18/20 [39:21<04:22, 131.19s/rollouts]
Iteration 3: Full valset score for new program: 0.2225472423976205
Iteration 3: Full train_val score for new program: 0.2225472423976205
Iteration 3: Individual valset scores for new program: [0.22866548337721018, 0.21864704884895614, 0.2203291949666952]
Iteration 3: New valset pareto front scores: [0.23142100182952327, 0.2389098334382265, 0.23513790628541456]
Iteration 3: Full valset pareto front score: 0.2351562471843881
Iteration 3: Updated valset pareto front programs: [{1}, {1}, {1}]
Iteration 3: Best valset aggregate score so far: 0.2351562471843881
Iteration 3: Best program as per aggregate score on train_val: 1
Iteration 3: Best program as per aggregate score on valset: 1
Iteration 3: Best score on valset: 0.2351562471843881
Iteration 3: Best score on train_val: 0.2351562471843881
Iteration 3: Linear pareto front program index: 1
Iteration 3: New program candidate index: 2
=== Best evolved instruction ===
You are a domain-aware summarization assistant for technical pharmaceutical texts. Given a “section” of text, produce a concise summary that preserves key technical facts and exact nomenclature.
Requirements:
- Length and format:
- Write 1–3 sentences totaling about 45–70 words (never exceed 90 words). Default to ~60 words.
- Use a single paragraph (no bullet points, headings, or heavy formatting).
- Preserve exact technical names and notation:
- Include every chemical name that appears in the section at least once, with exact spelling, capitalization, isotopic labels, brackets, hyphens, salts, and buffer names (e.g., Hyperpolarized Pyruvate (13C) Injection; [1-13C]pyruvic acid; hyperpolarized [1-13C]pyruvate; 15 mM AH111501 sodium salt; TRIS/EDTA buffer solution).
- Keep study identifiers, section numbers, regulatory citations, and codes verbatim when mentioned (e.g., GE-101-001, GE-101-003, USP <797>, 3.2.P.7, company codes, CAS numbers).
...
Self-check before finalizing:
- Have you included every chemical name exactly as written?
- Is the summary within 45–70 words (≤90 max) and a single paragraph?
- Are key process/regulatory/test details and critical numbers preserved without unnecessary verbosity?
In this cookbook, we explored three distinct approaches to prompt optimization:
- **OpenAI Platform Optimizer:** using the _Optimize_ button with a dataset containing manually entered human feedback (thumbs up/down and textual comments), we quickly produced a strong prompt with minimal configuration. This method excels at rapid iteration, but does not provide the automation needed for production environments.
- **Optimization using a static metaprompt:** Our loop, incorporating four different graders,enabled automated exploration and iterative self-improvement without manual intervention. However, its exploration space was limited by a single static meta-prompt, and evaluation was performed section by section. Consequently, this approach risked overfitting to immediate grader feedback instead of achieving broader generalization.
- **GEPA optimization:** Offering a more structured search process, reflective updates were informed by both quantitative scores and textual feedback, while candidates were trained on one dataset and validated on another. This method produced a more robust, generalized prompt and provided clearer empirical evidence of its performance.
_Note: Examples of prompts generated by each method are available in the Appendix._
Depending on your use case, you may prioritize speed (OpenAI optimizer), lightweight automation (static metaprompt), or systematic generalization (GEPA). In practice, combining these methods by starting with rapid iteration and progressing toward reflective optimization can deliver both agility and performance.
Happy coding!
## Contributors
This cookbook is based on a joint collaboration between [Bain](https://developers.openai.com/cookbook/examples/partners/self_evolving_agents/www.bain.com) and [OpenAI](https://developers.openai.com/cookbook/examples/partners/self_evolving_agents/openai.com).
[Calvin Maguranis](https://www.linkedin.com/in/calvin-maguranis-b9956045/)
[Fanny Perraudeau](https://www.linkedin.com/in/fanny-sabran-perraudeau-494b7573/)
[Giorgio Saladino](https://www.linkedin.com/in/giorgio-saladino-202/)
[Shikhar Kwatra](https://www.linkedin.com/in/shikharkwatra/)
[Valentina Frenkel](https://www.linkedin.com/in/valentina-frenkel/)
## Citations
[1] _GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning_ by Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab - https://arxiv.org/abs/2507.19457
## Appendix
### Examples of output prompts:
- **Initial prompt:**
```pgsql
You are a summarization assistant. Given a section of text, produce a summary.
```
- **OpenAI Platform Optimizer:**
```pgsql
You are a summarization assistant.
Task: Summarize the provided text concisely and accurately.
Output requirements:
- Output only the summary. Do not add titles, labels (e.g.,
"Summary:"), prefaces, or commentary.
- Preserve the document's structure. If multiple sections/subsections appear, summarize each one.
- Use a numbered list for sections/subsections (use their numbers/titles when present).
- Under each, use short dash bullets for key points.
- If there is only a single short section, return a brief bullet list or 1-2 concise sentences.
- Split any inline lists into separate bullets.
- Use plain, simple language. Keep bullets tight (ideally one line each). Remove redundancy.
- Include important quantitative details (values, units, conditions) and constraints. Do not invent information.
- Keep formatting simple: plain text, "1." numbering and "-" bullets only. No tables or special markup.
- Retain exact technical terms/notation from the source (e.g., chemical names, isotopic labels).
- If a section is explicitly marked "Not applicable," include that status; otherwise do not add it.
```
- **Static metaprompt:**
```pgsql
You are a technical summarization assistant for scientific and regulatory documentation. Your task is to generate a concise, comprehensive, and fully detailed summary of any scientific, technical, or regulatory text provided. Strictly adhere to the following instructions:
---
**1. Complete and Exact Information Inclusion**
- Capture *every* explicit fact, technical value, specification, quantity, measurement, regulatory reference, entity, process, site, and contextual detail verbatim from the source text.
- Do not omit or generalize any explicit information, no matter how minor.
**2. Precise Terminology and Named Entity Retention**
- Reproduce all names of chemicals, drugs, mixtures, buffer components, devices, companies, institutions, regulatory standards, section numbers, and procedural labels *exactly as stated*.
- Report all quantities, measurements, concentrations, ratios, masses, volumes, compositions, pH values, and units precisely as given.
- Do not paraphrase, rename, substitute, or simplify any term or value.
**3. All Procedural Details and Justifications**
- Explicitly include all described procedures, technical processes (e.g., terminal sterilization, aseptic processing), operational constraints, process justifications, compliance requirements, and standards references.
- Clearly state all reasons provided for choosing or omitting particular methods or processes.
**4. Regulatory and Compliance References**
- Accurately cite all regulations, standards (e.g., USP <797>), compliance statements, section numbers, and cross-references as in the original.
- Include all explicit mentions of compliance, applicability, and site location details.
**5. Explicit Statements of Absence, Limitations, and Applicability**
- Clearly state any declarations of absence, inapplicability (“Not applicable”), or limitations exactly as written in the source.
**6. Structural and Organizational Fidelity**
- Precisely reflect the original document’s section and subsection hierarchy, using clear section labels and indentation.
- Present all enumerations, lists, and tabulated data in structured bullet-point or numbered format, organized in accordance with the source document’s arrangement.
**7. No Paraphrasing, Summarizing, or Reinterpretation**
- Do *not* paraphrase, summarize contextually, reinterpret, or alter the meaning or sequence of any content.
- Remove only literal repetitions or redundant phrasing; otherwise, preserve all explicit statements, technical details, and contextual notes.
---
**Summary Output Objective:**
Produce a summary that delivers the full technical, factual, and regulatory content and structure of the original text, reformatted by eliminating only redundant language. The summary must enable audit, regulatory review, or peer reference without loss of any explicit information or terminology from the source.
---
*Apply these instructions rigorously to every provided document section to ensure scientific and regulatory accuracy and completeness.*
```
- **GEPA optimizer**:
```pgsql
You are a domain-aware summarization assistant for technical pharmaceutical texts. Given a “section” of text, produce a concise, single-paragraph summary that preserves key technical facts and exact nomenclature.
Length and format
- Write 1–3 sentences totaling about 45–70 words (target ~60; never exceed 90).
- Use one paragraph; no bullets, headings, tables, or heavy formatting.
Exact names and notation
- Include every chemical name that appears in the section at least once, using the exact original spelling, capitalization, punctuation, isotopic labels, brackets, hyphens, salts, buffer names, and parenthetical qualifiers. Treat distinct case/format variants as distinct names (e.g., [1-13C]pyruvic acid and [1-13C]Pyruvic acid are separate and each must appear once).
- Examples you must preserve verbatim when present: Hyperpolarized Pyruvate (13C) Injection; non-polarized Pyruvate Injection; Pyruvate (13C) Injection; hyperpolarized [1-13C]pyruvate; Mixture of [1-13C]pyruvic acid and 15 mM AH111501 sodium salt; TRIS/EDTA buffer solution; TRIS; NaOH; Na2EDTA; [1-13C]pyruvic acid; AH111501 sodium salt.
- Also preserve exact study identifiers, batch codes, section numbers, regulatory citations, and instrument parameters as written (e.g., GE-101-001, GE-101-003, USP <797>, 3.2.P.5.2.5, FFF106/140-806, FFF106/142-806, 3T MRI, 5 degree RF pulse, TR=3s, 90 degree pulse, 64 averages, TR=10s, 10 μl Gd/ml solution).
Content prioritization (if space is tight)
1) What the section is about (topic/purpose).
2) All named chemical entities and compositions (list all chemical names at least once; include concentrations/amounts if given).
3) Critical process/handling facts (e.g., aseptic processing vs terminal sterilization; ISO classifications; filtration specs; compounding/filling steps; temperatures/times/volumes; storage/administration limits).
4) Container/packaging specifics (e.g., cryovials, “sterile fluid path”).
5) Microbiological/testing/regulatory details (e.g., sterility/pyrogenicity testing timing; USP <797>; state board compliance; site/manufacturer if stated).
6) Overages/single-dose formulas and key quantities.
Numerical fidelity
- Preserve all critical numbers and units exactly (e.g., 1.44 g, 27.7 mg, 15 mM, 18 mL, 1.47 g, two 0.2 μm filters, ISO 7, ISO 5, 38 mL).
- Include testing/analysis parameters when present (e.g., polarization/relaxation time (T1); number of spectra; pulse angles; TR values; MRI location relative to clean room).
Style and compression
- Be neutral and factual; do not infer unstated information.
- Consolidate repeated statements; compress lists with commas/semicolons to save words.
- Mention tables/figures only to convey key data; do not reproduce them.
- If many chemicals are present, ensure each distinct name appears once; group them succinctly.
- Avoid symbols or special formatting not in the source text.
Common domain cues to include when present
- Aseptic processing vs terminal sterilization and the rationale/timing (e.g., “tested for sterility and pyrogenicity subsequent to patient administration”).
- Environmental/processing controls (ISO 7/ISO 5; LAF unit; filtration; filling/weight targets per cryovial).
- Site/regulatory context (e.g., USP <797>; California State Board of Pharmacy; University of California, San Francisco Department of Clinical Pharmacy).
- Study/kit equivalence statements (e.g., equivalence to GE-101-001/GE-101-003 formulations).
- QC/measurement methods (e.g., capacitive threshold at Administration syringe nominal 38 mL).
Self-check before finalizing
- Does the paragraph contain every distinct chemical name exactly as written in the section (including case and notation variants)?
- Is the summary 45–70 words (≤90), in a single paragraph?
- Are the most critical process/regulatory/testing details and all key numbers preserved without unnecessary verbosity?`
```
---
# Source: https://developers.openai.com/cookbook/examples/chatgpt/rag-quickstart/azure/azure_ai_search_with_azure_functions_and_gpt_actions_in_chatgpt.md
# Azure AI Search as a vector database + Azure Functions for GPT integration in ChatGPT
This notebook provides step by step instuctions on using Azure AI Search (f.k.a Azure Cognitive Search) as a vector database with OpenAI embeddings, then creating an Azure Function on top to plug into a Custom GPT in ChatGPT.
This can be a solution for customers looking to set up RAG infrastructure contained within Azure, and exposing it as an endpoint to integrate that with other platforms such as ChatGPT.
Azure AI Search is a cloud search service that gives developers infrastructure, APIs, and tools for building a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications.
Azure Functions is a serverless compute service that runs event-driven code, automatically managing infrastructure, scaling, and integrating with other Azure services.
## Prerequisites:
For the purposes of this exercise you must have the following:
- Azure user with permission to create [Azure AI Search Service](https://learn.microsoft.com/azure/search/) and Azure Function Apps
- Azure subscription ID and a resource group.
- [OpenAI Key](https://platform.openai.com/account/api-keys)
# Architecture
Below is a diagram of the architecture of this solution, which we'll walk through step-by-step.

> Note: This architecture pattern of vector data store + serverless functions can be extrapolated to other vector data stores. For example, if you would want to use something like Postgres within Azure, you'd change the [Configure Azure AI Search Settings](#configure-azure-ai-search-settings) step to set-up the requirements for Postgres, you'd modify the [Create Azure AI Vector Search](#create-azure-ai-vector-search) to create the database and table in Postgres instead, and you'd update the `function_app.py` code in this repository to query Postgres instead of Azure AI Search. The data preparation and creation of the Azure Function would stay consistent.
# Table of Contents:
1. **[Setup of Environment](#set-up-environment)**
Setup environment by installing and importing the required libraries and configuring our Azure settings. Includes:
- [Install and Import Required Libraries](#install-and-import-required-libraries)
- [Configure OpenAI Settings](#configure-openai-settings)
- [Configure Azure AI Search Settings](#configure-azure-ai-search-settings)
2. **[Prepare Data](#prepare-data)** Prepare the data for uploading by embedding the documents, as well as capturing additional metadata. We will use a subset of OpenAI's docs as example data for this.
3. **[Create Azure AI Vector Search](#create-azure-ai-vector-search)** Create an Azure AI Vector Search and upload the data we've prepared. Includes:
- [Create Index](#create-index): Steps to create an index in Azure AI Search.
- [Upload Data](#upload-data): Instructions to upload data to Azure AI Search.
- [Test Search](#test-search): Steps to test the search functionality.
4. **[Create Azure Function](#create-azure-function)** Create an Azure Function to interact with the Azure AI Vector Search. Includes:
- [Create Storage Account](#create-storage-account): Steps to create a storage account for the Azure Function.
- [Create Function App](#create-function-app): Instructions to create a function app in Azure.
5. **[Input in a Custom GPT in ChatGPT](#input-in-a-custom-gpt-in-chatgpt)** Integrate the Azure Function with a Custom GPT in ChatGPT. Includes:
- [Create OpenAPI Spec](#create-openapi-spec): Steps to create an OpenAPI specification for the Azure Function.
- [Create GPT Instructions](#create-gpt-instructions): Instructions to create GPT-specific instructions for the integration.
# Set up environment
We'll set up our environment by importing the required libraries and configuring our Azure settings.
## Install and import required libraries
We categorize these libraries into standard Python libraries, third-party libraries, and Azure-related libraries for readability.
```python
! pip install -q wget
! pip install -q azure-search-documents
! pip install -q azure-identity
! pip install -q openai
! pip install -q azure-mgmt-search
! pip install -q pandas
! pip install -q azure-mgmt-resource
! pip install -q azure-mgmt-storage
! pip install -q pyperclip
! pip install -q PyPDF2
! pip install -q tiktoken
```
```python
# Standard Libraries
import json
import os
import platform
import subprocess
import csv
from itertools import islice
import uuid
import shutil
import concurrent.futures
# Third-Party Libraries
import pandas as pd
from PyPDF2 import PdfReader
import tiktoken
from dotenv import load_dotenv
import pyperclip
# OpenAI Libraries (note we use OpenAI directly here, but you can replace with Azure OpenAI as needed)
from openai import OpenAI
# Azure Identity and Credentials
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.core.credentials import AzureKeyCredential
from azure.core.exceptions import HttpResponseError
# Azure Search Documents
from azure.search.documents import SearchClient, SearchIndexingBufferedSender
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.models import (
VectorizedQuery
)
from azure.search.documents.indexes.models import (
HnswAlgorithmConfiguration,
HnswParameters,
SearchField,
SearchableField,
SearchFieldDataType,
SearchIndex,
SimpleField,
VectorSearch,
VectorSearchAlgorithmKind,
VectorSearchAlgorithmMetric,
VectorSearchProfile,
)
# Azure Management Clients
from azure.mgmt.search import SearchManagementClient
from azure.mgmt.resource import ResourceManagementClient, SubscriptionClient
from azure.mgmt.storage import StorageManagementClient
```
## Configure OpenAI settings
Before going through this section, make sure you have your OpenAI API key.
```python
openai_api_key = os.environ.get("OPENAI_API_KEY", "") # Saving this as a variable to reference in function app in later step
openai_client = OpenAI(api_key=openai_api_key)
embeddings_model = "text-embedding-3-small" # We'll use this by default, but you can change to your text-embedding-3-large if desired
```
## Configure Azure AI Search Settings
You can locate your Azure AI Search service details in the Azure Portal or programmatically via the [Search Management SDK](https://learn.microsoft.com/rest/api/searchmanagement/).
#### Prerequisites:
- Subscription ID from Azure
- Resource Group name from Azure
- Region in Azure
```python
# Update the below with your values
subscription_id=""
resource_group=""
## Make sure to choose a region that supports the proper products. We've defaulted to "eastus" below. https://azure.microsoft.com/en-us/explore/global-infrastructure/products-by-region/#products-by-region_tab5
region = "eastus"
credential = InteractiveBrowserCredential()
subscription_client = SubscriptionClient(credential)
subscription = next(subscription_client.subscriptions.list())
```
#### Create and Configure Azure AI Search Service
Below we'll generate a unique name for the search service, set up the service properties, and create the search service.
```python
# Initialize the SearchManagementClient with the provided credentials and subscription ID
search_management_client = SearchManagementClient(
credential=credential,
subscription_id=subscription_id,
)
# Generate a unique name for the search service using UUID, but you can change this if you'd like.
generated_uuid = str(uuid.uuid4())
search_service_name = "search-service-gpt-demo" + generated_uuid
## The below is the default endpoint structure that is created when you create a search service. This may differ based on your Azure settings.
search_service_endpoint = 'https://'+search_service_name+'.search.windows.net'
# Create or update the search service with the specified parameters
response = search_management_client.services.begin_create_or_update(
resource_group_name=resource_group,
search_service_name=search_service_name,
service={
"location": region,
"properties": {"hostingMode": "default", "partitionCount": 1, "replicaCount": 1},
# We are using the free pricing tier for this demo. You are only allowed one free search service per subscription.
"sku": {"name": "free"},
"tags": {"app-name": "Search service demo"},
},
).result()
# Convert the response to a dictionary and then to a pretty-printed JSON string
response_dict = response.as_dict()
response_json = json.dumps(response_dict, indent=4)
print(response_json)
print("Search Service Name:" + search_service_name)
print("Search Service Endpoint:" + search_service_endpoint)
```
#### Get the Search Service API Key
Now that we have the search service up and running, we need the [Search Service API Key](https://learn.microsoft.com/en-us/azure/search/search-security-api-keys?tabs=rest-use,portal-find,portal-query), which we'll use to initiate the index creation, and later to execute the search.
```python
# Retrieve the admin keys for the search service
try:
response = search_management_client.admin_keys.get(
resource_group_name=resource_group,
search_service_name=search_service_name,
)
# Extract the primary API key from the response and save as a variable to be used later
search_service_api_key = response.primary_key
print("Successfully retrieved the API key.")
except Exception as e:
print(f"Failed to retrieve the API key: {e}")
```
# Prepare data
We're going to embed and store a few pages of the OpenAI docs in the oai_docs folder. We'll first embed each, add it to a CSV, and then use that CSV to upload to the index.
In order to handle longer text files beyond the context of 8191 tokens, we can either use the chunk embeddings separately, or combine them in some way, such as averaging (weighted by the size of each chunk).
We will take a function from Python's own cookbook that breaks up a sequence into chunks.
```python
def batched(iterable, n):
"""Batch data into tuples of length n. The last batch may be shorter."""
# batched('ABCDEFG', 3) --> ABC DEF G
if n < 1:
raise ValueError('n must be at least one')
it = iter(iterable)
while (batch := tuple(islice(it, n))):
yield batch
```
Now we define a function that encodes a string into tokens and then breaks it up into chunks. We'll use tiktoken, a fast open-source tokenizer by OpenAI.
To read more about counting tokens with Tiktoken, check out [this cookbook](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken).
```python
def chunked_tokens(text, chunk_length, encoding_name='cl100k_base'):
# Get the encoding object for the specified encoding name. OpenAI's tiktoken library, which is used in this notebook, currently supports two encodings: 'bpe' and 'cl100k_base'. The 'bpe' encoding is used for GPT-3 and earlier models, while 'cl100k_base' is used for newer models like GPT-4.
encoding = tiktoken.get_encoding(encoding_name)
# Encode the input text into tokens
tokens = encoding.encode(text)
# Create an iterator that yields chunks of tokens of the specified length
chunks_iterator = batched(tokens, chunk_length)
# Yield each chunk from the iterator
yield from chunks_iterator
```
Finally, we can write a function that safely handles embedding requests, even when the input text is longer than the maximum context length, by chunking the input tokens and embedding each chunk individually. The average flag can be set to True to return the weighted average of the chunk embeddings, or False to simply return the unmodified list of chunk embeddings.
> Note: there are other, more sophisticated techniques you can take here, including:
> - using GPT-4o to capture images/chart descriptions for embedding.
> - keeping text overlap between the chunks to minimize cutting off important context.
> - chunking based on paragraphs or sections.
> - adding more descriptive metadata about each article.
```python
## Change the below based on model. The below is for the latest embeddings models from OpenAI, so you can leave as is unless you are using a different embedding model..
EMBEDDING_CTX_LENGTH = 8191
EMBEDDING_ENCODING='cl100k_base'
```
```python
def generate_embeddings(text, model):
# Generate embeddings for the provided text using the specified model
embeddings_response = openai_client.embeddings.create(model=model, input=text)
# Extract the embedding data from the response
embedding = embeddings_response.data[0].embedding
return embedding
def len_safe_get_embedding(text, model=embeddings_model, max_tokens=EMBEDDING_CTX_LENGTH, encoding_name=EMBEDDING_ENCODING):
# Initialize lists to store embeddings and corresponding text chunks
chunk_embeddings = []
chunk_texts = []
# Iterate over chunks of tokens from the input text
for chunk in chunked_tokens(text, chunk_length=max_tokens, encoding_name=encoding_name):
# Generate embeddings for each chunk and append to the list
chunk_embeddings.append(generate_embeddings(chunk, model=model))
# Decode the chunk back to text and append to the list
chunk_texts.append(tiktoken.get_encoding(encoding_name).decode(chunk))
# Return the list of chunk embeddings and the corresponding text chunks
return chunk_embeddings, chunk_texts
```
Next, we can define a helper function that will capture additional metadata about the documents. This is useful to use as a metadata filter for search queries, and capturing richer data for search.
In this example, I'll choose from a list of categories to use later on in a metadata filter.
```python
## These are the categories I will be using for the categorization task. You can change these as needed based on your use case.
categories = ['authentication','models','techniques','tools','setup','billing_limits','other']
def categorize_text(text, categories):
# Create a prompt for categorization
messages = [
{"role": "system", "content": f"""You are an expert in LLMs, and you will be given text that corresponds to an article in OpenAI's documentation.
Categorize the document into one of these categories: {', '.join(categories)}. Only respond with the category name and nothing else."""},
{"role": "user", "content": text}
]
try:
# Call the OpenAI API to categorize the text
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=messages
)
# Extract the category from the response
category = response.choices[0].message.content
return category
except Exception as e:
print(f"Error categorizing text: {str(e)}")
return None
```
Now, we can define some helper functions to process the .txt files in the oai_docs folder within the data folder. You can use this with your own data as well and supports both .txt and .pdf files.
```python
def extract_text_from_pdf(pdf_path):
# Initialize the PDF reader
reader = PdfReader(pdf_path)
text = ""
# Iterate through each page in the PDF and extract text
for page in reader.pages:
text += page.extract_text()
return text
def process_file(file_path, idx, categories, embeddings_model):
file_name = os.path.basename(file_path)
print(f"Processing file {idx + 1}: {file_name}")
# Read text content from .txt files
if file_name.endswith('.txt'):
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
# Extract text content from .pdf files
elif file_name.endswith('.pdf'):
text = extract_text_from_pdf(file_path)
title = file_name
# Generate embeddings for the title
title_vectors, title_text = len_safe_get_embedding(title, embeddings_model)
print(f"Generated title embeddings for {file_name}")
# Generate embeddings for the content
content_vectors, content_text = len_safe_get_embedding(text, embeddings_model)
print(f"Generated content embeddings for {file_name}")
category = categorize_text(' '.join(content_text), categories)
print(f"Categorized {file_name} as {category}")
# Prepare the data to be appended
data = []
for i, content_vector in enumerate(content_vectors):
data.append({
"id": f"{idx}_{i}",
"vector_id": f"{idx}_{i}",
"title": title_text[0],
"text": content_text[i],
"title_vector": json.dumps(title_vectors[0]), # Assuming title is short and has only one chunk
"content_vector": json.dumps(content_vector),
"category": category
})
print(f"Appended data for chunk {i + 1}/{len(content_vectors)} of {file_name}")
return data
```
We'll now use this helper function to process our OpenAI documentation. Feel free to update this to use your own data by changing the folder in `process_files` below.
Note that this will process the documents in chosen folder concurrently, so this should take <30 seconds if using txt files, and slightly longer if using PDFs.
```python
## Customize the location below if you are using different data besides the OpenAI documentation. Note that if you are using a different dataset, you will need to update the categories list as well.
folder_name = "../../../data/oai_docs"
files = [os.path.join(folder_name, f) for f in os.listdir(folder_name) if f.endswith('.txt') or f.endswith('.pdf')]
data = []
# Process each file concurrently
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = {executor.submit(process_file, file_path, idx, categories, embeddings_model): idx for idx, file_path in enumerate(files)}
for future in concurrent.futures.as_completed(futures):
try:
result = future.result()
data.extend(result)
except Exception as e:
print(f"Error processing file: {str(e)}")
# Write the data to a CSV file
csv_file = os.path.join("..", "embedded_data.csv")
with open(csv_file, 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ["id", "vector_id", "title", "text", "title_vector", "content_vector","category"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in data:
writer.writerow(row)
print(f"Wrote row with id {row['id']} to CSV")
# Convert the CSV file to a Dataframe
article_df = pd.read_csv("../embedded_data.csv")
# Read vectors from strings back into a list using json.loads
article_df["title_vector"] = article_df.title_vector.apply(json.loads)
article_df["content_vector"] = article_df.content_vector.apply(json.loads)
article_df["vector_id"] = article_df["vector_id"].apply(str)
article_df["category"] = article_df["category"].apply(str)
article_df.head()
```
We now have an `embedded_data.csv` file with six columns that we can upload to our vector database!
# Create Azure AI Vector Search
## Create index
We'll define and create a search index using the `SearchIndexClient` from the Azure AI Search Python SDK. The index incorporates both vector search and hybrid search capabilities. For more details, visit Microsoft's documentation on how to [Create a Vector Index](https://learn.microsoft.com/azure/search/vector-search-how-to-create-index?.tabs=config-2023-11-01%2Crest-2023-11-01%2Cpush%2Cportal-check-index)
```python
index_name = "azure-ai-search-openai-cookbook-demo"
# index_name = ""
index_client = SearchIndexClient(
endpoint=search_service_endpoint, credential=AzureKeyCredential(search_service_api_key)
)
# Define the fields for the index. Update these based on your data.
# Each field represents a column in the search index
fields = [
SimpleField(name="id", type=SearchFieldDataType.String), # Simple string field for document ID
SimpleField(name="vector_id", type=SearchFieldDataType.String, key=True), # Key field for the index
# SimpleField(name="url", type=SearchFieldDataType.String), # URL field (commented out)
SearchableField(name="title", type=SearchFieldDataType.String), # Searchable field for document title
SearchableField(name="text", type=SearchFieldDataType.String), # Searchable field for document text
SearchField(
name="title_vector",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single), # Collection of single values for title vector
vector_search_dimensions=1536, # Number of dimensions in the vector
vector_search_profile_name="my-vector-config", # Profile name for vector search configuration
),
SearchField(
name="content_vector",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single), # Collection of single values for content vector
vector_search_dimensions=1536, # Number of dimensions in the vector
vector_search_profile_name="my-vector-config", # Profile name for vector search configuration
),
SearchableField(name="category", type=SearchFieldDataType.String, filterable=True), # Searchable field for document category
]
# This configuration defines the algorithm and parameters for vector search
vector_search = VectorSearch(
algorithms=[
HnswAlgorithmConfiguration(
name="my-hnsw", # Name of the HNSW algorithm configuration
kind=VectorSearchAlgorithmKind.HNSW, # Type of algorithm
parameters=HnswParameters(
m=4, # Number of bi-directional links created for every new element
ef_construction=400, # Size of the dynamic list for the nearest neighbors during construction
ef_search=500, # Size of the dynamic list for the nearest neighbors during search
metric=VectorSearchAlgorithmMetric.COSINE, # Distance metric used for the search
),
)
],
profiles=[
VectorSearchProfile(
name="my-vector-config", # Name of the vector search profile
algorithm_configuration_name="my-hnsw", # Reference to the algorithm configuration
)
],
)
# Create the search index with the vector search configuration
# This combines all the configurations into a single search index
index = SearchIndex(
name=index_name, # Name of the index
fields=fields, # Fields defined for the index
vector_search=vector_search # Vector search configuration
)
# Create or update the index
# This sends the index definition to the Azure Search service
result = index_client.create_index(index)
print(f"{result.name} created") # Output the name of the created index
```
## Upload Data
Now we'll upload the articles from above that we've stored in `embedded_data.csv` from a pandas DataFrame to an Azure AI Search index. For a detailed guide on data import strategies and best practices, refer to [Data Import in Azure AI Search](https://learn.microsoft.com/azure/search/search-what-is-data-import).
```python
# Convert the 'id' and 'vector_id' columns to string so one of them can serve as our key field
article_df["id"] = article_df["id"].astype(str)
article_df["vector_id"] = article_df["vector_id"].astype(str)
# Convert the DataFrame to a list of dictionaries
documents = article_df.to_dict(orient="records")
# Log the number of documents to be uploaded
print(f"Number of documents to upload: {len(documents)}")
# Create a SearchIndexingBufferedSender
batch_client = SearchIndexingBufferedSender(
search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key)
)
# Get the first document to check its schema
first_document = documents[0]
# Get the index schema
index_schema = index_client.get_index(index_name)
# Get the field names from the index schema
index_fields = {field.name: field.type for field in index_schema.fields}
# Check each field in the first document
for field, value in first_document.items():
if field not in index_fields:
print(f"Field '{field}' is not in the index schema.")
# Check for any fields in the index schema that are not in the documents
for field in index_fields:
if field not in first_document:
print(f"Field '{field}' is in the index schema but not in the documents.")
try:
if documents:
# Add upload actions for all documents in a single call
upload_result = batch_client.upload_documents(documents=documents)
# Check if the upload was successful
# Manually flush to send any remaining documents in the buffer
batch_client.flush()
print(f"Uploaded {len(documents)} documents in total")
else:
print("No documents to upload.")
except HttpResponseError as e:
print(f"An error occurred: {e}")
raise # Re-raise the exception to ensure it errors out
finally:
# Clean up resources
batch_client.close()
```
## Test search
Now that the data is uploaded, we'll test both vector similarity search and hybrid search locally below to make sure it is working as expected.
You can test both a pure vector search and hybrid search. Pure vector search passes in `None` to the `search_text` below and will only search on vector similarity. Hybrid search will combines the capabilities of traditional keyword-based search by passing in the query text `query` to the `search_text` with vector-based similarity search to provide more relevant and contextual results.
```python
query = "What model should I use to embed?"
# Note: we'll have the GPT choose the category automatically once we put it in ChatGPT
category ="models"
search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key))
vector_query = VectorizedQuery(vector=generate_embeddings(query, embeddings_model), k_nearest_neighbors=3, fields="content_vector")
results = search_client.search(
search_text=None, # Pass in None if you want to use pure vector search, and `query` if you want to use hybrid search
vector_queries= [vector_query],
select=["title", "text"],
filter=f"category eq '{category}'"
)
for result in results:
print(result)
```
## Create Azure Function
Azure Functions are an easy way to build an API on top of our new AI search. Our code (see the `function_app.py` file in this folder, or linked [here](https://github.com/openai/openai-cookbook/blob/main/examples/chatgpt/rag-quickstart/azure/function_app.py)) does the following:
1. Takes in an input of the user's query, search index endpoint, the index name, the k_nearest_neighbors*, the search column to use (either content_vector or title_vector), and whether it should use a hybrid query
2. Takes the user's query and embeds it.
3. Conducts a vector search and retrieves relevant text chunks.
4. Returns those relevant text chunks as the response body.
*In the context of vector search, k_nearest_neighbors specifies the number of "closest" vectors (in terms of cosine similarity) that the search should return. For example, if k_nearest_neighbors is set to 3, the search will return the 3 vectors in the index that are most similar to the query vector.
> Note that this Azure Function _does not have any authentication_. However, you can set authentication on it following docs [here](https://learn.microsoft.com/en-us/azure/azure-functions/security-concepts?tabs=v4)
### Create storage account
We can create a new storage account using the code below, but feel free to skip that block and modify the subsequent steps to use an existing storage account. This may take up to 30 seconds.
```python
## Update below with a different name
storage_account_name = ""
## Use below SKU or any other SKU as per your requirement
sku = "Standard_LRS"
resource_client = ResourceManagementClient(credential, subscription_id)
storage_client = StorageManagementClient(credential, subscription_id)
# Create resource group if it doesn't exist
rg_result = resource_client.resource_groups.create_or_update(resource_group, {"location": region})
# Create storage account
storage_async_operation = storage_client.storage_accounts.begin_create(
resource_group,
storage_account_name,
{
"sku": {"name": sku},
"kind": "StorageV2",
"location": region,
},
)
storage_account = storage_async_operation.result()
print(f"Storage account {storage_account.name} created")
```
### Create Function App
This Function App is where the python code will execute once it is triggered via a GPT Action. To read more about Function Apps, see the docs [here](https://learn.microsoft.com/en-us/azure/azure-functions/functions-overview?pivots=programming-language-csharp).
To deploy Function Apps, we'll need to use the Azure CLI and Azure Functions Core Tools.
> The below will attempt to install it and run it based on your platform type in your virtual environment, but if that does not work, read the Azure documentation to figure out how to install [Azure Function Core Tools](https://learn.microsoft.com/en-us/azure/azure-functions/create-first-function-cli-python?tabs=linux,bash,azure-cli,browser) and [Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli). After doing that, run the below `subprocess.run` commands in your terminal after navigating to this folder.
First we'll make sure we have the relevant tools in the environment in order to run the Azure commands necessary. This may take a few minutes to install.
```python
os_type = platform.system()
if os_type == "Windows":
# Install Azure Functions Core Tools on Windows
subprocess.run(["npm", "install", "-g", "azure-functions-core-tools@3", "--unsafe-perm", "true"], check=True)
# Install Azure CLI on Windows
subprocess.run(["powershell", "-Command", "Invoke-WebRequest -Uri https://aka.ms/installazurecliwindows -OutFile .\\AzureCLI.msi; Start-Process msiexec.exe -ArgumentList '/I AzureCLI.msi /quiet' -Wait"], check=True)
elif os_type == "Darwin": # MacOS
# Install Azure Functions Core Tools on MacOS
if platform.machine() == 'arm64':
# For M1 Macs
subprocess.run(["arch", "-arm64", "brew", "install", "azure-functions-core-tools@3"], check=True)
else:
# For Intel Macs
subprocess.run(["brew", "install", "azure-functions-core-tools@3"], check=True)
# Install Azure CLI on MacOS
subprocess.run(["brew", "update"], check=True)
subprocess.run(["brew", "install", "azure-cli"], check=True)
elif os_type == "Linux":
# Install Azure Functions Core Tools on Linux
subprocess.run(["curl", "https://packages.microsoft.com/keys/microsoft.asc", "|", "gpg", "--dearmor", ">", "microsoft.gpg"], check=True, shell=True)
subprocess.run(["sudo", "mv", "microsoft.gpg", "/etc/apt/trusted.gpg.d/microsoft.gpg"], check=True)
subprocess.run(["sudo", "sh", "-c", "'echo \"deb [arch=amd64] https://packages.microsoft.com/repos/microsoft-ubuntu-$(lsb_release -cs)-prod $(lsb_release -cs) main\" > /etc/apt/sources.list.d/dotnetdev.list'"], check=True, shell=True)
subprocess.run(["sudo", "apt-get", "update"], check=True)
subprocess.run(["sudo", "apt-get", "install", "azure-functions-core-tools-3"], check=True)
# Install Azure CLI on Linux
subprocess.run(["curl", "-sL", "https://aka.ms/InstallAzureCLIDeb", "|", "sudo", "bash"], check=True, shell=True)
else:
# Raise an error if the operating system is not supported
raise OSError("Unsupported operating system")
# Verify the installation of Azure Functions Core Tools
subprocess.run(["func", "--version"], check=True)
# Verify the installation of Azure CLI
subprocess.run(["az", "--version"], check=True)
subprocess.run([
"az", "login"
], check=True)
```
Now, we need to create a `local.settings.json` file with our key environment variables for Azure
```python
local_settings_content = f"""
{{
"IsEncrypted": false,
"Values": {{
"AzureWebJobsStorage": "UseDevelopmentStorage=true",
"FUNCTIONS_WORKER_RUNTIME": "python",
"OPENAI_API_KEY": "{openai_api_key}",
"EMBEDDINGS_MODEL": "{embeddings_model}",
"SEARCH_SERVICE_API_KEY": "{search_service_api_key}",
}}
}}
"""
with open("local.settings.json", "w") as file:
file.write(local_settings_content)
```
Check the `local.settings.json` file and make sure that the environment variables match what you expect.
Now, give your app a name below, and you are ready to create your Function App and then publish your function.
```python
# Replace this with your own values. This name will appear in the URL of the API call https://.azurewebsites.net
app_name = ""
subprocess.run([
"az", "functionapp", "create",
"--resource-group", resource_group,
"--consumption-plan-location", region,
"--runtime", "python",
"--name", app_name,
"--storage-account", storage_account_name,
"--os-type", "Linux",
], check=True)
```
Once we've created the Function App, we now want to add the configuration variables to the function app to use in the function. Specifically, we need the `OPENAI_API_KEY`, the `SEARCH_SERVICE_API_KEY`, and the `EMBEDDINGS_MODEL` as these are all used in the `function_app.py` code.
```python
# Collect the relevant environment variables
env_vars = {
"OPENAI_API_KEY": openai_api_key,
"SEARCH_SERVICE_API_KEY": search_service_api_key,
"EMBEDDINGS_MODEL": embeddings_model
}
# Create the settings argument for the az functionapp create command
settings_args = []
for key, value in env_vars.items():
settings_args.append(f"{key}={value}")
subprocess.run([
"az", "functionapp", "config", "appsettings", "set",
"--name", app_name,
"--resource-group", resource_group,
"--settings", *settings_args
], check=True)
```
We are now ready to publish your function code `function_app.py` to the Azure Function. This may take up to 10 minutes to deploy. Once this is finished, we now have an API endpoint using an Azure Function on top of Azure AI Search.
```python
subprocess.run([
"func", "azure", "functionapp", "publish", app_name
], check=True)
```
## Input in a Custom GPT in ChatGPT
Now that we have an Azure Function that queries this Vector Search Index, let's put it as a GPT Action!
See documentation [here](https://openai.com/index/introducing-gpts/) on GPTs and [here](https://platform.openai.com/docs/actions) on GPT Actions. Use the below as the instructions for the GPT and as the OpenAPI spec for the GPT Action.
### Create OpenAPI Spec
Below is a sample OpenAPI spec. When we run the block below, a functional spec should be copied to the clipboard to paste in the GPT Action.
Note that this does not have any authentication by default, but you can set up Azure Functions with OAuth by following the pattern in [this cookbook](https://cookbook.openai.com/examples/chatgpt/gpt_actions_library/gpt_middleware_azure_function#part-2-set-up-auth) in the Authentication section or looking at the documentation [here](https://learn.microsoft.com/en-us/azure/app-service/overview-authentication-authorization).
```python
spec = f"""
openapi: 3.1.0
info:
title: Vector Similarity Search API
description: API for performing vector similarity search.
version: 1.0.0
servers:
- url: https://{app_name}.azurewebsites.net/api
description: Main (production) server
paths:
/vector_similarity_search:
post:
operationId: vectorSimilaritySearch
summary: Perform a vector similarity search.
requestBody:
required: true
content:
application/json:
schema:
type: object
properties:
search_service_endpoint:
type: string
description: The endpoint of the search service.
index_name:
type: string
description: The name of the search index.
query:
type: string
description: The search query.
k_nearest_neighbors:
type: integer
description: The number of nearest neighbors to return.
search_column:
type: string
description: The name of the search column.
use_hybrid_query:
type: boolean
description: Whether to use a hybrid query.
category:
type: string
description: category to filter.
required:
- search_service_endpoint
- index_name
- query
- k_nearest_neighbors
- search_column
- use_hybrid_query
responses:
'200':
description: A successful response with the search results.
content:
application/json:
schema:
type: object
properties:
results:
type: array
items:
type: object
properties:
id:
type: string
description: The identifier of the result item.
score:
type: number
description: The similarity score of the result item.
content:
type: object
description: The content of the result item.
'400':
description: Bad request due to missing or invalid parameters.
'500':
description: Internal server error.
"""
pyperclip.copy(spec)
print("OpenAPI spec copied to clipboard")
print(spec)
```
### Create GPT Instructions
Feel free to modify instructions as you see fit. Check out our docs [here](https://platform.openai.com/docs/guides/prompt-engineering) for some tips on prompt engineering.
```python
instructions = f'''
You are an OAI docs assistant. You have an action in your knowledge base where you can make a POST request to search for information. The POST request should always include: {{
"search_service_endpoint": "{search_service_endpoint}",
"index_name": {index_name},
"query": "",
"k_nearest_neighbors": 1,
"search_column": "content_vector",
"use_hybrid_query": true,
"category": ""
}}. Only the query and category change based on the user's request. Your goal is to assist users by performing searches using this POST request and providing them with relevant information based on the query.
You must only include knowledge you get from your action in your response.
The category must be from the following list: {categories}, which you should determine based on the user's query. If you cannot determine, then do not include the category in the POST request.
'''
pyperclip.copy(instructions)
print("GPT Instructions copied to clipboard")
print(instructions)
```
We now have a GPT that queries a vector database!
# Recap
We've now successfully integrated Azure AI Search with GPT Actions in ChatGPT by doing the following:
1. embedded them using OpenAI's embeddings, while adding some additional metadata using gpt-4o.
2. uploaded that data to Azure AI Search.
3. created an endpoint to query it using Azure Functions.
4. incorporated it into a Custom GPT.
Our GPT can now retrieve information to help answer user queries, making it much more accurate and customized to our data. Here's the GPT in action:
# 
---
# Source: https://developers.openai.com/resources/guide/background-mode-guide.md
# Background mode guide
> Guide to running tasks in the background with Responses.
- Type: Guide
- Tags: responses
- URL: https://platform.openai.com/docs/guides/background
- Created: 2025-07-22
- Updated: 2025-08-13
## Summary
Shows how to handle long-running actions asynchronously. — Responses API, tools, function calling
## Details
Covers patterns for deferring work and delivering results later.
---
# Source: https://developers.openai.com/resources/video/balancing-accuracy-latency-cost-video.md
# Balance accuracy, latency, and cost
> Talk on optimizing AI systems for accuracy, speed, and cost.
- Type: Video
- Tags: optimization
- URL: https://www.youtube.com/watch?v=Bx6sUDRMx-8
- Created: 2025-07-21
- Updated: 2025-08-13
## Summary
Shares strategies for achieving the right trade-offs between quality, performance, and expenses. — latency, cost
## Details
Covers practical approaches to scale models efficiently while maintaining desired accuracy and responsiveness.
---
# Source: https://developers.openai.com/resources/guide/batch-api-guide.md
# Batch API guide
> Guide on how to use the Batch API to reduce costs
- Type: Guide
- Tags: tools, search
- URL: https://platform.openai.com/docs/guides/batch
- Created: 2025-07-22
- Updated: 2025-08-13
## Summary
Describes how to use the Batch API to reduce costs
## Details
Provides instructions for enabling the Batch API within your applications.
---
# Source: https://developers.openai.com/cookbook/examples/batch_processing.md
# Batch processing with the Batch API
The new Batch API allows to **create async batch jobs for a lower price and with higher rate limits**.
Batches will be completed within 24h, but may be processed sooner depending on global usage.
Ideal use cases for the Batch API include:
- Tagging, captioning, or enriching content on a marketplace or blog
- Categorizing and suggesting answers for support tickets
- Performing sentiment analysis on large datasets of customer feedback
- Generating summaries or translations for collections of documents or articles
and much more!
This cookbook will walk you through how to use the Batch API with a couple of practical examples.
We will start with an example to categorize movies using `gpt-4o-mini`, and then cover how we can use the vision capabilities of this model to caption images.
Please note that multiple models are available through the Batch API, and that you can use the same parameters in your Batch API calls as with the Chat Completions endpoint.
## Setup
```python
# Make sure you have the latest version of the SDK available to use the Batch API
%pip install openai --upgrade
```
```python
import json
from openai import OpenAI
import pandas as pd
from IPython.display import Image, display
```
```python
# Initializing OpenAI client - see https://platform.openai.com/docs/quickstart?context=python
client = OpenAI()
```
## First example: Categorizing movies
In this example, we will use `gpt-4o-mini` to extract movie categories from a description of the movie. We will also extract a 1-sentence summary from this description.
We will use [JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode) to extract categories as an array of strings and the 1-sentence summary in a structured format.
For each movie, we want to get a result that looks like this:
```
{
categories: ['category1', 'category2', 'category3'],
summary: '1-sentence summary'
}
```
### Loading data
We will use the IMDB top 1000 movies dataset for this example.
```python
dataset_path = "data/imdb_top_1000.csv"
df = pd.read_csv(dataset_path)
df.head()
```
Poster_Link
Series_Title
Released_Year
Certificate
Runtime
Genre
IMDB_Rating
Overview
Meta_score
Director
Star1
Star2
Star3
Star4
No_of_Votes
Gross
0
https://m.media-amazon.com/images/M/MV5BMDFkYT...
The Shawshank Redemption
1994
A
142 min
Drama
9.3
Two imprisoned men bond over a number of years...
80.0
Frank Darabont
Tim Robbins
Morgan Freeman
Bob Gunton
William Sadler
2343110
28,341,469
1
https://m.media-amazon.com/images/M/MV5BM2MyNj...
The Godfather
1972
A
175 min
Crime, Drama
9.2
An organized crime dynasty's aging patriarch t...
100.0
Francis Ford Coppola
Marlon Brando
Al Pacino
James Caan
Diane Keaton
1620367
134,966,411
2
https://m.media-amazon.com/images/M/MV5BMTMxNT...
The Dark Knight
2008
UA
152 min
Action, Crime, Drama
9.0
When the menace known as the Joker wreaks havo...
84.0
Christopher Nolan
Christian Bale
Heath Ledger
Aaron Eckhart
Michael Caine
2303232
534,858,444
3
https://m.media-amazon.com/images/M/MV5BMWMwMG...
The Godfather: Part II
1974
A
202 min
Crime, Drama
9.0
The early life and career of Vito Corleone in ...
90.0
Francis Ford Coppola
Al Pacino
Robert De Niro
Robert Duvall
Diane Keaton
1129952
57,300,000
4
https://m.media-amazon.com/images/M/MV5BMWU4N2...
12 Angry Men
1957
U
96 min
Crime, Drama
9.0
A jury holdout attempts to prevent a miscarria...
96.0
Sidney Lumet
Henry Fonda
Lee J. Cobb
Martin Balsam
John Fiedler
689845
4,360,000
### Processing step
Here, we will prepare our requests by first trying them out with the Chat Completions endpoint.
Once we're happy with the results, we can move on to creating the batch file.
```python
categorize_system_prompt = '''
Your goal is to extract movie categories from movie descriptions, as well as a 1-sentence summary for these movies.
You will be provided with a movie description, and you will output a json object containing the following information:
{
categories: string[] // Array of categories based on the movie description,
summary: string // 1-sentence summary of the movie based on the movie description
}
Categories refer to the genre or type of the movie, like "action", "romance", "comedy", etc. Keep category names simple and use only lower case letters.
Movies can have several categories, but try to keep it under 3-4. Only mention the categories that are the most obvious based on the description.
'''
def get_categories(description):
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.1,
# This is to enable JSON mode, making sure responses are valid json objects
response_format={
"type": "json_object"
},
messages=[
{
"role": "system",
"content": categorize_system_prompt
},
{
"role": "user",
"content": description
}
],
)
return response.choices[0].message.content
```
```python
# Testing on a few examples
for _, row in df[:5].iterrows():
description = row['Overview']
title = row['Series_Title']
result = get_categories(description)
print(f"TITLE: {title}\nOVERVIEW: {description}\n\nRESULT: {result}")
print("\n\n----------------------------\n\n")
```
```text
TITLE: The Shawshank Redemption
OVERVIEW: Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.
RESULT: {
"categories": ["drama"],
"summary": "Two imprisoned men develop a deep bond over the years, ultimately finding redemption through their shared acts of kindness."
}
----------------------------
TITLE: The Godfather
OVERVIEW: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.
RESULT: {
"categories": ["crime", "drama"],
"summary": "An aging crime lord hands over his empire to his hesitant son."
}
----------------------------
TITLE: The Dark Knight
OVERVIEW: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.
RESULT: {
"categories": ["action", "thriller", "superhero"],
"summary": "Batman faces a formidable challenge as the Joker unleashes chaos on Gotham City."
}
----------------------------
TITLE: The Godfather: Part II
OVERVIEW: The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands and tightens his grip on the family crime syndicate.
RESULT: {
"categories": ["crime", "drama"],
"summary": "The film depicts the early life of Vito Corleone and the rise of his son Michael within the family crime syndicate in 1920s New York City."
}
----------------------------
TITLE: 12 Angry Men
OVERVIEW: A jury holdout attempts to prevent a miscarriage of justice by forcing his colleagues to reconsider the evidence.
RESULT: {
"categories": ["drama", "thriller"],
"summary": "A jury holdout fights to ensure justice is served by challenging his fellow jurors to reevaluate the evidence."
}
----------------------------
```
### Creating the batch file
The batch file, in the `jsonl` format, should contain one line (json object) per request.
Each request is defined as such:
```
{
"custom_id": ,
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": ,
"messages": ,
// other parameters
}
}
```
Note: the request ID should be unique per batch. This is what you can use to match results to the initial input files, as requests will not be returned in the same order.
```python
# Creating an array of json tasks
tasks = []
for index, row in df.iterrows():
description = row['Overview']
task = {
"custom_id": f"task-{index}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
# This is what you would have in your Chat Completions API call
"model": "gpt-4o-mini",
"temperature": 0.1,
"response_format": {
"type": "json_object"
},
"messages": [
{
"role": "system",
"content": categorize_system_prompt
},
{
"role": "user",
"content": description
}
],
}
}
tasks.append(task)
```
```python
# Creating the file
file_name = "data/batch_tasks_movies.jsonl"
with open(file_name, 'w') as file:
for obj in tasks:
file.write(json.dumps(obj) + '\n')
```
### Uploading the file
```python
batch_file = client.files.create(
file=open(file_name, "rb"),
purpose="batch"
)
```
```python
print(batch_file)
```
```text
FileObject(id='file-lx16f1KyIxQ2UHVvkG3HLfNR', bytes=1127310, created_at=1721144107, filename='batch_tasks_movies.jsonl', object='file', purpose='batch', status='processed', status_details=None)
```
### Creating the batch job
```python
batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
```
### Checking batch status
Note: this can take up to 24h, but it will usually be completed faster.
You can continue checking until the status is 'completed'.
```python
batch_job = client.batches.retrieve(batch_job.id)
print(batch_job)
```
### Retrieving results
```python
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
```
```python
result_file_name = "data/batch_job_results_movies.jsonl"
with open(result_file_name, 'wb') as file:
file.write(result)
```
```python
# Loading data from saved file
results = []
with open(result_file_name, 'r') as file:
for line in file:
# Parsing the JSON string into a dict and appending to the list of results
json_object = json.loads(line.strip())
results.append(json_object)
```
### Reading results
Reminder: the results are not in the same order as in the input file.
Make sure to check the custom_id to match the results against the input requests
```python
# Reading only the first results
for res in results[:5]:
task_id = res['custom_id']
# Getting index from task id
index = task_id.split('-')[-1]
result = res['response']['body']['choices'][0]['message']['content']
movie = df.iloc[int(index)]
description = movie['Overview']
title = movie['Series_Title']
print(f"TITLE: {title}\nOVERVIEW: {description}\n\nRESULT: {result}")
print("\n\n----------------------------\n\n")
```
```text
TITLE: American Psycho
OVERVIEW: A wealthy New York City investment banking executive, Patrick Bateman, hides his alternate psychopathic ego from his co-workers and friends as he delves deeper into his violent, hedonistic fantasies.
RESULT: {
"categories": ["thriller", "psychological", "drama"],
"summary": "A wealthy investment banker in New York City conceals his psychopathic alter ego while indulging in violent and hedonistic fantasies."
}
----------------------------
TITLE: Lethal Weapon
OVERVIEW: Two newly paired cops who are complete opposites must put aside their differences in order to catch a gang of drug smugglers.
RESULT: {
"categories": ["action", "comedy", "crime"],
"summary": "An action-packed comedy about two mismatched cops teaming up to take down a drug smuggling gang."
}
----------------------------
TITLE: A Star Is Born
OVERVIEW: A musician helps a young singer find fame as age and alcoholism send his own career into a downward spiral.
RESULT: {
"categories": ["drama", "music"],
"summary": "A musician's career spirals downward as he helps a young singer find fame amidst struggles with age and alcoholism."
}
----------------------------
TITLE: From Here to Eternity
OVERVIEW: In Hawaii in 1941, a private is cruelly punished for not boxing on his unit's team, while his captain's wife and second-in-command are falling in love.
RESULT: {
"categories": ["drama", "romance", "war"],
"summary": "A drama set in Hawaii in 1941, where a private faces punishment for not boxing on his unit's team, amidst a forbidden love affair between his captain's wife and second-in-command."
}
----------------------------
TITLE: The Jungle Book
OVERVIEW: Bagheera the Panther and Baloo the Bear have a difficult time trying to convince a boy to leave the jungle for human civilization.
RESULT: {
"categories": ["adventure", "animation", "family"],
"summary": "An adventure-filled animated movie about a panther and a bear trying to persuade a boy to leave the jungle for human civilization."
}
----------------------------
```
## Second example: Captioning images
In this example, we will use `gpt-4-turbo` to caption images of furniture items.
We will use the vision capabilities of the model to analyze the images and generate the captions.
### Loading data
We will use the Amazon furniture dataset for this example.
```python
dataset_path = "data/amazon_furniture_dataset.csv"
df = pd.read_csv(dataset_path)
df.head()
```
asin
url
title
brand
price
availability
categories
primary_image
images
upc
...
color
material
style
important_information
product_overview
about_item
description
specifications
uniq_id
scraped_at
0
B0CJHKVG6P
https://www.amazon.com/dp/B0CJHKVG6P
GOYMFK 1pc Free Standing Shoe Rack, Multi-laye...
GOYMFK
$24.99
Only 13 left in stock - order soon.
['Home & Kitchen', 'Storage & Organization', '...
https://m.media-amazon.com/images/I/416WaLx10j...
['https://m.media-amazon.com/images/I/416WaLx1...
NaN
...
White
Metal
Modern
[]
[{'Brand': ' GOYMFK '}, {'Color': ' White '}, ...
['Multiple layers: Provides ample storage spac...
multiple shoes, coats, hats, and other items E...
['Brand: GOYMFK', 'Color: White', 'Material: M...
02593e81-5c09-5069-8516-b0b29f439ded
2024-02-02 15:15:08
1
B0B66QHB23
https://www.amazon.com/dp/B0B66QHB23
subrtex Leather ding Room, Dining Chairs Set o...
subrtex
NaN
NaN
['Home & Kitchen', 'Furniture', 'Dining Room F...
https://m.media-amazon.com/images/I/31SejUEWY7...
['https://m.media-amazon.com/images/I/31SejUEW...
NaN
...
Black
Sponge
Black Rubber Wood
[]
NaN
['【Easy Assembly】: Set of 2 dining room chairs...
subrtex Dining chairs Set of 2
['Brand: subrtex', 'Color: Black', 'Product Di...
5938d217-b8c5-5d3e-b1cf-e28e340f292e
2024-02-02 15:15:09
2
B0BXRTWLYK
https://www.amazon.com/dp/B0BXRTWLYK
Plant Repotting Mat MUYETOL Waterproof Transpl...
MUYETOL
$5.98
In Stock
['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...
https://m.media-amazon.com/images/I/41RgefVq70...
['https://m.media-amazon.com/images/I/41RgefVq...
NaN
...
Green
Polyethylene
Modern
[]
[{'Brand': ' MUYETOL '}, {'Size': ' 26.8*26.8 ...
['PLANT REPOTTING MAT SIZE: 26.8" x 26.8", squ...
NaN
['Brand: MUYETOL', 'Size: 26.8*26.8', 'Item We...
b2ede786-3f51-5a45-9a5b-bcf856958cd8
2024-02-02 15:15:09
3
B0C1MRB2M8
https://www.amazon.com/dp/B0C1MRB2M8
Pickleball Doormat, Welcome Doormat Absorbent ...
VEWETOL
$13.99
Only 10 left in stock - order soon.
['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo...
https://m.media-amazon.com/images/I/61vz1Igler...
['https://m.media-amazon.com/images/I/61vz1Igl...
NaN
...
A5589
Rubber
Modern
[]
[{'Brand': ' VEWETOL '}, {'Size': ' 16*24INCH ...
['Specifications: 16x24 Inch ', " High-Quality...
The decorative doormat features a subtle textu...
['Brand: VEWETOL', 'Size: 16*24INCH', 'Materia...
8fd9377b-cfa6-5f10-835c-6b8eca2816b5
2024-02-02 15:15:10
4
B0CG1N9QRC
https://www.amazon.com/dp/B0CG1N9QRC
JOIN IRON Foldable TV Trays for Eating Set of ...
JOIN IRON Store
$89.99
Usually ships within 5 to 6 weeks
['Home & Kitchen', 'Furniture', 'Game & Recrea...
https://m.media-amazon.com/images/I/41p4d4VJnN...
['https://m.media-amazon.com/images/I/41p4d4VJ...
NaN
...
Grey Set of 4
Iron
X Classic Style
[]
NaN
['Includes 4 Folding Tv Tray Tables And one Co...
Set of Four Folding Trays With Matching Storag...
['Brand: JOIN IRON', 'Shape: Rectangular', 'In...
bdc9aa30-9439-50dc-8e89-213ea211d66a
2024-02-02 15:15:11
5 rows × 25 columns
### Processing step
Again, we will first prepare our requests with the Chat Completions endpoint, and create the batch file afterwards.
```python
caption_system_prompt = '''
Your goal is to generate short, descriptive captions for images of items.
You will be provided with an item image and the name of that item and you will output a caption that captures the most important information about the item.
If there are multiple items depicted, refer to the name provided to understand which item you should describe.
Your generated caption should be short (1 sentence), and include only the most important information about the item.
The most important information could be: the type of item, the style (if mentioned), the material or color if especially relevant and/or any distinctive features.
Keep it short and to the point.
'''
def get_caption(img_url, title):
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.2,
max_tokens=300,
messages=[
{
"role": "system",
"content": caption_system_prompt
},
{
"role": "user",
"content": [
{
"type": "text",
"text": title
},
# The content type should be "image_url" to use gpt-4-turbo's vision capabilities
{
"type": "image_url",
"image_url": {
"url": img_url
}
},
],
}
]
)
return response.choices[0].message.content
```
```python
# Testing on a few images
for _, row in df[:5].iterrows():
img_url = row['primary_image']
caption = get_caption(img_url, row['title'])
img = Image(url=img_url)
display(img)
print(f"CAPTION: {caption}\n\n")
```
```text
CAPTION: A stylish white free-standing shoe rack featuring multiple layers and eight double hooks, perfect for organizing shoes and accessories in living rooms, bathrooms, or hallways.
```
```text
CAPTION: Set of 2 black leather dining chairs featuring a sleek design with vertical stitching and sturdy wooden legs.
```
```text
CAPTION: The MUYETOL Plant Repotting Mat is a waterproof, portable, and foldable gardening work mat measuring 26.8" x 26.8", designed for easy soil changing and indoor transplanting.
```
```text
CAPTION: Absorbent non-slip doormat featuring the phrase "It's a good day to play PICKLEBALL" with paddle graphics, measuring 16x24 inches.
```
```text
CAPTION: Set of 4 foldable TV trays in grey, featuring a compact design with a stand for easy storage, perfect for small spaces.
```
### Creating the batch job
As with the first example, we will create an array of json tasks to generate a `jsonl` file and use it to create the batch job.
```python
# Creating an array of json tasks
tasks = []
for index, row in df.iterrows():
title = row['title']
img_url = row['primary_image']
task = {
"custom_id": f"task-{index}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
# This is what you would have in your Chat Completions API call
"model": "gpt-4o-mini",
"temperature": 0.2,
"max_tokens": 300,
"messages": [
{
"role": "system",
"content": caption_system_prompt
},
{
"role": "user",
"content": [
{
"type": "text",
"text": title
},
{
"type": "image_url",
"image_url": {
"url": img_url
}
},
],
}
]
}
}
tasks.append(task)
```
```python
# Creating the file
file_name = "data/batch_tasks_furniture.jsonl"
with open(file_name, 'w') as file:
for obj in tasks:
file.write(json.dumps(obj) + '\n')
```
```python
# Uploading the file
batch_file = client.files.create(
file=open(file_name, "rb"),
purpose="batch"
)
```
```python
# Creating the job
batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
```
```python
batch_job = client.batches.retrieve(batch_job.id)
print(batch_job)
```
### Getting results
As with the first example, we can retrieve results once the batch job is done.
Reminder: the results are not in the same order as in the input file.
Make sure to check the custom_id to match the results against the input requests
```python
# Retrieving result file
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
```
```python
result_file_name = "data/batch_job_results_furniture.jsonl"
with open(result_file_name, 'wb') as file:
file.write(result)
```
```python
# Loading data from saved file
results = []
with open(result_file_name, 'r') as file:
for line in file:
# Parsing the JSON string into a dict and appending to the list of results
json_object = json.loads(line.strip())
results.append(json_object)
```
```python
# Reading only the first results
for res in results[:5]:
task_id = res['custom_id']
# Getting index from task id
index = task_id.split('-')[-1]
result = res['response']['body']['choices'][0]['message']['content']
item = df.iloc[int(index)]
img_url = item['primary_image']
img = Image(url=img_url)
display(img)
print(f"CAPTION: {result}\n\n")
```
```text
CAPTION: Brushed brass pedestal towel rack with a sleek, modern design, featuring multiple bars for hanging towels, measuring 25.75 x 14.44 x 32 inches.
```
```text
CAPTION: Black round end table featuring a tempered glass top and a metal frame, with a lower shelf for additional storage.
```
```text
CAPTION: Black collapsible and height-adjustable telescoping stool, portable and designed for makeup artists and hairstylists, shown in various stages of folding for easy transport.
```
```text
CAPTION: Ergonomic pink gaming chair featuring breathable fabric, adjustable height, lumbar support, a footrest, and a swivel recliner function.
```
```text
CAPTION: A set of two Glitzhome adjustable bar stools featuring a mid-century modern design with swivel seats, PU leather upholstery, and wooden backrests.
```
## Wrapping up
In this cookbook, we have seen two examples of how to use the new Batch API, but keep in mind that the Batch API works the same way as the Chat Completions endpoint, supporting the same parameters and most of the recent models (gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo...).
By using this API, you can significantly reduce costs, so we recommend switching every workload that can happen async to a batch job with this new API.
---
# Source: https://developers.openai.com/codex/guides/build-ai-native-engineering-team.md
# Building an AI-Native Engineering Team
## Introduction
AI models are rapidly expanding the range of tasks they can perform, with significant implications for engineering. Frontier systems now sustain multi-hour reasoning: as of August 2025, METR found that leading models could complete **2 hours and 17 minutes** of continuous work with roughly **50% confidence** of producing a correct answer.
This capability is improving quickly, with task length doubling about every seven months. Only a few years ago, models could manage about 30 seconds of reasoning – enough for small code suggestions. Today, as models sustain longer chains of reasoning, the entire software development lifecycle is potentially in scope for AI assistance, enabling coding agents to contribute effectively to planning, design, development, testing, code reviews, and deployment.
![][image1]In this guide, we’ll share real examples that outline how AI agents are contributing to the software development lifecycle with practical guidance on what engineering leaders can do today to start building AI-native teams and processes.
## AI Coding: From Autocomplete to Agents
AI coding tools have progressed far beyond their origins as autocomplete assistants. Early tools handled quick tasks such as suggesting the next line of code or filling in function templates. As models gained stronger reasoning abilities, developers began interacting with agents through chat interfaces in IDEs for pair programming and code exploration.
Today’s coding agents can generate entire files, scaffold new projects, and translate designs into code. They can reason through multi-step problems such as debugging or refactoring, with agent execution also now shifting from an individual developer’s machine to cloud-based, multi-agent environments. This is changing how developers work, allowing them to spend less time generating code with the agent inside the IDE and more time delegating entire workflows.
| Capability | What It Enables |
| :--------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Unified context across systems** | A single model can read code, configuration, and telemetry, providing consistent reasoning across layers that previously required separate tooling. |
| **Structured tool execution** | Models can now call compilers, test runners, and scanners directly, producing verifiable results rather than static suggestions. |
| **Persistent project memory** | Long context windows and techniques like compaction allow models to follow a feature from proposal to deployment, remembering previous design choices and constraints. |
| **Evaluation loops** | Model outputs can be tested automatically against benchmarks—unit tests, latency targets, or style guides—so improvements are grounded in measurable quality. |
At OpenAI, we have witnessed this firsthand. Development cycles have accelerated, with work that once required weeks now being delivered in days. Teams move more easily across domains, onboard faster to unfamiliar projects, and operate with greater agility and autonomy across the organization. Many routine and time-consuming tasks, from documenting new code and surfacing relevant tests, maintaining dependencies and cleaning up feature flags are now delegated to Codex entirely.
However, some aspects of engineering remain unchanged. True ownership of code—especially for new or ambiguous problems—still rests with engineers, and certain challenges exceed the capabilities of current models. But with coding agents like Codex, engineers can now spend more time on complex and novel challenges, focusing on design, architecture, and system-level reasoning rather than debugging or rote implementation.
In the following sections, we break down how each phase of the SDLC changes with coding agents — and outline the concrete steps your team can take to start operating as an AI-native engineering org.
## 1. Plan
Teams across an organization often depend on engineers to determine whether a feature is feasible, how long it will take to build, and which systems or teams will be involved. While anyone can draft a specification, forming an accurate plan typically requires deep codebase awareness and multiple rounds of iteration with engineering to uncover requirements, clarify edge cases, and align on what is technically realistic.
### How coding agents help
AI coding agents give teams immediate, code-aware insights during planning and scoping. For example, teams may build workflows that connect coding agents to their issue-tracking systems to read a feature specification, cross-reference it against the codebase, and then flag ambiguities, break the work into subcomponents, or estimate difficulty.
Coding agents can also instantly trace code paths to show which services are involved in a feature — work that previously required hours or days of manual digging through a large codebase.
### What engineers do instead
Teams spend more time on core feature work because agents surface the context that previously required meetings for product alignment and scoping. Key implementation details, dependencies, and edge cases are identified up front, enabling faster decisions with fewer meetings.
| Delegate | Review | Own |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| AI agents can take the first pass at feasibility and architectural analysis. They read a specification, map it to the codebase, identify dependencies, and surface ambiguities or edge cases that need clarification. | Teams review the agent’s findings to validate accuracy, assess completeness, and ensure estimates reflect real technical constraints. Story point assignment, effort sizing, and identifying non-obvious risks still require human judgment. | Strategic decisions — such as prioritization, long-term direction, sequencing, and tradeoffs — remain human-led. Teams may ask the agent for options or next steps, but final responsibility for planning and product direction stays with the organization. |
### Getting started checklist
- Identify common processes that require alignment between features and source code. Common areas include feature scoping and ticket creation.
- Begin by implementing basic workflows, for example tagging and deduplicating issues or feature requests.
- Consider more advanced workflows, like adding sub-tasks to a ticket based on an initial feature description. Or kick off an agent run when a ticket reaches a specific stage to supplement the description with more details.
## 2. Design
The design phase is often slowed by foundational setup work. Teams spend significant time wiring up boilerplate, integrating design systems, and refining UI components or flows. Misalignment between mockups and implementation can create rework and long feedback cycles, and limited bandwidth to explore alternatives or adapt to changing requirements delays design validation.
### How coding agents help
AI coding tools dramatically accelerate prototyping by scaffolding boilerplate code, building project structures, and instantly implementing design tokens or style guides. Engineers can describe desired features or UI layouts in natural language and receive prototype code or component stubs that match the team’s conventions.
They can convert designs directly into code, suggest accessibility improvements, and even analyze the codebase for user flows or edge cases. This makes it possible to iterate on multiple prototypes in hours instead of days, and to prototype in high fidelity early, giving teams a clearer basis for decision-making and enabling customer testing far sooner in the process.
### What engineers do instead
With routine setup and translation tasks handled by agents, teams can redirect their attention to higher-leverage work. Engineers focus on refining core logic, establishing scalable architectural patterns, and ensuring components meet quality and reliability standards. Designers can spend more time evaluating user flows and exploring alternative concepts. The collaborative effort shifts from implementation overhead to improving the underlying product experience.
| Delegate | Review | Own |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| Agents handle the initial implementation work by scaffolding projects, generating boilerplate code, translating mockups into components, and applying design tokens or style guides. | The team reviews the agent’s output to ensure components follow design conventions, meet quality and accessibility standards, and integrate correctly with existing systems. | The team owns the overarching design system, UX patterns, architectural decisions, and the final direction of the user experience. |
### Getting started checklist
- Use a multi-modal coding agent that accepts both text and image input
- Integrate design tools via MCP with coding agents
- Programmatically expose component libraries with MCP, and integrate them with your coding model
- Build workflows that map designs → components → implementation of components
- Utilize typed languages (e.g. Typescript) to define valid props and subcomponents for the agent
## 3. Build
The build phase is where teams feel the most friction, and where coding agents have the clearest impact. Engineers spend substantial time translating specs into code structures, wiring services together, duplicating patterns across the codebase, and filling in boilerplate, with even small features requiring hours of busy-work.
As systems grow, this friction compounds. Large monorepos accumulate patterns, conventions, and historical quirks that slow contributors down. Engineers can spend as much time rediscovering the “right way” to do something as implementing the feature itself. Constant context switching between specs, code search, build errors, test failures, and dependency management adds cognitive load — and interruptions during long-running tasks break flow and delay delivery further.
### How coding agents help
Coding agents running in the IDE and CLI accelerate the build phase by handling larger, multi-step implementation tasks. Rather than producing just the next function or file, they can produce full features end-to-end — data models, APIs, UI components, tests, and documentation — in a single coordinated run. With sustained reasoning across the entire codebase, they handle decisions that once required engineers to manually trace code paths.
With long-running tasks, agents can:
- Draft entire feature implementations based on a written spec.
- Search and modify code across dozens of files while maintaining consistency.
- Generate boilerplate that matches conventions: error handling, telemetry, security wrappers, or style patterns.
- Fix build errors as they appear rather than pausing for human intervention.
- Write tests alongside implementation as part of a single workflow.
- Produce diff-ready changesets that follow internal guidelines and include PR messages.
In practice, this shifts much of the mechanical “build work” from engineers to agents. The agent becomes the first-pass implementer; the engineer becomes the reviewer, editor, and source of direction.
### What engineers do instead
When agents can reliably execute multi-step build tasks, engineers shift their attention to higher-order work:
- Clarifying product behavior, edge cases, and specs before implementation.
- Reviewing architectural implications of AI-generated code instead of performing rote wiring.
- Refining business logic and performance-critical paths that require deep domain reasoning.
- Designing patterns, guardrails, and conventions that guide agent-generated code.
- Collaborating with PMs and design to iterate on feature intent, not boilerplate.
Instead of “translating” a feature spec into code, engineers concentrate on correctness, coherence, maintainability, and long-term quality, areas where human context still matters most.
| Delegate | Review | Own |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Agents draft the first implementation pass for well-specified features — scaffolding, CRUD logic, wiring, refactors, and tests. As long-running reasoning improves, this increasingly covers full end-to-end builds rather than isolated snippets. | Engineers assess design choices, performance, security, migration risk, and domain alignment while correcting subtle issues the agent may miss. They shape and refine AI-generated code rather than performing the mechanical work. | Engineers retain ownership of work requiring deep system intuition: new abstractions, cross-cutting architectural changes, ambiguous product requirements, and long-term maintainability trade-offs. As agents take on longer tasks, engineering shifts from line-by-line implementation to iterative oversight. |
Example:
Engineers, PMs, designers, and operators at Cloudwalk use Codex daily to turn specs into working code whether they need a script, a new fraud rule, or a full microservice delivered in minutes. It removes the busy work from the build phase and gives every employee the power to implement ideas at remarkable speed.
### Getting started checklist
- Start with well specified tasks
- Have the agent use a planning tool via MCP, or by writing a PLAN.md file that is committed to the codebase
- Check that the commands the agent attempts to execute are succeeding
- Iterate on an AGENTS.md file that unlocks agentic loops like running tests and linters to receive feedback
## 4. Test
Developers often struggle to ensure adequate test coverage because writing and maintaining comprehensive tests takes time, requires context switching, and deep understanding of edge cases. Teams frequently face trade-offs between moving fast and writing thorough tests. When deadlines loom, test coverage is often the first thing to suffer.
Even when tests are written, keeping them updated as code evolves introduces ongoing friction. Tests can become brittle, fail for unclear reasons, and can require their own major refactors as the underlying product changes. High quality tests let teams ship faster with more confidence.
### How coding agents help
AI coding tools can help developers author better tests in several powerful ways. First, they can suggest test cases based on reading a requirements document and the logic of the feature code. Models can be surprisingly good at suggesting edge cases and failure modes that may be easy for a developer to overlook, especially when they have been deeply focused on the feature and need a second opinion.
In addition, models can help tests up to date as code evolves, reducing the friction of refactoring and avoiding stale tests that become flaky. By handling the basic implementation details of test writing and surfacing edge cases, coding agents accelerate the process of developing tests.
### What engineers do instead
Writing tests with AI tools doesn’t remove the need for developers to think about testing. In fact, as agents remove barriers to generating code, tests serve a more and more important function as a source of truth for application functionality. Since agents can run the test suite and iterate based on the output, defining high quality tests is often the first step to allowing an agent to build a feature.
Instead, developers focus more on seeing the high level patterns in test coverage, building on and challenging the model’s identification of test cases. Making test writing faster allows developers to ship features more quickly and also take on more ambitious features.
| Delegate | Review | Own |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Engineers will delegate the initial pass at generating test cases based on feature specifications. They’ll also use the model to take a first pass at generating tests. It can be helpful to have the model generate tests in a separate session from the feature implementation. | Engineers must still thoroughly review model-generated tests to ensure that the model did not take shortcuts or implement stubbed tests. Engineers also ensure that tests are runnable by their agents; that the agent has the appropriate permissions to run, and that the agent has context awareness of the different test suites it can run. | Engineers own aligning test coverage with feature specifications and user experience expectations. Adversarial thinking, creativity in mapping edge cases, and focus on intent of the tests remain critical skills. |
### Getting started checklist
- Guide the model to implement tests as a separate step, and validate that new tests fail before moving to feature implementation.
- Set guidelines for test coverage in your AGENTS.md file
- Give the agent specific examples of code coverage tools it can call to understand test coverage
## 5. Review
On average, developers spend 2–5 hours per week conducting code reviews. Teams often face a choice between investing significant time in a deep review or doing a quick “good enough” pass for changes that seem small. When this prioritization is off, bugs slip into production, causing issues for users and creating substantial rework.
### How coding agents help
Coding agents allow the code review process to scale so every PR receives a consistent baseline of attention. Unlike traditional static analysis tools (which rely on pattern matching and rule-based checks) AI reviewers can actually execute parts of the code, interpret runtime behavior, and trace logic across files and services. To be effective, however, models must be trained specifically to identify P0 and P1-level bugs, and tuned to provide concise, high-signal feedback; overly verbose responses are ignored just as easily as noisy lint warnings.
### What engineers do instead
At OpenAI, we find that AI code review gives engineers more confidence that they are not shipping major bugs into production. Frequently, code review will catch issues that the contributor can correct before pulling in another engineer. Code review doesn’t necessarily make the pull request process faster, especially if it finds meaningful bugs – but it does prevent defects and outages.
### Delegate vs review vs own
Even with AI code review, engineers are still responsible for ensuring that the code is ready to ship. Practically, this means reading and understanding the implications of the change. Engineers delegate the initial code review to an agent, but own the final review and merge process.
| Delegate | Review | Own |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| Engineers delegate the initial coding review to agents. This may happen multiple times before the pull request is marked as ready for review by a teammate. | Engineers still review pull requests, but with more of an emphasis on architectural alignment; are composable patterns being implemented, are the correct conventions being used, does the functionality match requirements. | Engineers ultimately own the code that is deployed to production; they must ensure it functions reliably and fulfills the intended requirements. |
Example:
Sansan uses Codex review for race conditions and database relations, which are issues humans often overlook. Codex has also been able to catch improper hard-coding and even anticipates future scalability concerns.
### Getting started checklist
- Curate examples of gold-standard PRs that have been conducted by engineers including both the code changes and comments left. Save this as an evaluation set to measure different tools.
- Select a product that has a model specifically trained on code review. We’ve found that generalized models often nitpick and provide a low signal to noise ratio.
- Define how your team will measure whether reviews are high quality. We recommend tracking PR comment reactions as a low-friction way to mark good and bad reviews.
- Start small but rollout quickly once you gain confidence in the results of reviews.
## 6. Document
Most engineering teams know their documentation is behind, but find catching up costly. Critical knowledge is often held by individuals rather than captured in searchable knowledge bases, and existing docs quickly go stale because updating them pulls engineers away from product work. And even when teams run documentation sprints, the result is usually a one-off effort that decays as soon as the system evolves.
### How coding agents help
Coding agents are highly capable of summarizing functionality based on reading codebases. Not only can they write about how parts of the codebase work, but they can also generate system diagrams in syntaxes like mermaid. As developers build features with agents, they can also update documentation simply by prompting the model. With AGENTS.md, instructions to update documentation as needed can be automatically included with every prompt for more consistency.
Since coding agents can be run programmatically through SDKs, they can also be incorporated into release workflows. For example, we can ask a coding agent to review commits being included in the release and summarize key changes. The result is that documentation becomes a built-in part of the delivery pipeline: faster to produce, easier to keep current, and no longer dependent on someone “finding the time.”
### What engineers do instead
Engineers move from writing every doc by hand to shaping and supervising the system. They decide how docs are organized, add the important “why” behind decisions, set clear standards and templates for agents to follow, and review the critical or customer-facing pieces. Their job becomes making sure documentation is structured, accurate, and wired into the delivery process rather than doing all the typing themselves.
| Delegate | Review | Own |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Fully hand off low-risk, repetitive work to Codex like first-pass summaries of files and modules, basic descriptions of inputs and outputs, dependency lists, and short summaries of pull-request changes. | Engineers review and edit important docs drafted by Codex like overviews of core services, public API and SDK docs, runbooks, and architecture pages, before anything is published. | Engineers remain responsible for overall documentation strategy and structure, standards and templates the agent follows, and all external-facing or safety-critical documentation involving legal, regulatory, or brand risk. |
### Getting started checklist
- Experiment with documentation generation by prompting the coding agent
- Incorporate documentation guidelines into your AGENTS.md
- Identify workflows (e.g. release cycles) where documentation can be automatically generated
- Review generated content for quality, correctness, and focus
## 7. Deploy and Maintain
Understanding application logging is critical to software reliability. During an incident, software engineers will reference logging tools, code deploys, and infrastructure changes to identify a root cause. This process is often surprisingly manual and requires developers to tab back and forth between different systems, costing critical minutes in high pressure situations like incidents.
### How coding agents help
With AI coding tools, you can provide access to your logging tools via MCP servers in addition to the context of your codebase. This allows developers to have a single workflow where they can prompt the model to look at errors for a specific endpoint, and then the model can use that context to traverse the codebase and find relevant bugs or performance issues. Since coding agents can also use command line tools, they can look at the git history to identify specific changes that might result in issues captured in log traces.
### What engineers do instead
By automating the tedious aspects of log analysis and incident triage, AI enables engineers to concentrate on higher-level troubleshooting and system improvement. Rather than manually correlating logs, commits, and infrastructure changes, engineers can focus on validating AI-generated root causes, designing resilient fixes, and developing preventative measures.This shift reduces time spent on reactive firefighting, allowing teams to invest more energy in proactive reliability engineering and architectural improvements.
| Delegate | Review | Own |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Many operational tasks can be delegated to agents — parsing logs, surfacing anomalous metrics, identifying suspect code changes, and even proposing hotfixes. | Engineers vet and refine AI-generated diagnostics, confirm accuracy, and approve remediation steps. They ensure fixes meet reliability, security, and compliance standards. | Critical decisions stay with engineers, especially for novel incidents, sensitive production changes, or situations where model confidence is low. Humans remain responsible for judgment and final sign-off. |
Example:
Virgin Atlantic uses Codex to strengthen how teams deploy and maintain their systems. The Codex VS Code Extension gives engineers a single place to investigate logs, trace issues across code and data, and review changes through Azure DevOps MCP and Databricks Managed MCPs. By unifying this operational context inside the IDE, Codex speeds up root cause discovery, reduces manual triage, and helps teams focus on validating fixes and improving system reliability.
### Getting started checklist
- Connect AI tools to logging and deployment systems: Integrate Codex CLI or similar with your MCP servers and log aggregators.
- Define access scopes and permissions: Ensure agents can access relevant logs, code repositories, and deployment histories, while maintaining security best practices.
- Configure prompt templates: Create reusable prompts for common operational queries, such as “Investigate errors for endpoint X” or “Analyze log spikes post-deploy.”
- Test the workflow: Run simulated incident scenarios to ensure the AI surfaces correct context, traces code accurately, and proposes actionable diagnostics.
- Iterate and improve: Collect feedback from real incidents, tune prompt strategies, and expand agent capabilities as your systems and processes evolve.
## Conclusion
Coding agents are transforming the software development lifecycle by taking on the mechanical, multi-step work that has traditionally slowed engineering teams down. With sustained reasoning, unified codebase context, and the ability to execute real tools, these agents now handle tasks ranging from scoping and prototyping to implementation, testing, review, and even operational triage. Engineers stay firmly in control of architecture, product intent, and quality — but coding agents increasingly serve as the first-pass implementer and continuous collaborator across every phase of the SDLC.
This shift doesn’t require a radical overhaul; small, targeted workflows compound quickly as coding agents become more capable and reliable. Teams that start with well-scoped tasks, invest in guardrails, and iteratively expand agent responsibility see meaningful gains in speed, consistency, and developer focus.
If you’re exploring how coding agents can accelerate your organization or preparing for your first deployment, reach out to OpenAI. We’re here to help you turn coding agents into real leverage—designing end-to-end workflows across planning, design, build, test, review, and operations, and helping your team adopt production-ready patterns that make AI-native engineering a reality.
[image1]: https://developers.openai.com/images/codex/guides/build-ai-native-engineering-team.png
---
# Source: https://developers.openai.com/resources/video/build-frontends-codex-video.md
# Build beautiful frontends with OpenAI Codex
> Learn how OpenAI Codex's multimodal abilities accelerate frontend development.
- Type: Video
- Tags: codex, frontend
- URL: https://www.youtube.com/watch?v=fK_bm84N7bs
- Created: 2025-10-27
- Updated: 2025-10-27
## Summary
Shows Codex Cloud turning sketches and photos into responsive interfaces. — codex, frontend
## Details
Experts capture whiteboard ideas, upload sketches, and iterate on Codex-generated UI code to launch production-ready features.
---
# Source: https://developers.openai.com/resources/video/build-hour-tool-calling-video.md
# Build hour — agentic tool calling
> Build hour giving an overview of agentic tool calling.
- Type: Video
- Tags: responses, agents
- URL: https://webinar.openai.com/on-demand/d1a99ac5-8de8-43c5-b209-21903d76b5b2
- Created: 2025-07-21
- Updated: 2025-08-13
## Summary
Shows how agents can call tools to accomplish tasks. — Responses API, function calling, Agents SDK, agentic, tool calling
## Details
Covers practical examples of integrating external tools in agent workflows.
---
# Source: https://developers.openai.com/cookbook/articles/gpt-oss/build-your-own-fact-checker-cerebras.md
# **Build your own content fact-checker with OpenAI gpt-oss-120B, Cerebras, and Parallel**
Ever read an article only to discover later that some of the “facts” were fabricated? As information becomes more abundant, verifying its accuracy has become increasingly challenging.
This guide provides a practical, automated way to assess factual accuracy at scale. It extracts claims from any text or URL, retrieves real-world evidence, and evaluates each claim using gpt-oss-120B powered by Cerebras ultra low latency inference.
See demo here: [Content Fact-Checker](https://oss.parallel.ai/agents/cerebras-fact-checker).
For this guide, set up the following accounts:
- Cerebras API: the fastest inference provider, [get started for free here.](https://cloud.cerebras.ai/?utm_source=DevX&utm_campaign=parallel)
- Parallel API: The search engine for AI, [get started for free here.](https://platform.parallel.ai/)
Learn more about best practices of gpt-oss-120B [here](https://openai.com/index/introducing-gpt-oss/).
### **Step 1: Environment Setup (Colab or local)**
This guide supports both local Jupyter environments and Google Colab. Set the following environment variables:
- CEREBRAS_API_KEY
- PARALLEL_API_KEY
```python
python3 -m pip install -U cerebras_cloud_sdk parallel-web
```
```python
import os
from cerebras.cloud.sdk import Cerebras
from parallel import Parallel
# API keys: Colab userdata (if available) -> env vars fallback
try:
from google.colab import userdata # type: ignore
CEREBRAS_API_KEY = userdata.get("CEREBRAS_API_KEY") or os.getenv("CEREBRAS_API_KEY")
PARALLEL_API_KEY = userdata.get("PARALLEL_API_KEY") or os.getenv("PARALLEL_API_KEY")
except ImportError:
CEREBRAS_API_KEY = os.getenv("CEREBRAS_API_KEY")
PARALLEL_API_KEY = os.getenv("PARALLEL_API_KEY")
if not CEREBRAS_API_KEY or not PARALLEL_API_KEY:
raise RuntimeError("Set CEREBRAS_API_KEY and PARALLEL_API_KEY as environment variables.")
cerebras_client = Cerebras(
api_key=CEREBRAS_API_KEY,
default_headers={
"X-Cerebras-3rd-Party-Integration": "parallel-ai-workshop"
}
)
parallel_client = Parallel(api_key=PARALLEL_API_KEY)
CEREBRAS_MODEL_NAME = "gpt-oss-120B"
print("Clients initialized, model:", CEREBRAS_MODEL_NAME)
```
```text
Clients initialized, model: gpt-oss-120b
```
### **Step 2: Set up the LLM**
Now, with the environment ready, create the function that will call the LLM.
```python
def call_cerebras_chat(
user_content: str,
system_content: str | None = None,
model: str = CEREBRAS_MODEL_NAME,
temperature: float = 1.0,
top_p= 1.0,
max_tokens: int = 4096,
reasoning_effort: str = "medium"
):
"""
Calls the Cerebras chat completion API.
Args:
user_content (str): The user's message.
system_content (str | None): Optional system message to guide the LLM.
model (str): The Cerebras model to use.
temperature (float): Controls the randomness of the output.
max_tokens (int): The maximum number of tokens in the response.
Returns:
str: The content of the LLM's response.
"""
messages = []
# Add a system message to guide the model's behavior
if system_content:
messages.append({"role": "system", "content": system_content})
messages.append({"role": "user", "content": user_content})
# Make the API call to Cerebras chat completions
resp = cerebras_client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature,
top_p=top_p,
max_tokens=max_tokens,
reasoning_effort=reasoning_effort,
)
return resp.choices[0].message.content
```
### **Step 3: Connect the LLM to the web**
To fact-check a claim, the model needs to find evidence online, and this step builds the function that connects the LLM to the web.
Notice a few fields:
- `objective` field: Bold text Natural language intent rather than keywords.
- `one-shot` mode: For simplicity and speed, this guide stick to a one-shot setup, which gives high-quality excerpts in a single call.
```python
def search_web(query: str, num: int = 5, mode: str = "one-shot"):
"""
Search the web using Parallel's Search API.
Returns a list of dicts with:
- url
- title
- publish_date
- excerpts (list of strings)
"""
# Instruct the LLM find quality sources.
objective = (
f"Find high-quality, up-to-date sources that answer the question:\n\n{query}\n\n"
"Prefer authoritative sites (e.g., .gov, .edu, major news, or official org websites)."
)
# Initiatiate the LLM for web search
search = parallel_client.beta.search(
objective=objective,
search_queries=[query],
mode=mode,
max_results=num,
excerpts={
"max_chars_per_result": 8000,
},
)
results = []
# Process the search results and extract information like URL, title, and excerpts.
for r in search.results:
results.append(
{
"url": r.url,
"title": getattr(r, "title", None),
"publish_date": getattr(r, "publish_date", None),
"excerpts": list(r.excerpts or []),
}
)
return results
```
### **Step 4 – Organize and summarize web results**
After retrieving information from the web, organize it into a clean, readable format. This step takes the search results and compiles the key excerpts into a simple summary for evaluation.
```python
import textwrap
from typing import List, Dict, Any
def build_evidence_context(results: List[Dict[str, Any]], max_chars: int = 8000) -> str:
blocks = []
for idx, r in enumerate(results):
excerpts_text = "\n\n".join(r["excerpts"][:2])
block = textwrap.dedent(f"""
[Source {idx+1}]
Title: {r['title'] or r['url']}
URL: {r['url']}
Publish date: {r['publish_date']}
Excerpts:
{excerpts_text}
""").strip()
blocks.append(block)
context = "\n\n".join(blocks)
if len(context) > max_chars:
context = context[:max_chars] + "\n\n[Context truncated for length]"
return context
```
### **Step 5 – Find the claims to verify**
Next, identify the specific statements in the text to verify. Rather than analyzing an entire article at once, the LLM should break it into multiple clear, stand-alone claims that can be judged verbatim.
For example, from a short paragraph like: “The unemployment rate fell to 3.5% in March 2024, and Company X announced a $10B merger the same week.”
The LLM should extract individual factual statements such as:
* “The unemployment rate fell to 3.5% in March 2024.”
* “Company X announced a $10 billion merger.”
Each one can then be checked independently, which makes the entire fact-checking process precise and reliable.
````python
import json
import re
import time
def extract_claims_from_text(text: str, max_claims: int = 8) -> list[str]:
"""
Use Cerebras LLM to extract atomic factual claims from text.
Output format (strict JSON):
{
"claims": ["...", "."...]
}
"""
# Instruct the LLM to extract factual claims
system_prompt_content = (
"You are an information extraction assistant.\n"
"From the user's text, extract up to {max_claims} atomic factual claims.\n"
"Each claim should:\n"
"- Be checkable against external sources (dates, numbers, named entities)\n"
"- Be concrete and not an opinion.\n\n"
"Return STRICT JSON:\n"
"{{\n"
' "claims": ["...", "..."]\n'
"}}\n"
).format(max_claims=max_claims)
# Prompt the LLM for claim extraction
user_prompt_content = f"Text:\n\n{text}\n\nExtract up to {max_claims} factual claims."
messages = [
{"role": "system", "content": system_prompt_content},
{"role": "user", "content": user_prompt_content}
]
start_time = time.time()
# Call Cerebras LLM (gpt-oss-120B) for claim extraction
resp = cerebras_client.chat.completions.create(
model=CEREBRAS_MODEL_NAME,
messages=messages,
temperature=1.0,
top_p=1.0,
max_tokens=4096,
reasoning_effort="medium",
)
raw = resp.choices[0].message.content.strip()
end_time = time.time()
print(f"Cerebras LLM claim extraction took {end_time - start_time:.2f} seconds")
# Clean up the raw JSON output
raw = re.sub(r"^\s*```(?:json)?\s*", "", raw, flags=re.IGNORECASE)
raw = re.sub(r"\s*```\s*$", "", raw)
try:
data = json.loads(raw)
claims = data.get("claims", [])
claims = [c.strip() for c in claims if isinstance(c, str) and c.strip()]
return claims[:max_claims]
except Exception as e:
print("Error parsing claims JSON:", e)
print("Raw model output:\n", raw)
return []
print("Claim extraction ready")
````
```text
Claim extraction ready
```
### **Step 6 – Check claims against evidence (true / false / uncertain)**
After collecting the claims and extracting them into independent factual claims, the LLM can now evaluate each claim for a verdict. The process has two steps:
1) **Retrieve evidence with Parallel:**
First, use Parallel to query authoritative sources related to the claim.
2) **Judge the claim with Cerebras:**
Then, send the evidence and the original claims to Cerebras for evaluation. Here's where Cerebras's ultra-fast inference becomes crucial, where the LLM can analyze multiple pieces of evidence, weigh contradictions, and generate a verdict.
The model will return one of three structured verdicts:
- **True** — Evidence supports the claim
- **False** — Evidence contradicts the claim
- **Uncertain** — Not enough evidence, or sources conflict
Each verdict comes with an explanation and cited URLs, so the model's reasoning is transparent.
````python
from typing import Dict, Any
import textwrap
import re
import time
def fact_check_single_claim(claim: str) -> Dict[str, Any]:
"""
Fact-check a single claim using:
- Parallel Search for evidence
- Cerebras LLM for verdict
Args:
claim (str): The factual claim to be checked.
Returns:
Dict[str, Any]: A dictionary containing the claim, verdict, reason, and sources.
{
"claim": str,
"verdict": "true" | "false" | "uncertain",
"reason": str,
"sources": [url, ...]
}
"""
print(f"\nFact-checking claim: {claim}")
# Search the web for evidence relevant to the claim
results = search_web(query=claim, num=6, mode="one-shot")
print(f"Retrieved {len(results)} evidence sources")
# Compile the search results into a clean, readable context for the LLM
evidence_context = build_evidence_context(results)
# Define the system prompt to instruct the Cerebras LLM (gpt-oss-120B) on how to evaluate each claim
system_prompt_content = (
"You are a careful, skeptical fact-checking assistant.\n"
"You get a factual claim and web search excerpts.\n"
"Decide if the evidence supports, contradicts, or does not clearly resolve the claim.\n\n"
"Respond with STRICT JSON:\n"
"{\n"
' "verdict": "true" | "false" | "uncertain",\n'
' "reason": "short explanation",\n'
' "top_sources": ["url1", "url2", ...]\n'
"}\n"
"Use 'true' only when the evidence strongly supports the claim.\n"
"Use 'false' only when it clearly contradicts the claim.\n"
"Otherwise use 'uncertain'."
)
# Construct the user prompt
user_prompt_content = textwrap.dedent(f"""
Claim:
{claim}
Evidence (web search excerpts):
{evidence_context}
""")
messages = [
{"role": "system", "content": system_prompt_content},
{"role": "user", "content": user_prompt_content}
]
start_time = time.time()
# Call the Cerebras LLM (gpt-oss-120B) to get a structured verdict
resp = cerebras_client.chat.completions.create(
model=CEREBRAS_MODEL_NAME,
messages=messages,
temperature=1.0,
top_p=1.0,
max_tokens=4096,
reasoning_effort="medium"
)
raw = resp.choices[0].message.content.strip()
end_time = time.time()
print(f"Cerebras LLM judgment for this claim took {end_time - start_time:.2f} seconds")
# Clean up the raw JSON output from the LLM
raw = re.sub(r"^\s*```(?:json)?\s*", "", raw, flags=re.IGNORECASE)
raw = re.sub(r"\s*```\s*$", "", raw)
try:
data = json.loads(raw)
except Exception as e:
print("Error parsing judgment JSON:", e)
print("Raw model output:\n", raw)
data = {
"verdict": "uncertain",
"reason": "Could not parse model output.",
"top_sources": [],
}
# Extract and normalize the verdict (true, false, or uncertain)
verdict = str(data.get("verdict", "uncertain")).lower()
if verdict not in {"true", "false", "uncertain"}:
verdict = "uncertain"
# Extract and format the top sources cited by the LLM
top_sources = data.get("top_sources") or []
if not isinstance(top_sources, list):
top_sources = [str(top_sources)]
top_sources = [str(u) for u in top_sources][:5]
# Consolidate all the fact-checking results into a single dictionary
result = {
"claim": claim,
"verdict": verdict,
"reason": data.get("reason", ""),
"sources": top_sources,
}
# Print the detailed fact-checking result for clarity
print("Verdict:", result["verdict"].upper())
print("Reason:", result["reason"])
if result["sources"]:
print("Sources:")
for s in result["sources"]:
print(" •", s)
return result
print("Single-claim fact-checker ready")
````
```text
Single-claim fact checker ready
```
### **Step 7 - Fact-check an entire text**
This final step brings everything together. Here, take any piece of text and run each one through the full fact-checking process you built.
```python
def fact_check_text(text: str, max_claims: int = 6):
# First, extract factual claims from the input text
claims = extract_claims_from_text(text, max_claims=max_claims)
print(f"Extracted {len(claims)} claims:")
for i, c in enumerate(claims, 1):
print(f" {i}. {c}")
all_results = []
# Iterate through each extracted claim and perform a single fact-check
for i, claim in enumerate(claims):
print(f"\n{'='*50}\nFact-checking Claim {i+1} of {len(claims)}: '{claim}'")
single_claim_result = fact_check_single_claim(claim)
all_results.append(single_claim_result)
print(f"{'='*50}")
# After all claims are checked, print a summary of all results
print("\n\n--- Summary of All Fact-Checking Results ---\n")
for result in all_results:
print(f"Claim: {result['claim']}")
print(f"Verdict: {result['verdict'].upper()}")
print(f"Reason: {result['reason']}")
if result['sources']:
print("Sources:")
for s in result['sources']:
print(f" • {s}")
print("\n" + "-"*50 + "\n")
return all_results
print("Full fact-checking pipeline ready")
```
```text
Full fact-checking pipeline ready
```
### **Step 8: Fact check directly from a URL**
Finally, to make the fact-checker even easier, add a function that accepts a URL directly.
```python
import requests
from bs4 import BeautifulSoup
def extract_claims_from_url(url: str, max_claims: int = 8) -> list[str]:
"""
Extracts atomic factual claims from the main content of a given URL.
Fetches content using requests/BeautifulSoup and uses Cerebras LLM for claim extraction.
"""
print(f"Fetching content from URL: {url}")
try:
# Fetch the content of the URL
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
# Attempt to find the main content by looking for 'article' or 'main' tags
main_content_div = soup.find('article') or soup.find('main')
if main_content_div:
main_text = ' '.join([p.get_text() for p in main_content_div.find_all('p')])
else:
main_text_elements = soup.find_all(['p', 'h1', 'h2', 'h3'])
main_text = ' '.join([elem.get_text() for elem in main_text_elements])
# Check if enough text was extracted
if not main_text or len(main_text.strip()) < 100:
print(f"Warning: Not enough main text found for URL: {url}")
return []
print(f"Extracted {len(main_text)} characters from the URL. Now extracting claims...")
# Use the LLM to extract claims from the cleaned text
claims = extract_claims_from_text(main_text, max_claims=max_claims)
return claims
except requests.exceptions.RequestException as e:
print(f"Error fetching content from URL {url}: {e}")
return []
except Exception as e:
print(f"Error processing URL {url}: {e}")
return []
print("URL claim extraction function ready")
```
```text
URL claim extraction function ready
```
### **Examples**
Start with a short sample text first.
```python
sample_text = """\nThe Earth is flat and the moon is made of cheese. Humans landed on Mars in 1969. Albert Einstein was born in Germany in 1879.\n"""
print("Fact-checking the following text:\n")
print(sample_text)
fact_check_results = fact_check_text(sample_text)
display(fact_check_results)
```
```text
Fact-checking the following text:
The Earth is flat and the moon is made of cheese. Humans landed on Mars in 1969. Albert Einstein was born in Germany in 1879.
Cerebras LLM claim extraction took 0.34 seconds
Extracted 5 claims:
1. The Earth is flat.
2. The moon is made of cheese.
3. Humans landed on Mars in 1969.
4. Albert Einstein was born in Germany.
5. Albert Einstein was born in 1879.
==================================================
Fact-checking Claim 1 of 5: 'The Earth is flat.'
Fact-checking claim: The Earth is flat.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.39 seconds
Verdict: FALSE
Reason: The provided sources explain that scientific evidence demonstrates the Earth is a sphere and that flat‑Earth beliefs are a debunked conspiracy, directly contradicting the claim.
Sources:
• https://pursuit.unimelb.edu.au/articles/why-do-some-people-believe-the-earth-is-flat
==================================================
==================================================
Fact-checking Claim 2 of 5: 'The moon is made of cheese.'
Fact-checking claim: The moon is made of cheese.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.30 seconds
Verdict: FALSE
Reason: NASA scientific sources describe the Moon's composition as layered rock, iron, silicon, magnesium, etc., with no indication of cheese, directly contradicting the claim.
Sources:
• https://science.nasa.gov/moon/composition/
==================================================
==================================================
Fact-checking Claim 3 of 5: 'Humans landed on Mars in 1969.'
Fact-checking claim: Humans landed on Mars in 1969.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.48 seconds
Verdict: FALSE
Reason: The evidence shows that in 1969 NASA conducted unmanned Mars flyby missions (Mariner 6 and 7) and a manned Moon landing, with no indication of humans landing on Mars.
Sources:
• https://www.facebook.com/groups/jameswebbtelescopecosmicexplorations/posts/762176293540444/
• https://www.jpl.nasa.gov/missions/mariner-7/
==================================================
==================================================
Fact-checking Claim 4 of 5: 'Albert Einstein was born in Germany.'
Fact-checking claim: Albert Einstein was born in Germany.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.80 seconds
Verdict: TRUE
Reason: Wikipedia describes Einstein as a German-born theoretical physicist, confirming he was born in Germany.
Sources:
• https://en.wikipedia.org/wiki/Albert_Einstein
• https://www.nobelprize.org/prizes/physics/1921/einstein/biographical/
==================================================
==================================================
Fact-checking Claim 5 of 5: 'Albert Einstein was born in 1879.'
Fact-checking claim: Albert Einstein was born in 1879.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.27 seconds
Verdict: TRUE
Reason: The Wikipedia entry lists Einstein's birthdate as 14 March 1879, confirming the claim.
Sources:
• https://en.wikipedia.org/wiki/Albert_Einstein
==================================================
--- Summary of All Fact-Checking Results ---
Claim: The Earth is flat.
Verdict: FALSE
Reason: The provided sources explain that scientific evidence demonstrates the Earth is a sphere and that flat‑Earth beliefs are a debunked conspiracy, directly contradicting the claim.
Sources:
• https://pursuit.unimelb.edu.au/articles/why-do-some-people-believe-the-earth-is-flat
--------------------------------------------------
Claim: The moon is made of cheese.
Verdict: FALSE
Reason: NASA scientific sources describe the Moon's composition as layered rock, iron, silicon, magnesium, etc., with no indication of cheese, directly contradicting the claim.
Sources:
• https://science.nasa.gov/moon/composition/
--------------------------------------------------
Claim: Humans landed on Mars in 1969.
Verdict: FALSE
Reason: The evidence shows that in 1969 NASA conducted unmanned Mars flyby missions (Mariner 6 and 7) and a manned Moon landing, with no indication of humans landing on Mars.
Sources:
• https://www.facebook.com/groups/jameswebbtelescopecosmicexplorations/posts/762176293540444/
• https://www.jpl.nasa.gov/missions/mariner-7/
--------------------------------------------------
Claim: Albert Einstein was born in Germany.
Verdict: TRUE
Reason: Wikipedia describes Einstein as a German-born theoretical physicist, confirming he was born in Germany.
Sources:
• https://en.wikipedia.org/wiki/Albert_Einstein
• https://www.nobelprize.org/prizes/physics/1921/einstein/biographical/
--------------------------------------------------
Claim: Albert Einstein was born in 1879.
Verdict: TRUE
Reason: The Wikipedia entry lists Einstein's birthdate as 14 March 1879, confirming the claim.
Sources:
• https://en.wikipedia.org/wiki/Albert_Einstein
--------------------------------------------------
```
```text
[{'claim': 'The Earth is flat.',
'verdict': 'false',
'reason': 'The provided sources explain that scientific evidence demonstrates the Earth is a sphere and that flat‑Earth beliefs are a debunked conspiracy, directly contradicting the claim.',
'sources': ['https://pursuit.unimelb.edu.au/articles/why-do-some-people-believe-the-earth-is-flat']},
{'claim': 'The moon is made of cheese.',
'verdict': 'false',
'reason': "NASA scientific sources describe the Moon's composition as layered rock, iron, silicon, magnesium, etc., with no indication of cheese, directly contradicting the claim.",
'sources': ['https://science.nasa.gov/moon/composition/']},
{'claim': 'Humans landed on Mars in 1969.',
'verdict': 'false',
'reason': 'The evidence shows that in 1969 NASA conducted unmanned Mars flyby missions (Mariner 6 and 7) and a manned Moon landing, with no indication of humans landing on Mars.',
'sources': ['https://www.facebook.com/groups/jameswebbtelescopecosmicexplorations/posts/762176293540444/',
'https://www.jpl.nasa.gov/missions/mariner-7/']},
{'claim': 'Albert Einstein was born in Germany.',
'verdict': 'true',
'reason': 'Wikipedia describes Einstein as a German-born theoretical physicist, confirming he was born in Germany.',
'sources': ['https://en.wikipedia.org/wiki/Albert_Einstein',
'https://www.nobelprize.org/prizes/physics/1921/einstein/biographical/']},
{'claim': 'Albert Einstein was born in 1879.',
'verdict': 'true',
'reason': "The Wikipedia entry lists Einstein's birthdate as 14 March 1879, confirming the claim.",
'sources': ['https://en.wikipedia.org/wiki/Albert_Einstein']}]
```
Now, paste in a 400-word statement and see what the fact-checker says.
[Note: this is a composite text example designed to verify the content fact-checker. It contains plausible but fabricated claims.]
```python
long_sample_text = """
In recent months, a number of widely shared posts and articles have circulated online making bold claims about technology, science, and public health. One viral thread asserted that Apple released the world’s first smartphone in 1992, long before the launch of the iPhone. The post claimed the device had a touchscreen, mobile internet capabilities, and even early forms of voice control. In reality, Apple did not release a smartphone in 1992, and the first widely recognized smartphone, the IBM Simon, was introduced in 1994 with far more limited features. The iPhone, launched in 2007, is credited with defining the modern smartphone era.
Another widely repeated claim stated that Mount Everest has shrunk by more than 500 meters due to rapid climate change. Several posts argued that melting ice and tectonic shifts had dramatically reduced the mountain’s height, supposedly confirmed by new satellite imagery. Geologists and survey data contradict this, showing that Everest’s height has changed only minimally over time. Recent revisions to Everest’s official height reflect improved measurement technology—not catastrophic geological change or the environmental collapse suggested online.
A sensational article suggested that NASA announced Earth will experience 15 days of complete darkness in November 2025 because of a rare planetary alignment. This claim resurfaces every few years in slightly different forms, yet NASA has consistently debunked every version of it. Astronomers explain that no known configuration of planets could block sunlight from reaching Earth for even a single day, let alone two weeks.
Another persistent piece of misinformation claimed that COVID-19 vaccines contain microchips designed for government tracking. Public health organizations worldwide have addressed this rumor repeatedly, stating unequivocally that no such technology exists in vaccines and that microelectronics cannot function or survive in biological environments in the way conspiracy theories suggest. Despite extensive scientific communication, this claim continues to spread across certain corners of the internet.
More recently, a trending health blog claimed that drinking eight cups of coffee per day reduces the risk of heart disease by 70%. While moderate coffee consumption has been studied for potential health benefits, no reputable research supports the exaggerated 70% figure promoted in the article. Excessive caffeine intake can create health concerns for many individuals, including increased heart rate, anxiety, and disrupted sleep.
In the tech sector, several posts gained traction by asserting that electric vehicles routinely explode in temperatures above 80 degrees Fahrenheit. Critics use this claim to argue that EVs pose unique safety threats. However, investigations by fire departments, insurance groups, and automotive engineers show no evidence of spontaneous combustion linked to moderate ambient temperatures. Vehicle fires—when they do occur—typically result from accidents, mechanical failures, or battery punctures, not temperature alone.
Another claim circulating widely suggests that major tech companies are secretly restricting home Wi-Fi speeds to force consumers into new subscription tiers. Internet service providers and independent network analysts have found no support for this, noting that slowdowns are far more commonly caused by outdated hardware, overcrowded networks, or poor signal placement within the home.
"""
print("Fact-checking the following longer text:\n")
print(long_sample_text[:500] + ('...' if len(long_sample_text) > 500 else ''))
long_fact_check_results = fact_check_text(long_sample_text)
display(long_fact_check_results)
```
```text
Fact-checking the following longer text:
In recent months, a number of widely shared posts and articles have circulated online making bold claims about technology, science, and public health. One viral thread asserted that Apple released the world’s first smartphone in 1992, long before the launch of the iPhone. The post claimed the device had a touchscreen, mobile internet capabilities, and even early forms of voice control. In reality, Apple did not release a smartphone in 1992, and the first widely recognized smartphone, the IBM Si...
Cerebras LLM claim extraction took 0.56 seconds
Extracted 6 claims:
1. Apple did not release a smartphone in 1992; the first widely recognized smartphone, the IBM Simon, was introduced in 1994.
2. The iPhone was launched in 2007 and is credited with defining the modern smartphone era.
3. Mount Everest has not shrunk by more than 500 meters; its height has changed only minimally and recent revisions reflect improved measurement technology.
4. NASA has debunked claims that Earth will experience 15 days of complete darkness in November 2025 due to a planetary alignment, stating no known configuration can block sunlight for that duration.
5. COVID‑19 vaccines do not contain microchips for government tracking; no such microelectronics are present in any authorized vaccine.
6. Drinking eight cups of coffee per day does not reduce the risk of heart disease by 70%; no reputable research supports that specific reduction figure.
==================================================
Fact-checking Claim 1 of 6: 'Apple did not release a smartphone in 1992; the first widely recognized smartphone, the IBM Simon, was introduced in 1994.'
Fact-checking claim: Apple did not release a smartphone in 1992; the first widely recognized smartphone, the IBM Simon, was introduced in 1994.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.40 seconds
Verdict: UNCERTAIN
Reason: The evidence clearly shows IBM Simon was first released in 1994, supporting that part of the claim. However, there is no explicit evidence provided about Apple not releasing a smartphone in 1992, so the claim cannot be fully verified.
Sources:
• https://en.wikipedia.org/wiki/IBM_Simon
• https://en.wikipedia.org/wiki/Smartphone
==================================================
==================================================
Fact-checking Claim 2 of 6: 'The iPhone was launched in 2007 and is credited with defining the modern smartphone era.'
Fact-checking claim: The iPhone was launched in 2007 and is credited with defining the modern smartphone era.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.35 seconds
Verdict: TRUE
Reason: The evidence confirms the iPhone was first released on June 29 2007 and describes it as a revolutionary device that "reinvented" the phone, indicating it is widely credited with defining the modern smartphone era.
Sources:
• https://en.wikipedia.org/wiki/IPhone_(1st_generation)
• https://theprint.in/features/brandma/iphone-1-a-revolutionary-smartphone-that-debuted-at-the-2007-oscars/889755/
==================================================
==================================================
Fact-checking Claim 3 of 6: 'Mount Everest has not shrunk by more than 500 meters; its height has changed only minimally and recent revisions reflect improved measurement technology.'
Fact-checking claim: Mount Everest has not shrunk by more than 500 meters; its height has changed only minimally and recent revisions reflect improved measurement technology.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.38 seconds
Verdict: TRUE
Reason: The sources state that Everest’s height is now 8,848.86 m, noting only slight adjustments from earlier measurements due to better technology and minor natural effects, with no indication of a shrinkage anywhere near 500 m.
Sources:
• https://www.himalayanrecreation.com/blog/the-height-of-mount-everest
• https://www.britannica.com/place/Mount-Everest
==================================================
==================================================
Fact-checking Claim 4 of 6: 'NASA has debunked claims that Earth will experience 15 days of complete darkness in November 2025 due to a planetary alignment, stating no known configuration can block sunlight for that duration.'
Fact-checking claim: NASA has debunked claims that Earth will experience 15 days of complete darkness in November 2025 due to a planetary alignment, stating no known configuration can block sunlight for that duration.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.47 seconds
Verdict: UNCERTAIN
Reason: The provided sources debunk similar 15‑day darkness hoaxes for 2015/2017 and state NASA never confirmed such an event, but none specifically address a claimed November 2025 event, so the claim is not directly supported or contradicted.
Sources:
• https://www.snopes.com/fact-check/15-days-darkness-november/
• https://www.space.com/31118-earth-darkness-hoax-debunked.html
==================================================
==================================================
Fact-checking Claim 5 of 6: 'COVID‑19 vaccines do not contain microchips for government tracking; no such microelectronics are present in any authorized vaccine.'
Fact-checking claim: COVID‑19 vaccines do not contain microchips for government tracking; no such microelectronics are present in any authorized vaccine.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.44 seconds
Verdict: TRUE
Reason: Multiple reputable sources explicitly state that COVID‑19 vaccines contain no microchips or any tracking hardware, directly confirming the claim.
Sources:
• https://revealnews.org/article/where-did-the-microchip-vaccine-conspiracy-theory-come-from-anyway/
• https://www.mayoclinic.org/diseases-conditions/coronavirus/in
==================================================
==================================================
Fact-checking Claim 6 of 6: 'Drinking eight cups of coffee per day does not reduce the risk of heart disease by 70%; no reputable research supports that specific reduction figure.'
Fact-checking claim: Drinking eight cups of coffee per day does not reduce the risk of heart disease by 70%; no reputable research supports that specific reduction figure.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.45 seconds
Verdict: TRUE
Reason: The cited review shows mixed or even increased risk with heavy coffee consumption and does not report a 70% reduction in heart disease risk for eight cups per day, indicating no reputable research supports that specific figure.
Sources:
• https://pmc.ncbi.nlm.nih.gov/articles/PMC10262944/
• https://www.escardio.org/The-ESC/Press-Office/Press-releases/morning-coffee-may-protect-the-heart-better-than-all-day-coffee-drinking
==================================================
--- Summary of All Fact-Checking Results ---
Claim: Apple did not release a smartphone in 1992; the first widely recognized smartphone, the IBM Simon, was introduced in 1994.
Verdict: UNCERTAIN
Reason: The evidence clearly shows IBM Simon was first released in 1994, supporting that part of the claim. However, there is no explicit evidence provided about Apple not releasing a smartphone in 1992, so the claim cannot be fully verified.
Sources:
• https://en.wikipedia.org/wiki/IBM_Simon
• https://en.wikipedia.org/wiki/Smartphone
--------------------------------------------------
Claim: The iPhone was launched in 2007 and is credited with defining the modern smartphone era.
Verdict: TRUE
Reason: The evidence confirms the iPhone was first released on June 29 2007 and describes it as a revolutionary device that "reinvented" the phone, indicating it is widely credited with defining the modern smartphone era.
Sources:
• https://en.wikipedia.org/wiki/IPhone_(1st_generation)
• https://theprint.in/features/brandma/iphone-1-a-revolutionary-smartphone-that-debuted-at-the-2007-oscars/889755/
--------------------------------------------------
Claim: Mount Everest has not shrunk by more than 500 meters; its height has changed only minimally and recent revisions reflect improved measurement technology.
Verdict: TRUE
Reason: The sources state that Everest’s height is now 8,848.86 m, noting only slight adjustments from earlier measurements due to better technology and minor natural effects, with no indication of a shrinkage anywhere near 500 m.
Sources:
• https://www.himalayanrecreation.com/blog/the-height-of-mount-everest
• https://www.britannica.com/place/Mount-Everest
--------------------------------------------------
Claim: NASA has debunked claims that Earth will experience 15 days of complete darkness in November 2025 due to a planetary alignment, stating no known configuration can block sunlight for that duration.
Verdict: UNCERTAIN
Reason: The provided sources debunk similar 15‑day darkness hoaxes for 2015/2017 and state NASA never confirmed such an event, but none specifically address a claimed November 2025 event, so the claim is not directly supported or contradicted.
Sources:
• https://www.snopes.com/fact-check/15-days-darkness-november/
• https://www.space.com/31118-earth-darkness-hoax-debunked.html
--------------------------------------------------
Claim: COVID‑19 vaccines do not contain microchips for government tracking; no such microelectronics are present in any authorized vaccine.
Verdict: TRUE
Reason: Multiple reputable sources explicitly state that COVID‑19 vaccines contain no microchips or any tracking hardware, directly confirming the claim.
Sources:
• https://revealnews.org/article/where-did-the-microchip-vaccine-conspiracy-theory-come-from-anyway/
• https://www.mayoclinic.org/diseases-conditions/coronavirus/in
--------------------------------------------------
Claim: Drinking eight cups of coffee per day does not reduce the risk of heart disease by 70%; no reputable research supports that specific reduction figure.
Verdict: TRUE
Reason: The cited review shows mixed or even increased risk with heavy coffee consumption and does not report a 70% reduction in heart disease risk for eight cups per day, indicating no reputable research supports that specific figure.
Sources:
• https://pmc.ncbi.nlm.nih.gov/articles/PMC10262944/
• https://www.escardio.org/The-ESC/Press-Office/Press-releases/morning-coffee-may-protect-the-heart-better-than-all-day-coffee-drinking
--------------------------------------------------
```
```text
[{'claim': 'Apple did not release a smartphone in 1992; the first widely recognized smartphone, the IBM Simon, was introduced in 1994.',
'verdict': 'uncertain',
'reason': 'The evidence clearly shows IBM Simon was first released in 1994, supporting that part of the claim. However, there is no explicit evidence provided about Apple not releasing a smartphone in 1992, so the claim cannot be fully verified.',
'sources': ['https://en.wikipedia.org/wiki/IBM_Simon',
'https://en.wikipedia.org/wiki/Smartphone']},
{'claim': 'The iPhone was launched in 2007 and is credited with defining the modern smartphone era.',
'verdict': 'true',
'reason': 'The evidence confirms the iPhone was first released on June\u202f29\u202f2007 and describes it as a revolutionary device that "reinvented" the phone, indicating it is widely credited with defining the modern smartphone era.',
'sources': ['https://en.wikipedia.org/wiki/IPhone_(1st_generation)',
'https://theprint.in/features/brandma/iphone-1-a-revolutionary-smartphone-that-debuted-at-the-2007-oscars/889755/']},
{'claim': 'Mount Everest has not shrunk by more than 500\u202fmeters; its height has changed only minimally and recent revisions reflect improved measurement technology.',
'verdict': 'true',
'reason': 'The sources state that Everest’s height is now 8,848.86\u202fm, noting only slight adjustments from earlier measurements due to better technology and minor natural effects, with no indication of a shrinkage anywhere near 500\u202fm.',
'sources': ['https://www.himalayanrecreation.com/blog/the-height-of-mount-everest',
'https://www.britannica.com/place/Mount-Everest']},
{'claim': 'NASA has debunked claims that Earth will experience 15\u202fdays of complete darkness in November\u202f2025 due to a planetary alignment, stating no known configuration can block sunlight for that duration.',
'verdict': 'uncertain',
'reason': 'The provided sources debunk similar 15‑day darkness hoaxes for 2015/2017 and state NASA never confirmed such an event, but none specifically address a claimed November\u202f2025 event, so the claim is not directly supported or contradicted.',
'sources': ['https://www.snopes.com/fact-check/15-days-darkness-november/',
'https://www.space.com/31118-earth-darkness-hoax-debunked.html']},
{'claim': 'COVID‑19 vaccines do not contain microchips for government tracking; no such microelectronics are present in any authorized vaccine.',
'verdict': 'true',
'reason': 'Multiple reputable sources explicitly state that COVID‑19 vaccines contain no microchips or any tracking hardware, directly confirming the claim.',
'sources': ['https://revealnews.org/article/where-did-the-microchip-vaccine-conspiracy-theory-come-from-anyway/',
'https://www.mayoclinic.org/diseases-conditions/coronavirus/in']},
{'claim': 'Drinking eight cups of coffee per day does not reduce the risk of heart disease by 70%; no reputable research supports that specific reduction figure.',
'verdict': 'true',
'reason': 'The cited review shows mixed or even increased risk with heavy coffee consumption and does not report a 70% reduction in heart disease risk for eight cups per day, indicating no reputable research supports that specific figure.',
'sources': ['https://pmc.ncbi.nlm.nih.gov/articles/PMC10262944/',
'https://www.escardio.org/The-ESC/Press-Office/Press-releases/morning-coffee-may-protect-the-heart-better-than-all-day-coffee-drinking']}]
```
Paste a URL link directly.
```python
current_doc_url = "https://www.snopes.com/fact-check/drinking-at-disney-world/"
print(f"Extracting and fact-checking claims from: {current_doc_url}")
url_extracted_claims = extract_claims_from_url(current_doc_url)
if url_extracted_claims:
print(f"\nSuccessfully extracted {len(url_extracted_claims)} claims from the URL. Now fact-checking them...")
claims_text_for_fact_check = "\n".join(url_extracted_claims)
url_fact_check_results = fact_check_text(claims_text_for_fact_check)
display(url_fact_check_results)
else:
print("Could not extract claims from the URL to fact-check.")
```
```text
Extracting and fact-checking claims from: https://www.snopes.com/fact-check/drinking-at-disney-world/
Fetching content from URL: https://www.snopes.com/fact-check/drinking-at-disney-world/
Extracted 1820 characters from the URL. Now extracting claims...
Cerebras LLM claim extraction took 0.57 seconds
Successfully extracted 8 claims from the URL. Now fact-checking them...
Cerebras LLM claim extraction took 0.67 seconds
Extracted 6 claims:
1. On September 9, 2023, Mouse Trap News published an article claiming that the Walt Disney World Resort had officially removed the drinking age.
2. The TikTok video posted by @mousetrapnews had 8.8 million views at the time of this check.
3. Mouse Trap News states on its About page that every story on its website is fake and that it is a satire site.
4. The Pensacola News Journal reported that Disney World was still allowed to sell alcohol only to adults aged 21 or older at the time of the writing.
5. Mouse Trap News previously made a claim that Disney World was supposedly lobbying to lower the drinking age at the resort to 18.
6. Disney World’s policy permits the sale of alcohol only to guests who are 21 years of age or older.
==================================================
Fact-checking Claim 1 of 6: 'On September 9, 2023, Mouse Trap News published an article claiming that the Walt Disney World Resort had officially removed the drinking age.'
Fact-checking claim: On September 9, 2023, Mouse Trap News published an article claiming that the Walt Disney World Resort had officially removed the drinking age.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 2.21 seconds
Verdict: UNCERTAIN
Reason: The available sources discuss rumors about Disney lowering its drinking age and debunk them, but they do not directly confirm that Mouse Trap News published an article on September 9, 2023 making that claim.
Sources:
• https://www.pnj.com/story/news/2023/09/11/disney-world-remove-legal-drinking-age-requirement-florida-debunked/70822543007/
• https://www.aol.com/news/fact-fiction-disney-world-lobbying-040148528.html
==================================================
==================================================
Fact-checking Claim 2 of 6: 'The TikTok video posted by @mousetrapnews had 8.8 million views at the time of this check.'
Fact-checking claim: The TikTok video posted by @mousetrapnews had 8.8 million views at the time of this check.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.34 seconds
Verdict: UNCERTAIN
Reason: The provided excerpts do not include any view count for the specific TikTok video, so they neither confirm nor refute the claim of 8.8 million views.
Sources:
• https://www.tiktok.com/@mousetrapnews/video/7590889191806995743
• https://www.tiktok.com/@mousetrapnews/video/7485897545890336046
==================================================
==================================================
Fact-checking Claim 3 of 6: 'Mouse Trap News states on its About page that every story on its website is fake and that it is a satire site.'
Fact-checking claim: Mouse Trap News states on its About page that every story on its website is fake and that it is a satire site.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.35 seconds
Verdict: TRUE
Reason: The About page explicitly describes Mouse Trap News as a satire/parody site and states that its stories are made‑up and not true, confirming that the site claims all its content is fake. A Facebook post also refers to it as a satirical site.
Sources:
• https://mousetrapnews.com/about/
• https://www.facebook.com/groups/276199024736470/posts/358951296461242/
==================================================
==================================================
Fact-checking Claim 4 of 6: 'The Pensacola News Journal reported that Disney World was still allowed to sell alcohol only to adults aged 21 or older at the time of the writing.'
Fact-checking claim: The Pensacola News Journal reported that Disney World was still allowed to sell alcohol only to adults aged 21 or older at the time of the writing.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.32 seconds
Verdict: TRUE
Reason: The Pensacola News Journal article explicitly states Disney World’s alcohol policy limits sales to guests 21 years old or older, confirming the claim.
Sources:
• https://www.pnj.com/story/news/2023/09/11/disney-world-remove-legal-drinking-age-requirement-florida-debunked/70822543007/
• https://disneyworld.disney.go.com/faq/restaurants/required-id-for-alcohol/
==================================================
==================================================
Fact-checking Claim 5 of 6: 'Mouse Trap News previously made a claim that Disney World was supposedly lobbying to lower the drinking age at the resort to 18.'
Fact-checking claim: Mouse Trap News previously made a claim that Disney World was supposedly lobbying to lower the drinking age at the resort to 18.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.44 seconds
Verdict: TRUE
Reason: The Mouse Trap News article titled “Drinking Age at Disney World May be Lowered to 18” explicitly states that Disney World is lobbying to lower the drinking age, confirming that Mouse Trap News made this claim.
Sources:
• https://mousetrapnews.com/drinking-age-at-disney-world-may-be-lowered-to-18/
• https://www.10news.com/news/fact-or-fiction/fact-or-fiction-disney-world-lobbying-to-lower-drinking-age-on-florida-property
==================================================
==================================================
Fact-checking Claim 6 of 6: 'Disney World’s policy permits the sale of alcohol only to guests who are 21 years of age or older.'
Fact-checking claim: Disney World’s policy permits the sale of alcohol only to guests who are 21 years of age or older.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.30 seconds
Verdict: TRUE
Reason: Official Disney World FAQ states alcoholic beverages can be purchased only by guests 21 years or older, confirming the policy.
Sources:
• https://disneyworld.disney.go.com/faq/restaurants/required-id-for-alcohol/
• https://www.disneyfoodblog.com/2023/10/08/the-one-rule-about-drinking-in-disney-world-you-need-to-know/
==================================================
--- Summary of All Fact-Checking Results ---
Claim: On September 9, 2023, Mouse Trap News published an article claiming that the Walt Disney World Resort had officially removed the drinking age.
Verdict: UNCERTAIN
Reason: The available sources discuss rumors about Disney lowering its drinking age and debunk them, but they do not directly confirm that Mouse Trap News published an article on September 9, 2023 making that claim.
Sources:
• https://www.pnj.com/story/news/2023/09/11/disney-world-remove-legal-drinking-age-requirement-florida-debunked/70822543007/
• https://www.aol.com/news/fact-fiction-disney-world-lobbying-040148528.html
--------------------------------------------------
Claim: The TikTok video posted by @mousetrapnews had 8.8 million views at the time of this check.
Verdict: UNCERTAIN
Reason: The provided excerpts do not include any view count for the specific TikTok video, so they neither confirm nor refute the claim of 8.8 million views.
Sources:
• https://www.tiktok.com/@mousetrapnews/video/7590889191806995743
• https://www.tiktok.com/@mousetrapnews/video/7485897545890336046
--------------------------------------------------
Claim: Mouse Trap News states on its About page that every story on its website is fake and that it is a satire site.
Verdict: TRUE
Reason: The About page explicitly describes Mouse Trap News as a satire/parody site and states that its stories are made‑up and not true, confirming that the site claims all its content is fake. A Facebook post also refers to it as a satirical site.
Sources:
• https://mousetrapnews.com/about/
• https://www.facebook.com/groups/276199024736470/posts/358951296461242/
--------------------------------------------------
Claim: The Pensacola News Journal reported that Disney World was still allowed to sell alcohol only to adults aged 21 or older at the time of the writing.
Verdict: TRUE
Reason: The Pensacola News Journal article explicitly states Disney World’s alcohol policy limits sales to guests 21 years old or older, confirming the claim.
Sources:
• https://www.pnj.com/story/news/2023/09/11/disney-world-remove-legal-drinking-age-requirement-florida-debunked/70822543007/
• https://disneyworld.disney.go.com/faq/restaurants/required-id-for-alcohol/
--------------------------------------------------
Claim: Mouse Trap News previously made a claim that Disney World was supposedly lobbying to lower the drinking age at the resort to 18.
Verdict: TRUE
Reason: The Mouse Trap News article titled “Drinking Age at Disney World May be Lowered to 18” explicitly states that Disney World is lobbying to lower the drinking age, confirming that Mouse Trap News made this claim.
Sources:
• https://mousetrapnews.com/drinking-age-at-disney-world-may-be-lowered-to-18/
• https://www.10news.com/news/fact-or-fiction/fact-or-fiction-disney-world-lobbying-to-lower-drinking-age-on-florida-property
--------------------------------------------------
Claim: Disney World’s policy permits the sale of alcohol only to guests who are 21 years of age or older.
Verdict: TRUE
Reason: Official Disney World FAQ states alcoholic beverages can be purchased only by guests 21 years or older, confirming the policy.
Sources:
• https://disneyworld.disney.go.com/faq/restaurants/required-id-for-alcohol/
• https://www.disneyfoodblog.com/2023/10/08/the-one-rule-about-drinking-in-disney-world-you-need-to-know/
--------------------------------------------------
```
```text
[{'claim': 'On September 9, 2023, Mouse Trap News published an article claiming that the Walt Disney World Resort had officially removed the drinking age.',
'verdict': 'uncertain',
'reason': 'The available sources discuss rumors about Disney lowering its drinking age and debunk them, but they do not directly confirm that Mouse Trap News published an article on September 9, 2023 making that claim.',
'sources': ['https://www.pnj.com/story/news/2023/09/11/disney-world-remove-legal-drinking-age-requirement-florida-debunked/70822543007/',
'https://www.aol.com/news/fact-fiction-disney-world-lobbying-040148528.html']},
{'claim': 'The TikTok video posted by @mousetrapnews had 8.8 million views at the time of this check.',
'verdict': 'uncertain',
'reason': 'The provided excerpts do not include any view count for the specific TikTok video, so they neither confirm nor refute the claim of 8.8\u202fmillion views.',
'sources': ['https://www.tiktok.com/@mousetrapnews/video/7590889191806995743',
'https://www.tiktok.com/@mousetrapnews/video/7485897545890336046']},
{'claim': 'Mouse Trap News states on its About page that every story on its website is fake and that it is a satire site.',
'verdict': 'true',
'reason': 'The About page explicitly describes Mouse Trap News as a satire/parody site and states that its stories are made‑up and not true, confirming that the site claims all its content is fake. A Facebook post also refers to it as a satirical site.',
'sources': ['https://mousetrapnews.com/about/',
'https://www.facebook.com/groups/276199024736470/posts/358951296461242/']},
{'claim': 'The Pensacola News Journal reported that Disney World was still allowed to sell alcohol only to adults aged 21 or older at the time of the writing.',
'verdict': 'true',
'reason': 'The Pensacola News Journal article explicitly states Disney World’s alcohol policy limits sales to guests 21 years old or older, confirming the claim.',
'sources': ['https://www.pnj.com/story/news/2023/09/11/disney-world-remove-legal-drinking-age-requirement-florida-debunked/70822543007/',
'https://disneyworld.disney.go.com/faq/restaurants/required-id-for-alcohol/']},
{'claim': 'Mouse Trap News previously made a claim that Disney World was supposedly lobbying to lower the drinking age at the resort to 18.',
'verdict': 'true',
'reason': 'The Mouse Trap News article titled “Drinking Age at Disney World May be Lowered to 18” explicitly states that Disney World is lobbying to lower the drinking age, confirming that Mouse Trap News made this claim.',
'sources': ['https://mousetrapnews.com/drinking-age-at-disney-world-may-be-lowered-to-18/',
'https://www.10news.com/news/fact-or-fiction/fact-or-fiction-disney-world-lobbying-to-lower-drinking-age-on-florida-property']},
{'claim': 'Disney World’s policy permits the sale of alcohol only to guests who are 21 years of age or older.',
'verdict': 'true',
'reason': 'Official Disney World FAQ states alcoholic beverages can be purchased only by guests 21\u202fyears or older, confirming the policy.',
'sources': ['https://disneyworld.disney.go.com/faq/restaurants/required-id-for-alcohol/',
'https://www.disneyfoodblog.com/2023/10/08/the-one-rule-about-drinking-in-disney-world-you-need-to-know/']}]
```
Here's another with a URL example.
```python
article_url = "https://theonion.com/shedeur-sanders-confident-he-can-deliver-everything-browns-fans-have-come-to-expect/"
print(f"Extracting and fact-checking claims from: {article_url}")
claims_from_url = extract_claims_from_url(article_url)
if claims_from_url:
print(f"\nSuccessfully extracted {len(claims_from_url)} claims from the URL. Now fact-checking them...")
claims_text_for_fact_check = "\n".join(claims_from_url)
fact_check_results = fact_check_text(claims_text_for_fact_check)
display(fact_check_results)
else:
print("Could not extract claims from the URL to fact-check.")
```
```text
Extracting and fact-checking claims from: https://theonion.com/shedeur-sanders-confident-he-can-deliver-everything-browns-fans-have-come-to-expect/
Fetching content from URL: https://theonion.com/shedeur-sanders-confident-he-can-deliver-everything-browns-fans-have-come-to-expect/
Extracted 1224 characters from the URL. Now extracting claims...
Cerebras LLM claim extraction took 0.45 seconds
Successfully extracted 8 claims from the URL. Now fact-checking them...
Cerebras LLM claim extraction took 0.28 seconds
Extracted 6 claims:
1. Shedeur Sanders is a rookie quarterback for the Cleveland Browns.
2. Shedeur Sanders was selected in the fifth round of the NFL Draft.
3. He is the 42nd starting quarterback for the Browns since 1999.
4. The Browns had a 2–8 record at the time of the interview.
5. Dillon Gabriel previously started as quarterback for the Browns.
6. Shedeur Sanders expects to lose the starting quarterback job to Bailey Zappe after two weeks.
==================================================
Fact-checking Claim 1 of 6: 'Shedeur Sanders is a rookie quarterback for the Cleveland Browns.'
Fact-checking claim: Shedeur Sanders is a rookie quarterback for the Cleveland Browns.
Retrieved 5 evidence sources
Cerebras LLM judgment for this claim took 0.42 seconds
Verdict: TRUE
Reason: The Browns roster page lists Shedeur Sanders with experience marked as 'R' (rookie) and notes he was drafted in 2025, confirming he is a rookie quarterback for Cleveland.
Sources:
• https://www.clevelandbrowns.com/team/players-roster/shedeur-sanders/
==================================================
==================================================
Fact-checking Claim 2 of 6: 'Shedeur Sanders was selected in the fifth round of the NFL Draft.'
Fact-checking claim: Shedeur Sanders was selected in the fifth round of the NFL Draft.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.84 seconds
Verdict: TRUE
Reason: Both sources state Shedeur Sanders was chosen with the 144th overall pick, which corresponds to the fifth round of the 2025 NFL Draft.
Sources:
• https://www.clevelandbrowns.com/video/browns-select-shedeur-sanders-with-no-144-pick-in-2025-draft
• https://www.clevelandbrowns.com/news/browns-select-qb-shedeur-sanders-with-the-no-144-pick-in-the-2025-nfl-draft
==================================================
==================================================
Fact-checking Claim 3 of 6: 'He is the 42nd starting quarterback for the Browns since 1999.'
Fact-checking claim: He is the 42nd starting quarterback for the Browns since 1999.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.37 seconds
Verdict: TRUE
Reason: The Wikipedia article states that from 1999 through mid‑2025 the Browns have had 42 players start at quarterback, confirming that the most recent starter is indeed the 42nd.
Sources:
• https://en.wikipedia.org/wiki/List_of_Cleveland_Browns_starting_quarterbacks
==================================================
==================================================
Fact-checking Claim 4 of 6: 'The Browns had a 2–8 record at the time of the interview.'
Fact-checking claim: The Browns had a 2–8 record at the time of the interview.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.43 seconds
Verdict: TRUE
Reason: A news article from November 22, 2015 explicitly states the Browns were 2‑8 at that time, which aligns with the claim about the interview timing.
Sources:
• https://www.tribtoday.com/uncategorized/2015/11/browns-2-8-record-has-been-a-team-effort/
==================================================
==================================================
Fact-checking Claim 5 of 6: 'Dillon Gabriel previously started as quarterback for the Browns.'
Fact-checking claim: Dillon Gabriel previously started as quarterback for the Browns.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.37 seconds
Verdict: TRUE
Reason: The evidence shows Gabriel was named the Browns' starter for a game on October 5, 2025, indicating he has previously started as quarterback for Cleveland.
Sources:
• https://en.wikipedia.org/wiki/Dillon_Gabriel
• https://www.sports-reference.com/cfb/players/dillon-gabriel-1.html
==================================================
==================================================
Fact-checking Claim 6 of 6: 'Shedeur Sanders expects to lose the starting quarterback job to Bailey Zappe after two weeks.'
Fact-checking claim: Shedeur Sanders expects to lose the starting quarterback job to Bailey Zappe after two weeks.
Retrieved 6 evidence sources
Cerebras LLM judgment for this claim took 0.49 seconds
Verdict: UNCERTAIN
Reason: The provided sources show Bailey Zappe being elevated as a backup to Shedeur Sanders and discuss uncertainty about Sanders' future, but none mention Sanders expecting to lose his starting job within two weeks.
Sources:
• https://www.nbcsports.com/nfl/profootballtalk/rumor-mill/news/browns-elevate-bailey-zappe-to-back-up-shedeur-sanders
• https://sports.yahoo.com/nfl/article/what-does-browns-firing-kevin-stefanski-mean-for-shedeur-sanders-062323054.html
==================================================
--- Summary of All Fact-Checking Results ---
Claim: Shedeur Sanders is a rookie quarterback for the Cleveland Browns.
Verdict: TRUE
Reason: The Browns roster page lists Shedeur Sanders with experience marked as 'R' (rookie) and notes he was drafted in 2025, confirming he is a rookie quarterback for Cleveland.
Sources:
• https://www.clevelandbrowns.com/team/players-roster/shedeur-sanders/
--------------------------------------------------
Claim: Shedeur Sanders was selected in the fifth round of the NFL Draft.
Verdict: TRUE
Reason: Both sources state Shedeur Sanders was chosen with the 144th overall pick, which corresponds to the fifth round of the 2025 NFL Draft.
Sources:
• https://www.clevelandbrowns.com/video/browns-select-shedeur-sanders-with-no-144-pick-in-2025-draft
• https://www.clevelandbrowns.com/news/browns-select-qb-shedeur-sanders-with-the-no-144-pick-in-the-2025-nfl-draft
--------------------------------------------------
Claim: He is the 42nd starting quarterback for the Browns since 1999.
Verdict: TRUE
Reason: The Wikipedia article states that from 1999 through mid‑2025 the Browns have had 42 players start at quarterback, confirming that the most recent starter is indeed the 42nd.
Sources:
• https://en.wikipedia.org/wiki/List_of_Cleveland_Browns_starting_quarterbacks
--------------------------------------------------
Claim: The Browns had a 2–8 record at the time of the interview.
Verdict: TRUE
Reason: A news article from November 22, 2015 explicitly states the Browns were 2‑8 at that time, which aligns with the claim about the interview timing.
Sources:
• https://www.tribtoday.com/uncategorized/2015/11/browns-2-8-record-has-been-a-team-effort/
--------------------------------------------------
Claim: Dillon Gabriel previously started as quarterback for the Browns.
Verdict: TRUE
Reason: The evidence shows Gabriel was named the Browns' starter for a game on October 5, 2025, indicating he has previously started as quarterback for Cleveland.
Sources:
• https://en.wikipedia.org/wiki/Dillon_Gabriel
• https://www.sports-reference.com/cfb/players/dillon-gabriel-1.html
--------------------------------------------------
Claim: Shedeur Sanders expects to lose the starting quarterback job to Bailey Zappe after two weeks.
Verdict: UNCERTAIN
Reason: The provided sources show Bailey Zappe being elevated as a backup to Shedeur Sanders and discuss uncertainty about Sanders' future, but none mention Sanders expecting to lose his starting job within two weeks.
Sources:
• https://www.nbcsports.com/nfl/profootballtalk/rumor-mill/news/browns-elevate-bailey-zappe-to-back-up-shedeur-sanders
• https://sports.yahoo.com/nfl/article/what-does-browns-firing-kevin-stefanski-mean-for-shedeur-sanders-062323054.html
--------------------------------------------------
```
```text
[{'claim': 'Shedeur Sanders is a rookie quarterback for the Cleveland Browns.',
'verdict': 'true',
'reason': "The Browns roster page lists Shedeur Sanders with experience marked as 'R' (rookie) and notes he was drafted in 2025, confirming he is a rookie quarterback for Cleveland.",
'sources': ['https://www.clevelandbrowns.com/team/players-roster/shedeur-sanders/']},
{'claim': 'Shedeur Sanders was selected in the fifth round of the NFL Draft.',
'verdict': 'true',
'reason': 'Both sources state Shedeur Sanders was chosen with the 144th overall pick, which corresponds to the fifth round of the 2025 NFL Draft.',
'sources': ['https://www.clevelandbrowns.com/video/browns-select-shedeur-sanders-with-no-144-pick-in-2025-draft',
'https://www.clevelandbrowns.com/news/browns-select-qb-shedeur-sanders-with-the-no-144-pick-in-the-2025-nfl-draft']},
{'claim': 'He is the 42nd starting quarterback for the Browns since 1999.',
'verdict': 'true',
'reason': 'The Wikipedia article states that from 1999 through mid‑2025 the Browns have had 42 players start at quarterback, confirming that the most recent starter is indeed the 42nd.',
'sources': ['https://en.wikipedia.org/wiki/List_of_Cleveland_Browns_starting_quarterbacks']},
{'claim': 'The Browns had a 2–8 record at the time of the interview.',
'verdict': 'true',
'reason': 'A news article from November 22, 2015 explicitly states the Browns were 2‑8 at that time, which aligns with the claim about the interview timing.',
'sources': ['https://www.tribtoday.com/uncategorized/2015/11/browns-2-8-record-has-been-a-team-effort/']},
{'claim': 'Dillon Gabriel previously started as quarterback for the Browns.',
'verdict': 'true',
'reason': "The evidence shows Gabriel was named the Browns' starter for a game on October 5, 2025, indicating he has previously started as quarterback for Cleveland.",
'sources': ['https://en.wikipedia.org/wiki/Dillon_Gabriel',
'https://www.sports-reference.com/cfb/players/dillon-gabriel-1.html']},
{'claim': 'Shedeur Sanders expects to lose the starting quarterback job to Bailey Zappe after two weeks.',
'verdict': 'uncertain',
'reason': "The provided sources show Bailey Zappe being elevated as a backup to Shedeur Sanders and discuss uncertainty about Sanders' future, but none mention Sanders expecting to lose his starting job within two weeks.",
'sources': ['https://www.nbcsports.com/nfl/profootballtalk/rumor-mill/news/browns-elevate-bailey-zappe-to-back-up-shedeur-sanders',
'https://sports.yahoo.com/nfl/article/what-does-browns-firing-kevin-stefanski-mean-for-shedeur-sanders-062323054.html']}]
```
And with that, you've successfully built a fact-checker using gpt-oss-120B, Cerebras, and Parallel!
**⚠️ Disclaimer:**
This guide is meant purely as an educational starting point. To keep things simple, the code here skips over several production concerns like prompt injection, input sanitation, and stricter output validation. If you decide to turn this into a real app, add those protections.
**Contributors**
This guide serves as a joint collaboration effort between OpenAI, [Cerebras Systems](https://www.cerebras.ai/), and [Parallel Web Systems](https://parallel.ai/), with attributions to the following for their valuable feedback and support.
- Vaibhav Srivastav
- Dominik Kundel
- Sarah Chieng
- Sebastian Duerr
- Matt Harris
- Lukas Levert
- Joyce Er
- Kevin Taylor
- Khushi Shelat
---
# Source: https://developers.openai.com/cookbook/examples/build_a_coding_agent_with_gpt-5.1.md
# Building a Coding Agent with GPT-5.1 and the OpenAI Agents SDK
GPT-5.1 is exceptionally strong at coding, and with the new code-editing and command-execution tools available in the [Responses API](https://platform.openai.com/docs/api-reference/responses), it’s now easier than ever to build coding agents that can work across full codebases and iterate quickly.
In this guide, we’ll use the [Agents SDK](https://openai.github.io/openai-agents-python/) to build a **coding agent that can scaffold a brand-new app from a prompt and refine it through user feedback**. Our agent will be equipped with the following tools:
- **apply_patch** — to edit files
- **shell** — to run shell commands
- **web_search** — to pull fresh information from the web
- **Context7 MCP** — to access up-to-date documentation
We’ll begin by focusing on the `shell` and `web_search` tools to generate a new project with web-sourced context. Then we’ll add `apply_patch` so the agent can iterate on the codebase, and we’ll connect it to the [Context7 MCP server](https://context7.com/) so it can write code informed by the most recent docs.
## Set up the agent
With the Agents SDK, defining an agent is as simple as providing instructions and a list of tools. In this example, we want to use the newest `gpt-5.1` model for its state-of-the-art coding abilities.
We’ll start by enabling `web_search`, which gives the agent the ability to look up up-to-date information online, and `shell`, which lets the agent propose shell commands for tasks like scaffolding, installing dependencies, and running build steps.
The shell tool works by letting the model propose commands it believes should be executed. Your environment is responsible for actually running those commands and returning the output.
The Agents SDK automates most of this command-execution handshake for you—you only need to implement the shell executor, the environment in which those commands will run.
```python
%pip install openai-agents openai asyncio
```
```python
import os
# Make sure your OpenAI API key is defined (you can set it on your global environment, or export it manually)
# export OPENAI_API_KEY="sk-..."
assert "OPENAI_API_KEY" in os.environ, "Please set OPENAI_API_KEY first."
```
### Define a working environment and shell executor
For simplicity, we'll run shell commands locally and isolate them in a dedicated workspace directory. This ensures the agent only interacts with files inside that folder.
**Note:** In production, **always execute shell commands in a sandboxed environment**. Arbitrary command execution is inherently risky and must be tightly controlled.
```python
# Create an isolated workspace for shell commands
from pathlib import Path
workspace_dir = Path("coding-agent-workspace").resolve()
workspace_dir.mkdir(exist_ok=True)
print(f"Workspace directory: {workspace_dir}")
```
```text
Workspace directory: /Users/katia/dev/openai-cookbook/examples/coding-agent-workspace
```
We’ll now define a small `ShellExecutor` class that:
- Receives a `ShellCommandRequest` from the agent
- Optionally asks for approval before running commands
- Runs them using `asyncio.create_subprocess_shell`
- Returns a `ShellResult` with the outputs
All commands will run with `cwd=workspace_dir`, so they only affect files in that subfolder.
```python
import asyncio
import os
from collections.abc import Sequence
from pathlib import Path
from typing import Literal
from agents import (
ShellTool,
ShellCommandRequest,
ShellCommandOutput,
ShellCallOutcome,
ShellResult,
)
async def require_approval(commands: Sequence[str]) -> None:
"""
Ask for confirmation before running shell commands.
Set SHELL_AUTO_APPROVE=1 in your environment to skip this prompt
(useful when you're iterating a lot or running in CI).
"""
if os.environ.get("SHELL_AUTO_APPROVE") == "1":
return
print("Shell command approval required:")
for entry in commands:
print(" ", entry)
response = input("Proceed? [y/N] ").strip().lower()
if response not in {"y", "yes"}:
raise RuntimeError("Shell command execution rejected by user.")
class ShellExecutor:
"""
Shell executor for the notebook cookbook.
- Runs all commands inside `workspace_dir`
- Captures stdout/stderr
- Enforces an optional timeout from `action.timeout_ms`
- Returns a ShellResult with ShellCommandOutput entries using ShellCallOutcome
"""
def __init__(self, cwd: Path):
self.cwd = cwd
async def __call__(self, request: ShellCommandRequest) -> ShellResult:
action = request.data.action
await require_approval(action.commands)
outputs: list[ShellCommandOutput] = []
for command in action.commands:
proc = await asyncio.create_subprocess_shell(
command,
cwd=self.cwd,
env=os.environ.copy(),
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
timed_out = False
try:
timeout = (action.timeout_ms or 0) / 1000 or None
stdout_bytes, stderr_bytes = await asyncio.wait_for(
proc.communicate(),
timeout=timeout,
)
except asyncio.TimeoutError:
proc.kill()
stdout_bytes, stderr_bytes = await proc.communicate()
timed_out = True
stdout = stdout_bytes.decode("utf-8", errors="ignore")
stderr = stderr_bytes.decode("utf-8", errors="ignore")
# Use ShellCallOutcome instead of exit_code/status fields directly
outcome = ShellCallOutcome(
type="timeout" if timed_out else "exit",
exit_code=getattr(proc, "returncode", None),
)
outputs.append(
ShellCommandOutput(
command=command,
stdout=stdout,
stderr=stderr,
outcome=outcome,
)
)
if timed_out:
# Stop running further commands if this one timed out
break
return ShellResult(
output=outputs,
provider_data={"working_directory": str(self.cwd)},
)
shell_tool = ShellTool(executor=ShellExecutor(cwd=workspace_dir))
```
### Define the agent
```python
# Define the agent's instructions
INSTRUCTIONS = '''
You are a coding assistant. The user will explain what they want to build, and your goal is to run commands to generate a new app.
You can search the web to find which command you should use based on the technical stack, and use commands to create code files.
You should also install necessary dependencies for the project to work.
'''
```
```python
from agents import Agent, Runner, ShellTool, WebSearchTool
coding_agent = Agent(
name="Coding Agent",
model="gpt-5.1",
instructions=INSTRUCTIONS,
tools=[
WebSearchTool(),
shell_tool
]
)
```
## Start a new project
Let’s send a prompt to our coding agent and then inspect the files it created in the `workspace_dir`.
In this example, we'll create a NextJS dashboard using the [shadcn](https://ui.shadcn.com/) library.
**Note:** sometimes you might run into an `MaxTurnsExceeded` error, or the project might have a dependency error. Simply run the agent loop again. In a production environment, you would implement an external loop or user input handling to iterate if the project creation fails.
```python
prompt = "Create a new NextJS app that shows dashboard-01 from https://ui.shadcn.com/blocks on the home page"
```
```python
import asyncio
from agents import ItemHelpers, RunConfig
async def run_coding_agent_with_logs(prompt: str):
"""
Run the coding agent and stream logs about what's happening
"""
print("=== Run starting ===")
print(f"[user] {prompt}\n")
result = Runner.run_streamed(
coding_agent,
input=prompt
)
async for event in result.stream_events():
# High-level items: messages, tool calls, tool outputs, MCP, etc.
if event.type == "run_item_stream_event":
item = event.item
# 1) Tool calls (function tools, web_search, shell, MCP, etc.)
if item.type == "tool_call_item":
raw = item.raw_item
raw_type_name = type(raw).__name__
# Special-case the ones we care most about in this cookbook
if raw_type_name == "ResponseFunctionWebSearch":
print("[tool] web_search_call – agent is calling web search")
elif raw_type_name == "LocalShellCall":
# LocalShellCall.action.commands is where the commands live
commands = getattr(getattr(raw, "action", None), "commands", None)
if commands:
print(f"[tool] shell – running commands: {commands}")
else:
print("[tool] shell – running command")
else:
# Generic fallback for other tools (MCP, function tools, etc.)
print(f"[tool] {raw_type_name} called")
# 2) Tool call outputs
elif item.type == "tool_call_output_item":
# item.output is whatever your tool returned (could be structured)
output_preview = str(item.output)
if len(output_preview) > 400:
output_preview = output_preview[:400] + "…"
print(f"[tool output] {output_preview}")
# 3) Normal assistant messages
elif item.type == "message_output_item":
text = ItemHelpers.text_message_output(item)
print(f"[assistant]\n{text}\n")
# 4) Other event types (reasoning, MCP list tools, etc.) – ignore
else:
pass
print("=== Run complete ===\n")
# Once streaming is done, result.final_output contains the final answer
print("Final answer:\n")
print(result.final_output)
```
```python
await run_coding_agent_with_logs(prompt)
```
````text
=== Run starting ===
[user] Create a new NextJS app that shows dashboard-01 from https://ui.shadcn.com/blocks on the home page
Shell command approval required:
npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias "@/*"
cd shadcn-dashboard && npm install shadcn-ui class-variance-authority clsx tailwind-merge lucide-react
cd shadcn-dashboard && npx shadcn-ui@latest init -y
Proceed? [y/N] y
[tool] ResponseOutputMessage called
[tool output] $ npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias "@/*"
[?25l[2K[1G[36m?[39m [1mWould you like to use [34mReact Compiler[39m?[22m [90m›[39m [36m[4mNo[39m[24m [90m/[39m Yes
$ cd shadcn-dashboard && npm install shadcn-ui class-variance-authority clsx tailwind-merge lucide-react
stderr:
/bin/sh: line 0: cd: shadcn-dashboard…
Shell command approval required:
yes "No" | npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias "@/*"
cd shadcn-dashboard && npm install shadcn-ui class-variance-authority clsx tailwind-merge lucide-react
cd shadcn-dashboard && npx shadcn@latest init -y
Proceed? [y/N] y
[tool] ResponseOutputMessage called
[tool output] $ yes "No" | npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias "@/*"
[?25l[2K[1G[36m?[39m [1mWould you like to use [34mReact Compiler[39m?[22m [90m›[39m [36m[4mNo[39m[24m [90m/[39m Yes[2K[1G[2K[1G[32m✔[39m [1mWould you like to use [34mReact Compiler[39m?[22m [90m…[39m [36m[4mNo[39m[24m [90m/[39m Yes
[?2…
Shell command approval required:
cd shadcn-dashboard && yes "" | npx shadcn@latest init
cd shadcn-dashboard && npx shadcn@latest add button card dropdown-menu input label progress select separator sheet sidebar skeleton tabs avatar
Proceed? [y/N] y
[tool] ResponseOutputMessage called
[tool output] $ cd shadcn-dashboard && yes "" | npx shadcn@latest init
[?25l[36m?[39m [1mWhich color would you like to use as the [36mbase color[39m?[22m [90m›[39m [90m- Use arrow-keys. Return to submit.[39m
[36m❯[39m [36m[4mNeutral[39m[24m[90m[39m
Gray[90m[39m
Zinc[90m[39m
Stone[90m[39m
Slate[90m[39m
[2K[1G[32m✔[39m [1mWhich color would you like to use as the …
Shell command approval required:
cd shadcn-dashboard && ls
cd shadcn-dashboard && sed -n '1,200p' src/app/page.tsx
cd shadcn-dashboard && sed -n '1,260p' src/app/layout.tsx
Proceed? [y/N] y
[tool] ResponseOutputMessage called
[tool output] $ cd shadcn-dashboard && ls
components.json
eslint.config.mjs
next-env.d.ts
next.config.ts
[1m[36mnode_modules[m[m
package-lock.json
package.json
postcss.config.mjs
[1m[36mpublic[m[m
README.md
[1m[36msrc[m[m
tsconfig.json
$ cd shadcn-dashboard && sed -n '1,200p' src/app/page.tsx
import Image from "next/image";
export default function Home() {
return (
{/* Dashboard-01 main content */}
{/* We can just reuse the dashboard page component structure here if desired. */}
)
}
EOF
cd shadcn-dashboard && sed -n '1,260p' src/app/page.tsx
Proceed? [y/N] y
[tool] ResponseOutputMessage called
[tool output] $ cd shadcn-dashboard && sed -n '1,260p' src/app/dashboard/page.tsx
import { AppSidebar } from "@/components/app-sidebar"
import { ChartAreaInteractive } from "@/components/chart-area-interactive"
import { DataTable } from "@/components/data-table"
import { SectionCards } from "@/components/section-cards"
import { SiteHeader } from "@/components/site-header"
import {
SidebarInset,
SidebarProvi…
Shell command approval required:
cd shadcn-dashboard && cat > src/app/page.tsx << 'EOF'
import { AppSidebar } from "@/components/app-sidebar"
import { ChartAreaInteractive } from "@/components/chart-area-interactive"
import { DataTable } from "@/components/data-table"
import { SectionCards } from "@/components/section-cards"
import { SiteHeader } from "@/components/site-header"
import {
SidebarInset,
SidebarProvider,
} from "@/components/ui/sidebar"
import data from "./dashboard/data.json"
import React from "react"
export default function Home() {
return (
)
}
EOF
cd shadcn-dashboard && sed -n '1,260p' src/app/page.tsx
Proceed? [y/N] y
[tool] ResponseOutputMessage called
[tool output] $ cd shadcn-dashboard && cat > src/app/page.tsx << 'EOF'
import { AppSidebar } from "@/components/app-sidebar"
import { ChartAreaInteractive } from "@/components/chart-area-interactive"
import { DataTable } from "@/components/data-table"
import { SectionCards } from "@/components/section-cards"
import { SiteHeader } from "@/components/site-header"
import {
SidebarInset,
SidebarProvider,
} from…
[assistant]
I’ve created the Next.js app and wired it so that `dashboard-01` is rendered on the **home page (`/`)**.
Here’s what was done and how to run it locally:
1. **App creation & setup**
```bash
npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias "@/*"
cd shadcn-dashboard
# shadcn/ui & dependencies
npm install shadcn-ui class-variance-authority clsx tailwind-merge lucide-react
# Initialize shadcn/ui
npx shadcn@latest init
# Add components required by dashboard-01
npx shadcn@latest add button card dropdown-menu input label progress select separator sheet sidebar skeleton tabs avatar
npx shadcn@latest add dashboard-01
```
2. **Home page now renders dashboard-01**
I replaced `src/app/page.tsx` with a version that reuses the generated `dashboard-01` layout and data, so `/` shows that dashboard:
`src/app/page.tsx`:
```tsx
import { AppSidebar } from "@/components/app-sidebar"
import { ChartAreaInteractive } from "@/components/chart-area-interactive"
import { DataTable } from "@/components/data-table"
import { SectionCards } from "@/components/section-cards"
import { SiteHeader } from "@/components/site-header"
import {
SidebarInset,
SidebarProvider,
} from "@/components/ui/sidebar"
import data from "./dashboard/data.json"
import React from "react"
export default function Home() {
return (
)
}
```
This uses:
- `src/app/dashboard/data.json` (created by `npx shadcn add dashboard-01`)
- All the supporting components (`AppSidebar`, `SiteHeader`, `SectionCards`, charts, table, etc.) that the block installer added.
3. **Run the app**
From the `shadcn-dashboard` folder:
```bash
npm run dev
```
Then open `http://localhost:3000` — you should see the **dashboard-01** block as the home page.
If you’d like, I can also remove the `/dashboard` route or keep it as a separate page; right now, the main dashboard layout is mirrored on `/`.
=== Run complete ===
Final answer:
I’ve created the Next.js app and wired it so that `dashboard-01` is rendered on the **home page (`/`)**.
Here’s what was done and how to run it locally:
1. **App creation & setup**
```bash
npx create-next-app@latest shadcn-dashboard --typescript --eslint --tailwind --app --src-dir --import-alias "@/*"
cd shadcn-dashboard
# shadcn/ui & dependencies
npm install shadcn-ui class-variance-authority clsx tailwind-merge lucide-react
# Initialize shadcn/ui
npx shadcn@latest init
# Add components required by dashboard-01
npx shadcn@latest add button card dropdown-menu input label progress select separator sheet sidebar skeleton tabs avatar
npx shadcn@latest add dashboard-01
```
2. **Home page now renders dashboard-01**
I replaced `src/app/page.tsx` with a version that reuses the generated `dashboard-01` layout and data, so `/` shows that dashboard:
`src/app/page.tsx`:
```tsx
import { AppSidebar } from "@/components/app-sidebar"
import { ChartAreaInteractive } from "@/components/chart-area-interactive"
import { DataTable } from "@/components/data-table"
import { SectionCards } from "@/components/section-cards"
import { SiteHeader } from "@/components/site-header"
import {
SidebarInset,
SidebarProvider,
} from "@/components/ui/sidebar"
import data from "./dashboard/data.json"
import React from "react"
export default function Home() {
return (
)
}
```
This uses:
- `src/app/dashboard/data.json` (created by `npx shadcn add dashboard-01`)
- All the supporting components (`AppSidebar`, `SiteHeader`, `SectionCards`, charts, table, etc.) that the block installer added.
3. **Run the app**
From the `shadcn-dashboard` folder:
```bash
npm run dev
```
Then open `http://localhost:3000` — you should see the **dashboard-01** block as the home page.
If you’d like, I can also remove the `/dashboard` route or keep it as a separate page; right now, the main dashboard layout is mirrored on `/`.
````
Once the agent is done creating the initial project (you should see a "=== Run complete ===" log followed by the final answer), you can check the output with the following commands:
```bash
cd coding-agent-workspace/
npm run dev
```
You should see something like this:

## Iterate on the project
Now that we have an initial version of the app, we can start iterating using the apply_patch tool. We also want to include calls to the OpenAI Responses API, and for that, the model should have access to the most up-to-date documentation. To make this possible, we’ll connect the agent to the [Context7 MCP server](https://context7.com/), which provides up-to-date docs.
### Set up the `apply_patch` tool for in-place edits
Note: in production you’ll typically want to run these edits in a sandboxed project workspace (e.g. ephemeral containers), and work with IDEs.
```python
import hashlib
import os
from pathlib import Path
from agents import ApplyPatchTool
from agents.editor import ApplyPatchOperation, ApplyPatchResult
class ApprovalTracker:
"""Tracks which apply_patch operations have already been approved."""
def __init__(self) -> None:
self._approved: set[str] = set()
def fingerprint(self, operation: ApplyPatchOperation, relative_path: str) -> str:
hasher = hashlib.sha256()
hasher.update(operation.type.encode("utf-8"))
hasher.update(b"\0")
hasher.update(relative_path.encode("utf-8"))
hasher.update(b"\0")
hasher.update((operation.diff or "").encode("utf-8"))
return hasher.hexdigest()
def remember(self, fingerprint: str) -> None:
self._approved.add(fingerprint)
def is_approved(self, fingerprint: str) -> bool:
return fingerprint in self._approved
class WorkspaceEditor:
"""
Minimal editor for the apply_patch tool:
- keeps all edits under `root`
- optional manual approval (APPLY_PATCH_AUTO_APPROVE=1 to skip prompts)
"""
def __init__(self, root: Path, approvals: ApprovalTracker, auto_approve: bool = False) -> None:
self._root = root.resolve()
self._approvals = approvals
self._auto_approve = auto_approve or os.environ.get("APPLY_PATCH_AUTO_APPROVE") == "1"
def create_file(self, operation: ApplyPatchOperation) -> ApplyPatchResult:
relative = self._relative_path(operation.path)
self._require_approval(operation, relative)
target = self._resolve(operation.path, ensure_parent=True)
diff = operation.diff or ""
content = apply_unified_diff("", diff, create=True)
target.write_text(content, encoding="utf-8")
return ApplyPatchResult(output=f"Created {relative}")
def update_file(self, operation: ApplyPatchOperation) -> ApplyPatchResult:
relative = self._relative_path(operation.path)
self._require_approval(operation, relative)
target = self._resolve(operation.path)
original = target.read_text(encoding="utf-8")
diff = operation.diff or ""
patched = apply_unified_diff(original, diff)
target.write_text(patched, encoding="utf-8")
return ApplyPatchResult(output=f"Updated {relative}")
def delete_file(self, operation: ApplyPatchOperation) -> ApplyPatchResult:
relative = self._relative_path(operation.path)
self._require_approval(operation, relative)
target = self._resolve(operation.path)
target.unlink(missing_ok=True)
return ApplyPatchResult(output=f"Deleted {relative}")
def _relative_path(self, value: str) -> str:
resolved = self._resolve(value)
return resolved.relative_to(self._root).as_posix()
def _resolve(self, relative: str, ensure_parent: bool = False) -> Path:
candidate = Path(relative)
target = candidate if candidate.is_absolute() else (self._root / candidate)
target = target.resolve()
try:
target.relative_to(self._root)
except ValueError:
raise RuntimeError(f"Operation outside workspace: {relative}") from None
if ensure_parent:
target.parent.mkdir(parents=True, exist_ok=True)
return target
def _require_approval(self, operation: ApplyPatchOperation, display_path: str) -> None:
fingerprint = self._approvals.fingerprint(operation, display_path)
if self._auto_approve or self._approvals.is_approved(fingerprint):
self._approvals.remember(fingerprint)
return
print("\n[apply_patch] approval required")
print(f"- type: {operation.type}")
print(f"- path: {display_path}")
if operation.diff:
preview = operation.diff if len(operation.diff) < 400 else f"{operation.diff[:400]}…"
print("- diff preview:\n", preview)
answer = input("Proceed? [y/N] ").strip().lower()
if answer not in {"y", "yes"}:
raise RuntimeError("Apply patch operation rejected by user.")
self._approvals.remember(fingerprint)
def apply_unified_diff(original: str, diff: str, create: bool = False) -> str:
"""
Simple "diff" applier (adapt this based on your environment)
- For create_file, the diff can be the full desired file contents,
optionally with leading '+' on each line.
- For update_file, we treat the diff as the new file contents:
keep lines starting with ' ' or '+', drop '-' lines and diff headers.
This avoids context/delete mismatch errors while still letting the model
send familiar diff-like patches.
"""
if not diff:
return original
lines = diff.splitlines()
body: list[str] = []
for line in lines:
if not line:
body.append("")
continue
# Skip typical unified diff headers / metadata
if line.startswith("@@") or line.startswith("---") or line.startswith("+++"):
continue
prefix = line[0]
content = line[1:]
if prefix in ("+", " "):
body.append(content)
elif prefix in ("-", "\\"):
# skip deletions and "\ No newline at end of file"
continue
else:
# If it doesn't look like diff syntax, keep the full line
body.append(line)
text = "\n".join(body)
if diff.endswith("\n"):
text += "\n"
return text
approvals = ApprovalTracker()
editor = WorkspaceEditor(root=workspace_dir, approvals=approvals, auto_approve=True)
apply_patch_tool = ApplyPatchTool(editor=editor)
```
### Connect to the the Context7 MCP server
```python
# Optional: set CONTEXT7_API_KEY in your environment for higher rate limits
CONTEXT7_API_KEY = os.getenv("CONTEXT7_API_KEY")
```
```python
from agents import HostedMCPTool
context7_tool = HostedMCPTool(
tool_config={
"type": "mcp",
"server_label": "context7",
"server_url": "https://mcp.context7.com/mcp",
# Basic usage works without auth; for higher rate limits, pass your key here.
**(
{"authorization": f"Bearer {CONTEXT7_API_KEY}"}
if CONTEXT7_API_KEY
else {}
),
"require_approval": "never",
},
)
```
### Update the agent
Let's create a new agent that also uses these two additional tools, and update the instructions accordingly.
To avoid a context mismatch when applying the diffs, for this agent we'll specify not to edit files via a command.
```python
UPDATED_INSTRUCTIONS = """
You are a coding assistant helping a user with an existing project.
Use the apply_patch tool to edit files based on their feedback.
When editing files:
- Never edit code via shell commands.
- Always read the file first using `cat` with the shell tool.
- Then generate a unified diff relative to EXACTLY that content.
- Use apply_patch only once per edit attempt.
- If apply_patch fails, stop and report the error; do NOT retry.
You can search the web to find which command you should use based on the technical stack, and use commands to install dependencies if needed.
When the user refers to an external API, use the Context7 MCP server to fetch docs for that API.
For example, if they want to use the OpenAI API, search docs for the openai-python or openai-node sdk depending on the project stack.
"""
```
```python
updated_coding_agent = Agent(
name="Updated Coding Agent",
model="gpt-5.1",
instructions=UPDATED_INSTRUCTIONS,
tools=[
WebSearchTool(),
shell_tool,
apply_patch_tool,
context7_tool,
]
)
```
### Run the agent to edit the project
```python
import asyncio
from agents import ItemHelpers, Runner
async def run_updated_coding_agent_with_logs(prompt: str):
"""
Run the updated coding agent (shell + web + apply_patch + Context7 MCP)
and stream logs about what's happening.
- Logs web_search, shell, apply_patch, and MCP (Context7) calls.
- For apply_patch, logs the outputs returned by the editor.
- At the end, shows a single "Apply all changes?" prompt for the tutorial.
"""
print("=== Run starting ===")
print(f"[user] {prompt}\n")
apply_patch_seen = False
# Start streamed run
result = Runner.run_streamed(
updated_coding_agent,
input=prompt,
)
async for event in result.stream_events():
if event.type != "run_item_stream_event":
continue
item = event.item
# 1) Tool calls (function tools, web_search, shell, MCP, etc.)
if item.type == "tool_call_item":
raw = item.raw_item
raw_type_name = type(raw).__name__
# web_search (hosted Responses tool)
if raw_type_name == "ResponseFunctionWebSearch":
print("[tool] web_search – agent is calling web search")
# shell (new ShellTool executor)
elif raw_type_name == "LocalShellCall":
action = getattr(raw, "action", None)
commands = getattr(action, "commands", None) if action else None
if commands:
print(f"[tool] shell – running commands: {commands}")
else:
print("[tool] shell – running command")
# MCP (e.g. Context7)
elif "MCP" in raw_type_name or "Mcp" in raw_type_name:
tool_name = getattr(raw, "tool_name", None)
if tool_name is None:
action = getattr(raw, "action", None)
tool_name = getattr(action, "tool", None) if action else None
server_label = getattr(raw, "server_label", None)
label_str = f" (server={server_label})" if server_label else ""
if tool_name:
print(f"[tool] mcp{label_str} – calling tool {tool_name!r}")
else:
print(f"[tool] mcp{label_str} – MCP tool call")
# Generic fallback for other tools (including hosted ones)
else:
print(f"[tool] {raw_type_name} called")
# 2) Tool call outputs (where apply_patch shows up)
elif item.type == "tool_call_output_item":
raw = item.raw_item
output_preview = str(item.output)
# Detect apply_patch via raw_item type or output format
is_apply_patch = False
if isinstance(raw, dict) and raw.get("type") == "apply_patch_call_output":
is_apply_patch = True
elif any(
output_preview.startswith(prefix)
for prefix in ("Created ", "Updated ", "Deleted ")
):
is_apply_patch = True
if is_apply_patch:
apply_patch_seen = True
if len(output_preview) > 400:
output_preview = output_preview[:400] + "…"
print(f"[apply_patch] {output_preview}\n")
else:
if len(output_preview) > 400:
output_preview = output_preview[:400] + "…"
print(f"[tool output]\n{output_preview}\n")
# 3) Normal assistant messages
elif item.type == "message_output_item":
text = ItemHelpers.text_message_output(item)
print(f"[assistant]\n{text}\n")
# 4) Other event types – ignore for now
else:
pass
print("=== Run complete ===\n")
# Final answer
print("Final answer:\n")
print(result.final_output)
# Single end-of-run confirmation about edits
if apply_patch_seen:
_ = print("\n[apply_patch] One or more apply_patch calls were executed.")
else:
print("\n[apply_patch] No apply_patch calls detected in this run.")
```
```python
edit_prompt = '''Update the dashboard to add a 'summarize' button in the top right corner.
When clicked, use the OpenAI Responses API with the gpt-5.1 model to generate a summary of the metrics on the dashboard, and display it in a modal.'''
```
```python
await run_updated_coding_agent_with_logs(edit_prompt)
```
````text
=== Run starting ===
[user] Update the dashboard to add a 'summarize' button in the top right corner.
When clicked, use the OpenAI Responses API with the gpt-5.1 model to generate a summary of the metrics on the dashboard, and display it in a modal.
Shell command approval required:
ls
ls -R
cat package.json || pip show flask || pip show django || echo 'no package.json'
Proceed? [y/N] y
[tool] ResponseOutputMessage called
[tool output]
$ ls
[1m[36mshadcn-dashboard[m[m
$ ls -R
[1m[36mshadcn-dashboard[m[m
./shadcn-dashboard:
components.json
eslint.config.mjs
next-env.d.ts
next.config.ts
package-lock.json
package.json
postcss.config.mjs
[1m[36mpublic[m[m
README.md
[1m[36msrc[m[m
tsconfig.json
./shadcn-dashboard/public:
file.svg
globe.svg
next.svg
vercel.svg
window.svg
./shadcn-dashboard/src:
[1m[36mapp[m[m
…
Shell command approval required:
cd shadcn-dashboard && cat package.json
cd shadcn-dashboard && cat src/components/site-header.tsx
cd shadcn-dashboard && cat src/app/dashboard/page.tsx
Proceed? [y/N] y
[tool] ResponseOutputMessage called
[tool output]
$ cd shadcn-dashboard && cat package.json
{
"name": "shadcn-dashboard",
"version": "0.1.0",
"private": true,
"scripts": {
"dev": "next dev",
"build": "next build",
"start": "next start",
"lint": "eslint"
},
"dependencies": {
"@dnd-kit/core": "^6.3.1",
"@dnd-kit/modifiers": "^9.0.0",
"@dnd-kit/sortable": "^10.0.0",
"@dnd-kit/utilities": "^3.2.2",
"@ra…
[tool] mcp (server=context7) – MCP tool call
[tool] mcp (server=context7) – MCP tool call
Shell command approval required:
cd shadcn-dashboard && cat src/components/ui/dialog.tsx || echo 'no dialog'
cd shadcn-dashboard && cat src/app/layout.tsx
Proceed? [y/N] y
Shell command approval required:
cd shadcn-dashboard && npm install openai
cd shadcn-dashboard && ls src/components/ui
Proceed? [y/N] y
[tool] ResponseOutputMessage called
[tool] ResponseOutputMessage called
[tool output]
$ cd shadcn-dashboard && cat src/components/ui/dialog.tsx || echo 'no dialog'
no dialog
stderr:
cat: src/components/ui/dialog.tsx: No such file or directory
$ cd shadcn-dashboard && cat src/app/layout.tsx
import type { Metadata } from "next";
import { Geist, Geist_Mono } from "next/font/google";
import "./globals.css";
const geistSans = Geist({
variable: "--font-geist-sans",
subsets: ["lati…
[tool output]
$ cd shadcn-dashboard && npm install openai
added 513 packages, and audited 514 packages in 10s
150 packages are looking for funding
run `npm fund` for details
found 0 vulnerabilities
$ cd shadcn-dashboard && ls src/components/ui
avatar.tsx
badge.tsx
breadcrumb.tsx
button.tsx
card.tsx
chart.tsx
checkbox.tsx
drawer.tsx
dropdown-menu.tsx
input.tsx
label.tsx
progress.tsx
select.tsx
separator.ts…
[tool] ResponseOutputMessage called
[tool] ResponseOutputMessage called
[tool] ResponseOutputMessage called
[apply_patch] Created shadcn-dashboard/src/components/ui/dialog.tsx
[apply_patch] Updated shadcn-dashboard/src/components/site-header.tsx
[apply_patch] Created shadcn-dashboard/src/app/api/summarize/route.ts
[assistant]
I’ve wired up a “Summarize” button and hooked it to the OpenAI Responses API with `gpt-5.1`, returning the result in a modal.
Here’s what changed and how to use it.
---
## 1. New dialog UI primitive
Added a Radix-based dialog component so we can show the summary in a modal.
**File:** `src/components/ui/dialog.tsx`
(standard shadcn-style Dialog wrapper)
You now have:
```ts
export {
Dialog,
DialogPortal,
DialogOverlay,
DialogClose,
DialogTrigger,
DialogContent,
DialogHeader,
DialogFooter,
DialogTitle,
DialogDescription,
}
```
You don’t need to change anything here unless you want to tweak styling/animations.
---
## 2. Header updated with “Summarize” button + modal
**File:** `src/components/site-header.tsx`
Key updates:
```tsx
import { useState } from "react"
import { Button } from "@/components/ui/button"
import {
Dialog,
DialogContent,
DialogHeader,
DialogTitle,
} from "@/components/ui/dialog"
import { Separator } from "@/components/ui/separator"
import { SidebarTrigger } from "@/components/ui/sidebar"
export function SiteHeader() {
const [open, setOpen] = useState(false)
const [loading, setLoading] = useState(false)
const [summary, setSummary] = useState(null)
const [error, setError] = useState(null)
async function handleSummarize() {
try {
setLoading(true)
setError(null)
const res = await fetch("/api/summarize", {
method: "POST",
})
if (!res.ok) {
throw new Error("Failed to generate summary")
}
const data = (await res.json()) as { summary: string }
setSummary(data.summary)
} catch (err) {
console.error(err)
setError(
err instanceof Error
? err.message
: "Something went wrong generating the summary.",
)
} finally {
setLoading(false)
}
}
return (
<>
Dashboard
>
)
}
```
Behavior:
- The **Summarize** button is in the top-right of the header.
- Clicking it:
- Opens the modal.
- Immediately calls `/api/summarize`.
- The modal shows:
- “Analyzing metrics…” while loading.
- The summary when complete.
- An error message if the API call fails.
---
## 3. API route using OpenAI Responses API with `gpt-5.1`
**File:** `src/app/api/summarize/route.ts`
```ts
import { NextResponse } from "next/server"
import OpenAI from "openai"
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
})
export async function POST() {
try {
const response = await client.responses.create({
model: "gpt-5.1",
instructions:
"You are a concise analytics assistant. Summarize the key insights from a metrics dashboard that includes KPI cards, a time-series chart, and a data table. Focus on trends, anomalies, and overall performance in 3-5 short bullet points. Do not fabricate exact numbers.",
input:
"Summarize the current metrics view for a business stakeholder at a high level.",
})
const summary = (response as any).output_text ?? "No summary generated."
return NextResponse.json({ summary })
} catch (error) {
console.error("Error generating summary:", error)
return NextResponse.json(
{ error: "Failed to generate summary" },
{ status: 500 },
)
}
}
```
Notes:
- Uses the **OpenAI Responses API**.
- Model is explicitly set to `"gpt-5.1"` as requested.
- Returns JSON `{ summary: string }` to the client.
- Reads API key from `process.env.OPENAI_API_KEY`.
---
## 4. Environment setup
Ensure you have your OpenAI key set:
- Create a `.env.local` in `shadcn-dashboard` (if you don’t already):
```bash
OPENAI_API_KEY=sk-...
```
Next.js will expose this to the server runtime only (safe for this API route).
---
## 5. How to run
From `shadcn-dashboard`:
```bash
npm run dev
```
Navigate to `/dashboard`:
- You’ll see the **Summarize** button in the header.
- Click it to open the modal and generate a metrics summary.
---
If you’d like the summary to reflect actual numbers from your dashboard (e.g., pull from `data.json`, cards, or charts), I can next wire the API route to read that data and send it as structured input to the Responses API instead of the generic instructions.
=== Run complete ===
Final answer:
I’ve wired up a “Summarize” button and hooked it to the OpenAI Responses API with `gpt-5.1`, returning the result in a modal.
Here’s what changed and how to use it.
---
## 1. New dialog UI primitive
Added a Radix-based dialog component so we can show the summary in a modal.
**File:** `src/components/ui/dialog.tsx`
(standard shadcn-style Dialog wrapper)
You now have:
```ts
export {
Dialog,
DialogPortal,
DialogOverlay,
DialogClose,
DialogTrigger,
DialogContent,
DialogHeader,
DialogFooter,
DialogTitle,
DialogDescription,
}
```
You don’t need to change anything here unless you want to tweak styling/animations.
---
## 2. Header updated with “Summarize” button + modal
**File:** `src/components/site-header.tsx`
Key updates:
```tsx
import { useState } from "react"
import { Button } from "@/components/ui/button"
import {
Dialog,
DialogContent,
DialogHeader,
DialogTitle,
} from "@/components/ui/dialog"
import { Separator } from "@/components/ui/separator"
import { SidebarTrigger } from "@/components/ui/sidebar"
export function SiteHeader() {
const [open, setOpen] = useState(false)
const [loading, setLoading] = useState(false)
const [summary, setSummary] = useState(null)
const [error, setError] = useState(null)
async function handleSummarize() {
try {
setLoading(true)
setError(null)
const res = await fetch("/api/summarize", {
method: "POST",
})
if (!res.ok) {
throw new Error("Failed to generate summary")
}
const data = (await res.json()) as { summary: string }
setSummary(data.summary)
} catch (err) {
console.error(err)
setError(
err instanceof Error
? err.message
: "Something went wrong generating the summary.",
)
} finally {
setLoading(false)
}
}
return (
<>
Dashboard
>
)
}
```
Behavior:
- The **Summarize** button is in the top-right of the header.
- Clicking it:
- Opens the modal.
- Immediately calls `/api/summarize`.
- The modal shows:
- “Analyzing metrics…” while loading.
- The summary when complete.
- An error message if the API call fails.
---
## 3. API route using OpenAI Responses API with `gpt-5.1`
**File:** `src/app/api/summarize/route.ts`
```ts
import { NextResponse } from "next/server"
import OpenAI from "openai"
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
})
export async function POST() {
try {
const response = await client.responses.create({
model: "gpt-5.1",
instructions:
"You are a concise analytics assistant. Summarize the key insights from a metrics dashboard that includes KPI cards, a time-series chart, and a data table. Focus on trends, anomalies, and overall performance in 3-5 short bullet points. Do not fabricate exact numbers.",
input:
"Summarize the current metrics view for a business stakeholder at a high level.",
})
const summary = (response as any).output_text ?? "No summary generated."
return NextResponse.json({ summary })
} catch (error) {
console.error("Error generating summary:", error)
return NextResponse.json(
{ error: "Failed to generate summary" },
{ status: 500 },
)
}
}
```
Notes:
- Uses the **OpenAI Responses API**.
- Model is explicitly set to `"gpt-5.1"` as requested.
- Returns JSON `{ summary: string }` to the client.
- Reads API key from `process.env.OPENAI_API_KEY`.
---
## 4. Environment setup
Ensure you have your OpenAI key set:
- Create a `.env.local` in `shadcn-dashboard` (if you don’t already):
```bash
OPENAI_API_KEY=sk-...
```
Next.js will expose this to the server runtime only (safe for this API route).
---
## 5. How to run
From `shadcn-dashboard`:
```bash
npm run dev
```
Navigate to `/dashboard`:
- You’ll see the **Summarize** button in the header.
- Click it to open the modal and generate a metrics summary.
---
If you’d like the summary to reflect actual numbers from your dashboard (e.g., pull from `data.json`, cards, or charts), I can next wire the API route to read that data and send it as structured input to the Responses API instead of the generic instructions.
[apply_patch] One or more apply_patch calls were executed.
````
Once the agent is done updating the project (you should see a "=== Run complete ===" log followed by the final answer), you will see the updated UI, with the OpenAI Responses API call to summarize what's on the dashboard.
**Note**: If this step fails, you can re-run the agent loop. In a production environment, you would implement an outer loop that handles errors or wait for user input and iterate.

## Wrapping up
In this cookbook guide, we built a coding agent that can scaffold a project, refine it through patches, execute commands, and stay up to date with external documentation. By combining GPT 5.1 with the Agents SDK and tools like `shell`, `apply_patch`, `web_search`, and the Context7 MCP, you can create agents that don’t just generate code—they actively work with codebases: running commands, applying edits, pulling in fresh context, and evolving a project end-to-end.
This workflow is a powerful blueprint for building agents that feel less like tools and more like collaborators. You can extend this pattern to integrate agents into IDEs or code sandboxes, generate new apps from scratch, work across large codebases, or even collaborate with developers in real time.
---
# Source: https://developers.openai.com/cookbook/examples/codex/build_code_review_with_codex_sdk.md
# Build Code Review with the Codex SDK
With [Code Review](https://chatgpt.com/codex/settings/code-review) in Codex Cloud, you can connect your team's cloud hosted GitHub repository to Codex and receive automated code reviews on every PR. But what if your code is hosted on-prem, or you don't have GitHub as an SCM?
Luckily, we can replicate Codex's cloud hosted review process in our own CI/CD runners. In this guide, we'll build our own Code Review action using the Codex CLI headless mode with both GitHub Actions and Jenkins.
Model recommendation: use `gpt-5.2-codex` for the strongest code review accuracy and consistency in these workflows.
To build our own Code review, we'll take the following steps and adhere to them closely:
1. Install the Codex CLI in our CI/CD runner
2. Prompt Codex in headless (exec) mode with the Code Review prompt that ships with the CLI
3. Specify a structured output JSON schema for Codex
4. Parse the JSON result and use it to make API calls to our SCM to create review comments
Once implemented, Codex will be able to leave inline code review comments:
## The Code Review Prompt
GPT-5.2-Codex has received specific training to improve its code review abilities. You can steer GPT-5.2-Codex to conduct a code review with the following prompt:
```
You are acting as a reviewer for a proposed code change made by another engineer.
Focus on issues that impact correctness, performance, security, maintainability, or developer experience.
Flag only actionable issues introduced by the pull request.
When you flag an issue, provide a short, direct explanation and cite the affected file and line range.
Prioritize severe issues and avoid nit-level comments unless they block understanding of the diff.
After listing findings, produce an overall correctness verdict (\"patch is correct\" or \"patch is incorrect\") with a concise justification and a confidence score between 0 and 1.
Ensure that file citations and line numbers are exactly correct using the tools available; if they are incorrect your comments will be rejected.
```
## Codex Structured Outputs
In order to make comments on code ranges in our pull request, we need to receive Codex's response in a specific format. To do that we can create a file called `codex-output-schema.json` that conforms to OpenAI's [structured outputs](https://platform.openai.com/docs/guides/structured-outputs) format.
To use this file in our workflow YAML, we can call Codex with the `output-schema-file` argument like this:
```yaml
- name: Run Codex structured review
id: run-codex
uses: openai/codex-action@main
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
prompt-file: codex-prompt.md
sandbox: read-only
model: ${{ env.CODEX_MODEL }}
output-schema-file: codex-output-schema.json # <-- Our schema file
output-file: codex-output.json
```
You can also pass a similar argument to `codex exec` for example:
```bash
codex exec "Review my pull request!" --output-schema codex-output-schema.json
```
## GitHub Actions Example
Let's put it all together. If you're using GitHub Actions in an on-prem environment, you can tailor this example to your specific workflow. Inline comments highlight the key steps.
```yaml
name: Codex Code Review
# Determine when the review action should be run:
on:
pull_request:
types:
- opened
- reopened
- synchronize
- ready_for_review
concurrency:
group: codex-structured-review-${{ github.event.pull_request.number }}
cancel-in-progress: true
jobs:
codex-structured-review:
name: Run Codex structured review
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GITHUB_TOKEN: ${{ github.token }}
CODEX_MODEL: ${{ vars.CODEX_MODEL || 'o4-mini' }}
PR_NUMBER: ${{ github.event.pull_request.number }}
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha }}
REPOSITORY: ${{ github.repository }}
outputs:
codex-output: ${{ steps.run-codex.outputs.final-message }}
steps:
- name: Checkout pull request merge commit
uses: actions/checkout@v5
with:
ref: refs/pull/${{ github.event.pull_request.number }}/merge
- name: Fetch base and head refs
run: |
set -euxo pipefail
git fetch --no-tags origin \
"${{ github.event.pull_request.base.ref }}" \
+refs/pull/${{ github.event.pull_request.number }}/head
shell: bash
# The structured output schema ensures that codex produces comments
# with filepaths, line numbers, title, body, etc.
- name: Generate structured output schema
run: |
set -euo pipefail
cat <<'JSON' > codex-output-schema.json
{
"type": "object",
"properties": {
"findings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {
"type": "string",
"maxLength": 80
},
"body": {
"type": "string",
"minLength": 1
},
"confidence_score": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"priority": {
"type": "integer",
"minimum": 0,
"maximum": 3
},
"code_location": {
"type": "object",
"properties": {
"absolute_file_path": {
"type": "string",
"minLength": 1
},
"line_range": {
"type": "object",
"properties": {
"start": {
"type": "integer",
"minimum": 1
},
"end": {
"type": "integer",
"minimum": 1
}
},
"required": [
"start",
"end"
],
"additionalProperties": false
}
},
"required": [
"absolute_file_path",
"line_range"
],
"additionalProperties": false
}
},
"required": [
"title",
"body",
"confidence_score",
"priority",
"code_location"
],
"additionalProperties": false
}
},
"overall_correctness": {
"type": "string",
"enum": [
"patch is correct",
"patch is incorrect"
]
},
"overall_explanation": {
"type": "string",
"minLength": 1
},
"overall_confidence_score": {
"type": "number",
"minimum": 0,
"maximum": 1
}
},
"required": [
"findings",
"overall_correctness",
"overall_explanation",
"overall_confidence_score"
],
"additionalProperties": false
}
JSON
shell: bash
# This section generates our prompt:
- name: Build Codex review prompt
env:
REVIEW_PROMPT_PATH: ${{ vars.CODEX_PROMPT_PATH || 'review_prompt.md' }}
run: |
set -euo pipefail
PROMPT_PATH="codex-prompt.md"
TEMPLATE_PATH="${REVIEW_PROMPT_PATH}"
if [ -n "$TEMPLATE_PATH" ] && [ -f "$TEMPLATE_PATH" ]; then
cat "$TEMPLATE_PATH" > "$PROMPT_PATH"
else
{
printf '%s\n' "You are acting as a reviewer for a proposed code change made by another engineer."
printf '%s\n' "Focus on issues that impact correctness, performance, security, maintainability, or developer experience."
printf '%s\n' "Flag only actionable issues introduced by the pull request."
printf '%s\n' "When you flag an issue, provide a short, direct explanation and cite the affected file and line range."
printf '%s\n' "Prioritize severe issues and avoid nit-level comments unless they block understanding of the diff."
printf '%s\n' "After listing findings, produce an overall correctness verdict (\"patch is correct\" or \"patch is incorrect\") with a concise justification and a confidence score between 0 and 1."
printf '%s\n' "Ensure that file citations and line numbers are exactly correct using the tools available; if they are incorrect your comments will be rejected."
} > "$PROMPT_PATH"
fi
{
echo ""
echo "Repository: ${REPOSITORY}"
echo "Pull Request #: ${PR_NUMBER}"
echo "Base ref: ${{ github.event.pull_request.base.ref }}"
echo "Head ref: ${{ github.event.pull_request.head.ref }}"
echo "Base SHA: ${BASE_SHA}"
echo "Head SHA: ${HEAD_SHA}"
echo "Changed files:"
git --no-pager diff --name-status "${BASE_SHA}" "${HEAD_SHA}"
echo ""
echo "Unified diff (context=5):"
git --no-pager diff --unified=5 --stat=200 "${BASE_SHA}" "${HEAD_SHA}" > /tmp/diffstat.txt
git --no-pager diff --unified=5 "${BASE_SHA}" "${HEAD_SHA}" > /tmp/full.diff
cat /tmp/diffstat.txt
echo ""
cat /tmp/full.diff
} >> "$PROMPT_PATH"
shell: bash
# Putting it all together: we run the codex action with our code review prompt,
# structured output, and output file:
- name: Run Codex structured review
id: run-codex
uses: openai/codex-action@main
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
prompt-file: codex-prompt.md
output-schema-file: codex-output-schema.json
output-file: codex-output.json
sandbox: read-only
model: ${{ env.CODEX_MODEL }}
- name: Inspect structured Codex output
if: ${{ always() }}
run: |
if [ -s codex-output.json ]; then
jq '.' codex-output.json || true
else
echo "Codex output file missing"
fi
shell: bash
# This step produces in-line code review comments on specific line
# ranges of code.
- name: Publish inline review comments
if: ${{ always() }}
env:
REVIEW_JSON: codex-output.json
run: |
set -euo pipefail
if [ ! -s "$REVIEW_JSON" ]; then
echo "No Codex output file present; skipping comment publishing."
exit 0
fi
findings_count=$(jq '.findings | length' "$REVIEW_JSON")
if [ "$findings_count" -eq 0 ]; then
echo "Codex returned no findings; skipping inline comments."
exit 0
fi
jq -c --arg commit "$HEAD_SHA" '.findings[] | {
body: (.title + "\n\n" + .body + "\n\nConfidence: " + (.confidence_score | tostring) + (if has("priority") then "\nPriority: P" + (.priority | tostring) else "" end)),
commit_id: $commit,
path: .code_location.absolute_file_path,
line: .code_location.line_range.end,
side: "RIGHT",
start_line: (if .code_location.line_range.start != .code_location.line_range.end then .code_location.line_range.start else null end),
start_side: (if .code_location.line_range.start != .code_location.line_range.end then "RIGHT" else null end)
} | with_entries(select(.value != null))' "$REVIEW_JSON" > findings.jsonl
while IFS= read -r payload; do
echo "Posting review comment payload:" && echo "$payload" | jq '.'
curl -sS \
-X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${GITHUB_TOKEN}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"https://api.github.com/repos/${REPOSITORY}/pulls/${PR_NUMBER}/comments" \
-d "$payload"
done < findings.jsonl
shell: bash
# This section creates a single comment summarizing the review.
- name: Publish overall summary comment
if: ${{ always() }}
env:
REVIEW_JSON: codex-output.json
run: |
set -euo pipefail
if [ ! -s "$REVIEW_JSON" ]; then
echo "Codex output missing; skipping summary."
exit 0
fi
overall_state=$(jq -r '.overall_correctness' "$REVIEW_JSON")
overall_body=$(jq -r '.overall_explanation' "$REVIEW_JSON")
confidence=$(jq -r '.overall_confidence_score' "$REVIEW_JSON")
msg="**Codex automated review**\n\nVerdict: ${overall_state}\nConfidence: ${confidence}\n\n${overall_body}"
curl -sS \
-X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${GITHUB_TOKEN}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"https://api.github.com/repos/${REPOSITORY}/issues/${PR_NUMBER}/comments" \
-d "$(jq -n --arg body "$msg" '{body: $body}')"
shell: bash
```
## Gitlab Example
GitLab doesn’t have a direct equivalent to the GitHub Action, but you can run codex exec inside GitLab CI/CD to perform automated code reviews.
However, the GitHub Action includes an important [safety strategy](https://github.com/openai/codex-action?tab=readme-ov-file#safety-strategy): it drops sudo permissions so Codex cannot access its own OpenAI API key. This isolation is critical—especially for public repositories where sensitive secrets (like your OpenAI API key) may be present—because it prevents Codex from reading or exfiltrating credentials during execution.
Before running this job, configure your GitLab project:
1. Go to **Project → Settings → CI/CD**.
2. Expand the **Variables** section.
3. Add these variables:
- `OPENAI_API_KEY`
- `GITLAB_TOKEN`
4. Mark them as masked/protected as appropriate.
5. Add the following GitLab example job to your `.gitlab-ci.yml` file at the root of your repository so it runs during merge request pipelines.
Please be mindful with your API key on public repositories.
```yaml
stages:
- review
codex-structured-review:
stage: review
image: ubuntu:22.04
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
variables:
PR_NUMBER: $CI_MERGE_REQUEST_IID
REPOSITORY: "$CI_PROJECT_PATH"
BASE_SHA: "$CI_MERGE_REQUEST_DIFF_BASE_SHA"
HEAD_SHA: "$CI_COMMIT_SHA"
before_script:
- apt-get update -y
- apt-get install -y git curl jq
- |
if ! command -v codex >/dev/null 2>&1; then
ARCH="$(uname -m)"
case "$ARCH" in
x86_64) CODEX_PLATFORM="x86_64-unknown-linux-musl" ;;
aarch64|arm64) CODEX_PLATFORM="aarch64-unknown-linux-musl" ;;
*)
echo "Unsupported architecture: $ARCH"
exit 1
;;
esac
CODEX_VERSION="${CODEX_VERSION:-latest}"
if [ -n "${CODEX_DOWNLOAD_URL:-}" ]; then
CODEX_URL="$CODEX_DOWNLOAD_URL"
elif [ "$CODEX_VERSION" = "latest" ]; then
CODEX_URL="https://github.com/openai/codex/releases/latest/download/codex-${CODEX_PLATFORM}.tar.gz"
else
CODEX_URL="https://github.com/openai/codex/releases/download/${CODEX_VERSION}/codex-${CODEX_PLATFORM}.tar.gz"
fi
TMP_DIR="$(mktemp -d)"
curl -fsSL "$CODEX_URL" -o "$TMP_DIR/codex.tar.gz"
tar -xzf "$TMP_DIR/codex.tar.gz" -C "$TMP_DIR"
install -m 0755 "$TMP_DIR"/codex-* /usr/local/bin/codex
rm -rf "$TMP_DIR"
fi
- git fetch origin $CI_MERGE_REQUEST_TARGET_BRANCH_NAME
- git fetch origin $CI_MERGE_REQUEST_SOURCE_BRANCH_NAME
- git checkout $CI_MERGE_REQUEST_SOURCE_BRANCH_NAME
script:
- echo "Running Codex structured review for MR !${PR_NUMBER}"
# Generate structured output schema
- |
cat <<'JSON' > codex-output-schema.json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Codex Structured Review",
"type": "object",
"additionalProperties": false,
"required": [
"overall_correctness",
"overall_explanation",
"overall_confidence_score",
"findings"
],
"properties": {
"overall_correctness": {
"type": "string",
"description": "Overall verdict for the merge request."
},
"overall_explanation": {
"type": "string",
"description": "Explanation backing up the verdict."
},
"overall_confidence_score": {
"type": "number",
"minimum": 0,
"maximum": 1,
"description": "Confidence level for the verdict."
},
"findings": {
"type": "array",
"description": "Collection of actionable review findings.",
"items": {
"type": "object",
"additionalProperties": false,
"required": [
"title",
"body",
"confidence_score",
"code_location"
],
"properties": {
"title": {
"type": "string"
},
"body": {
"type": "string"
},
"confidence_score": {
"type": "number",
"minimum": 0,
"maximum": 1
},
"code_location": {
"type": "object",
"additionalProperties": false,
"required": [
"absolute_file_path",
"relative_file_path",
"line_range"
],
"properties": {
"absolute_file_path": {
"type": "string"
},
"relative_file_path": {
"type": "string"
},
"line_range": {
"type": "object",
"additionalProperties": false,
"required": [
"start",
"end"
],
"properties": {
"start": {
"type": "integer",
"minimum": 1
},
"end": {
"type": "integer",
"minimum": 1
}
}
}
}
}
}
},
"default": []
}
}
}
JSON
# Build Codex review prompt
- |
PROMPT_PATH="codex-prompt.md"
TEMPLATE_PATH="${REVIEW_PROMPT_PATH:-review_prompt.md}"
if [ -n "$TEMPLATE_PATH" ] && [ -f "$TEMPLATE_PATH" ]; then
cat "$TEMPLATE_PATH" > "$PROMPT_PATH"
else
{
printf '%s\n' "You are acting as a reviewer for a proposed code change..."
printf '%s\n' "Focus on issues that impact correctness, performance, security..."
printf '%s\n' "Flag only actionable issues introduced by this merge request..."
printf '%s\n' "Provide an overall correctness verdict..."
} > "$PROMPT_PATH"
fi
{
echo ""
echo "Repository: ${REPOSITORY}"
echo "Merge Request #: ${PR_NUMBER}"
echo "Base SHA: ${BASE_SHA}"
echo "Head SHA: ${HEAD_SHA}"
echo ""
echo "Changed files:"
git --no-pager diff --name-status "${BASE_SHA}" "${HEAD_SHA}"
echo ""
echo "Unified diff (context=5):"
git --no-pager diff --unified=5 "${BASE_SHA}" "${HEAD_SHA}"
} >> "$PROMPT_PATH"
# Run Codex exec CLI
- |
printenv OPENAI_API_KEY | codex login --with-api-key && \
codex exec --output-schema codex-output-schema.json \
--output-last-message codex-output.json \
--sandbox read-only \
- < codex-prompt.md
# Inspect structured Codex output
- |
if [ -s codex-output.json ]; then
jq '.' codex-output.json || true
else
echo "Codex output file missing"; exit 1
fi
# Publish inline comments to GitLab MR
- |
findings_count=$(jq '.findings | length' codex-output.json)
if [ "$findings_count" -eq 0 ]; then
echo "No findings from Codex; skipping comments."
exit 0
fi
jq -c \
--arg base "$BASE_SHA" \
--arg start "$BASE_SHA" \
--arg head "$HEAD_SHA" '
.findings[] | {
body: (.title + "\n\n" + .body + "\n\nConfidence: " + (.confidence_score | tostring)),
position: {
position_type: "text",
base_sha: $base,
start_sha: $start,
head_sha: $head,
new_path: (.code_location.relative_file_path // .code_location.absolute_file_path),
new_line: .code_location.line_range.end
}
}
' codex-output.json > findings.jsonl
while IFS= read -r payload; do
curl -sS --request POST \
--header "PRIVATE-TOKEN: $GITLAB_TOKEN" \
--header "Content-Type: application/json" \
--data "$payload" \
"https://gitlab.com/api/v4/projects/${CI_PROJECT_ID}/merge_requests/${PR_NUMBER}/discussions"
done < findings.jsonl
# Publish overall summary comment
- |
overall_state=$(jq -r '.overall_correctness' codex-output.json)
overall_body=$(jq -r '.overall_explanation' codex-output.json)
confidence=$(jq -r '.overall_confidence_score' codex-output.json)
summary="**Codex automated review**\n\nVerdict: ${overall_state}\nConfidence: ${confidence}\n\n${overall_body}"
curl -sS --request POST \
--header "PRIVATE-TOKEN: $GITLAB_TOKEN" \
--header "Content-Type: application/json" \
--data "$(jq -n --arg body "$summary" '{body: $body}')" \
"https://gitlab.com/api/v4/projects/${CI_PROJECT_ID}/merge_requests/${PR_NUMBER}/notes"
artifacts:
when: always
paths:
- codex-output.json
- codex-prompt.md
```
## Jenkins Example
We can use the same approach to scripting a job with Jenkins. Once again, comments highlight key stages of the workflow:
```groovy
pipeline {
agent any
options {
timestamps()
ansiColor('xterm')
// Prevent overlapping runs on the same PR. Newer builds will cancel older ones after passing the milestone.
disableConcurrentBuilds()
}
environment {
// Default model like your GHA (can be overridden at job/env level)
CODEX_MODEL = "${env.CODEX_MODEL ?: 'o4-mini'}"
// Filled in during Init
PR_NUMBER = ''
HEAD_SHA = ''
BASE_SHA = ''
REPOSITORY = '' // org/repo
}
stages {
stage('Init (PR context, repo, SHAs)') {
steps {
checkout scm
// Compute PR context and SHAs similar to the GitHub Action
sh '''
set -euo pipefail
# Derive PR number from Jenkins env when building PRs via GitHub Branch Source
PR_NUMBER="${CHANGE_ID:-}"
if [ -z "$PR_NUMBER" ]; then
echo "Not a PR build (CHANGE_ID missing). Exiting."
exit 1
fi
echo "PR_NUMBER=$PR_NUMBER" >> $WORKSPACE/jenkins.env
# Discover owner/repo (normalize SSH/HTTPS forms)
ORIGIN_URL="$(git config --get remote.origin.url)"
if echo "$ORIGIN_URL" | grep -qE '^git@github.com:'; then
REPO_PATH="${ORIGIN_URL#git@github.com:}"
REPO_PATH="${REPO_PATH%.git}"
else
# e.g. https://github.com/owner/repo.git
REPO_PATH="${ORIGIN_URL#https://github.com/}"
REPO_PATH="${REPO_PATH%.git}"
fi
echo "REPOSITORY=$REPO_PATH" >> $WORKSPACE/jenkins.env
# Ensure we have all refs we need
git fetch --no-tags origin \
"+refs/heads/*:refs/remotes/origin/*" \
"+refs/pull/${PR_NUMBER}/head:refs/remotes/origin/PR-${PR_NUMBER}-head" \
"+refs/pull/${PR_NUMBER}/merge:refs/remotes/origin/PR-${PR_NUMBER}-merge"
# HEAD (PR head) and BASE (target branch tip)
CHANGE_TARGET="${CHANGE_TARGET:-main}"
HEAD_SHA="$(git rev-parse refs/remotes/origin/PR-${PR_NUMBER}-head)"
BASE_SHA="$(git rev-parse refs/remotes/origin/${CHANGE_TARGET})"
echo "HEAD_SHA=$HEAD_SHA" >> $WORKSPACE/jenkins.env
echo "BASE_SHA=$BASE_SHA" >> $WORKSPACE/jenkins.env
echo "Resolved:"
echo " REPOSITORY=$REPO_PATH"
echo " PR_NUMBER=$PR_NUMBER"
echo " CHANGE_TARGET=$CHANGE_TARGET"
echo " HEAD_SHA=$HEAD_SHA"
echo " BASE_SHA=$BASE_SHA"
'''
script {
def envMap = readProperties file: 'jenkins.env'
env.PR_NUMBER = envMap['PR_NUMBER']
env.REPOSITORY = envMap['REPOSITORY']
env.HEAD_SHA = envMap['HEAD_SHA']
env.BASE_SHA = envMap['BASE_SHA']
}
// Ensure only latest build for this PR proceeds; older in-flight builds will be aborted here
milestone 1
}
}
stage('Generate structured output schema') {
steps {
sh '''
set -euo pipefail
cat > codex-output-schema.json <<'JSON'
{
"type": "object",
"properties": {
"findings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string", "maxLength": 80 },
"body": { "type": "string", "minLength": 1 },
"confidence_score": { "type": "number", "minimum": 0, "maximum": 1 },
"priority": { "type": "integer", "minimum": 0, "maximum": 3 },
"code_location": {
"type": "object",
"properties": {
"absolute_file_path": { "type": "string", "minLength": 1 },
"line_range": {
"type": "object",
"properties": {
"start": { "type": "integer", "minimum": 1 },
"end": { "type": "integer", "minimum": 1 }
},
"required": ["start","end"],
"additionalProperties": false
}
},
"required": ["absolute_file_path","line_range"],
"additionalProperties": false
}
},
"required": ["title","body","confidence_score","priority","code_location"],
"additionalProperties": false
}
},
"overall_correctness": { "type": "string", "enum": ["patch is correct","patch is incorrect"] },
"overall_explanation": { "type": "string", "minLength": 1 },
"overall_confidence_score": { "type": "number", "minimum": 0, "maximum": 1 }
},
"required": ["findings","overall_correctness","overall_explanation","overall_confidence_score"],
"additionalProperties": false
}
JSON
'''
}
}
stage('Build Codex review prompt') {
environment {
REVIEW_PROMPT_PATH = "${env.CODEX_PROMPT_PATH ?: 'review_prompt.md'}"
}
steps {
sh '''
set -euo pipefail
PROMPT_PATH="codex-prompt.md"
TEMPLATE_PATH="${REVIEW_PROMPT_PATH}"
if [ -n "$TEMPLATE_PATH" ] && [ -f "$TEMPLATE_PATH" ]; then
cat "$TEMPLATE_PATH" > "$PROMPT_PATH"
else
{
printf '%s\n' "You are acting as a reviewer for a proposed code change made by another engineer."
printf '%s\n' "Focus on issues that impact correctness, performance, security, maintainability, or developer experience."
printf '%s\n' "Flag only actionable issues introduced by the pull request."
printf '%s\n' "When you flag an issue, provide a short, direct explanation and cite the affected file and line range."
printf '%s\n' "Prioritize severe issues and avoid nit-level comments unless they block understanding of the diff."
printf '%s\n' "After listing findings, produce an overall correctness verdict (\\\"patch is correct\\\" or \\\"patch is incorrect\\\") with a concise justification and a confidence score between 0 and 1."
printf '%s\n' "Ensure that file citations and line numbers are exactly correct using the tools available; if they are incorrect your comments will be rejected."
} > "$PROMPT_PATH"
fi
{
echo ""
echo "Repository: ${REPOSITORY}"
echo "Pull Request #: ${PR_NUMBER}"
echo "Base ref: ${CHANGE_TARGET}"
echo "Head ref: ${CHANGE_BRANCH:-PR-${PR_NUMBER}-head}"
echo "Base SHA: ${BASE_SHA}"
echo "Head SHA: ${HEAD_SHA}"
echo "Changed files:"
git --no-pager diff --name-status "${BASE_SHA}" "${HEAD_SHA}"
echo ""
echo "Unified diff (context=5):"
git --no-pager diff --unified=5 --stat=200 "${BASE_SHA}" "${HEAD_SHA}" > /tmp/diffstat.txt
git --no-pager diff --unified=5 "${BASE_SHA}" "${HEAD_SHA}" > /tmp/full.diff
cat /tmp/diffstat.txt
echo ""
cat /tmp/full.diff
} >> "$PROMPT_PATH"
'''
}
}
stage('Run Codex structured review') {
environment {
REVIEW_PROMPT = 'codex-prompt.md'
REVIEW_SCHEMA = 'codex-output-schema.json'
REVIEW_OUTPUT = 'codex-output.json'
}
steps {
withCredentials([
string(credentialsId: 'openai-api-key', variable: 'OPENAI_API_KEY')
]) {
// Option A: If you have the OpenAI CLI installed on the Jenkins agent
sh '''
set -euo pipefail
if command -v openai >/dev/null 2>&1; then
# Use the Responses API with a JSON schema tool spec
# Produces codex-output.json with the structured result.
openai responses.create \
--model "${CODEX_MODEL}" \
--input-file "${REVIEW_PROMPT}" \
--response-format "json_object" \
--output-schema "${RESPONSE_FORMAT}" \
--tool-choice "auto" \
> raw_response.json || true
# Fallback if CLI doesn’t support your exact flags:
# Keep demo resilient: If raw_response.json is empty, create a minimal stub so later steps don’t fail.
if [ ! -s raw_response.json ]; then
echo '{"findings":[],"overall_correctness":"patch is correct","overall_explanation":"No issues detected.","overall_confidence_score":0.5}' > "${REVIEW_OUTPUT}"
else
# If your CLI/format returns a JSON object with the structured content in .output or similar, map it here.
# Adjust jq path to match your CLI output shape.
jq -r '.output // .' raw_response.json > "${REVIEW_OUTPUT}" || cp raw_response.json "${REVIEW_OUTPUT}"
fi
else
echo "openai CLI not found; creating a stub output for demo continuity."
echo '{"findings":[],"overall_correctness":"patch is correct","overall_explanation":"(CLI not available on agent)","overall_confidence_score":0.4}' > "${REVIEW_OUTPUT}"
fi
'''
}
}
}
stage('Inspect structured Codex output') {
steps {
sh '''
if [ -s codex-output.json ]; then
jq '.' codex-output.json || true
else
echo "Codex output file missing"
fi
'''
}
}
stage('Publish inline review comments') {
when { expression { true } }
steps {
withCredentials([string(credentialsId: 'github-token', variable: 'GITHUB_TOKEN')]) {
sh '''
set -euo pipefail
REVIEW_JSON="codex-output.json"
if [ ! -s "$REVIEW_JSON" ]; then
echo "No Codex output file present; skipping comment publishing."
exit 0
fi
findings_count=$(jq '.findings | length' "$REVIEW_JSON")
if [ "$findings_count" -eq 0 ]; then
echo "Codex returned no findings; skipping inline comments."
exit 0
fi
jq -c --arg commit "$HEAD_SHA" '.findings[] | {
body: (.title + "\\n\\n" + .body + "\\n\\nConfidence: " + (.confidence_score | tostring) + (if has("priority") then "\\nPriority: P" + (.priority | tostring) else "" end)),
commit_id: $commit,
path: .code_location.absolute_file_path,
line: .code_location.line_range.end,
side: "RIGHT",
start_line: (if .code_location.line_range.start != .code_location.line_range.end then .code_location.line_range.start else null end),
start_side: (if .code_location.line_range.start != .code_location.line_range.end then "RIGHT" else null end)
} | with_entries(select(.value != null))' "$REVIEW_JSON" > findings.jsonl
while IFS= read -r payload; do
echo "Posting review comment payload:" && echo "$payload" | jq '.'
curl -sS \
-X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${GITHUB_TOKEN}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"https://api.github.com/repos/${REPOSITORY}/pulls/${PR_NUMBER}/comments" \
-d "$payload"
done < findings.jsonl
'''
}
}
}
stage('Publish overall summary comment') {
steps {
withCredentials([string(credentialsId: 'github-token', variable: 'GITHUB_TOKEN')]) {
sh '''
set -euo pipefail
REVIEW_JSON="codex-output.json"
if [ ! -s "$REVIEW_JSON" ]; then
echo "Codex output missing; skipping summary."
exit 0
fi
overall_state=$(jq -r '.overall_correctness' "$REVIEW_JSON")
overall_body=$(jq -r '.overall_explanation' "$REVIEW_JSON")
confidence=$(jq -r '.overall_confidence_score' "$REVIEW_JSON")
msg="**Codex automated review**\\n\\nVerdict: ${overall_state}\\nConfidence: ${confidence}\\n\\n${overall_body}"
jq -n --arg body "$msg" '{body: $body}' > /tmp/summary.json
curl -sS \
-X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${GITHUB_TOKEN}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"https://api.github.com/repos/${REPOSITORY}/issues/${PR_NUMBER}/comments" \
-d @/tmp/summary.json
'''
}
}
}
}
post {
always {
archiveArtifacts artifacts: 'codex-*.json, *.md, /tmp/diff*.txt', allowEmptyArchive: true
}
}
}
```
# Wrap Up
With the Codex SDK, you can build your own GitHub Code Review in on-prem environments. However, the pattern of triggering Codex with a prompt, receiving a structured output, and then acting on that output with an API call extends far beyond Code Review. For example, we could use this pattern to trigger a root-cause analysis when an incident is created and post a structured report into a Slack channel. Or we could create a code quality report on each PR and post results into a dashboard.
---
# Source: https://developers.openai.com/resources/guide/building-agents-guide.md
# Building agents guide
> Official guide to building agents using the OpenAI platform.
- Type: Guide
- Tags: agents
- URL: https://platform.openai.com/docs/guides/agents
- Created: 2025-07-18
- Updated: 2025-08-13
## Summary
This guide describes how to create and manage agents. — Agents SDK, agentic, tool calling
## Details
Walks through agent architecture and best practices.
---
# Source: https://developers.openai.com/resources/video/building-with-open-models-video.md
# Building with Open Models
> Talk covering how developers customize and deploy OpenAI’s open models.
- Type: Video
- Tags: fine-tuning
- URL: https://www.youtube.com/watch?v=1HL2YHRj270
- Created: 2025-10-22
- Updated: 2025-10-22
## Summary
Explains strategies for adapting open models to specific products and workflows. — open models, customization
## Details
Walks through real examples of preparing data, fine-tuning, and evaluating open models so they can power production-ready experiences.
---
# Source: https://developers.openai.com/cookbook/examples/codex/codex_mcp_agents_sdk/building_consistent_workflows_codex_cli_agents_sdk.md
# Building Consistent Workflows with Codex CLI & Agents SDK
### Ensuring Repeatable, Traceable, and Scaleable Agentic Development
## Introduction
Developers strive for consistency in everything they do. With Codex CLI and the Agents SDK, that consistency can now scale like never before. Whether you’re refactoring a large codebase, rolling out new features, or introducing a new testing framework, Codex integrates seamlessly into CLI, IDE, and cloud workflows to automate and enforce repeatable development patterns.
In this track, we’ll build both single and multi-agent systems using the Agents SDK, with Codex CLI exposed as an MCP Server. This enables:
- **Consistency and Repeatability** by providing each agent a scoped context.
- **Scalable Orchestration** to coordinate single and multi-agent systems.
- **Observability & Auditability** by reviewing the full agentic stack trace.
## What We’ll Cover
- Initializing Codex CLI as an MCP Server: How to run Codex as a long-running MCP process.
- Building Single-Agent Systems: Using Codex MCP for scoped tasks.
- Orchestrating Multi-Agent Workflows: Coordinating multiple specialized agents.
- Tracing Agentic Behavior: Leveraging agent traces for visibility and evaluation.
## Prerequisites & Setup
Before starting this track, ensure you have the following:
- Basic coding familiarity: You should be comfortable with Python and JavaScript.
- Developer environment: You’ll need an IDE, like VS Code or Cursor.
- OpenAI API key: Create or find your API key in the OpenAI Dashboard.
## Environment Setup
1. create a `.env` folder in your directory and add your `OPENAI_API_KEY` Key
2. Install dependencies
```python
%pip install openai-agents openai ## install dependencies
```
## Initializing Codex CLI as an MCP Server
Here run Codex CLI as an MCP Server inside the Agents SDK. We provide the initialization parameters of `codex mcp`. This command starts Codex CLI as an MCP server and exposes two Codex tools available on the MCP server — `codex()` and `codex-reply()`. These are the underlying tools that the Agents SDK will call when it needs to invoke Codex.
- `codex()` is used for creating a conversation.
- `codex-reply()` is for continuing a conversation.
```python
import asyncio
from agents import Agent, Runner
from agents.mcp import MCPServerStdio
async def main() -> None:
async with MCPServerStdio(
name="Codex CLI",
params={
"command": "npx",
"args": ["-y", "codex", "mcp-server"],
},
client_session_timeout_seconds=360000,
) as codex_mcp_server:
print("Codex MCP server started.")
# We will add more code here in the next section
return
```
Also note that we are extending the MCP Server timeout to allow Codex CLI enough time to execute and complete the given task.
---
## Building Single Agent Systems
Let’s start with a simple example to use our Codex MCP Server. We define two agents:
1. **Designer Agent** – brainstorms and creates a small brief for a game.
2. **Developer Agent** – implements a simple game according to the Designer’s spec.
```python
developer_agent = Agent(
name="Game Developer",
instructions=(
"You are an expert in building simple games using basic html + css + javascript with no dependencies. "
"Save your work in a file called index.html in the current directory."
"Always call codex with \"approval-policy\": \"never\" and \"sandbox\": \"workspace-write\""
),
mcp_servers=[codex_mcp_server],
)
designer_agent = Agent(
name="Game Designer",
instructions=(
"You are an indie game connoisseur. Come up with an idea for a single page html + css + javascript game that a developer could build in about 50 lines of code. "
"Format your request as a 3 sentence design brief for a game developer and call the Game Developer coder with your idea."
),
model="gpt-5",
handoffs=[developer_agent],
)
result = await Runner.run(designer_agent, "Implement a fun new game!")
```
Notice that we are providing the Developer agent with the ability to write files to the project directory without asking the user for permissions.
Now run the code and you’ll see an `index.html` file generated. Go ahead and open the file and start playing the game!
Here’s a few screenshots of the game my agentic system created. Yours will be different!
| Example gameplay | Game Over Score |
| :---: | :---: |
| | |
Here's the full executable code. Note that it might take a few minutes to run. It will have run successfully if you see an index.html file produced. You might also see some MCP events warnings about format. You can ignore these events.
```python
import os
from dotenv import load_dotenv
import asyncio
from agents import Agent, Runner, set_default_openai_api
from agents.mcp import MCPServerStdio
load_dotenv(override=True) # load the API key from the .env file. We set override to True here to ensure the notebook is loading any changes
set_default_openai_api(os.getenv("OPENAI_API_KEY"))
async def main() -> None:
async with MCPServerStdio(
name="Codex CLI",
params={
"command": "npx",
"args": ["-y", "codex", "mcp-server"],
},
client_session_timeout_seconds=360000,
) as codex_mcp_server:
developer_agent = Agent(
name="Game Developer",
instructions=(
"You are an expert in building simple games using basic html + css + javascript with no dependencies. "
"Save your work in a file called index.html in the current directory."
"Always call codex with \"approval-policy\": \"never\" and \"sandbox\": \"workspace-write\""
),
mcp_servers=[codex_mcp_server],
)
designer_agent = Agent(
name="Game Designer",
instructions=(
"You are an indie game connoisseur. Come up with an idea for a single page html + css + javascript game that a developer could build in about 50 lines of code. "
"Format your request as a 3 sentence design brief for a game developer and call the Game Developer coder with your idea."
),
model="gpt-5",
handoffs=[developer_agent],
)
result = await Runner.run(designer_agent, "Implement a fun new game!")
# print(result.final_output)
if __name__ == "__main__":
# Jupyter/IPython already runs an event loop, so calling asyncio.run() here
# raises "asyncio.run() cannot be called from a running event loop".
# Workaround: if a loop is running (notebook), use top-level `await`; otherwise use asyncio.run().
try:
asyncio.get_running_loop()
await main()
except RuntimeError:
asyncio.run(main())
```
---
## Orchestrating Multi-Agent Workflows
For larger workflows, we introduce a team of agents:
- **Project Manager**: Breaks down task list, creates requirements, and coordinates work.
- **Designer**: Produces UI/UX specifications.
- **Frontend Developer**: Implements UI/UX.
- **Backend Developer**: Implements APIs and logic.
- **Tester**: Validates outputs against acceptance criteria.
In this example, we intentionally have the Project Manager agent enforce gating logic between each of the specialized downstream agents. This ensures that artifacts exist before handoffs are made. This mirrors real world enterprise workflows such as JIRA task orchestration, long-chained rollouts, and QA sign-offs.
Multi-agent orchestration with Codex MCP and gated handoffs producing artifacts.
In this structure, each of our agents serve a specialized purpose. The Project Manager is overall responsible for coordinating across all other agents and ensuring the overall task is complete.
## Define the Codex CLI MCP Server
We set up our MCP Server to initialize Codex CLI just as we did in the single agent example.
```python
async def main() -> None:
async with MCPServerStdio(
name="Codex CLI",
params={
"command": "npx",
"args": ["-y", "codex", "mcp-server"],
},
client_session_timeout_seconds=360000,
) as codex_mcp_server:
print("Codex MCP server started.")
# We will add more code here in the next section
return
```
## Define each specialized agent
Below we define each of our specialized agents and provide access to our Codex MCP server. Notice that we are also passing the `RECOMMMENDED_PROMPT_PREFIX` to each agent that helps the system optimize for handoffs between agents.
```python
# Downstream agents are defined first for clarity, then PM references them in handoffs.
designer_agent = Agent(
name="Designer",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Designer.\n"
"Your only source of truth is AGENT_TASKS.md and REQUIREMENTS.md from the Project Manager.\n"
"Do not assume anything that is not written there.\n\n"
"You may use the internet for additional guidance or research."
"Deliverables (write to /design):\n"
"- design_spec.md – a single page describing the UI/UX layout, main screens, and key visual notes as requested in AGENT_TASKS.md.\n"
"- wireframe.md – a simple text or ASCII wireframe if specified.\n\n"
"Keep the output short and implementation-friendly.\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
tools=[WebSearchTool()],
mcp_servers=[codex_mcp_server],
handoffs=[],
)
frontend_developer_agent = Agent(
name="Frontend Developer",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Frontend Developer.\n"
"Read AGENT_TASKS.md and design_spec.md. Implement exactly what is described there.\n\n"
"Deliverables (write to /frontend):\n"
"- index.html – main page structure\n"
"- styles.css or inline styles if specified\n"
"- main.js or game.js if specified\n\n"
"Follow the Designer’s DOM structure and any integration points given by the Project Manager.\n"
"Do not add features or branding beyond the provided documents.\n\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager_agent."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
mcp_servers=[codex_mcp_server],
handoffs=[],
)
backend_developer_agent = Agent(
name="Backend Developer",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Backend Developer.\n"
"Read AGENT_TASKS.md and REQUIREMENTS.md. Implement the backend endpoints described there.\n\n"
"Deliverables (write to /backend):\n"
"- package.json – include a start script if requested\n"
"- server.js – implement the API endpoints and logic exactly as specified\n\n"
"Keep the code as simple and readable as possible. No external database.\n\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager_agent."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
mcp_servers=[codex_mcp_server],
handoffs=[],
)
tester_agent = Agent(
name="Tester",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Tester.\n"
"Read AGENT_TASKS.md and TEST.md. Verify that the outputs of the other roles meet the acceptance criteria.\n\n"
"Deliverables (write to /tests):\n"
"- TEST_PLAN.md – bullet list of manual checks or automated steps as requested\n"
"- test.sh or a simple automated script if specified\n\n"
"Keep it minimal and easy to run.\n\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
mcp_servers=[codex_mcp_server],
handoffs=[],
)
```
After each role completes its assignment, it will call `transfer_to_project_manager_agent`, and let the Project Manager confirm that the required files exist (or request fixes) before unblocking the next team.
## Define Project Manager Agent
The Project Manager is the only agent that receives the initial prompt, creates the planning documents in the project directory, and enforces the gatekeeping logic before every transfer.
```python
project_manager_agent = Agent(
name="Project Manager",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"""
You are the Project Manager.
Objective:
Convert the input task list into three project-root files the team will execute against.
Deliverables (write in project root):
- REQUIREMENTS.md: concise summary of product goals, target users, key features, and constraints.
- TEST.md: tasks with [Owner] tags (Designer, Frontend, Backend, Tester) and clear acceptance criteria.
- AGENT_TASKS.md: one section per role containing:
- Project name
- Required deliverables (exact file names and purpose)
- Key technical notes and constraints
Process:
- Resolve ambiguities with minimal, reasonable assumptions. Be specific so each role can act without guessing.
- Create files using Codex MCP with {"approval-policy":"never","sandbox":"workspace-write"}.
- Do not create folders. Only create REQUIREMENTS.md, TEST.md, AGENT_TASKS.md.
Handoffs (gated by required files):
1) After the three files above are created, hand off to the Designer with transfer_to_designer_agent and include REQUIREMENTS.md, and AGENT_TASKS.md.
2) Wait for the Designer to produce /design/design_spec.md. Verify that file exists before proceeding.
3) When design_spec.md exists, hand off in parallel to both:
- Frontend Developer with transfer_to_frontend_developer_agent (provide design_spec.md, REQUIREMENTS.md, AGENT_TASKS.md).
- Backend Developer with transfer_to_backend_developer_agent (provide REQUIREMENTS.md, AGENT_TASKS.md).
4) Wait for Frontend to produce /frontend/index.html and Backend to produce /backend/server.js. Verify both files exist.
5) When both exist, hand off to the Tester with transfer_to_tester_agent and provide all prior artifacts and outputs.
6) Do not advance to the next handoff until the required files for that step are present. If something is missing, request the owning agent to supply it and re-check.
PM Responsibilities:
- Coordinate all roles, track file completion, and enforce the above gating checks.
- Do NOT respond with status updates. Just handoff to the next agent until the project is complete.
"""
),
model="gpt-5",
model_settings=ModelSettings(
reasoning=Reasoning(effort="medium")
),
handoffs=[designer_agent, frontend_developer_agent, backend_developer_agent, tester_agent],
mcp_servers=[codex_mcp_server],
)
```
After constructing the Project Manager, the script sets every specialist's handoffs back to the Project
Manager. This ensures deliverables return for validation before moving on.
```python
designer_agent.handoffs = [project_manager_agent]
frontend_developer_agent.handoffs = [project_manager_agent]
backend_developer_agent.handoffs = [project_manager_agent]
tester_agent.handoffs = [project_manager_agent]
```
## Add in your task list
This is the task that the Project Manager will refine into specific requirements and tasks for the entire system.
```python
task_list = """
Goal: Build a tiny browser game to showcase a multi-agent workflow.
High-level requirements:
- Single-screen game called "Bug Busters".
- Player clicks a moving bug to earn points.
- Game ends after 20 seconds and shows final score.
- Optional: submit score to a simple backend and display a top-10 leaderboard.
Roles:
- Designer: create a one-page UI/UX spec and basic wireframe.
- Frontend Developer: implement the page and game logic.
- Backend Developer: implement a minimal API (GET /health, GET/POST /scores).
- Tester: write a quick test plan and a simple script to verify core routes.
Constraints:
- No external database—memory storage is fine.
- Keep everything readable for beginners; no frameworks required.
- All outputs should be small files saved in clearly named folders.
"""
```
Next, run your system, sit back, and you’ll see the agents go to work and create a game in a few minutes! We've included the fully executable code below. Once it's finished, you'll notice the creation of the following files directory. Note that this multi-agent orchestration usually took about 11 mintues to fully complete.
```markdown
root_directory/
├── AGENT_TASKS.md
├── REQUIREMENTS.md
├── backend
│ ├── package.json
│ └── server.js
├── design
│ ├── design_spec.md
│ └── wireframe.md
├── frontend
│ ├── game.js
│ ├── index.html
│ └── styles.css
└── TEST.md
```
Start your backend server with `node server.js` and open your `index.html` file to play your game.
```python
import os
from dotenv import load_dotenv
import asyncio
from agents import Agent, Runner, WebSearchTool, ModelSettings, set_default_openai_api
from agents.mcp import MCPServerStdio
from agents.extensions.handoff_prompt import RECOMMENDED_PROMPT_PREFIX
from openai.types.shared import Reasoning
load_dotenv(override=True) # load the API key from the .env file. We set override to True here to ensure the notebook is loading any changes
set_default_openai_api(os.getenv("OPENAI_API_KEY"))
async def main() -> None:
async with MCPServerStdio(
name="Codex CLI",
params={"command": "npx", "args": ["-y", "codex", "mcp-server"]},
client_session_timeout_seconds=360000,
) as codex_mcp_server:
# Downstream agents are defined first for clarity, then PM references them in handoffs.
designer_agent = Agent(
name="Designer",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Designer.\n"
"Your only source of truth is AGENT_TASKS.md and REQUIREMENTS.md from the Project Manager.\n"
"Do not assume anything that is not written there.\n\n"
"You may use the internet for additional guidance or research."
"Deliverables (write to /design):\n"
"- design_spec.md – a single page describing the UI/UX layout, main screens, and key visual notes as requested in AGENT_TASKS.md.\n"
"- wireframe.md – a simple text or ASCII wireframe if specified.\n\n"
"Keep the output short and implementation-friendly.\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
tools=[WebSearchTool()],
mcp_servers=[codex_mcp_server],
handoffs=[],
)
frontend_developer_agent = Agent(
name="Frontend Developer",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Frontend Developer.\n"
"Read AGENT_TASKS.md and design_spec.md. Implement exactly what is described there.\n\n"
"Deliverables (write to /frontend):\n"
"- index.html – main page structure\n"
"- styles.css or inline styles if specified\n"
"- main.js or game.js if specified\n\n"
"Follow the Designer’s DOM structure and any integration points given by the Project Manager.\n"
"Do not add features or branding beyond the provided documents.\n\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager_agent."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
mcp_servers=[codex_mcp_server],
handoffs=[],
)
backend_developer_agent = Agent(
name="Backend Developer",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Backend Developer.\n"
"Read AGENT_TASKS.md and REQUIREMENTS.md. Implement the backend endpoints described there.\n\n"
"Deliverables (write to /backend):\n"
"- package.json – include a start script if requested\n"
"- server.js – implement the API endpoints and logic exactly as specified\n\n"
"Keep the code as simple and readable as possible. No external database.\n\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager_agent."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
mcp_servers=[codex_mcp_server],
handoffs=[],
)
tester_agent = Agent(
name="Tester",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"You are the Tester.\n"
"Read AGENT_TASKS.md and TEST.md. Verify that the outputs of the other roles meet the acceptance criteria.\n\n"
"Deliverables (write to /tests):\n"
"- TEST_PLAN.md – bullet list of manual checks or automated steps as requested\n"
"- test.sh or a simple automated script if specified\n\n"
"Keep it minimal and easy to run.\n\n"
"When complete, handoff to the Project Manager with transfer_to_project_manager."
"When creating files, call Codex MCP with {\"approval-policy\":\"never\",\"sandbox\":\"workspace-write\"}."
),
model="gpt-5",
mcp_servers=[codex_mcp_server],
handoffs=[],
)
project_manager_agent = Agent(
name="Project Manager",
instructions=(
f"""{RECOMMENDED_PROMPT_PREFIX}"""
"""
You are the Project Manager.
Objective:
Convert the input task list into three project-root files the team will execute against.
Deliverables (write in project root):
- REQUIREMENTS.md: concise summary of product goals, target users, key features, and constraints.
- TEST.md: tasks with [Owner] tags (Designer, Frontend, Backend, Tester) and clear acceptance criteria.
- AGENT_TASKS.md: one section per role containing:
- Project name
- Required deliverables (exact file names and purpose)
- Key technical notes and constraints
Process:
- Resolve ambiguities with minimal, reasonable assumptions. Be specific so each role can act without guessing.
- Create files using Codex MCP with {"approval-policy":"never","sandbox":"workspace-write"}.
- Do not create folders. Only create REQUIREMENTS.md, TEST.md, AGENT_TASKS.md.
Handoffs (gated by required files):
1) After the three files above are created, hand off to the Designer with transfer_to_designer_agent and include REQUIREMENTS.md, and AGENT_TASKS.md.
2) Wait for the Designer to produce /design/design_spec.md. Verify that file exists before proceeding.
3) When design_spec.md exists, hand off in parallel to both:
- Frontend Developer with transfer_to_frontend_developer_agent (provide design_spec.md, REQUIREMENTS.md, AGENT_TASKS.md).
- Backend Developer with transfer_to_backend_developer_agent (provide REQUIREMENTS.md, AGENT_TASKS.md).
4) Wait for Frontend to produce /frontend/index.html and Backend to produce /backend/server.js. Verify both files exist.
5) When both exist, hand off to the Tester with transfer_to_tester_agent and provide all prior artifacts and outputs.
6) Do not advance to the next handoff until the required files for that step are present. If something is missing, request the owning agent to supply it and re-check.
PM Responsibilities:
- Coordinate all roles, track file completion, and enforce the above gating checks.
- Do NOT respond with status updates. Just handoff to the next agent until the project is complete.
"""
),
model="gpt-5",
model_settings=ModelSettings(
reasoning=Reasoning(effort="medium")
),
handoffs=[designer_agent, frontend_developer_agent, backend_developer_agent, tester_agent],
mcp_servers=[codex_mcp_server],
)
designer_agent.handoffs = [project_manager_agent]
frontend_developer_agent.handoffs = [project_manager_agent]
backend_developer_agent.handoffs = [project_manager_agent]
tester_agent.handoffs = [project_manager_agent]
# Example task list input for the Project Manager
task_list = """
Goal: Build a tiny browser game to showcase a multi-agent workflow.
High-level requirements:
- Single-screen game called "Bug Busters".
- Player clicks a moving bug to earn points.
- Game ends after 20 seconds and shows final score.
- Optional: submit score to a simple backend and display a top-10 leaderboard.
Roles:
- Designer: create a one-page UI/UX spec and basic wireframe.
- Frontend Developer: implement the page and game logic.
- Backend Developer: implement a minimal API (GET /health, GET/POST /scores).
- Tester: write a quick test plan and a simple script to verify core routes.
Constraints:
- No external database—memory storage is fine.
- Keep everything readable for beginners; no frameworks required.
- All outputs should be small files saved in clearly named folders.
"""
# Only the Project Manager receives the task list directly
result = await Runner.run(project_manager_agent, task_list, max_turns=30)
print(result.final_output)
if __name__ == "__main__":
# Jupyter/IPython already runs an event loop, so calling asyncio.run() here
# raises "asyncio.run() cannot be called from a running event loop".
# Workaround: if a loop is running (notebook), use top-level `await`; otherwise use asyncio.run().
try:
asyncio.get_running_loop()
await main()
except RuntimeError:
asyncio.run(main())
```
---
## Tracing the agentic behavior using Traces
As the complexity of your agentic systems grow, it’s important to see how these agents are interacting. We can do this with the Traces dashboard that records:
- Prompts, tool calls, and handoffs between agents.
- MCP Server calls, Codex CLI calls, execution times, and file writes.
- Errors and warnings.
Let’s take a look at the agent trace for the team of agents above.
In this Trace, we can confirm that every agent handoff is quarterbacked by our Project Manager Agent who is confirming that specific artifacts exist before handoff to the next agent. Additionally, we can see specific innovations of the Codex MCP Server and generate each output by calling the Responses API. The timeline bars highlight execution durations, making it easy to spot long-running steps and understand how control passes between agents.
You can even click into each trace to see the specific details of the prompt, tool calls, and other metadata. Over time you can view this information to further tune, optimize, and track your agentic system performance.
---
## Recap of What We Did in This Guide
In this guide, we walked through the process of building consistent, scalable workflows using Codex CLI and the Agents SDK. Specifically, we covered:
- **Codex MCP Server Setup** – How to initialize Codex CLI as an MCP server and make it available as tools for agent interactions.
- **Single-Agent Example** – A simple workflow with a Designer Agent and a Developer Agent, where Codex executed scoped tasks deterministically to produce a playable game.
- **Multi-Agent Orchestration** – Expanding to a larger workflow with a Project Manager, Designer, Frontend Developer, Backend Developer, and Tester, mirroring complex task orchestration and sign-off processes.
- **Traces & Observability** – Using built-in Traces to capture prompts, tool calls, handoffs, execution times, and artifacts, giving full visibility into agentic behavior for debugging, evaluation, and future optimization.
---
## Moving Forward: Applying These Lessons
Now that you’ve seen Codex MCP and the Agents SDK in action, here’s how you can apply the concepts in real projects and extract value:
### 1. Scale to Real-World Rollouts
- Apply the same multi-agent orchestration to large code refactors (e.g., 500+ files, framework migrations).
- Use Codex MCP’s deterministic execution for long-running, auditable rollouts with traceable progress.
### 2. Accelerate Delivery Without Losing Control
- Organize teams of specialized agents to parallelize development, while maintaining gating logic for artifact validation.
- Reduce turnaround time for new features, testing, or codebase modernization.
### 3. Extend and Connect to Your Development Workflows
- Connect MCP-powered agents with Jira, GitHub, or CI/CD pipelines via webhooks for automated, repeatable development cycles.
- Leverage Codex MCP in multi-agent service orchestration: not just codegen, but also documentation, QA, and deployment.
---
# Source: https://developers.openai.com/cookbook/examples/evaluation/building_resilient_prompts_using_an_evaluation_flywheel.md
## Overview
### Purpose of this cookbook
This cookbook provides a practical guide on how to use the OpenAI Platform to easily build resilience into your prompts.
> A **resilient prompt** is one that provides high-quality responses across the full breadth of possible inputs.
Prompt resilience is an essential piece of deploying AI applications in production. Without this property, your prompts can produce unexpected results on edge cases, provide subpar responses in normal cases, and undermine the effectiveness of your AI application.
To build resilience into your prompts, we recommend the **evaluation flywheel** process — a methodology that enables builders to continuously refine their AI applications over time in a measurable way.
### Target audience
This cookbook is designed for subject-matter experts, solutions architects, data scientists, and AI engineers who are looking to improve the general consistency and quality of their prompts, or address specific edge cases in their AI applications.
## The evaluation flywheel
AI applications often feel brittle. A prompt that works well one day can produce unexpected and low-quality results the next. This happens because prompts can be sensitive to small changes in user input or context. To build reliable AI products, we need a systematic way to make prompts more resilient.
The solution is a continuous, iterative process called the **evaluation flywheel**. Instead of guessing what might improve a prompt ("prompt-and-pray"), this lifecycle provides a structured engineering discipline to diagnose, measure, and solve problems.
The flywheel consists of three phases:
1. **Analyze**:
Understand how and why your system is failing through qualitative review. Manually examine and annotate examples where the model behaves incorrectly to identify recurring failure modes.
2. **Measure**:
Quantify the identified failure modes and set a baseline. You can’t improve what you can’t measure. Create a test dataset and build automated evaluators (“graders”) to score your system’s performance at scale.
3. **Improve**:
Make targeted improvements such as rewriting prompts, adding better examples, or adjusting system components. With measurement in place, you can immediately see the impact of changes and iterate until failure rates are acceptably low.
This is a continuous cycle. As you improve the system, new, subtler failure modes emerge — and the flywheel begins again. This process is the core methodology for building robust and reliable AI applications.

> **Source:** Shankar, S., & Husain, H. (2025). *Application-Centric AI Evals for Engineers and Technical Product Managers*. AI Evals Course Reader.
## An Example
To illustrate the evaluation process, let’s use data from an **apartment leasing assistant** in production.
It answers questions from prospective renters, such as:
* “How large are the apartments?”
* “When can I come in for a tour?”
Suppose we have a specific prompt within our application that we’d like to analyze. We can get started in the OpenAI Platform by adding in our prompt and uploading our input and output data to our Dataset (learn more about how to do this in [our docs](https://platform.openai.com/docs/guides/evaluations-getting-started)).

With our prompt and traces loaded in, we’re ready to analyze prompt effectiveness.
## Analyzing prompt effectiveness
To improve a system, you must first understand how it fails. While automated metrics are useful for tracking progress, they cannot reveal *why* a failure occurred. Manual analysis of model outputs is the most effective way to diagnose issues and gain insights for targeted improvements.
The core of this analysis is **annotation** — applying structured labels to text to categorize and understand failure modes. This turns unstructured failures into an actionable roadmap for improvement. We recommend a two-step method drawn from qualitative research: open coding and axial coding.
### 1. Open Coding: Discovering failure modes
The first step is to read through a sample of failing traces (we recommend starting with around 50) and apply descriptive labels to each error you find. In this phase, do not worry about creating a perfect, structured taxonomy. The goal is discovery.
On the OpenAI Platform, you can use annotation columns to open code your dataset. Here, we add a **Feedback**-type annotation column titled `open_coding` to capture our results.

For our apartment leasing assistant, our initial open codes might look like this:
* “bot suggested a tour time that wasn't available”
* “the list of amenities was a single block of text”
* “failed to cancel the original appointment when rescheduling”
* “the link to the floorplan was broken”
These specific, grounded-in-data labels become the raw material for the next step.

Here's our dataset after open coding.
### 2. Axial Coding: Structuring your insights
Once you have a set of open codes, the next step is to group them into higher-level categories. This is axial coding—the process of identifying relationships between your initial labels to build a structured understanding of the core problems.
We can group our open codes into predefined axial codes:
* **Tour scheduling/rescheduling issue:**
* Bot suggested a tour time that wasn't available
* Failed to cancel the original appointment when rescheduling
* **Formatting error with output:**
* The list of amenities was a single block of text
* The link to the floorplan was broken
We will add a new **Label**-type annotation column titled `axial_coding` to our dataset to capture this.

This simple taxonomy gives us a clear, quantitative picture of our system's primary weaknesses. We might discover that 35% of failures are related to tour scheduling, while only 10% are formatting errors. This tells us exactly where to focus our improvement efforts. For more information on how to conduct error analysis, see [this walkthrough](https://youtu.be/qH1dZ8JLLdU?si=Sxczt-LpKVVnMEdG).
## Adding robustness with automatic graders
Armed with our taxonomy and dataset, we’re now ready to start automating the evaluation flywheel. The OpenAI Platform supports [a variety of grader types](https://platform.openai.com/docs/guides/graders) (including Python graders and LLM graders) that can be run in bulk on our dataset (learn more [here](https://platform.openai.com/docs/guides/evaluation-getting-started#adding-graders)). For this example, we can build and run LLM graders for the following:
* **Formatting grader:** assess whether the model's response matches the desired format
* **Availability accuracy grader:** compares the availability returned by the model to a ground truth value you specify in your dataset
Our formatting grader is a fairly straightforward directive.

Our availability accuracy grader will reference additional input columns we’ve added to our dataset to capture business hours and day availability.


With automated graders in place, we can easily evaluate our performance on any change to our system — an updated prompt, updated model parameters, or newly discovered edge cases.
For more detail on how to get graders right, see our section on “Aligning your LLM judge” below.
## Optimizing the prompt
We’ve now identified and classified our errors, and built out grading to automate our flywheel. At this stage, we could choose to use our data to inform manual changes to our prompt. However, the OpenAI Platform supports an automatic [prompt optimization tool](https://platform.openai.com/docs/guides/prompt-optimizer) that speeds up this process.
The prompt optimizer takes our generated output, our custom annotation columns, and our graders into consideration to construct an improved prompt. We’ve constructed a fairly small example here, but with a full-fledged dataset (say, with the 50 rows we recommended earlier), the optimizer will produce a new prompt that solves many of our identified errors.
We may find ourselves wanting to iterate further, by re-annotating new model outputs, adding or refining graders, and re-optimizing. Graders and annotation column specifications are preserved across tabs, so we can continue to create additional prompt versions in new tabs as we work. The tabs also allow us to compare performance across different models, so we can use our graders to measure which model parameter configuration performs best.
This process enables us to improve our prompt over time, proactively responding to new errors or new model releases.
## Advanced techniques
### Expanding datasets with synthetic data
The core evaluation flywheel is your primary tool for improving your system. However, there are times when you may need more test data than you can gather from production logs. Synthetic data generation is a powerful, additional technique for these situations. It is particularly useful if you want to more extensively explore a specific failure mode, if you haven't shipped your product yet and need initial data, or if you have a hypothesis about a weakness but lack real-world examples to validate it.
Simply asking an LLM to "generate N examples" often produces a homogenous set of test cases. A more structured approach is to define key dimensions of a query and generate data across combinations of them, forming tuples. This ensures greater diversity and coverage in your test set.
For our leasing assistant, you could define dimensions such as:
* **Channel:** Voice, Chat, Text
* **Intent:** Tour Scheduling, Maintenance, General Info & Inquiries
* **Persona:** Prospective Resident, Agency
You can then combine these into a tuple like `(Text, Tour Scheduling, Prospective Resident)` and prompt an LLM to generate specific test cases that match this profile. This structured method creates challenging, realistic scenarios that a simpler generation process might miss.
In addition to varying the core components of the query, you can apply **perturbations** to make test cases harder and more realistic. This involves slightly altering your generated examples to test the system's resilience. Common perturbations include adding irrelevant information, introducing mistakes, or using different slang.
For a deeper dive into this topic, see [this discussion](https://hamel.dev/blog/posts/evals-faq/#q-what-is-the-best-approach-for-generating-synthetic-data).
### Aligning your LLM judge
An automated LLM judge is only useful if its judgments are trustworthy. To ensure this, you must systematically measure its performance against a human subject-matter expert (SME) using a "gold standard" dataset.
However, most test sets are **imbalanced** — they contain far more "pass" examples than "fail" examples. This makes a simple accuracy score misleading. A judge that always guesses "pass" might be 95% accurate but will never find a single failure.
* **True Positive Rate (TPR):** How well does the judge correctly identify the *failures*?
* **True Negative Rate (TNR):** How well does the judge correctly identify the *passes*?
The goal is to achieve high scores on both TPR and TNR. This confirms the judge is effective at finding real problems without being overly critical. This measurement process uses a standard dataset split.
1. **Train Set (~20%)**
This set's only job is to provide the "few-shot" examples for your judge's prompt. You will select a handful of clear pass/fail cases from this set and embed them directly into the prompt to give it a strong starting point.
2. **Validation Set (~40%)**
This is where you will iteratively improve your judge. You run the judge against this set and analyze the cases where its decision differs from the expert's. Tune the judge's prompt instructions to improve both its TPR and TNR.
3. **Test Set (~40%)**
This final, held-out set is your report card. After tuning, run the judge on this set one time. The final TPR and TNR scores confirm you haven't overfit and give you a trustworthy measure of your judge's performance.
For more guidance on how to align an LLM judge with your SMEs, see [this discussion](https://hamel.dev/blog/posts/llm-judge/). For more guidance on what model you should use for judging your AI, see [this post](https://hamel.dev/blog/posts/evals-faq/#q-can-i-use-the-same-model-for-both-the-main-task-and-evaluation).
## Next steps
This cookbook provides a foundational workflow for building resilient prompts, but the evaluation flywheel doesn't stop after one cycle. The next step is to make this process a core part of your engineering practice by integrating your graders into a CI/CD pipeline and monitoring production data to discover new failure modes.
In addition, the world of AI evaluations is deep and full of challenges we couldn't cover here. As you work to build out your eval strategy, you'll likely encounter more complex questions, such as:
* How do I make the case for investing in evaluations to my team?
* Why is a binary (pass/fail) evaluation often better than a 1-5 rating scale?
* What is the best way to debug a complex, multi-turn conversation trace?
* How should I approach evaluating my RAG system?
* How does this workflow adapt to agentic systems?
We recommend exploring [this FAQ about Evals](https://hamel.dev/blog/posts/evals-faq/) for further study.
---
# Source: https://developers.openai.com/cookbook/examples/building_w_rt_mini/building_w_rt_mini.md
# Build with Realtime Mini
Growing up, I was fascinated by the idea of Jarvis—an intelligent assistant that could autonomously handle complex workflows. What I didn’t realize back then was that I was imagining the future of voice agents. OpenAI was the first to make this vision real with the launch of `4o-audio`, and more recently made it even more accessible—cutting costs by 70%—with the release of [GPT Realtime Mini](https://platform.openai.com/docs/models/gpt-realtime-mini), which offers lower latency and major improvements in tool calling.
Building with speech models, however, is fundamentally different from working with text-only interfaces. In addition to prompt engineering, audio models bring new challenges: they’re more latency-sensitive, require managing a WebRTC session, and introduce additional variability through voice activity detection (VAD).
To make this process easier, OpenAI has released the Agents SDK in both Python and TypeScript, along with detailed examples that showcase our recommended design patterns for building reliable voice agents.
Before diving into code, let’s map out exactly what we’ll be building—and how it fits into the broader agent handoff architecture.
## System Architecture
For our application today we are going to be building an extremely simple customer support app using the **“handoff architecture”**. **“Handoff Architecture”** means a **primary agent** acts as the orchestrator for all incoming customer queries. Rather than handling every request directly, the primary agent analyzes the intent behind the user’s message and **categorizes it into one of 2 core pathways**:
1. General questions and basic support (no authenticator required).
2. Specific questions (user authentication required before lookup is performed).
Based on this categorization, the primary agent **hands off the conversation** to the appropriate specialist agent designed for that specific task.

## Setup
Instead of starting from scratch we're going to be working from the [openai-agents-js](https://github.com/openai/openai-agents-js/tree/main) repo, so lets start by cloning, installing the necessary dependencies, and building the web demo
```bash
git clone https://github.com/openai/openai-agents-js/tree/main
```
After cloning follow along with the steps in the readme to get started
```bash
npm install @openai/agents zod@3
pnpm examples:realtime-next
```
If everything works as expected you should see a simple chat interface

## Main Agent
Great! Now that we've cloned the repo, we are going to be modifying `openai-agents-js/examples/realtime-next/src/app/page.tsx`, starting with the **Main Agent**. Our **Main Agent** is the point of entry for the application stack. It acts as an intent classifier for any user query choosing how to re-route between different layers.
The implementation is fairly straightforward
```js
const mainAgent = new RealtimeAgent({
name: 'Main Agent',
instructions:
'You are the entry point for all customer queries. Default to the no-auth QA flow. If authentication is needed and validated, escalate to the Auth Layer by handing off to either the Flight Status Checker or Rebooking Agent. Do not answer policy questions from your own knowledge; rely on subordinate agents and tools.',
tools: [
checkFlightsTool,
],
handoffs: [qaAgent],
});
```
## QA Agent
Now that we’ve built the main agent, the next step is to add a specialized supporting agent to handle a specific class of customer queries. For general airline policy questions, this will be the QA Agent.
In a real-world product, this agent would power a more sophisticated experience: it would ingest company-specific PDFs and other reference materials, embed them, and dynamically query those documents at runtime to provide accurate, policy-grounded answers.
```
┌────────────┐ ┌────────────┐ ┌────────────────────────┐ ┌────────────┐
│ User Query │ ───► │ QA Agent │ ───► │ Vector DB / Retriever │ ───► │ LLM Answer │
└────────────┘ └────────────┘ └────────────────────────┘ └────────────┘
│ │
│ build search │ top-k context
▼ ▼
(semantic search) (grounded generation)
```
This would typically involve building a full vector database service that embeds the customer’s query and retrieves the most relevant results. For the sake of simplicity in this demo, we’ll mock that part of the pipeline.
If you’re interested in learning how to implement a fully featured retrieval system, take a look at our other cookbooks on the topic [here](https://cookbook.openai.com/examples/vector_databases/pinecone/readme).
```js
const documentLookupTool = tool({
name: 'document_lookup_tool',
description: 'Looks up answers from known airline documentation to handle general questions without authentication.',
parameters: z.object({
request: z.string(),
}),
execute: async ({ request }) => {
const mockDocument = `**Airline Customer Support — Quick Reference**
1. Each passenger may bring 1 carry-on (22 x 14 x 9) and 1 personal item.
2. Checked bags must be under 50 lbs; overweight fees apply.
3. Online check-in opens 24 hours before departure.
4. Seat upgrades can be requested up to 1 hour before boarding.
5. Wi‑Fi is complimentary on all flights over 2 hours.
6. Customers can change flights once for free within 24 hours of booking.
7. Exit rows offer extra legroom and require passengers to meet safety criteria.
8. Refunds can be requested for canceled or delayed flights exceeding 3 hours.
9. Pets are allowed in the cabin if under 20 lbs and in an approved carrier.
10. For additional help, contact our support team via chat or call center.`;
return mockDocument;
},
});
```
Like before when we defined the Main Agent we are going to create another instance of `RealtimeAgent` but this time we are going to supply a `documentLookupTool`.
```js
const qaAgent = new RealtimeAgent({
name: 'QA Agent',
instructions:
'You handle general customer questions using the document lookup tool. Use only the document lookup for answers. If the request may involve personal data or operations (rebooking, flight status), call the auth check tool. If auth is required and validated, handoff to the appropriate Auth Layer agent.',
tools: [documentLookupTool],
});
```
## Flight Status Agent
We’ve already built a powerful foundation: a main agent that can handle inbound customer queries, and a QA agent that searches our document store to provide accurate, policy-based answers.
What’s missing is a layer for customer-specific information—for example, queries like “What’s the status of my flight?” or “Which terminal should I go to?”. To support these kinds of personalized interactions, we need to embed an authentication layer into the workflow so the system can securely access and respond with user-specific data.
```
┌────────────┐ ┌──────────────┐ ┌───────────────────────┐ ┌───────────────────────┐
│ User Query │ ───► │ Auth Layer │ ───► │ Customer Data Access │ ───► │ LLM Answer (Personal) │
└────────────┘ └──────────────┘ └───────────────────────┘ └───────────────────────┘
│ │
│ verify identity │ query flight / account
▼ ▼
(token, SSO, OTP, etc.) (e.g., flight status, profile info)
```
Fortunately, the Agents SDK is designed to support this kind of use case. For customer support scenarios that involve sensitive, account-level information, we can ensure proper access control by using the `needsApproval` parameter within `tool`, which requires the user to authenticate before any protected data is accessed.
```js
const checkFlightsTool = tool({
name: 'checkFlightsTool',
description: 'Call this tool if the user queries about their current flight status',
parameters: z.object({}),
// Require approval so the UI can collect creds before executing.
needsApproval: true,
execute: async () => {
if (!credState.username || !credState.password) {
return 'Authentication missing.';
}
return `${credState.username} you are currently booked on the 8am flight from SFO to JFK`;
},
});
```
When a tool is registered with `needsApproval`, it automatically emits a `tool_approval_requested` event during the session. This allows us to add logic inside the `RealtimeAgent` instantiation block of our web application to listen for these events and update the UI accordingly—for example, by prompting the user to approve or authenticate before continuing.
```js
const [credUsername, setCredUsername] = useState('');
const [credPassword, setCredPassword] = useState('');
const [pendingApproval, setPendingApproval] = useState(null);
useEffect(() => {
session.current = new RealtimeSession(mainAgent, {
// other configs go here!
});
// various other event based logic goes here!
session.current.on(
'tool_approval_requested',
(_context, _agent, approvalRequest) => {
setPendingApproval(approvalRequest.approvalItem); // <- Alterations to react state!
setCredUsername('');
setCredPassword('');
setCredOpen(true);
},
);
}, []);
// ....
return (
{credOpen && (
// ... remainder of component logic
)}
)
```
## Final Code Snippet
And with that, we’re done! You’ve now built the core components of a customer support application:
* A generalist agent capable of handling a wide range of customer support queries
* An authentication workflow that verifies user identity and retrieves customer-specific information
With everything in place, the final version of `realtime-next/src/app/page.tsx` should look like this.
```js
'use client';
import {
RealtimeAgent,
RealtimeSession,
tool,
TransportEvent,
RealtimeOutputGuardrail,
OutputGuardrailTripwireTriggered,
RealtimeItem,
} from '@openai/agents/realtime';
import { useEffect, useRef, useState } from 'react';
import { z } from 'zod';
import { getToken } from './server/token.action';
import { App } from '@/components/App';
import { CameraCapture } from '@/components/CameraCapture';
// Demo-only credential store the tool can read at execution time
const credState: { username?: string; password?: string } = {};
// ---------------------------------------------
// Tools.
const documentLookupTool = tool({
name: 'document_lookup_tool',
description: 'Looks up answers from known airline documentation to handle general questions without authentication.',
parameters: z.object({
request: z.string(),
}),
execute: async ({ request }) => {
const mockDocument = `**Airline Customer Support — Quick Reference**
1. Each passenger may bring 1 carry-on (22 x 14 x 9) and 1 personal item.
2. Checked bags must be under 50 lbs; overweight fees apply.
3. Online check-in opens 24 hours before departure.
4. Seat upgrades can be requested up to 1 hour before boarding.
5. Wi‑Fi is complimentary on all flights over 2 hours.
6. Customers can change flights once for free within 24 hours of booking.
7. Exit rows offer extra legroom and require passengers to meet safety criteria.
8. Refunds can be requested for canceled or delayed flights exceeding 3 hours.
9. Pets are allowed in the cabin if under 20 lbs and in an approved carrier.
10. For additional help, contact our support team via chat or call center.`;
return mockDocument;
},
});
const checkFlightsTool = tool({
name: 'checkFlightsTool',
description: 'Call this tool if the user queries about their current flight status',
parameters: z.object({}),
// Require approval so the UI can collect creds before executing.
needsApproval: true,
execute: async () => {
if (!credState.username || !credState.password) {
return 'Authentication missing.';
}
return `${credState.username} you are currently booked on the 8am flight from SFO to JFK`;
},
});
// ---------------------------------------------
// Agents for each layer.
// 2. No-Auth Layer: QA Agent with doc lookup and auth check tool.
const qaAgent = new RealtimeAgent({
name: 'QA Agent',
instructions:
'You handle general customer questions using the document lookup tool. Use only the document lookup for answers. If the request may involve personal data or operations (rebooking, flight status), call the auth check tool. If auth is required and validated, handoff to the appropriate Auth Layer agent.',
tools: [documentLookupTool],
});
// 1. Main Agent: entry point and routing.
const mainAgent = new RealtimeAgent({
name: 'Main Agent',
instructions:
'You are the entry point for all customer queries. Default to the no-auth QA flow. If authentication is needed and validated, escalate to the Auth Layer by handing off to either the Flight Status Checker or Rebooking Agent. Do not answer policy questions from your own knowledge; rely on subordinate agents and tools.',
tools: [
checkFlightsTool,
],
handoffs: [qaAgent],
});
// Cross-handoffs so agents can return or escalate.
qaAgent.handoffs = [mainAgent];
export default function Home() {
const session = useRef | null>(null);
const [isConnected, setIsConnected] = useState(false);
const [isMuted, setIsMuted] = useState(false);
const [outputGuardrailResult, setOutputGuardrailResult] =
useState | null>(null);
const [events, setEvents] = useState([]);
const [history, setHistory] = useState([]);
const [mcpTools, setMcpTools] = useState([]);
const [credOpen, setCredOpen] = useState(false);
const [credUsername, setCredUsername] = useState('');
const [credPassword, setCredPassword] = useState('');
const [pendingApproval, setPendingApproval] = useState(null);
useEffect(() => {
session.current = new RealtimeSession(mainAgent, {
model: 'gpt-realtime-mini',
outputGuardrailSettings: {
debounceTextLength: 200,
},
config: {
audio: {
output: {
voice: 'cedar',
},
},
},
});
session.current.on('transport_event', (event) => {
setEvents((events) => [...events, event]);
});
session.current.on('mcp_tools_changed', (tools) => {
setMcpTools(tools.map((t) => t.name));
});
session.current.on(
'guardrail_tripped',
(_context, _agent, guardrailError) => {
setOutputGuardrailResult(guardrailError);
},
);
session.current.on('history_updated', (history) => {
setHistory(history);
});
session.current.on(
'tool_approval_requested',
(_context, _agent, approvalRequest) => {
setPendingApproval(approvalRequest.approvalItem);
setCredUsername('');
setCredPassword('');
setCredOpen(true);
},
);
}, []);
async function connect() {
if (isConnected) {
await session.current?.close();
setIsConnected(false);
} else {
const token = await getToken();
try {
await session.current?.connect({
apiKey: token,
});
setIsConnected(true);
} catch (error) {
console.error('Error connecting to session', error);
}
}
}
async function toggleMute() {
if (isMuted) {
await session.current?.mute(false);
setIsMuted(false);
} else {
await session.current?.mute(true);
setIsMuted(true);
}
}
function handleCredCancel() {
const approval = pendingApproval;
setCredOpen(false);
setPendingApproval(null);
if (approval) session.current?.reject(approval);
}
function handleCredSubmit(e: React.FormEvent) {
e.preventDefault();
if (!credUsername || !credPassword) return;
// Store creds for the tool to read
credState.username = credUsername;
credState.password = credPassword;
const approval = pendingApproval;
setCredOpen(false);
setPendingApproval(null);
setCredUsername('');
setCredPassword('');
if (approval) session.current?.approve(approval);
}
return (
);
}
```
---
# Source: https://developers.openai.com/resources/guide/built-in-tools-guide.md
# Built-in tools guide
> Guide to using OpenAI's built-in tools with the Responses API.
- Type: Guide
- Tags: tools
- URL: https://platform.openai.com/docs/guides/tools?api-mode=responses
- Created: 2025-07-18
- Updated: 2025-08-13
## Summary
Details available built-in tools and their usage. — tool calling
## Details
Provides instructions and examples for integrating built-in tools.
---
# Source: https://developers.openai.com/resources/video/built-in-tools-video.md
# Build hour — built-in tools
> Build hour giving an overview of built-in tools available in the Responses API.
- Type: Video
- Tags: responses, agents
- URL: https://webinar.openai.com/on-demand/c17a0484-d32c-4359-b5ee-d318dad51586
- Created: 2025-07-21
- Updated: 2025-08-13
## Summary
Shows how agents can call tools to accomplish tasks. — Responses API, function calling, Agents SDK, agentic, tool calling
## Details
Covers practical examples of integrating external tools in agent workflows.
---
# Source: https://developers.openai.com/cookbook/examples/evaluation/use-cases/bulk-experimentation.md
# Evaluations Example: Push Notifications Bulk Experimentation
Evals are **task oriented** and iterative, they're the best way to check how your LLM integration is doing and improve it.
In the following eval, we are going to focus on the task of **testing many variants of models and prompts**.
Our use-case is:
1. I want to get the best possible performance out of my push notifications summarizer
## Evals structure
Evals have two parts, the "Eval" and the "Run". An "Eval" holds the configuration for your testing criteria and the structure of the data for your "Runs". An Eval `has_many` runs, that are evaluated by your testing criteria.
```python
import pydantic
import openai
from openai.types.chat import ChatCompletion
import os
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")
```
## Use-case
We're testing the following integration, a push notifications summarizer, which takes in multiple push notifications and collapses them into a single message.
```python
class PushNotifications(pydantic.BaseModel):
notifications: str
print(PushNotifications.model_json_schema())
```
```python
DEVELOPER_PROMPT = """
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
Output only the final summary, nothing else.
"""
def summarize_push_notification(push_notifications: str) -> ChatCompletion:
result = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "developer", "content": DEVELOPER_PROMPT},
{"role": "user", "content": push_notifications},
],
)
return result
example_push_notifications_list = PushNotifications(notifications="""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""")
result = summarize_push_notification(example_push_notifications_list.notifications)
print(result.choices[0].message.content)
```
# Setting up your eval
An Eval holds the configuration that is shared across multiple *Runs*, it has two components:
1. Data source configuration `data_source_config` - the schema (columns) that your future *Runs* conform to.
- The `data_source_config` uses JSON Schema to define what variables are available in the Eval.
2. Testing Criteria `testing_criteria` - How you'll determine if your integration is working for each *row* of your data source.
For this use-case, we want to test if the push notification summary completion is good, so we'll set-up our eval with this in mind.
```python
# We want our input data to be available in our variables, so we set the item_schema to
# PushNotifications.model_json_schema()
data_source_config = {
"type": "custom",
"item_schema": PushNotifications.model_json_schema(),
# We're going to be uploading completions from the API, so we tell the Eval to expect this
"include_sample_schema": True,
}
```
This data_source_config defines what variables are available throughout the eval.
This item schema:
```json
{
"properties": {
"notifications": {
"title": "Notifications",
"type": "string"
}
},
"required": ["notifications"],
"title": "PushNotifications",
"type": "object"
}
```
Means that we'll have the variable `{{item.notifications}}` available in our eval.
`"include_sample_schema": True`
Mean's that we'll have the variable `{{sample.output_text}}` available in our eval.
**Now, we'll use those variables to set up our test criteria.**
```python
GRADER_DEVELOPER_PROMPT = """
Categorize the following push notification summary into the following categories:
1. concise-and-snappy
2. drops-important-information
3. verbose
4. unclear
5. obscures-meaning
6. other
You'll be given the original list of push notifications and the summary like this:
...notificationlist...
...summary...
You should only pick one of the categories above, pick the one which most closely matches and why.
"""
GRADER_TEMPLATE_PROMPT = """
{{item.notifications}}{{sample.output_text}}
"""
push_notification_grader = {
"name": "Push Notification Summary Grader",
"type": "label_model",
"model": "o3-mini",
"input": [
{
"role": "developer",
"content": GRADER_DEVELOPER_PROMPT,
},
{
"role": "user",
"content": GRADER_TEMPLATE_PROMPT,
},
],
"passing_labels": ["concise-and-snappy"],
"labels": [
"concise-and-snappy",
"drops-important-information",
"verbose",
"unclear",
"obscures-meaning",
"other",
],
}
```
The `push_notification_grader` is a model grader (llm-as-a-judge) which looks at the input `{{item.notifications}}` and the generated summary `{{sample.output_text}}` and labels it as "correct" or "incorrect"
We then instruct via the "passing_labels" what constitutes a passing answer.
Note: under the hood, this uses structured outputs so that labels are always valid.
**Now we'll create our eval, and start adding data to it!**
```python
eval_create_result = openai.evals.create(
name="Push Notification Bulk Experimentation Eval",
metadata={
"description": "This eval tests many prompts and models to find the best performing combination.",
},
data_source_config=data_source_config,
testing_criteria=[push_notification_grader],
)
eval_id = eval_create_result.id
```
# Creating runs
Now that we have our eval set-up with our testing_criteria, we can start to add a bunch of runs!
We'll start with some push notification data.
```python
push_notification_data = [
"""
- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
"""
- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
""",
"""
- Breaking News: Local elections results are in.
- Your daily workout summary is ready.
- Check out your weekly screen time report.
""",
"""
- Your ride is arriving in 2 minutes.
- Grocery order has been shipped.
- Don't miss the season finale of your favorite show tonight!
""",
"""
- Event reminder: Concert starts at 7 PM.
- Your favorite team just scored!
- Flashback: Memories from 3 years ago.
""",
"""
- Low battery alert: Charge your device.
- Your friend Mike is nearby.
- New episode of "The Tech Hour" podcast is live!
""",
"""
- System update available.
- Monthly billing statement is ready.
- Your next meeting starts in 15 minutes.
""",
"""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""",
"""
- Special offer: Free coffee with any breakfast order.
- Your flight has been delayed by 30 minutes.
- New movie release: "Adventures Beyond" now streaming.
""",
"""
- Traffic alert: Accident reported on Main Street.
- Package out for delivery: Expected by 5 PM.
- New friend suggestion: Connect with Emma.
"""]
```
Now we're going to set up a bunch of prompts to test.
We want to test a basic prompt, with a couple of variations:
1. In one variation, we'll just have the basic prompt
2. In the next one, we'll include some positive examples of what we want the summaries to look like
3. In the final one, we'll include both positive and negative examples.
We'll also include a list of models to use.
```python
PROMPT_PREFIX = """
You are a helpful assistant that takes in an array of push notifications and returns a collapsed summary of them.
The push notification will be provided as follows:
...notificationlist...
You should return just the summary and nothing else.
"""
PROMPT_VARIATION_BASIC = f"""
{PROMPT_PREFIX}
You should return a summary that is concise and snappy.
"""
PROMPT_VARIATION_WITH_EXAMPLES = f"""
{PROMPT_VARIATION_BASIC}
Here is an example of a good summary:
- Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma.
Traffic alert, package expected by 5pm, suggestion for new friend (Emily).
"""
PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES = f"""
{PROMPT_VARIATION_WITH_EXAMPLES}
Here is an example of a bad summary:
- Traffic alert: Accident reported on Main Street.- Package out for delivery: Expected by 5 PM.- New friend suggestion: Connect with Emma.
Traffic alert reported on main street. You have a package that will arrive by 5pm, Emily is a new friend suggested for you.
"""
prompts = [
("basic", PROMPT_VARIATION_BASIC),
("with_examples", PROMPT_VARIATION_WITH_EXAMPLES),
("with_negative_examples", PROMPT_VARIATION_WITH_NEGATIVE_EXAMPLES),
]
models = ["gpt-4o", "gpt-4o-mini", "o3-mini"]
```
**Now we can just loop through all prompts and all models to test a bunch of configurations at once!**
We'll use the 'completion' run data source with template variables for our push notification list.
OpenAI will handle making the completions calls for you and populating "sample.output_text"
```python
for prompt_name, prompt in prompts:
for model in models:
run_data_source = {
"type": "completions",
"input_messages": {
"type": "template",
"template": [
{
"role": "developer",
"content": prompt,
},
{
"role": "user",
"content": "{{item.notifications}}",
},
],
},
"model": model,
"source": {
"type": "file_content",
"content": [
{
"item": PushNotifications(notifications=notification).model_dump()
}
for notification in push_notification_data
],
},
}
run_create_result = openai.evals.runs.create(
eval_id=eval_id,
name=f"bulk_{prompt_name}_{model}",
data_source=run_data_source,
)
print(f"Report URL {model}, {prompt_name}:", run_create_result.report_url)
```
## Congratulations, you just tested 9 different prompt and model variations across your dataset!
---
# Source: https://developers.openai.com/cookbook/examples/azure/chat.md
# Azure chat completions example
This example will cover chat completions using the Azure OpenAI service. It also includes information on content filtering.
## Setup
First, we install the necessary dependencies and import the libraries we will be using.
```python
! pip install "openai>=1.0.0,<2.0.0"
! pip install python-dotenv
```
```python
import os
import openai
import dotenv
dotenv.load_dotenv()
```
### Authentication
The Azure OpenAI service supports multiple authentication mechanisms that include API keys and Azure Active Directory token credentials.
```python
use_azure_active_directory = False # Set this flag to True if you are using Azure Active Directory
```
#### Authentication using API key
To set up the OpenAI SDK to use an *Azure API Key*, we need to set `api_key` to a key associated with your endpoint (you can find this key in *"Keys and Endpoints"* under *"Resource Management"* in the [Azure Portal](https://portal.azure.com)). You'll also find the endpoint for your resource here.
```python
if not use_azure_active_directory:
endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
api_key = os.environ["AZURE_OPENAI_API_KEY"]
client = openai.AzureOpenAI(
azure_endpoint=endpoint,
api_key=api_key,
api_version="2023-09-01-preview"
)
```
#### Authentication using Azure Active Directory
Let's now see how we can autheticate via Azure Active Directory. We'll start by installing the `azure-identity` library. This library will provide the token credentials we need to authenticate and help us build a token credential provider through the `get_bearer_token_provider` helper function. It's recommended to use `get_bearer_token_provider` over providing a static token to `AzureOpenAI` because this API will automatically cache and refresh tokens for you.
For more information on how to set up Azure Active Directory authentication with Azure OpenAI, see the [documentation](https://learn.microsoft.com/azure/ai-services/openai/how-to/managed-identity).
```python
! pip install "azure-identity>=1.15.0"
```
```python
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
if use_azure_active_directory:
endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
client = openai.AzureOpenAI(
azure_endpoint=endpoint,
azure_ad_token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"),
api_version="2023-09-01-preview"
)
```
> Note: the AzureOpenAI infers the following arguments from their corresponding environment variables if they are not provided:
- `api_key` from `AZURE_OPENAI_API_KEY`
- `azure_ad_token` from `AZURE_OPENAI_AD_TOKEN`
- `api_version` from `OPENAI_API_VERSION`
- `azure_endpoint` from `AZURE_OPENAI_ENDPOINT`
## Deployments
In this section we are going to create a deployment of a GPT model that we can use to create chat completions.
### Deployments: Create in the Azure OpenAI Studio
Let's deploy a model to use with chat completions. Go to https://portal.azure.com, find your Azure OpenAI resource, and then navigate to the Azure OpenAI Studio. Click on the "Deployments" tab and then create a deployment for the model you want to use for chat completions. The deployment name that you give the model will be used in the code below.
```python
deployment = "" # Fill in the deployment name from the portal here
```
## Create chat completions
Now let's create a chat completion using the client we built.
```python
# For all possible arguments see https://platform.openai.com/docs/api-reference/chat-completions/create
response = client.chat.completions.create(
model=deployment,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Knock knock."},
{"role": "assistant", "content": "Who's there?"},
{"role": "user", "content": "Orange."},
],
temperature=0,
)
print(f"{response.choices[0].message.role}: {response.choices[0].message.content}")
```
### Create a streaming chat completion
We can also stream the response.
```python
response = client.chat.completions.create(
model=deployment,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Knock knock."},
{"role": "assistant", "content": "Who's there?"},
{"role": "user", "content": "Orange."},
],
temperature=0,
stream=True
)
for chunk in response:
if len(chunk.choices) > 0:
delta = chunk.choices[0].delta
if delta.role:
print(delta.role + ": ", end="", flush=True)
if delta.content:
print(delta.content, end="", flush=True)
```
### Content filtering
Azure OpenAI service includes content filtering of prompts and completion responses. You can learn more about content filtering and how to configure it [here](https://learn.microsoft.com/azure/ai-services/openai/concepts/content-filter).
If the prompt is flagged by the content filter, the library will raise a `BadRequestError` exception with a `content_filter` error code. Otherwise, you can access the `prompt_filter_results` and `content_filter_results` on the response to see the results of the content filtering and what categories were flagged.
#### Prompt flagged by content filter
```python
import json
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": ""}
]
try:
completion = client.chat.completions.create(
messages=messages,
model=deployment,
)
except openai.BadRequestError as e:
err = json.loads(e.response.text)
if err["error"]["code"] == "content_filter":
print("Content filter triggered!")
content_filter_result = err["error"]["innererror"]["content_filter_result"]
for category, details in content_filter_result.items():
print(f"{category}:\n filtered={details['filtered']}\n severity={details['severity']}")
```
### Checking the result of the content filter
```python
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the biggest city in Washington?"}
]
completion = client.chat.completions.create(
messages=messages,
model=deployment,
)
print(f"Answer: {completion.choices[0].message.content}")
# prompt content filter result in "model_extra" for azure
prompt_filter_result = completion.model_extra["prompt_filter_results"][0]["content_filter_results"]
print("\nPrompt content filter results:")
for category, details in prompt_filter_result.items():
print(f"{category}:\n filtered={details['filtered']}\n severity={details['severity']}")
# completion content filter result
print("\nCompletion content filter results:")
completion_filter_result = completion.choices[0].model_extra["content_filter_results"]
for category, details in completion_filter_result.items():
print(f"{category}:\n filtered={details['filtered']}\n severity={details['severity']}")
```
---
# Source: https://developers.openai.com/cookbook/examples/chat_finetuning_data_prep.md
# Data preparation and analysis for chat model fine-tuning
This notebook serves as a tool to preprocess and analyze the chat dataset used for fine-tuning a chat model.
It checks for format errors, provides basic statistics, and estimates token counts for fine-tuning costs.
The method shown here corresponds to the [current fine-tuning method](https://platform.openai.com/docs/guides/fine-tuning) for gpt-3.5-turbo.
See [legacy fine-tuning](https://platform.openai.com/docs/guides/legacy-fine-tuning) for models like babbage-002 and davinci-002.
```python
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict
```
## Data loading
We first load the chat dataset from an [example JSONL file](https://github.com/openai/openai-cookbook/blob/main/examples/data/toy_chat_fine_tuning.jsonl).
```python
data_path = "data/toy_chat_fine_tuning.jsonl"
# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
dataset = [json.loads(line) for line in f]
# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
print(message)
```
```text
Num examples: 5
First example:
{'role': 'system', 'content': 'You are a happy assistant that puts a positive spin on everything.'}
{'role': 'user', 'content': 'I fell off my bike today.'}
{'role': 'assistant', 'content': "It's great that you're getting exercise outdoors!"}
```
## Format validation
We can perform a variety of error checks to validate that each conversation in the dataset adheres to the format expected by the fine-tuning API. Errors are categorized based on their nature for easier debugging.
1. **Data Type Check**: Checks whether each entry in the dataset is a dictionary (`dict`). Error type: `data_type`.
2. **Presence of Message List**: Checks if a `messages` list is present in each entry. Error type: `missing_messages_list`.
3. **Message Keys Check**: Validates that each message in the `messages` list contains the keys `role` and `content`. Error type: `message_missing_key`.
4. **Unrecognized Keys in Messages**: Logs if a message has keys other than `role`, `content`, `weight`, `function_call`, and `name`. Error type: `message_unrecognized_key`.
5. **Role Validation**: Ensures the `role` is one of "system", "user", or "assistant". Error type: `unrecognized_role`.
6. **Content Validation**: Verifies that `content` has textual data and is a string. Error type: `missing_content`.
7. **Assistant Message Presence**: Checks that each conversation has at least one message from the assistant. Error type: `example_missing_assistant_message`.
The code below performs these checks, and outputs counts for each type of error found are printed. This is useful for debugging and ensuring the dataset is ready for the next steps.
```python
# Format error checks
format_errors = defaultdict(int)
for ex in dataset:
if not isinstance(ex, dict):
format_errors["data_type"] += 1
continue
messages = ex.get("messages", None)
if not messages:
format_errors["missing_messages_list"] += 1
continue
for message in messages:
if "role" not in message or "content" not in message:
format_errors["message_missing_key"] += 1
if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
format_errors["message_unrecognized_key"] += 1
if message.get("role", None) not in ("system", "user", "assistant", "function"):
format_errors["unrecognized_role"] += 1
content = message.get("content", None)
function_call = message.get("function_call", None)
if (not content and not function_call) or not isinstance(content, str):
format_errors["missing_content"] += 1
if not any(message.get("role", None) == "assistant" for message in messages):
format_errors["example_missing_assistant_message"] += 1
if format_errors:
print("Found errors:")
for k, v in format_errors.items():
print(f"{k}: {v}")
else:
print("No errors found")
```
```text
No errors found
```
## Token Counting Utilities
Lets define a few helpful utilities to be used in the rest of the notebook.
```python
encoding = tiktoken.get_encoding("cl100k_base")
# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3
return num_tokens
def num_assistant_tokens_from_messages(messages):
num_tokens = 0
for message in messages:
if message["role"] == "assistant":
num_tokens += len(encoding.encode(message["content"]))
return num_tokens
def print_distribution(values, name):
print(f"\n#### Distribution of {name}:")
print(f"min / max: {min(values)}, {max(values)}")
print(f"mean / median: {np.mean(values)}, {np.median(values)}")
print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")
```
## Data Warnings and Token Counts
With some lightweight analysis we can identify potential issues in the dataset, like missing messages, and provide statistical insights into message and token counts.
1. **Missing System/User Messages**: Counts the number of conversations missing a "system" or "user" message. Such messages are critical for defining the assistant's behavior and initiating the conversation.
2. **Number of Messages Per Example**: Summarizes the distribution of the number of messages in each conversation, providing insight into dialogue complexity.
3. **Total Tokens Per Example**: Calculates and summarizes the distribution of the total number of tokens in each conversation. Important for understanding fine-tuning costs.
4. **Tokens in Assistant's Messages**: Calculates the number of tokens in the assistant's messages per conversation and summarizes this distribution. Useful for understanding the assistant's verbosity.
5. **Token Limit Warnings**: Checks if any examples exceed the maximum token limit (16,385 tokens), as such examples will be truncated during fine-tuning, potentially resulting in data loss.
```python
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []
for ex in dataset:
messages = ex["messages"]
if not any(message["role"] == "system" for message in messages):
n_missing_system += 1
if not any(message["role"] == "user" for message in messages):
n_missing_user += 1
n_messages.append(len(messages))
convo_lens.append(num_tokens_from_messages(messages))
assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 16385 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 16,385 token limit, they will be truncated during fine-tuning")
```
```text
Num examples missing system message: 1
Num examples missing user message: 1
#### Distribution of num_messages_per_example:
min / max: 2, 9
mean / median: 3.8, 3.0
p5 / p95: 2.0, 6.6000000000000005
#### Distribution of num_total_tokens_per_example:
min / max: 26, 8032
mean / median: 1648.4, 45.0
p5 / p95: 26.8, 4863.6
#### Distribution of num_assistant_tokens_per_example:
min / max: 4, 8000
mean / median: 1610.2, 10.0
p5 / p95: 6.0, 4811.200000000001
0 examples may be over the 16,385 token limit, they will be truncated during fine-tuning
```
## Cost Estimation
In this final section, we estimate the total number of tokens that will be used for fine-tuning, which allows us to approximate the cost. It is worth noting that the duration of the fine-tuning jobs will also increase with the token count.
```python
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 16385
TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25
n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)
n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")
```
```text
Dataset has ~4306 tokens that will be charged for during training
By default, you'll train for 20 epochs on this dataset
By default, you'll be charged for ~86120 tokens
```
See https://openai.com/pricing to estimate total costs.
---
# Source: https://developers.openai.com/cookbook/examples/azure/chat_with_your_own_data.md
# Azure chat completion models with your own data (preview)
This example shows how to use Azure OpenAI service models with your own data. The feature is currently in preview.
Azure OpenAI on your data enables you to run supported chat models such as GPT-3.5-Turbo and GPT-4 on your data without needing to train or fine-tune models. Running models on your data enables you to chat on top of, and analyze your data with greater accuracy and speed. One of the key benefits of Azure OpenAI on your data is its ability to tailor the content of conversational AI. Because the model has access to, and can reference specific sources to support its responses, answers are not only based on its pretrained knowledge but also on the latest information available in the designated data source. This grounding data also helps the model avoid generating responses based on outdated or incorrect information.
Azure OpenAI on your own data with Azure AI Search (f.k.a. Azure Cognitive Search) provides a customizable, pre-built solution for knowledge retrieval, from which a conversational AI application can be built. To see alternative methods for knowledge retrieval and semantic search, check out the cookbook examples for [vector databases](https://github.com/openai/openai-cookbook/tree/main/examples/vector_databases).
## How it works
[Azure OpenAI on your own data](https://learn.microsoft.com/azure/ai-services/openai/concepts/use-your-data) connects the model with your data, giving it the ability to retrieve and utilize data in a way that enhances the model's output. Together with Azure AI Search, data is retrieved from designated data sources based on the user input and provided conversation history. The data is then augmented and resubmitted as a prompt to the model, giving the model contextual information it can use to generate a response.
See the [Data, privacy, and security for Azure OpenAI Service](https://learn.microsoft.com/legal/cognitive-services/openai/data-privacy?context=%2Fazure%2Fai-services%2Fopenai%2Fcontext%2Fcontext) for more information.
## Prerequisites
To get started, we'll cover a few prerequisites.
To properly access the Azure OpenAI Service, we need to create the proper resources at the [Azure Portal](https://portal.azure.com) (you can check a detailed guide on how to do this in the [Microsoft Docs](https://learn.microsoft.com/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal))
To use your own data with Azure OpenAI models, you will need:
1. Azure OpenAI access and a resource with a chat model deployed (for example, GPT-3 or GPT-4)
2. Azure AI Search (f.k.a. Azure Cognitive Search) resource
3. Azure Blob Storage resource
4. Your documents to be used as data (See [data source options](https://learn.microsoft.com/azure/ai-services/openai/concepts/use-your-data#data-source-options))
For a full walk-through on how to upload your documents to blob storage and create an index using the Azure AI Studio, see this [Quickstart](https://learn.microsoft.com/azure/ai-services/openai/use-your-data-quickstart?pivots=programming-language-studio&tabs=command-line).
## Setup
First, we install the necessary dependencies.
```python
! pip install "openai>=1.0.0,<2.0.0"
! pip install python-dotenv
```
In this example, we'll use `dotenv` to load our environment variables. To connect with Azure OpenAI and the Search index, the following variables should be added to a `.env` file in `KEY=VALUE` format:
* `AZURE_OPENAI_ENDPOINT` - the Azure OpenAI endpoint. This can be found under "Keys and Endpoints" for your Azure OpenAI resource in the Azure Portal.
* `AZURE_OPENAI_API_KEY` - the Azure OpenAI API key. This can be found under "Keys and Endpoints" for your Azure OpenAI resource in the Azure Portal. Omit if using Azure Active Directory authentication (see below `Authentication using Microsoft Active Directory`)
* `SEARCH_ENDPOINT` - the AI Search endpoint. This URL be found on the "Overview" of your Search resource on the Azure Portal.
* `SEARCH_KEY` - the AI Search API key. Found under "Keys" for your Search resource in the Azure Portal.
* `SEARCH_INDEX_NAME` - the name of the index you created with your own data.
```python
import os
import openai
import dotenv
dotenv.load_dotenv()
```
### Authentication
The Azure OpenAI service supports multiple authentication mechanisms that include API keys and Azure Active Directory token credentials.
```python
use_azure_active_directory = False # Set this flag to True if you are using Azure Active Directory
```
#### Authentication using API key
To set up the OpenAI SDK to use an *Azure API Key*, we need to set `api_key` to a key associated with your endpoint (you can find this key in *"Keys and Endpoints"* under *"Resource Management"* in the [Azure Portal](https://portal.azure.com)). You'll also find the endpoint for your resource here.
```python
if not use_azure_active_directory:
endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
api_key = os.environ["AZURE_OPENAI_API_KEY"]
# set the deployment name for the model we want to use
deployment = ""
client = openai.AzureOpenAI(
base_url=f"{endpoint}/openai/deployments/{deployment}/extensions",
api_key=api_key,
api_version="2023-09-01-preview"
)
```
#### Authentication using Azure Active Directory
Let's now see how we can authenticate via Azure Active Directory. We'll start by installing the `azure-identity` library. This library will provide the token credentials we need to authenticate and help us build a token credential provider through the `get_bearer_token_provider` helper function. It's recommended to use `get_bearer_token_provider` over providing a static token to `AzureOpenAI` because this API will automatically cache and refresh tokens for you.
For more information on how to set up Azure Active Directory authentication with Azure OpenAI, see the [documentation](https://learn.microsoft.com/azure/ai-services/openai/how-to/managed-identity).
```python
! pip install "azure-identity>=1.15.0"
```
```python
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
if use_azure_active_directory:
endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
api_key = os.environ["AZURE_OPENAI_API_KEY"]
# set the deployment name for the model we want to use
deployment = ""
client = openai.AzureOpenAI(
base_url=f"{endpoint}/openai/deployments/{deployment}/extensions",
azure_ad_token_provider=get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"),
api_version="2023-09-01-preview"
)
```
> Note: the AzureOpenAI infers the following arguments from their corresponding environment variables if they are not provided:
- `api_key` from `AZURE_OPENAI_API_KEY`
- `azure_ad_token` from `AZURE_OPENAI_AD_TOKEN`
- `api_version` from `OPENAI_API_VERSION`
- `azure_endpoint` from `AZURE_OPENAI_ENDPOINT`
## Chat completion model with your own data
### Setting the context
In this example, we want our model to base its responses on Azure AI services documentation data. Following the [Quickstart](https://learn.microsoft.com/azure/ai-services/openai/use-your-data-quickstart?tabs=command-line&pivots=programming-language-studio) shared previously, we have added the [markdown](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/cognitive-services-and-machine-learning.md) file for the [Azure AI services and machine learning](https://learn.microsoft.com/azure/ai-services/cognitive-services-and-machine-learning) documentation page to our search index. The model is now ready to answer questions about Azure AI services and machine learning.
### Code
Now we can use Azure on your own data with Chat Completions. Providing our search endpoint, key, and index name in `dataSources`, any questions posed to the model will now be grounded in our own data. An additional property, `context`, will be provided in the response to show the data the model referenced to answer the question.
```python
completion = client.chat.completions.create(
messages=[{"role": "user", "content": "What are the differences between Azure Machine Learning and Azure AI services?"}],
model=deployment,
extra_body={
"dataSources": [
{
"type": "AzureCognitiveSearch",
"parameters": {
"endpoint": os.environ["SEARCH_ENDPOINT"],
"key": os.environ["SEARCH_KEY"],
"indexName": os.environ["SEARCH_INDEX_NAME"],
}
}
]
}
)
print(f"{completion.choices[0].message.role}: {completion.choices[0].message.content}")
# `context` is in the model_extra for Azure
print(f"\nContext: {completion.choices[0].message.model_extra['context']['messages'][0]['content']}")
```
If you would prefer to stream the response from the model, you can pass the `stream=True` keyword argument:
```python
response = client.chat.completions.create(
messages=[{"role": "user", "content": "What are the differences between Azure Machine Learning and Azure AI services?"}],
model=deployment,
extra_body={
"dataSources": [
{
"type": "AzureCognitiveSearch",
"parameters": {
"endpoint": os.environ["SEARCH_ENDPOINT"],
"key": os.environ["SEARCH_KEY"],
"indexName": os.environ["SEARCH_INDEX_NAME"],
}
}
]
},
stream=True,
)
for chunk in response:
delta = chunk.choices[0].delta
if delta.role:
print("\n"+ delta.role + ": ", end="", flush=True)
if delta.content:
print(delta.content, end="", flush=True)
if delta.model_extra.get("context"):
print(f"Context: {delta.model_extra['context']}", end="", flush=True)
```
---
# Source: https://developers.openai.com/apps-sdk/build/chatgpt-ui.md
# Build your ChatGPT UI
## Overview
UI components turn structured tool results from your MCP server into a human-friendly UI. Your components run inside an iframe in ChatGPT, talk to the host via the `window.openai` API, and render inline with the conversation. This guide describes how to structure your component project, bundle it, and wire it up to your MCP server.
You can also check out the [examples repository on GitHub](https://github.com/openai/openai-apps-sdk-examples).
### Component library
Use the optional UI kit at [apps-sdk-ui](https://openai.github.io/apps-sdk-ui) for ready-made buttons, cards, input controls, and layout primitives that match ChatGPT’s container. It saves time when you want consistent styling without rebuilding base components.
## Understand the `window.openai` API
The host injects `window.openai` with UI-related globals and methods for calling tools, sending follow-ups, and managing layout. In your widget, read values directly from `window.openai` (e.g., `window.openai.toolOutput`, `window.openai.locale`) or through helper hooks like `useOpenAiGlobal` shown later.
`window.openai` is the bridge between your frontend and ChatGPT. For the full API reference, see [Apps SDK Reference](https://developers.openai.com/apps-sdk/reference#windowopenai-component-bridge).
### useOpenAiGlobal
Many Apps SDK projects wrap `window.openai` access in small hooks so views remain testable. This example hook listens for host `openai:set_globals` events and lets React components subscribe to a single global value:
```ts
export function useOpenAiGlobal(
key: K
): OpenAiGlobals[K] {
return useSyncExternalStore(
(onChange) => {
const handleSetGlobal = (event: SetGlobalsEvent) => {
const value = event.detail.globals[key];
if (value === undefined) {
return;
}
onChange();
};
window.addEventListener(SET_GLOBALS_EVENT_TYPE, handleSetGlobal, {
passive: true,
});
return () => {
window.removeEventListener(SET_GLOBALS_EVENT_TYPE, handleSetGlobal);
};
},
() => window.openai[key]
);
}
```
`useOpenAiGlobal` is an important primitive to make your app reactive to changes in display mode, theme, and "props" via subsequent tool calls.
For example, read the tool input, output, and metadata:
```ts
export function useToolInput() {
return useOpenAiGlobal("toolInput");
}
export function useToolOutput() {
return useOpenAiGlobal("toolOutput");
}
export function useToolResponseMetadata() {
return useOpenAiGlobal("toolResponseMetadata");
}
```
### Persist component state, expose context to ChatGPT
Widget state can be used for persisting data across user sessions, and exposing data to ChatGPT. Anything you pass to `setWidgetState` will be shown to the model, and hydrated into `window.openai.widgetState`
Widget state is scoped to the specific widget instance that lives on a single conversation message. When your component calls `window.openai.setWidgetState(payload)`, the host stores that payload under that widget’s `message_id/widgetId` pair and rehydrates it only for that widget. The state does not travel across the whole conversation or between different widgets.
Follow-up turns keep the same widget (and therefore the same state) only when the user submits through that widget’s controls—inline follow-ups, PiP composer, or fullscreen composer. If the user types into the main chat composer, the request is treated as a new widget run with a fresh `widgetId` and empty `widgetState`.
Anything you pass to `setWidgetState` is sent to the model, so keep the payload focused and well under 4k [tokens](https://platform.openai.com/tokenizer) for performance.
### Trigger server actions
`window.openai.callTool` lets the component directly make MCP tool calls. Use this for direct manipulations (refresh data, fetch nearby restaurants). Design tools to be idempotent where possible and return updated structured content that the model can reason over in subsequent turns.
Please note that your tool needs to be marked as [able to be initiated by the component](https://developers.openai.com/apps-sdk/build/mcp-server###allow-component-initiated-tool-access).
```tsx
async function refreshPlaces(city: string) {
await window.openai?.callTool("refresh_pizza_list", { city });
}
```
### Send conversational follow-ups
Use `window.openai.sendFollowUpMessage` to insert a message into the conversation as if the user asked it.
```tsx
await window.openai?.sendFollowUpMessage({
prompt: "Draft a tasting itinerary for the pizzerias I favorited.",
});
```
### Upload files from the widget
Use `window.openai.uploadFile(file)` to upload a user-selected file and receive a `fileId`. This currently supports `image/png`, `image/jpeg`, and `image/webp`.
```tsx
function FileUploadInput() {
return (
{
const file = event.currentTarget.files?.[0];
if (!file || !window.openai?.uploadFile) {
return;
}
const { fileId } = await window.openai.uploadFile(file);
console.log("Uploaded fileId:", fileId);
}}
/>
);
}
```
### Download files in the widget
Use `window.openai.getFileDownloadUrl({ fileId })` to retrieve a temporary URL for files that were uploaded by the widget or passed to your tool via file params.
```tsx
const { downloadUrl } = await window.openai.getFileDownloadUrl({ fileId });
imageElement.src = downloadUrl;
```
### Close the widget
You can close the widget two ways: from the UI by calling `window.openai.requestClose()`, or from the server by having your tool response set `metadata.openai/closeWidget: true`, which instructs the host to hide the widget when that response arrives:
```json
{
"role": "tool",
"tool_call_id": "abc123",
"content": "...",
"metadata": {
"openai/closeWidget": true,
"openai/widgetDomain": "https://myapp.example.com",
"openai/widgetCSP": {
"connect_domains": ["https://api.myapp.example.com"],
"resource_domains": ["https://*.oaistatic.com"],
"redirect_domains": ["https://checkout.example.com"], // Optional: allow openExternal redirects + return link
"frame_domains": ["https://*.example.com"] // Optional: allow iframes from these domains
}
}
}
```
Note: By default, widgets cannot render subframes. Setting `frame_domains` relaxes this and allows your widget to embed iframes from those origins. Apps that use `frame_domains` are subject to stricter review and are likely to be rejected for broad distribution unless iframe content is core to the use case.
If you want `window.openai.openExternal` to send users to an external flow (like checkout) and enable a return link to the same conversation, optionally add the destination origin to `redirect_domains`. ChatGPT will skip the safe-link modal and append a `redirectUrl` query parameter to the destination so you can route the user back into ChatGPT.
### Widget session ID
The host includes a per-widget identifier in tool response metadata as `openai/widgetSessionId`. Use it to correlate multiple tool calls or logs for the same widget instance while it remains mounted.
### Request alternate layouts
If the UI needs more space—like maps, tables, or embedded editors—ask the host to change the container. `window.openai.requestDisplayMode` negotiates inline, PiP, or fullscreen presentations.
```tsx
await window.openai?.requestDisplayMode({ mode: "fullscreen" });
// Note: on mobile, PiP may be coerced to fullscreen
```
### Open a modal
Use `window.openai.requestModal` to open a host-controlled modal. You can pass a different UI template from the same app by providing the template URI that you registered on your MCP server with `registerResource`, or omit `template` to open the current one.
```tsx
await window.openai.requestModal({
template: "ui://widget/checkout.html",
});
```
### Use host-backed navigation
Skybridge (the sandbox runtime) mirrors the iframe’s history into ChatGPT’s UI. Use standard routing APIs—such as React Router—and the host will keep navigation controls in sync with your component.
Router setup (React Router’s `BrowserRouter`):
```ts
export default function PizzaListRouter() {
return (
}>
} />
);
}
```
Programmatic navigation:
```ts
const navigate = useNavigate();
function openDetails(placeId: string) {
navigate(`place/${placeId}`, { replace: false });
}
function closeDetails() {
navigate("..", { replace: true });
}
```
## Scaffold the component project
Now that you understand the `window.openai` API, it's time to scaffold your component project.
As best practice, keep the component code separate from your server logic. A common layout is:
```
app/
server/ # MCP server (Python or Node)
web/ # Component bundle source
package.json
tsconfig.json
src/component.tsx
dist/component.js # Build output
```
Create the project and install dependencies (Node 18+ recommended):
```bash
cd app/web
npm init -y
npm install react@^18 react-dom@^18
npm install -D typescript esbuild
```
If your component requires drag-and-drop, charts, or other libraries, add them now. Keep the dependency set lean to reduce bundle size.
## Author the React component
Your entry file should mount a component into a `root` element and read initial data from `window.openai.toolOutput` or persisted state.
We have provided some example apps under the [examples page](https://developers.openai.com/apps-sdk/build/examples#pizzaz-list-source), for example, for a "Pizza list" app, which is a list of pizza restaurants.
### Explore the Pizzaz component gallery
We provide a number of example components in the [Apps SDK examples](https://developers.openai.com/apps-sdk/build/examples). Treat them as blueprints when shaping your own UI:
- **Pizzaz List** – ranked card list with favorites and call-to-action buttons.

- **Pizzaz Carousel** – embla-powered horizontal scroller that demonstrates media-heavy layouts.

- **Pizzaz Map** – Mapbox integration with fullscreen inspector and host state sync.

- **Pizzaz Album** – stacked gallery view built for deep dives on a single place.

- **Pizzaz Video** – scripted player with overlays and fullscreen controls.
Each example shows how to bundle assets, wire host APIs, and structure state for real conversations. Copy the one closest to your use case and adapt the data layer for your tool responses.
### React helper hooks
Using `useOpenAiGlobal` in a `useWidgetState` hook to keep host-persisted widget state aligned with your local React state:
```ts
export function useWidgetState(
defaultState: T | (() => T)
): readonly [T, (state: SetStateAction) => void];
export function useWidgetState(
defaultState?: T | (() => T | null) | null
): readonly [T | null, (state: SetStateAction) => void];
export function useWidgetState(
defaultState?: T | (() => T | null) | null
): readonly [T | null, (state: SetStateAction) => void] {
const widgetStateFromWindow = useWebplusGlobal("widgetState") as T;
const [widgetState, _setWidgetState] = useState(() => {
if (widgetStateFromWindow != null) {
return widgetStateFromWindow;
}
return typeof defaultState === "function"
? defaultState()
: (defaultState ?? null);
});
useEffect(() => {
_setWidgetState(widgetStateFromWindow);
}, [widgetStateFromWindow]);
const setWidgetState = useCallback(
(state: SetStateAction) => {
_setWidgetState((prevState) => {
const newState = typeof state === "function" ? state(prevState) : state;
if (newState != null) {
window.openai.setWidgetState(newState);
}
return newState;
});
},
[window.openai.setWidgetState]
);
return [widgetState, setWidgetState] as const;
}
```
The hooks above make it easy to read the latest tool output, layout globals, or widget state directly from React components while still delegating persistence back to ChatGPT.
## Widget localization
The host passes `locale` in `window.openai` and mirrors it to `document.documentElement.lang`. It is up to your widget to use that locale to load translations and format dates/numbers. A simple pattern with `react-intl`:
```tsx
const messages: Record> = {
"en-US": en,
"es-ES": es,
};
export function App() {
const locale = window.openai.locale ?? "en-US";
return (
{/* Render UI with or useIntl() */}
);
}
```
## Bundle for the iframe
Once you are done writing your React component, you can build it into a single JavaScript module that the server can inline:
```json
// package.json
{
"scripts": {
"build": "esbuild src/component.tsx --bundle --format=esm --outfile=dist/component.js"
}
}
```
Run `npm run build` to produce `dist/component.js`. If esbuild complains about missing dependencies, confirm you ran `npm install` in the `web/` directory and that your imports match installed package names (e.g., `@react-dnd/html5-backend` vs `react-dnd-html5-backend`).
## Embed the component in the server response
See the [Set up your server docs](https://developers.openai.com/apps-sdk/build/mcp-server#) for how to embed the component in your MCP server response.
Component UI templates are the recommended path for production.
During development you can rebuild the component bundle whenever your React code changes and hot-reload the server.
---
# Source: https://developers.openai.com/resources/code/chatkit-advanced-samples.md
# ChatKit advanced samples
> Advanced samples showcasing the capabilities of ChatKit (part of AgentKit).
- Type: Code
- Tags: chatkit, agentkit, agents, customer-service, knowledge-assistant, ad-generation
- URL: https://github.com/openai/openai-chatkit-advanced-samples
- Created: 2025-10-06
- Updated: 2025-10-06
## Summary
Demonstrates advanced use cases for ChatKit (part of AgentKit) with custom ChatKit server integrations for different use cases.
## Details
Provides example workflows showcasing the capabilities of ChatKit (part of AgentKit) with custom ChatKit server integrations for different use cases.
---
# Source: https://developers.openai.com/resources/code/chatkit-starter-app.md
# ChatKit starter app
> Integrate ChatKit with an Agent Builder workflow in your application.
- Type: Code
- Tags: chatkit, agentkit, agents
- URL: https://github.com/openai/openai-chatkit-starter-app
- Created: 2025-10-06
- Updated: 2025-10-06
## Summary
Demonstrates how to use ChatKit (part of AgentKit) to build agents easily in your own applications.
## Details
Provides example workflows utilizing the ChatKit API to build agents in your own applications.
---
# Source: https://developers.openai.com/commerce/specs/checkout.md
# Agentic Checkout Spec
## Overview
Enable merchants to run end-to-end checkout flows inside ChatGPT while keeping orders, payments, and compliance on their existing commerce stack.
**How it works**
1. Create session (REST). ChatGPT calls your `POST /checkout_sessions` to start a session with cart contents and buyer context; your response must include a rich, authoritative cart state.
2. Update session (REST). As the user changes items, shipping, or discounts, ChatGPT calls `POST /checkout_sessions/{checkout_session_id}`; each response returns the full cart state for display and validation.
3. Order events (webhooks). Your system publishes order lifecycle events (e.g., `order.created`, `order.updated`) to the provided webhook so ChatGPT stays in sync with fulfillment-grade truth.
4. Complete checkout (REST). ChatGPT finalizes via `POST /checkout_sessions/{checkout_session_id}/complete`; you confirm order creation and return the final cart and order identifiers.
5. Optionally, cancel checkouts using POST `/checkout_sessions/{checkout_session_id}/cancel` and get checkout information with `GET /checkout_sessions/{checkout_session_id}`.
6. Payments on your rails. You process payment with your existing PSP; if using Delegated Payments, accept the token and apply your normal authorization/capture flow.
**Key points**
- **Required endpoints.** Implement create, update, and complete checkout session REST endpoints; all responses must return a rich cart state (items, pricing, taxes/fees, shipping, discounts, totals, status).
- **Authoritative webhooks.** Emit order events to the provided webhook to keep state consistent across retries and edge cases.
- **Keep payments where they are.** Use your current PSP and settlement processes; integrate Delegated Payments only if applicable.
- **Security and robustness.** Authenticate every request, verify signatures, enforce idempotency, validate inputs, and support safe retries.
- **Certify integration.** Pass conformance checks (schema, error codes, rate limits, webhook delivery) to ensure reliable in-ChatGPT checkout.
## Checkout session
For users to place an order through ChatGPT, you must create, update and complete a Checkout session. This Checkout session holds information about items to be purchased, fulfillment information, and payment information.
As the user progresses through the checkout flow the Checkout session will be updated and move between various states.
The response to update calls, should return all checkout options, messages, and errors to be displayed to the user. Once the customer clicks “Buy”, the checkout session is completed with a selected payment method.

## REST endpoints
Merchants must implement the following five endpoints to place orders on behalf of ChatGPT users.
In the future, the Agentic Checkout Spec will support MCP servers.
### Common features of all endpoints
All endpoints must use HTTPS and return JSON.
#### Request headers
All endpoints will be called with the following headers set:
| Field | Description | Example Value |
| :-------------- | :-------------------------------------------------------- | :---------------------------------------------- |
| Authorization | API Key used to make requests | `Bearer api_key_123` |
| Accept-Language | The preferred locale for content like messages and errors | `en-US` |
| User-Agent | Information about the client making this request | `ChatGPT/2.0 (Mac OS X 15.0.1; arm64; build 0)` |
| Idempotency-Key | Key used to ensure requests are idempotent | `idempotency_key_123` |
| Request-Id | Unique key for each request for tracing purposes | `request_id_123` |
| Content-Type | Type of request content | `application/json` |
| Signature | Base64 encoded signature of the request body | `eyJtZX...` |
| Timestamp | Formatted as an RFC 3339 string. | 2025-09-25T10:30:00Z |
| API-Version | API version | 2025-09-12 |
#### Response headers
| Field | Description | Example Value |
| :-------------- | :------------------------------------ | :-------------------- |
| Idempotency-Key | Idempotency key passed in the request | `idempotency_key_123` |
| Request-Id | Request ID passed in the request | `request_id_123` |
### POST /checkout_sessions
Call direction: OpenAI -> Merchant
This is the initial call to create a checkout session. The call will contain information about the items the customer wishes to purchase and should return line item information, along with any messages or errors to be displayed to the customer. It should always return a checkout session id. All responses should be returned with a 201 status.
#### Request
| Field | Type | Required | Description | Validation |
| :------------------ | :--------- | :------- | :---------------------------------------------------------- | :------------------------- |
| buyer | Buyer | No | Optional information about the buyer. | None |
| items | List[Item] | Yes | The initial list of items to initiate the checkout session. | Should be a non empty list |
| fulfillment_address | Address | No | Optional fulfillment address if present. | None |
#### Response
| Field | Type | Required | Description | Validation |
| :-------------------- | :---------------------- | :------- | :------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------ |
| id | String | Yes | Unique id that identifies the checkout session. This id will be used to update the checkout session in subsequent calls. | None |
| buyer | Buyer | No | Buyer information, if provided | None |
| payment_provider | PaymentProvider | Yes | Payment provider that will be used to complete this transaction. | None |
| status | String enum | Yes | Current status of the checkout session. Possible values are: `not_ready_for_payment` `ready_for_payment` `completed` `canceled` | None |
| currency | String | Yes | Currency code as per the ISO 4217 standard | Should follow the ISO 4217 standard in lower case |
| line_items | List[LineItem] | Yes | List of items and computed costs. | None |
| fulfillment_address | Address | No | Address to ship items to. | None |
| fulfillment_options | List[FulfillmentOption] | Yes | All available fulfillment options and associated costs. | None |
| fulfillment_option_id | String | No | Id of the selected fulfillment option. | None |
| totals | List[Total] | Yes | List of totals. | None |
| messages | List[Message] | Yes | List of informational and error messages to be displayed to the customer. | None |
| links | List[Link] | Yes | List of links (e.g. ToS/privacy policy/etc.) to be displayed to the customer. | None |
#### Examples
1. Creating a checkout session with a single item and quantity. No fulfillment address is provided, so the checkout cannot be completed.
```json
POST Request to /checkout_sessions
{
"items": [
{
"id": "item_123",
"quantity": 1
}
]
}
```
```json
Response
{
"id": "checkout_session_123",
"payment_provider": {
"provider": "stripe",
"supported_payment_methods": ["card"]
},
"status": "in_progress",
"currency": "usd",
"line_items": [
{
"id": "line_item_123",
"item": {
"id": "item_123",
"quantity": 1
},
"base_amount": 300,
"discount": 0,
"subtotal": 300,
"tax": 30,
"total": 330
}
],
"totals": [
{
"type": "items_base_amount",
"display_text": "Item(s) total",
"amount": 300
},
{
"type": "subtotal",
"display_text": "Subtotal",
"amount": 300
},
{
"type": "tax",
"display_text": "Tax",
"amount": "0.30"
},
{
"type": "total",
"display_text": "Total",
"amount": 330
}
],
"fulfillment_options": [],
"messages": [
{
"type": "error",
"code": "out_of_stock",
"path": "$.line_items[0]",
"content_type": "plain",
"content": "This item is not available for sale.",
}
],
"links": [
{
"type": "terms_of_use",
"url": "https://www.testshop.com/legal/terms-of-use"
}
]
}
```
2. Creating a checkout session with a single item and quantity, and a provided fulfillment address. Since a fulfillment address is provided, taxes are returned as well. Fulfillment options are also available, and the cheapest one is selected by default. Any messages to show to the customer based on their fulfillment address (e.g. CA 65 warning) are also returned.
```json
POST Request to /checkout_sessions
{
"items": [
{
"id": "item_456",
"quantity": 1
}
],
"fulfillment_address": {
"name": "test",
"line_one": "1234 Chat Road",
"line_two": "Apt 101",
"city": "San Francisco",
"state": "CA",
"country": "US",
"postal_code": "94131"
}
}
```
```json
Response
{
"id": "checkout_session_123",
"payment_provider": {
"provider": "stripe",
"supported_payment_methods": ["card"]
},
"status": "ready_for_payment",
"currency": "usd",
"line_items": [
{
"id": "line_item_456",
"item": {
"id": "item_456",
"quantity": 1
},
"base_amount": 300,
"discount": 0,
"subtotal": 0,
"tax": 30,
"total": 330
}
],
"fulfillment_address": {
"name": "test",
"line_one": "1234 Chat Road",
"line_two": "Apt 101",
"city": "San Francisco",
"state": "CA",
"country": "US",
"postal_code": "94131"
},
"fulfillment_option_id": "fulfillment_option_123",
"totals": [
{
"type": "items_base_amount",
"display_text": "Item(s) total",
"amount": 300
},
{
"type": "subtotal",
"display_text": "Subtotal",
"amount": 300
},
{
"type": "tax",
"display_text": "Tax",
"amount": 30
},
{
"type": "fulfillment",
"display_text": "Fulfillment",
"amount": 100
},
{
"type": "total",
"display_text": "Total",
"amount": 430
}
],
"fulfillment_options": [
{
"type": "shipping",
"id": "fulfillment_option_123",
"title": "Standard",
"subtitle": "Arrives in 4-5 days",
"carrier": "USPS",
"earliest_delivery_time": "2025-10-12T07:20:50.52Z",
"latest_delivery_time": "2025-10-13T07:20:50.52Z",
"subtotal": 100,
"tax": 0,
"total": 100
},
{
"type": "shipping",
"id": "fulfillment_option_456",
"title": "Express",
"subtitle": "Arrives in 1-2 days",
"carrier": "USPS",
"earliest_delivery_time": "2025-10-09T07:20:50.52Z",
"latest_delivery_time": "2025-10-10T07:20:50.52Z",
"subtotal": 500,
"tax": 0,
"total": 500
}
],
"messages": [],
"links": [
{
"type": "terms_of_use",
"url": "https://www.testshop.com/legal/terms-of-use"
}
]
}
```
### POST `/checkout_sessions/{checkout_session_id}`
Call direction: OpenAI -> Merchant
This endpoint will be called on checkout session updates, such as a change in fulfillment address or fulfillment option. The endpoint should return updated costs, new options (e.g. new fulfillment options based on update in fulfillment address), and any new errors.
#### Request
| Field | Type | Required | Description | Validation |
| :-------------------- | :--------- | :------- | :-------------------------------------------------------------------- | :--------- |
| buyer | Buyer | No | Optional information about the buyer. | None |
| items | List[Item] | No | Optional list of updated items to be purchased. | None |
| fulfillment_address | Address | No | Newly added or updated fulfillment address specified by the customer. | None |
| fulfillment_option_id | String | No | Id of the fulfillment option specified by the customer. | None |
#### Response
| Field | Type | Required | Description | Validation |
| :-------------------- | :---------------------- | :------- | :------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------ |
| id | String | Yes | Unique id that identifies the checkout session. This id will be used to update the checkout session in subsequent calls. | None |
| buyer | Buyer | No | Buyer information, if provided | None |
| status | String enum | Yes | Current status of the checkout session. Possible values are: `not_ready_for_payment` `ready_for_payment` `completed` `canceled` | None |
| currency | String | Yes | Currency code as per the ISO 4217 standard | Should follow the ISO 4217 standard in lower case |
| line_items | List[LineItem] | Yes | List of items and computed costs. | None |
| fulfillment_address | Address | No | Address to ship items to. | None |
| fulfillment_options | List[FulfillmentOption] | Yes | All available fulfillment options and associated costs. | None |
| fulfillment_option_id | String | No | Id of the selected fulfillment option. | None |
| totals | List[Total] | Yes | List of totals. | None |
| messages | List[Message] | Yes | List of informational and error messages to be displayed to the customer. | None |
| links | List[Link] | Yes | List of links (e.g. ToS/privacy policy/etc.) to be displayed to the customer. | None |
#### Example
Updating the fulfillment option updates the checkout session totals.
```json
POST Request to /checkout_sessions/checkout_session_123
{
"fulfillment_option_id": "fulfillment_option_456"
}
```
```json
Response
{
"id": "checkout_session_123",
"status": "ready_for_payment",
"currency": "usd",
"line_items": [
{
"id": "line_item_456",
"item": {
"id": "item_456",
"quantity": 1
},
"base_amount": 300,
"discount": 0,
"subtotal": 0,
"tax": 30,
"total": 330
}
],
"fulfillment_address": {
"name": "test",
"line_one": "1234 Chat Road",
"line_two": "Apt 101",
"city": "San Francisco",
"state": "CA",
"country": "US",
"postal_code": "94131"
},
"fulfillment_option_id": "fulfillment_option_456",
"totals": [
{
"type": "items_base_amount",
"display_text": "Item(s) total",
"amount": 300
},
{
"type": "subtotal",
"display_text": "Subtotal",
"amount": 300
},
{
"type": "tax",
"display_text": "Tax",
"amount": 30
},
{
"type": "fulfillment",
"display_text": "Fulfillment",
"amount": 500
},
{
"type": "total",
"display_text": "Total",
"amount": 830
}
],
"fulfillment_options": [
{
"type": "shipping",
"id": "fulfillment_option_123",
"title": "Standard",
"subtitle": "Arrives in 4-5 days",
"carrier": "USPS",
"earliest_delivery_time": "2025-10-12T07:20:50.52Z",
"latest_delivery_time": "2025-10-13T07:20:50.52Z",
"subtotal": 100,
"tax": 0,
"total": 100
},
{
"type": "shipping",
"id": "fulfillment_option_456",
"title": "Express",
"subtitle": "Arrives in 1-2 days",
"carrier": "USPS",
"earliest_delivery_time": "2025-10-09T07:20:50.52Z",
"latest_delivery_time": "2025-10-10T07:20:50.52Z",
"subtotal": 500,
"tax": 0,
"total": 500
}
],
"messages": [],
"links": [
{
"type": "terms_of_use",
"url": "https://www.testshop.com/legal/terms-of-use"
}
]
}
```
### POST `/checkout_sessions/{checkout_session_id}/complete`
Call direction: OpenAI -> Merchant
The endpoint will be called with the payment method to complete the purchase. It is expected that the checkout session will be completed and an order will be created after this call. Any errors that prevent this from happening should be returned in the response.
#### Request
| Field | Type | Required | Description | Validation |
| :----------- | :---------- | :------- | :-------------------------------------------------- | :--------- |
| buyer | Buyer | No | Optional information about the buyer. | None |
| payment_data | PaymentData | Yes | Payment data used to complete the checkout session. | None |
#### Response
| Field | Type | Required | Description | Validation |
| :-------------------- | :---------------------- | :------- | :------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------ |
| id | String | Yes | Unique id that identifies the checkout session. This id will be used to update the checkout session in subsequent calls. | None |
| buyer | Buyer | Yes | Buyer information | None |
| status | String enum | Yes | Current status of the checkout session. Possible values are: `not_ready_for_payment` `ready_for_payment` `completed` `canceled` | None |
| currency | String | Yes | Currency code as per the ISO 4217 standard | Should follow the ISO 4217 standard in lower case |
| line_items | List[LineItem] | Yes | List of items and computed costs. | None |
| fulfillment_address | Address | No | Address to ship items to. | None |
| fulfillment_options | List[FulfillmentOption] | Yes | All available fulfillment options and associated costs. | None |
| fulfillment_option_id | String | No | Id of the selected fulfillment option. | None |
| totals | List[Total] | Yes | List of totals. | None |
| order | Order | No | Order that is created after the checkout session completes. | None |
| messages | List[Message] | Yes | List of informational and error messages to be displayed to the customer. | None |
| links | List[Link] | Yes | List of links (e.g. ToS/privacy policy/etc.) to be displayed to the customer. | None |
#### Example
Completing the checkout session with an encrypted payload representing the payment method.
```json
POST Request to /checkout_sessions/checkout_session_123/complete
{
"buyer": {
"name": "John Smith",
"email": "johnsmith@mail.com",
"phone_number": "+15552003434"
},
"payment_data": {
"token": "spt_123",
"provider": "stripe",
"billing_address": {
"name": "test",
"line_one": "1234 Chat Road",
"line_two": "Apt 101",
"city": "San Francisco",
"state": "CA",
"country": "US",
"postal_code": "94131",
"phone_number": "+15552428478"
}
}
}
```
```json
Response
{
"id": "checkout_session_123",
"buyer": {
"name": "John Smith",
"email": "johnsmith@mail.com",
"phone_number": "+15552003434"
},
"status": "completed",
"currency": "usd",
"line_items": [
{
"id": "line_item_456",
"item": {
"id": "item_456",
"quantity": 1
},
"base_amount": 300,
"discount": 0,
"subtotal": 300,
"tax": 30,
"total": 330
}
],
"fulfillment_address": {
"name": "test",
"line_one": "1234 Chat Road",
"line_two": "Apt 101",
"city": "San Francisco",
"state": "CA",
"country": "US",
"postal_code": "94131"
},
"fulfillment_option_id": "fulfillment_option_123",
"totals": [
{
"type": "items_base_amount",
"display_text": "Item(s) total",
"amount": 300
},
{
"type": "subtotal",
"display_text": "Subtotal",
"amount": 300
},
{
"type": "tax",
"display_text": "Tax",
"amount": 30
},
{
"type": "fulfillment",
"display_text": "Fulfillment",
"Amount": 100
},
{
"type": "total",
"display_text": "Total",
"amount": 430
}
],
"fulfillment_options": [
{
"type": "shipping",
"id": "fulfillment_option_123",
"title": "Standard",
"subtitle": "Arrives in 4-5 days",
"carrier": "USPS",
"earliest_delivery_time": "2025-10-12T07:20:50.52Z",
"latest_delivery_time": "2025-10-13T07:20:50.52Z",
"subtotal": 100,
"tax": 0,
"total": 100
},
{
"type": "shipping",
"id": "fulfillment_option_456",
"title": "Express",
"subtitle": "Arrives in 1-2 days",
"carrier": "USPS",
"earliest_delivery_time": "2025-10-09T07:20:50.52Z",
"latest_delivery_time": "2025-10-10T07:20:50.52Z",
"subtotal": 500,
"tax": 0,
"total": 500
}
],
"messages": [],
"links": [
{
"type": "terms_of_use",
"url": "https://www.testshop.com/legal/terms-of-use"
}
]
}
```
### POST `/checkout_sessions/{checkout_session_id}/cancel`
This endpoint will be used to cancel a checkout session, if it can be canceled. If the checkout session cannot be canceled (e.g. if the checkout session is already canceled or completed), then the server should send back a response with status 405. Any checkout session with a status that is not equal to completed or canceled should be cancelable.
#### Request
None
#### Response
| Field | Type | Required | Description | Validation |
| :-------------------- | :---------------------- | :------- | :------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------ |
| id | String | Yes | Unique id that identifies the checkout session. This id will be used to update the checkout session in subsequent calls. | None |
| buyer | Buyer | No | Buyer information, if provided | None |
| status | String enum | Yes | Current status of the checkout session. Possible values are: `not_ready_for_payment` `ready_for_payment` `completed` `canceled` | None |
| currency | String | Yes | Currency code as per the ISO 4217 standard | Should follow the ISO 4217 standard in lower case |
| line_items | List[LineItem] | Yes | List of items and computed costs. | None |
| fulfillment_address | Address | No | Address to ship items to. | None |
| fulfillment_options | List[FulfillmentOption] | Yes | All available fulfillment options and associated costs. | None |
| fulfillment_option_id | String | No | Id of the selected fulfillment option. | None |
| totals | List[Total] | Yes | List of totals. | None |
| messages | List[Message] | Yes | List of informational and error messages to be displayed to the customer. | None |
| links | List[Link] | Yes | List of links (e.g. ToS/privacy policy/etc.) to be displayed to the customer. | None |
### GET `/checkout_sessions/{checkout_session_id}`
This endpoint is used to return update to date information about the checkout session. If the checkout session is not found, then the server should return a response with status 404.
#### Request
None
#### Response
| Field | Type | Required | Description | Validation |
| :-------------------- | :---------------------- | :------- | :------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------ |
| id | String | Yes | Unique id that identifies the checkout session. This id will be used to update the checkout session in subsequent calls. | None |
| buyer | Buyer | No | Buyer information, if provided | None |
| status | String enum | Yes | Current status of the checkout session. Possible values are: `not_ready_for_payment` `ready_for_payment` `completed` `canceled` | None |
| currency | String | Yes | Currency code as per the ISO 4217 standard | Should follow the ISO 4217 standard in lower case |
| line_items | List[LineItem] | Yes | List of items and computed costs. | None |
| fulfillment_address | Address | No | Address to ship items to. | None |
| fulfillment_options | List[FulfillmentOption] | Yes | All available fulfillment options and associated costs. | None |
| fulfillment_option_id | String | No | Id of the selected fulfillment option. | None |
| totals | List[Total] | Yes | List of totals. | None |
| messages | List[Message] | Yes | List of informational and error messages to be displayed to the customer. | None |
| links | List[Link] | Yes | List of links (e.g. ToS/privacy policy/etc.) to be displayed to the customer. | None |
### Response Errors
If the server is unable to return a 201 response, then it should return an error of the following shape with a 4xx/5xx status.
#### Error
| Field | Type | Required | Description |
| :------ | :---------- | :------- | :--------------------------------------------------------------------- |
| type | String enum | Yes | Error type. Possible values are: `invalid_request` |
| code | String enum | Yes | Error code. Possible values are: `request_not_idempotent` |
| message | String | Yes | Human‑readable description of the error. |
| param | String | No | JSONPath referring to the offending request body field, if applicable. |
## Object definitions
### Item
| Field | Type | Required | Description | Example Value | Validation |
| :------- | :----- | :------- | :------------------------------------------------- | :------------ | :------------------------------------------- |
| id | string | Yes | Id of a piece of merchandise that can be purchased | `“itm_123”` | `None` |
| quantity | int | Yes | Quantity of the item for fulfillment | `1` | Should be a positive integer greater than 0. |
### Address
| Field | Type | Required | Description | Validation |
| :----------- | :----- | :------- | :----------------------------------------------- | :------------------------------------ |
| name | String | Yes | Name of the person to whom the items are shipped | Max. length is 256 |
| line_one | String | Yes | First line of address | Max. length is 60 |
| line_two | String | No | Optional second line of address | Max. length is 60 |
| city | String | Yes | Address city/district/suburb/town/village. | Max. length is 60 |
| state | String | Yes | Address state/county/province/region. | Should follow the ISO 3166-1 standard |
| country | String | Yes | Address country | Should follow the ISO 3166-1 standard |
| postal_code | String | Yes | Address postal code or zip code | Max. length is 20 |
| phone_number | String | No | Optional phone number | Follows the E.164 standard |
### PaymentProvider
| Field | Type | Required | Description | Validation |
| :------------------------ | :---------------- | :------- | :--------------------------------------------------------------------------------------------- | :--------- |
| provider | String enum | Yes | String value representing payment processor. Possible values are: `stripe` `adyen` `braintree` | None |
| supported_payment_methods | List[String enum] | Yes | List of payment methods that the merchant is willing to accept. Possible values are: `card` | None |
### Message (type = info)
| Field | Type | Required | Description | Validation |
| :----------- | :---------- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------- |
| type | String | Yes | String value representing the type of message. For an informational message, the type should be `info.` | None |
| param | String | Yes | RFC 9535 JSONPath to the component of the checkout session that the message is referring to. For instance, if the message is referring to the second line item, the path would be `$.line_items[1]`. | None |
| content_type | String enum | Yes | Type of the message content for rendering purposes. Possible values are: `plain` `markdown` | None |
| content | String | Yes | Raw message content. | None |
### Message (type = error)
| Field | Type | Required | Description | Validation |
| :----------- | :---------- | :------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------- |
| type | String | Yes | String value representing the type of message. For an error message, the type should be `error.` | None |
| code | String enum | Yes | Error code. Possible values are: `missing` `invalid` `out_of_stock` `payment_declined` `requires_sign_in` `requires_3ds` | None |
| param | String | No | RFC 9535 JSONPath to the component of the checkout session that the message is referring to. For instance, if the message is referring to the second line item, the path would be `$.line_items[1]`. | None |
| content_type | String enum | Yes | Type of the message content for rendering purposes. Possible values are: `plain` `markdown` | None |
| content | String | Yes | Raw message content. | None |
### Link
| Field | Type | Required | Description | Validation |
| :---- | :----------- | :------- | :-------------------------------------------------------------------------------------------- | :--------- |
| type | Enum(String) | Yes | Type of the link. Possible values are: `terms_of_use` `privacy_policy` `seller_shop_policies` | None |
| url | String | Yes | Link content specified as a URL. | None |
### Buyer
| Field | Type | Required | Description | Validation |
| :----------- | :----- | :------- | :------------------------------------------------------- | :------------------------- |
| name | String | Yes | Name of the buyer. | Max. length is 256 |
| email | String | Yes | Email address of the buyer to be used for communication. | Max. length is 256 |
| phone_number | String | No | Optional phone number of the buyer. | Follows the E.164 standard |
### Line Item
| Field | Type | Required | Description | Validation |
| :---------- | :----- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------- |
| id | String | Yes | Id of the line item. This is different from the id of the item - two line items representing the same item will have different line item ids. | None |
| item | Item | Yes | Item that is represented by the line item. | None |
| base_amount | int | Yes | Integer representing item base amount before adjustments. | Should be >= 0 |
| discount | int | Yes | Integer representing any discount applied to the item. | Should be >= 0 |
| subtotal | int | Yes | Integer representing amount after all adjustments. | Should sum up to `base_amount - discount` Should be >= 0 |
| tax | int | Yes | Integer representing tax amount. | Should be >= 0 |
| total | int | Yes | Integer representing total amount. | Should sum up to `base_amount - discount + tax` Should be >= 0 |
### Total
| Field | Type | Required | Description | Validation |
| :----------- | :---------- | :------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| type | String enum | Yes | String value representing the type of total. Possible values are: `items_base_amount` `items_discount` `subtotal` `discount` `fulfillment` `tax` `fee` `total` | None |
| display_text | String | Yes | The text displayed to the customer for this total. | None |
| amount | int | Yes | Integer representing total amount in minor units. | If type == `subtotal`, should sum to `items_base_amount - items_discount` If type == `total`, should sum to `items_base_amount - items_discount - discount + fulfillment + tax + fee` Should be >= 0 |
### FulfillmentOption (type = shipping)
| Field | Type | Required | Description | Validation |
| :--------------------- | :----- | :------- | :--------------------------------------------------------------------------------------------------------------- | :------------------------------------- |
| type | String | Yes | String value representing the type of fulfillment option. For a shipping option, the value should be `shipping.` | None |
| id | String | Yes | Unique ID that represents the shipping option. Unique across all fulfillment options. | Unique across all fulfillment options. |
| title | String | Yes | Title of the shipping option to display to the customer. | None |
| subtitle | String | Yes | Text content describing the estimated timeline for shipping to display to the customer. | None |
| carrier | String | Yes | Name of the shipping carrier. | None |
| earliest_delivery_time | String | Yes | Estimated earliest delivery time, formatted as an RFC 3339 string. | Formatted as an RFC 3339 string. |
| latest_delivery_time | String | Yes | Estimated latest delivery time, formatted as an RFC 3339 string. | Formatted as an RFC 3339 string. |
| subtotal | int | Yes | Integer subtotal cost of the shipping option, formatted as a string. | Should be >= 0 |
| tax | int | Yes | Integer representing tax amount. | Should be >= 0 |
| total | int | Yes | Integer total cost of the shipping option, formatted as a string. | Should sum to `subtotal + tax` |
### FulfillmentOption (type = digital)
| Field | Type | Required | Description | Validation |
| :------- | :----- | :------- | :------------------------------------------------------------------------------------------------------------- | :------------------------------------- |
| type | String | Yes | String value representing the type of fulfillment option. For a digital option, the value should be `digital.` | None |
| id | String | Yes | Unique ID that represents the digital option. Unique across all fulfillment options. | Unique across all fulfillment options. |
| title | String | Yes | Title of the digital option to display to the customer. | None |
| subtitle | String | No | Text content describing how the item will be digitally delivered to the customer. | None |
| subtotal | int | Yes | Integer subtotal cost of the digital option, formatted as a string. | Should be >= 0 |
| tax | int | Yes | Integer representing tax amount. | Should be >= 0 |
| total | int | Yes | Integer total cost of the digital option, formatted as a string. | Should sum to `subtotal + tax` |
### PaymentData
| Field | Type | Required | Description | Validation |
| :-------------- | :---------- | :------- | :------------------------------------------------------------------------------------------------- | :--------- |
| token | String | Yes | Token that represents the payment method. | None |
| provider | String enum | Yes | String value representing the payment processor. Possible values are: `stripe` `adyen` `braintree` | None |
| billing_address | Address | No | Optional billing address associated with the payment method | None |
### Order
| Field | Type | Required | Description | Validation |
| :------------------ | :----- | :------- | :-------------------------------------------------------------------------------------------------------------------------------------- | :--------- |
| id | String | Yes | Unique id that identifies the order that is created after completing the checkout session. | None |
| checkout_session_id | String | Yes | Id that identifies the checkout session that created this order | None |
| permalink_url | String | Yes | URL that points to the order. Customers should be able to visit this URL and provide at most their email address to view order details. | None |
## Webhooks
The merchant sends OpenAI webhook events on order creation and update events. These events ensure that the buyer’s view stays in sync. The webhook events will be sent with a HMAC signature sent as a request header (i.e. `Merchant_Name-Signature`) that is created using the webhook payload and signed using a key provided by OpenAI.
### Webhook Event
| Field | Type | Required | Description | Validation |
| :---- | :---------- | :------- | :------------------------------------------------------------------------------------------ | :--------- |
| type | String enum | Yes | String representing the type of event. Possible values are: `order_created` `order_updated` | None |
| data | EventData | Yes | Webhook event data. See EventData for more information. | None |
### EventData (type = order)
| Field | Type | Required | Description | Validation |
| :------------------ | :----------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------------- | :--------- |
| type | String | Yes | String value representing the type of event data. For order data, the value should be `order` | None |
| checkout_session_id | String | Yes | ID that identifies the checkout session that created this order. | None |
| permalink_url | String | Yes | URL that points to the order. Customers should be able to visit this URL and provide at most their email address to view order details. | None |
| status | String enum | Yes | String representing the latest status of the order. Possible values are: `created` `manual_review` `confirmed` `canceled` `shipped` `fulfilled` | None |
| refunds | List[Refund] | Yes | List of refunds that have been issued for the order. | None |
### Refund
| Field | Type | Required | Description | Validation |
| :----- | :---------- | :------- | :--------------------------------------------------------------------------------------------- | :------------- |
| type | String enum | Yes | String representing the type of refund. Possible values are: `store_credit` `original_payment` | None |
| amount | integer | Yes | Integer representing total amount of money refunded. | Should be >= 0 |
---
# Source: https://developers.openai.com/codex/cli.md
# Codex CLI
Codex CLI is OpenAI's coding agent that you can run locally from your terminal. It can read, change, and run code on your machine in the selected directory.
It's [open source](https://github.com/openai/codex) and built in Rust for speed and efficiency.
Codex is included with ChatGPT Plus, Pro, Business, Edu, and Enterprise plans. Learn more about [what's included](https://developers.openai.com/codex/pricing).
## CLI setup
The Codex CLI is available on macOS and Linux. Windows support is
experimental. For the best Windows experience, use Codex in a WSL workspace
and follow our Windows setup guide.
---
## Work with the Codex CLI
### Run Codex interactively
Run `codex` to start an interactive terminal UI (TUI) session.
### Control model and reasoning
Use `/model` to switch between GPT-5-Codex and GPT-5, or adjust reasoning levels.
### Image inputs
Attach screenshots or design specs so Codex reads them alongside your prompt.
### Run local code review
Get your code reviewed by a separate Codex agent before you commit or push your changes.
### Web search
Use Codex to search the web and get up-to-date information for your task.
### Codex Cloud tasks
Launch a Codex Cloud task, choose environments, and apply the resulting diffs without leaving your terminal.
### Scripting Codex
Automate repeatable workflows by scripting Codex with the `exec` command.
### Model Context Protocol
Give Codex access to additional third-party tools and context with Model Context Protocol (MCP).
### Approval modes
Choose the approval mode that matches your comfort level before Codex edits or runs commands.
---
# Source: https://developers.openai.com/codex/cloud.md
# Codex web
Codex is OpenAI's coding agent that can read, edit, and run code. It helps you build faster, fix bugs, and understand unfamiliar code. With Codex cloud, Codex can work on tasks in the background (including in parallel) using its own cloud environment.
## Codex web setup
Go to [Codex](https://chatgpt.com/codex) and connect your GitHub account. This lets Codex work with the code in your repositories and create pull requests from its work.
Your Plus, Pro, Business, Edu, or Enterprise plan includes Codex. Learn more about [what's included](https://developers.openai.com/codex/pricing). Some Enterprise workspaces may require [admin setup](https://developers.openai.com/codex/enterprise/admin-setup) before you can access Codex.
---
## Work with Codex web
### Learn about prompting
Write clearer prompts, add constraints, and choose the right level of detail to get better results.
### Common workflows
Start with proven patterns for delegating tasks, reviewing changes, and turning results into PRs.
### Configuring environments
Choose the repo, setup steps, and tools Codex should use when it runs tasks in the cloud.
### Delegate work from the IDE extension
Kick off a cloud task from your editor, then monitor progress and apply the resulting diffs locally.
### Delegating from GitHub
Tag `@codex` on issues and pull requests to spin up tasks and propose changes directly from GitHub.
### Control internet access
Decide whether Codex can reach the public internet from cloud environments, and when to enable it.
---
# Source: https://developers.openai.com/resources/guide/code-interpreter-guide.md
# Code interpreter guide
> Guide to using the built-in code interpreter tool.
- Type: Guide
- Tags: tools, code
- URL: https://platform.openai.com/docs/guides/tools-code-interpreter
- Created: 2025-07-22
- Updated: 2025-07-22
## Summary
Shows how to run computations and analyze data via the code interpreter.
## Details
Includes setup instructions and examples for leveraging the interpreter in Responses.
---
# Source: https://developers.openai.com/resources/cookbook/code-modernization.md
# Modernizing your Codebase with Codex
> Cookbook to modernize legacy codebases using the OpenAI Codex CLI.
- Type: Cookbook
- Tags: codex
- URL: /cookbook/examples/codex/code_modernization
- Created: 2025-11-19
- Updated: 2025-11-19
## Summary
Cookbook to modernize legacy codebases using the OpenAI Codex CLI.
## Details
Cookbook to modernize legacy codebases using the OpenAI Codex CLI.
---
# Source: https://developers.openai.com/cookbook/examples/codex/code_modernization.md
# Modernizing your Codebase with Codex
## Introduction
Codex is trained to read and reason about large, complex codebases, plan work alongside engineers, and produce high-quality changes. Code modernization has quickly become one of its most common and valuable uses. In this setup, engineers focus on architecture and business rules while Codex handles the heavy lifting: translating legacy patterns, proposing safe refactors, and keeping documentation and tests in sync as the system evolves.
This cookbook shows how to use **OpenAI's Codex CLI** to modernize a legacy repository in a way that is:
* Understandable to new engineers
* Auditable for architects and risk teams
* Repeatable as a pattern across other systems
We’ll use a COBOL-based [investment portfolio system](https://github.com/sentientsergio/COBOL-Legacy-Benchmark-Suite/) as the running example and choose a single pilot flow to focus on. You can substitute any legacy stack (eg. Java monolith, PL/SQL) where you have legacy programs, orchestration (jobs, schedulers, scripts), or shared data sources.
---
## High Level Overview
We’ve broken it down into 5 different phases that revolve around an executive plan (ExecPlan in short), which is a design document that the agent can follow to deliver the system change.
We will create 4 types of documents for the pilot flow we choose:
* **pilot_execplan.md** - ExecPlan that orchestrates the pilot that answers: what’s in scope, why it matters, what steps we’ll take, and how we’ll know we’re done.
* **pilot_overview.md** - Which legacy programs (COBOL in our example), orchestration jobs (JCL here), and data sources are involved, how data flows between them, and what the business flow actually does.
* **pilot_design.md** - Target shape of the system: the service/module that will own this flow, the new data model, and the public APIs or batch entry points.
* **pilot_validation.md** - Defines how we’ll prove parity: key scenarios, shared input datasets, how to run legacy vs modern side-by-side, and what “matching outputs” means in practice.
These 4 files help lay out what code is being changed, what the new system should look like, and exactly how to check that behavior hasn’t regressed.
---
## Phase 0 - Set up AGENTS and PLANS
**Goal**: Give Codex a lightweight contract for how planning works in this repo, without overwhelming people with process.
We’re taking inspiration from the [Using PLANS.md for multi-hour problem solving](https://cookbook.openai.com/articles/codex_exec_plans) cookbook to create an AGENTS.md and PLANS.md file that will be placed in a .agent folder.
* AGENTS.md: If you haven’t created an AGENTS.md for your repository yet, I suggest using the /init command. Once generated, reference the add a section in your AGENTS.md to instruct the agent to reference the PLANS.md.
* PLANS.md: Use the example provided in the cookbook as a starting point
These explain what an ExecPlan is, when to create or update one, where it lives, and what sections every plan must have.
### Where Codex CLI helps
If you want Codex to tighten AGENTS or PLANS for your specific repo, you can run:
```md
Please read the directory structure and refine .agent/AGENTS.md and .agent/PLANS.md so they are a clear, opinionated standard for how we plan COBOL modernization work here. Keep the ExecPlan skeleton but add one or two concrete examples.
```
---
## Phase 1 - Pick a pilot and create the first ExecPlan
**Goal**: Align on one realistic but bounded pilot flow and capture the plan for Phase 1 in a single ExecPlan file.
**Key artifact**: pilot_execplan.md
### 1.1 Choose pilot flow
If you don’t have a flow in mind to pilot with, you can ask Codex to propose. Example prompt from the repository root:
```md
Look through this repository and propose one or two candidate pilot flows for modernization that are realistic but bounded.
For each candidate, list:
- COBOL programs and copybooks involved
- JCL members involved
- The business scenario in plain language
- End with a clear recommendation for which flow we should use as the first pilot
```
In this case, we’ll choose a reporting flow as the pilot.
### 1.2 Ask Codex to create the pilot ExecPlan
```md
Create pilot_execplan.md following .agent/PLANS.md. Scope it to the daily reporting flow. The plan should cover four outcomes for this one flow:
- Inventory and diagrams
- Modernization Technical Report content
- Target design and spec
- Test plan for parity
Use the ExecPlan skeleton and fill it in with concrete references to the actual COBOL and JCL files.
```
This plan is now your “home base” for all pilot work.
---
## Phase 2 - Inventory and discovery
**Goal**: Capture what the pilot flow actually does today: programs, jobs, data flows, and business rules. Engineers can reason about the change without reading every line of legacy code.
**Key artifact**: pilot_reporting_overview.md
**Where engineers can focus**:
* Confirm which jobs truly run in production
* Fill in gaps Codex cannot infer from code (SLAs, operational context, owners)
* Sanity check diagrams and descriptions
### 2.1 Ask Codex to draft the overview
```md
Create or update pilot_reporting_overview.md with two top-level sections: “Inventory for the pilot” and “Modernization Technical Report for the pilot”.
Use pilot_execplan.md to identify the pilot flow.
In the inventory section, include:
1. The COBOL programs and copybooks involved, grouped as batch, online, and utilities if applicable
2. The JCL jobs and steps that call these programs
3. The data sets or tables they read and write
4. A simple text diagram that shows the sequence of jobs and data flows
In the modernization technical report section, describe:
1. The business scenario for this flow in plain language
2. Detailed behavior of each COBOL program in the flow
3. The data model for the key files and tables, including field names and meanings
4. Known technical risks such as date handling, rounding, special error codes, or tricky conditions
```
This document will be helpful for engineers to understand the shape and behavior of the pilot without reading all the code.
Example of the flow diagram in pilot_reporting_overview.md
### 2.2 Update the ExecPlan
Once the overview exists, ask Codex to keep the plan aligned
```md
Update pilot_execplan.md to reflect the new pilot_reporting_overview.md file.
- In Progress, mark the inventory and MTR sections as drafted.
- Add any notable findings to Surprises and discoveries and Decision log.
- Keep the ExecPlan readable for someone new to the repo.
```
At the end of Phase 2, you’ll have a single pilot overview doc that plays the role of both system inventory report and modernization technical report.
---
## Phase 3 - Design, spec, and validation plan
**Goal**
* Decide what the modern version of the pilot flow should look like
* Describe the target service and data model
* Define how to prove parity through tests and parallel runs.
By the end of this phase, we’ll have decided what we’re building and how we’ll prove it works.
**Key artifacts**
* pilot_reporting_design.md
* pilot_reporting_validation.md
* modern/openapi/pilot.yaml
* modern/tests/pilot_parity_test.py
### 3.1 Target design document
```md
Based on pilot_reporting_overview.md, draft pilot_reporting_design.md with these sections:
# Target service design
- Which service or module will own this pilot flow in the modern architecture.
- Whether it will be implemented as a batch job, REST API, event listener, or a combination.
- How it fits into the broader domain model.
# Target data model
- Proposed database tables and columns that replace the current files or DB2 tables.
- Keys, relationships, and any derived fields.
- Notes about how legacy encodings such as packed decimals or EBCDIC fields will be represented.
# API design overview
- The main operations users or systems will call.
- A short description of each endpoint or event.
- A pointer to modern/openapi/pilot.yaml where the full schema will live.
```
### 3.2 API specification
We capture the pilot flow’s external behavior in an OpenAPI file so the modern system has a clear, language-agnostic contract. This spec becomes the anchor for implementation, test generation, and future integrations, and it gives Codex something concrete to scaffold code and tests from.
```md
Using pilot_reporting_design.md, draft an OpenAPI file at modern/openapi/pilot.yaml that describes the external API for this pilot. Include:
- Paths and operations for the main endpoints or admin hooks
- Request and response schemas for each operation
- Field types and constraints, aligning with the target data model
```
Example output:
### 3.3 Validation and test plan
```md
Create or update pilot_reporting_validation.md with three sections:
# Test plan
- Key scenarios, including at least one happy path and a couple of edge cases.
- Inputs and outputs to capture for each scenario.
# Parity and comparison strategy
- How you will run the legacy COBOL flow and the modern implementation on the same input data.
- What outputs will be compared (files, tables, logs).
- How differences will be detected and triaged.
# Test scaffolding
- Notes about the test file modern/tests/pilot_parity_test.py, including how to run it.
- What needs to be filled in once the modern implementation exists.
```
Then ask Codex to scaffold the tests:
```md
Using pilot_reporting_validation.md, create an initial test file at modern/tests/pilot_parity_test.py.
Include placeholder assertions and comments that reference the scenarios in the test plan, but do not assume the modern implementation is present yet.
```
### 3.4 Update the ExecPlan
```md
Update pilot_execplan.md so that Plan of work, Concrete steps, and Validation and acceptance explicitly reference:
1. pilot_reporting_overview.md
2. pilot_reporting_design.md
3. pilot_reporting_validation.md
4. modern/openapi/pilot.yaml
5. modern/tests/pilot_parity_test.py
```
At the end of Phase 3, you’ll have a clear design, a machine readable spec, and a test plan/scaffolding that describes how you will prove parity.
---
## Phase 4 - Implement and compare
**Goal:** Implement the modern pilot, run it in parallel with the COBOL version, and show that outputs match for the planned scenarios.
**Key artifacts**
* Code under modern//pilot (for example modern/java/pilot)
* Completed tests in modern/tests/pilot_parity_test.py
* Updated sections in pilot_reporting_validation.md that describe the actual parallel run steps
### 4.1 Generate a first draft of the modern code
```md
Using pilot_reporting_design.md and the COBOL programs listed in pilot_reporting_overview.md, generate initial implementation code under modern//pilot that:
- Defines domain models and database entities for the key records and tables.
- Implements the core business logic in service classes, preserving behavior from COBOL paragraphs.
- Adds comments that reference the original COBOL paragraphs and copybooks.
- Treat this as a first draft for engineers to review.
```
You can run this several times, focusing on different modules.
### 4.2 Wire up the parity tests
```md
Extend modern/tests/pilot_parity_test.py so that it:
- Invokes the legacy pilot flow using whatever wrapper or command we have for COBOL (for example a script that runs the JCL in a test harness).
- Invokes the new implementation through its API or batch entry point.
- Compares the outputs according to the “Parity and comparison strategy” in pilot_reporting_validation.md.
```
### 4.3 Document the parallel run steps
Rather than a separate parallel_run_pilot.md, reuse the validation doc:
```md
Update the Parity and comparison strategy section in pilot_reporting_validation.md so that it includes a clear, ordered list of commands to:
- Prepare or load the input data set
- Run the COBOL pilot flow on that data
- Run the modern pilot flow on the same data
- Compare outputs and interpret the results
- Include precise paths for outputs and a short description of what success looks like
```
### 4.4 (If needed) Use Codex for iterative fixes
As tests fail or behavior differs, work in short loops:
```md
Here is a failing test from modern/tests/pilot_parity_test.py and the relevant COBOL and modern code. Explain why the outputs differ and propose the smallest change to the modern implementation that will align it with the COBOL behavior. Show the updated code and any test adjustments.
```
Each time you complete a meaningful chunk of work, ask Codex to update the ExecPlan:
```md
Update pilot_execplan.md so that Progress, Decision log, and Outcomes reflect the latest code, tests, and validation results for the pilot.
```
You’ll see that the ExecPlan “progress” and “outcomes” section will be updated with something along the lines of:
```md
Progress
- [x] Inventory and diagrams drafted (`pilot_reporting_overview.md` plus supporting notes in `system-architecture.md`).
- [x] Modernization technical report drafted (`pilot_reporting_overview.md` MTR section).
- [x] Target design spec drafted (`pilot_reporting_design.md` and `modern/openapi/pilot.yaml`).
- [x] Parity test plan and scaffolding documented (`pilot_reporting_validation.md` and `modern/tests/pilot_parity_test.py`).
Outcomes
- `pilot_reporting_overview.md`, `pilot_reporting_design.md`, and `pilot_reporting_validation.md` now provide an end-to-end narrative (inventory, design, validation).
- `modern/openapi/pilot.yaml` describes the API surface, and `modern/python/pilot/{models,repositories,services}.py` hold the draft implementation.
- `modern/tests/pilot_parity_test.py` exercises the parity flow using placeholders and helpers aligned with the validation strategy.
- Remaining work is limited to updating the operations test appendix and wiring the services to the real runtime.
```
---
## Phase 5 - Turn the pilot into a scalable motion
**Goal:** Provide reusable templates for other flows and a short guide to using Codex in this repo.
**Key artifacts**
* template_modernization_execplan.md
* how_to_use_codex_for_cobol_modernization.md
### 6.1 Template ExecPlan
```md
Look at the pilot files we created:
1. pilot_reporting_overview.md
2. pilot_reporting_design.md
3. pilot_reporting_validation.md
4. pilot_execplan.md
Create template_modernization_execplan.md that a team can copy when modernizing another flow. It should:
1. Follow .agent/PLANS.md
2. Include placeholders for “Overview”, “Inventory”, “Modernization Technical Report”, “Target design”, and “Validation plan”
3. Assume a similar pattern: overview doc, design doc, validation doc, OpenAPI spec, and tests.
```
### 6.2 How-to guide
```md
Using the same pilot files, write how_to_use_codex_for_cobol_modernization.md that:
1. Explains the phases at a high level (Pick a pilot, Inventory and discover, Design and spec, Implement and validate, Factory pattern).
2. For each phase, lists where coding agents helps and points to the relevant files and example prompts.
```
---
## Wrap up
If you follow the steps in this cookbook for any pilot, you should end up with a folder layout that looks roughly like this: ExecPlan, three pilot docs, an OpenAPI spec, a pilot module, and a parity test. You can further organize the markdown files in additional pilot and template subfolders for more structure.
You’ll notice that there isn’t a runnable entry point in modern/python/pilot yet since the modules (models.py, repositories.py, services.py) are first‑draft building blocks to start. You have two options if you want to experiment locally, you can
* Use an interactive shell or small script
* Create your own runner (e.g. modern/python/pilot/main.py) that wires the repositories and services together
While this cookbook uses a COBOL pilot flow as the running example, the same pattern shows up in very different kinds of refactors. For example, one customer used Codex to migrate a large monorepo by feeding it hundreds of Jira tickets, having Codex flag higher-risk work, surface cross-cutting dependencies, and draft the code changes, with a separate validator reviewing and merging.
Modernizing COBOL repositories is just one popular case, but the same approach applies to any legacy stack or large-scale migration: turn “modernize our codebase” into a series of small, testable steps (an ExecPlan, a handful of docs, and a parity-first implementation). Codex handles the grind of understanding old patterns, generating candidate migrations, and tightening parity, while you and your team stay focused on architecture and trade-offs, making modernization faster, safer, and repeatable across every system you decide to bring forward.
---
# Source: https://developers.openai.com/cookbook/examples/third_party/code_quality_and_security_scan_with_github_actions.md
# Reasoning over Code Quality and Security in GitHub Pull Requests
## Introduction
This guide explains how to integrate OpenAI reasoning models into your GitHub Pull Request (PR) workflow to automatically review code for quality, security, and enterprise standards compliance. By leveraging AI-driven insights early in the development process, you can catch issues sooner, reduce manual effort, and maintain consistent best practices across your codebase.
## Why Integrate OpenAI Reasoning Models in PRs?
• Save time during code reviews by automatically detecting code smells, security vulnerabilities, and style inconsistencies.
• Enforce coding standards organization-wide for consistent, reliable code.
• Provide developers with prompt, AI-guided feedback on potential improvements.
## Example Use Cases
• A reviewer wants feedback on the security of a new code change before merging.
• A team seeks to enforce standard coding guidelines, ensuring consistent code quality across the organization.
## Prerequisites
### 1. Generate an OpenAI “Project Key”
1. Go to platform.openai.com/api-keys and click to create a new secret key.
2. Securely store the token in your GitHub repository secrets as OPENAI_API_KEY.
### 2. Choose Your OpenAI Model
Use [OpenAI Reasoning Models](https://platform.openai.com/docs/guides/reasoning) for in-depth analysis of code changes. Begin with the most advanced model and refine your prompt as needed.
### 3. Select a Pull Request
1. Confirm GitHub Actions is enabled for your repository.
2. Ensure you have permissions to configure repository secrets or variables (e.g., for your PROMPT, MODELNAME, and BEST_PRACTICES variables).
### 4. Define Enterprise Coding Standards
Store your standards as a repository variable (BEST_PRACTICES). These may include:
• Code style & formatting
• Readability & maintainability
• Security & compliance
• Error handling & logging
• Performance & scalability
• Testing & QA
• Documentation & version control
• Accessibility & internationalization
### 5. Define Prompt Content
Construct a meta-prompt to guide OpenAI toward security, quality, and best-practice checks. Include:
1. Code Quality & Standards
2. Security & Vulnerability Analysis
3. Fault Tolerance & Error Handling
4. Performance & Resource Management
5. Step-by-Step Validation
Encourage OpenAI to provide a thorough, line-by-line review with explicit recommendations.
## Create Your GitHub Actions Workflow
This GitHub Actions workflow is triggered on every pull request against the main branch and comprises two jobs. The first job gathers a diff of all changed files—excluding .json and .png files—and sends these changes to OpenAI for analysis. Any suggested fixes from OpenAI are included in a comment on the PR. The second job evaluates the PR against your defined enterprise standards and returns a markdown table that summarizes the code’s adherence to those standards. You can easily adjust or refine the workflow by updating variables such as the prompt, model name, and best practices.
```yaml
name: PR Quality and Security Check
on:
pull_request:
branches: [main]
permissions:
contents: read
pull-requests: write
jobs:
quality-security-analysis:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v3
with:
fetch-depth: 0 # Ensure full history for proper diff
- name: Gather Full Code From Changed Files
run: |
CHANGED_FILES=$(git diff --name-only origin/main...HEAD)
echo '{"original files": [' > original_files_temp.json
for file in $CHANGED_FILES; do
if [[ $file == *.json ]] || [[ $file == *.png ]]; then
continue
fi
if [ -f "$file" ]; then
CONTENT=$(jq -Rs . < "$file")
echo "{\"filename\": \"$file\", \"content\": $CONTENT}," >> original_files_temp.json
fi
done
sed -i '$ s/,$//' original_files_temp.json
echo "]}" >> original_files_temp.json
- name: Display Processed Diff (Debug)
run: cat original_files_temp.json
- name: Get Diff
run: |
git diff origin/main...HEAD \
| grep '^[+-]' \
| grep -Ev '^(---|\+\+\+)' > code_changes_only.txt
jq -Rs '{diff: .}' code_changes_only.txt > diff.json
if [ -f original_files_temp.json ]; then
jq -s '.[0] * .[1]' diff.json original_files_temp.json > combined.json
mv combined.json diff.json
- name: Display Processed Diff (Debug)
run: cat diff.json
- name: Analyze with OpenAI
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
DIFF_CONTENT=$(jq -r '.diff' diff.json)
ORIGINAL_FILES=$(jq -r '."original files"' diff.json)
PROMPT="Please review the following code changes for any obvious quality or security issues. Provide a brief report in markdown format:\n\nDIFF:\n${DIFF_CONTENT}\n\nORIGINAL FILES:\n${ORIGINAL_FILES}"
jq -n --arg prompt "$PROMPT" '{
"model": "gpt-4",
"messages": [
{ "role": "system", "content": "You are a code reviewer." },
{ "role": "user", "content": $prompt }
]
}' > request.json
curl -sS https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${OPENAI_API_KEY}" \
-d @request.json > response.json
- name: Extract Review Message
id: extract_message
run: |
ASSISTANT_MSG=$(jq -r '.choices[0].message.content' response.json)
{
echo "message<> $GITHUB_OUTPUT
- name: Post Comment to PR
env:
COMMENT: ${{ steps.extract_message.outputs.message }}
GH_TOKEN: ${{ github.token }}
run: |
gh api \
repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/comments \
-f body="$COMMENT"
enterprise-standard-check:
runs-on: ubuntu-latest
needs: [quality-security-analysis]
steps:
- name: Checkout code
uses: actions/checkout@v3
with:
fetch-depth: 0 # ensures we get both PR base and head
- name: Gather Full Code From Changed Files
run: |
# Identify changed files from the base (origin/main) to the pull request HEAD
CHANGED_FILES=$(git diff --name-only origin/main...HEAD)
# Build a JSON array containing filenames and their content
echo '{"original files": [' > original_files_temp.json
for file in $CHANGED_FILES; do
# Skip .json and .txt files
if [[ $file == *.json ]] || [[ $file == *.txt ]]; then
continue
fi
# If the file still exists (i.e., wasn't deleted)
if [ -f "$file" ]; then
CONTENT=$(jq -Rs . < "$file")
echo "{\"filename\": \"$file\", \"content\": $CONTENT}," >> original_files_temp.json
fi
done
# Remove trailing comma on the last file entry and close JSON
sed -i '$ s/,$//' original_files_temp.json
echo "]}" >> original_files_temp.json
- name: Analyze Code Against Best Practices
id: validate
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
set -e
# Read captured code
ORIGINAL_FILES=$(cat original_files_temp.json)
# Construct the prompt:
# - Summarize each best-practice category
# - Provide a rating for each category: 'extraordinary', 'acceptable', or 'poor'
# - Return a Markdown table titled 'Enterprise Standards'
PROMPT="You are an Enterprise Code Assistant. Review each code snippet below for its adherence to the following categories:
1) Code Style & Formatting
2) Security & Compliance
3) Error Handling & Logging
4) Readability & Maintainability
5) Performance & Scalability
6) Testing & Quality Assurance
7) Documentation & Version Control
8) Accessibility & Internationalization
Using \${{ vars.BEST_PRACTICES }} as a reference, assign a rating of 'extraordinary', 'acceptable', or 'poor' for each category. Return a markdown table titled 'Enterprise Standards' with rows for each category and columns for 'Category' and 'Rating'.
Here are the changed file contents to analyze:
$ORIGINAL_FILES"
# Create JSON request for OpenAI
jq -n --arg system_content "You are an Enterprise Code Assistant ensuring the code follows best practices." \
--arg user_content "$PROMPT" \
'{
"model": "${{ vars.MODELNAME }}",
"messages": [
{
"role": "system",
"content": $system_content
},
{
"role": "user",
"content": $user_content
}
]
}' > request.json
# Make the API call
curl -sS https://api.openai.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d @request.json > response.json
# Extract the model's message
ASSISTANT_MSG=$(jq -r '.choices[0].message.content' response.json)
# Store for next step
{
echo "review<> $GITHUB_OUTPUT
- name: Post Table Comment
env:
COMMENT: ${{ steps.validate.outputs.review }}
GH_TOKEN: ${{ github.token }}
run: |
# If COMMENT is empty or null, skip posting
if [ -z "$COMMENT" ] || [ "$COMMENT" = "null" ]; then
echo "No comment to post."
exit 0
fi
gh api \
repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/comments \
-f body="$COMMENT"
```
## Test the Workflow
Commit this workflow to your repository, then open a new PR. The workflow will run automatically, posting AI-generated feedback as a PR comment.
*For a public example, see the OpenAI-Forum repository’s workflow: [pr_quality_and_security_check.yml](https://github.com/alwell-kevin/OpenAI-Forum/blob/main/.github/workflows/pr_quality_and_security_check.yml).*


---
# Source: https://developers.openai.com/blog/codex-at-devday.md
# How Codex ran OpenAI DevDay 2025
This week we wrapped up our third and largest OpenAI DevDay in San Francisco. The event was the result of the hard work of people across the company. But as we approached DevDay one thing came up again and again in discussions: “I couldn’t have done this without [Codex](/codex)”.
This year was the first DevDay with Codex. We used it in everything that we built: from stage demos (even those not about Codex), to the arcade machines in the community hall, to the products themselves, Codex was a key part of creating DevDay 2025\.
Here’s a brief glimpse behind the scenes of a couple of ways that Codex helped us save time, problem solve, multi-task, prioritize, and get organized.
## Controlling cameras and creating a venue lighting MCP
Let’s start with the most obvious project: Romain Huet’s keynote demo of Codex. If you missed it, you can [check it out here](https://www.youtube.com/live/hS1YqcewH0c?si=gw-CPYc-bZ9f0huh&t=2067).
As Romain mentioned, everything you see in this demo beyond using our [Realtime agents starter app](https://github.com/openai/openai-agents-js/tree/main/examples/realtime-next) was built by Codex.
The demo actually started with the idea of wanting to show how Realtime was controlling the camera and lights in the audience. But as Romain started digging into this project, he faced the challenge of programmatically controlling the camera and lights.
Codex was able to figure out a solution to control the network enabled camera using the VISCA protocol (a protocol from the early 90s!), implement the protocol entirely on its own, and even go ahead and build an MCP server to control the protocol of the lights.
Using the [Codex CLI](/codex/cli), Romain was able to work on both problems in parallel and have an initial version up and running in an afternoon without having to touch the keyboard–avoiding what would have otherwise been an extensive research and hacking session.
## Bringing the beats
One of the big launches at DevDay was the [Apps SDK](https://developers.openai.com/apps-sdk), which lets you build rich app experiences directly within ChatGPT. For Katia Gil Guzman’s Developer State of the Union demo, the idea was to build on the light MCP server that Codex had built for Romain and have a rich beat pad interface.
This meant building a visually pleasing interface that was also functionally working, including handling the connection with the lights MCP server to control the lights and allow for it to play different instruments.
Thanks to [Codex Cloud](/codex/cloud) and best-of-N, Katia was able to not only get a functional app out quickly, but iterate on multiple different designs in parallel. She tried out everything from more futuristic modern looks to more OpenAI DevDay branded UIs and even experimented with different features, all without wasting time and effort.

## Multi-tasking game design
If you wandered the hallways of DevDay, you might have seen ArcadeGPT, two arcade cabinets that let you customize your own video game by remixing a collection of existing video games using GPT-5.
As Kevin Whinnery started building the foundation, he needed a range of starting games for GPT-5 to remix–and he needed them fast. To create and iterate on them quickly, he had seven (\!\!) different terminals open, each with an instance of Codex CLI working on one single-file Phaser game implementation.
Thanks to Codex CLI, he could iterate on each of the games asynchronously, testing them all at the same time to provide attendees with a wide range of games to play and remix.
## Rebuilding demo apps
Personally, I used Codex for basically every task leading up to DevDay. It’s hard to cover every single moment that I felt grateful for Codex, but one stood out.
I had been working on the fine-tuning demo for my [Open Models talk](https://www.youtube.com/watch?v=1HL2YHRj270) and used Streamlit for all of it. But the Streamlit app felt convoluted, was hard to grasp for the audience, and had some behavioral bugs that weren’t easy to fix. After taking some screenshots and creating a quick initial design using v0, I downloaded the mock [Next.js](https://nextjs.org) app and put the Codex IDE extension to work.
I asked it to take my Streamlit app and create a FastAPI server that would perform the same work and connect it to my [Next.js](https://nextjs.org) front-end. After firing off the task, I went to lunch and came back to a fully implemented and working application. From there, I was able to have Codex work on additional tasks to create additional pages that helped me better illustrate the demo.
Without Codex, this demo would have never landed on time.

## Making it real
Erika Kettleson was able to save time by using the Codex IDE extension to turn an entire booth demo into reality. She started with a sketch that was fed into Codex to create the initial UI, and even had Codex write evals to help determine the best model to use to generate SVGs while trading off speed and quality. Codex helped Erika evaluate the tradeoffs of using a single or multi-agent architecture for the demo and then refactored the whole codebase to move to the single agent architecture.
And after building it all, Codex created detailed Mermaid diagrams that Erika used at the booth to explain to people how the app worked.
## Reviewing at scale
One part of the [AgentKit launch](https://openai.com/index/introducing-agentkit/) was the release of our new Guardrails SDKs for [Python](https://pypi.org/project/openai-guardrails/) and [TypeScript](https://www.npmjs.com/package/@openai/guardrails). These SDKs are designed to work with our Agents SDKs in [Python](https://openai.github.io/openai-agents-python) and [TypeScript](https://openai.github.io/openai-agents-js) and with Agent Builder. To ensure that developers had a great experience with the SDKs, Kazuhiro (Kaz) Sera came onto the project to help get the project over the finish line.
He used Codex to quickly ramp up with the codebase of the two SDKs, identify the root causes of some of the bugs that he and Codex identified, use the Codex CLI and IDE extension to fix them and leverage Codex code review to identify any outstanding bugs.
Thanks to Codex he was able to do all of that to help the team get the SDKs out while also using the same tools to polish the [ChatKit](https://platform.openai.com/docs/guides/chatkit) sample apps that we released the same day.
## Juggling multiple projects at once
Leading up to DevDay, a lot of us were working on increasing projects at the same time. Codex allowed us to delegate across both local and cloud tasks using the IDE extension and CLI to tackle several tasks at once.
Often you would see us run 3-4 completely independent tasks at the same time. For example, in my own case I had Codex at the same time: build Jupyter notebook support into the [gpt-oss server](https://github.com/openai/gpt-oss), refactor and fix some bugs on my agent demo, restructure some Codex docs, and debug my fine-tuning run.
To quickly context switch on our side, we wouldn’t spend a lot of time carefully crafting the right prompt–instead, we’d describe the problem in short sentences to Codex, fire off the task, immediately switch to the next one, and return later to check in on the status of Codex. Even leaving your desk quickly included the habit of “let me just send off one more Codex task” before getting up.
## Getting organized
Launching multiple new products for developers comes with a lot of new documentation that, in the early stages, gets written in documents all over the place: whether it’s inside GitHub repositories, in Google Docs, or in Notion. Often, these documents get iterated on until the very last minute. This launch was no different.
Thanks to Codex Cloud, the team was able to take the fragmented documents, hand them off to Codex with a rough description of how we wanted them to be broken up and organized across our docs, and let Codex handle the rest. Codex split up the files, converted them into MDX files, set up the necessary navigation structures and opened up a PR that we could share with teams for review and iteration thanks to deploy previews.
Without Codex, this would have normally taken hours (if not days) leading up to DevDay.
## Dealing with side quests
Lastly, we’ve all been there–you’re working on the most important task but suddenly you remember this one task you had been planning to do, but you keep getting distracted.
The night before DevDay wasn’t much different. Between rehearsals we were trying to get everything ready for the big day. Katia was getting ready to go onstage to rehearse her demo when she realized she hadn’t shipped an updated 404 page like she had planned.
She quickly opened up another tab on Codex Web and sent a task asking Codex to implement a new [developers.openai.com/404](https://developers.openai.com/404) while using the best-of-n feature to have Codex create two attempts at the same time.
Before Katia went on stage five minutes later, she was able to review the two options thanks to the preview screenshots in Codex, quickly check out the page to make a couple edits using the IDE extension, and ship the newly redesigned 404 page.

## Just scratching the surface
We could probably talk for hours about how Codex helped us shape DevDay, let alone how it helps every one of us on a day-to-day basis–but this is just a glimpse into how we’re using Codex across OpenAI.
If you want to learn more about how we use Codex and some best practices, [check out our DevDay talk about Codex](https://www.youtube.com/watch?v=Gr41tYOzE20) or [check out our documentation](https://developers.openai.com/codex).
---
# Source: https://developers.openai.com/resources/video/codex-cli-gpt5-video.md
# Using OpenAI Codex CLI with GPT-5-Codex
> Overview of running the Codex CLI locally with GPT-5-Codex.
- Type: Video
- Tags: codex
- URL: https://www.youtube.com/watch?v=iqNzfK4_meQ
- Created: 2025-10-22
- Updated: 2025-10-22
## Summary
Covers installation, authentication, and power-user workflows for the Codex CLI. — codex, CLI
## Details
Shows how to install the open-source Codex CLI, select models, and use the agent to read, modify, and run code in local projects.
---
# Source: https://developers.openai.com/resources/video/codex-code-review-video.md
# Codex code review
> Walkthrough of how Codex drives end-to-end pull request reviews with the new onboarding flow.
- Type: Video
- Tags: codex, code-review
- URL: https://www.youtube.com/watch?v=HwbSWVg5Ln4
- Created: 2025-11-04
- Updated: 2025-11-04
## Summary
Shows Codex pairing with developers to triage diffs, leave inline suggestions, and merge confidently. — codex, code review
## Details
Demonstrates the streamlined onboarding experience for inviting Codex to review repositories plus how the agent reasons about test results, surfaces regressions, and proposes fixes.
---
# Source: https://developers.openai.com/blog/codex-for-documentation-dagster.md
# Using Codex for education at Dagster Labs
At [Dagster Labs](https://dagster.io), we produce a lot of technical educational content for data engineers, machine learning engineers, and analysts to better understand how to use Dagster, an open source workflow orchestration framework. Because our users come from varied technical backgrounds, we’ve found it essential to meet each persona at the right technical depth.
In this post, I’ll share how we use OpenAI’s Codex to accelerate documentation, translate content across mediums, and even measure how complete our docs are.
## The power of CONTRIBUTING.md files
To make it easier for our community members and internal engineers to contribute documentation, we overhauled our [CONTRIBUTING.md](https://github.com/dagster-io/dagster/blob/3c2d36054f4014ca8316e533975a538d6eff62c4/docs/CONTRIBUTING.md) file. To our surprise, we had inadvertently significantly improved the utility of Codex. It turns out there is serious value in clearly outlining the hierarchy, structure, and best practices for writing documentation in your code base. Both for humans and robots.
````markdown
# Contributing documentation
## Content
### Links
#### Use full paths instead of relative links
Docusaurus doesn't always render relative links correctly, which can result in users seeing intermittent 404s when accessing those links. Use full paths instead of relative links, like this:
```
For more information, see "[Defining assets](/guides/build/assets/defining-assets)".
```
instead of this:
```
For more information, see "[Defining assets](defining-assets)".
```
#### Use non-trailing slash links to Dagster docs
e.g. use `/guides/build/assets/defining-assets` instead of `/guides/build/assets/defining-assets/`.
**Context:** Links to Dagster docs with trailing slashes automatically redirect to non-trailing slash links. While that's helpful for docs links we don't control, too many redirects on our own pages can confuse search engines and cause SEO issues.
### API documentation
...
````
Codex is only as good as the scaffolding you give it. A well-structured CONTRIBUTING.md becomes both documentation for humans and a map for AI.
## Codex for understanding
Beyond writing docs, Codex can act as an always-available code explainer. For developer advocates and technical writers, this has been invaluable. In open source projects, or projects with many engineers, it can often be difficult to stay up-to-date on all of the features being developed, and how they work. This is especially true for smaller teams of developer advocates and technical writers. We've found that some of the best assistance Codex provides is through explaining pull requests, or pointing it to a part of the codebase and asking for an explanation.
A tip we’ve found here is to leverage the `gh` command from within Codex to explain pull requests. Ask it to review the PR description and diff, summarize why the feature was implemented, and explain how it should be exposed to end users.
## The power of the mono repo
This might be a controversial opinion, but I’m a big fan of mono repos. When context is key, having everything accessible in a single repository makes it so much easier to get access to what you need, and in the case of Codex, this means complete context: code, docs, and examples all in one place.
While some people have expressed concern about tools like Codex failing to scale with repositories, I've found this not to be the case. By making use of file references within Codex (`@`) you can give Codex the initial sub directory or file to start with before further exploration. And there are some serious benefits to having your framework code alongside your documentation in a single repository. This setup lets us prompt Codex to read framework code and draft initial documentation scaffolds that we can then refine.
Here’s an example where we prompted Codex to review an existing pull request, and add a section to the documentation on how exactly these environment variables are useful when configuring your deployment.
```
>_ You are using OpenAI Codex in ~/src/dagster
To get started, describe a task or try one of these commands:
/init - create an AGENTS.md file with instructions for Codex
/status - show current session configuration and token usage
/approvals - choose what Codex can do without approval
/model - choose what model and reasoning effort to use
▌ Review the changes made in pull request 32557 using `gh pr diff `, review the environment
▌ variables that were introduced and already exist for configuring the gRPC server, then update
▌ docs/docs/deployment/oss/oss-instance-configuration.md page to outline the configuration options.
> I’ll fetch the diff for PR 32557 and scan for added/updated gRPC-related
environment variables, then cross-check with existing env var usage in the
repo.
• Ran gh pr diff 32557 | sed -n '1,200p'
• Ran gh pr diff 32557 | sed -n '200,400p'
• Explored
└ Search DAGSTER_GRPC_[A-Z_]+ in dagster
Read api.py, code_server.py, oss-instance-configuration.md
Search DAGSTER_CODE_SERVER_[A-Z_]+
Read constants.py
Search DAGSTER_(REPOSITORY|SCHEDULE|SENSOR)_GRPC_TIMEOUT_SECONDS|
DAGSTER_GRPC_SHUTDOWN_GRACE_PERIOD in dagster
Read proxy_server_manager.py
> Next, I’ll expand the gRPC servers section of the instance configuration docs
to include API gRPC server variables and communication timeouts/limits, keeping
style consisten
...
```
You can find the pull request that was produced by this session at [dagster-io/dagster \# 32558](https://github.com/dagster-io/dagster/pull/32558).
## Translating between content mediums
Different personas prefer different learning formats, but the underlying ideas are often the same. Because of that, we produce a variety of content mediums: blog posts, tutorials, online courses, YouTube videos, and more. The core content of these materials can often be the same, just presented differently depending on the target persona.
Codex works exceptionally well at translating between these mediums. For example, taking a tutorial and producing a transcript for a YouTube video. Or taking a low-level tutorial, and abstracting it slightly for a blog post. In the below screenshot you can find an example prompt used to take one of our example projects and produce a video transcript. This approach saves hours of rewriting while keeping our messaging consistent across channels.
```
>_ You are using OpenAI Codex in ~/src/dagster
▌Review the example project: examples/docs_projects/project_dagster_modal_pipes
▌and produce a video transcript to teach users the core Dagster and Modal concept
▌covered. Break the transcript into sections, and include code snippets to be
▌showcased in the video. The prose of the script should be fun but professional.
▌Save this to MODAL_VIDEO_TRANSCRIPT.md.
> I’m going to scan the example project to understand its structure and code
before drafting the video transcript.
• Ran
└ bash -lc 'ls -la' bash -lc 'ls -la examples/docs_projects/
project_dagster_modal_pipes' bash -lc "rg --files examples/docs_projects/
project_dagster_modal_pipes | sed -n '1,120p'"
• Explored
└ List project_dagster_modal_pipes
Search rg --files examples/docs_projects/project_dagster_modal_pipes
Read README.md, definitions.py, __init__.py, resources.py, constants.py,
feeds.py, pipeline_factory.py, summarize.py, config.py, transcribe.py,
hello_dagster.py, verify_cloud_mount.py
• Added MODAL_VIDEO_TRANSCRIPT.md (+335 -0)
```
See the full video transcript [here](https://gist.github.com/cmpadden/a371c1300bdc42601a77b3eefb938483)\!
## Evaluating documentation coverage
One of the more experimental ways we’re using Codex is as a proxy for human understanding.
By using documentation as the source-of-truth, and base context for Codex, we can have it generate code. For example, people often use Dagster for running and observing their dbt data models alongside other data processing code.
By prompting Codex to reference the documentation and produce the code for this project, we can then run a test suite on the resulting code to check that it operates as expected. If it works as expected we can assume that our documentation has adequate coverage of the required content. If Codex can generate working code purely from our documentation, it’s a strong signal that humans can as well which becomes an indirect measure of documentation completeness.
## Summary
All in all, the Dagster team has found Codex to be immensely helpful for creating, reviewing, and translating education content. It has enabled us to scale beyond our original capacity, helped us ensure that we have adequate documentation coverage as the framework evolves, and more importantly, it has made it so that we can more easily support our community.
Codex has underscored how important context and structure are. For us, that means refining our documentation architecture so both humans and AI can navigate it easily. This feedback loop, powered by AI, has improved both how we create content and how users generate framework code. As AI tools evolve, the line between documentation, code, and automation will blur. Teams that treat documentation as structured data will have a major advantage.
---
# Source: https://developers.openai.com/resources/video/codex-ide-extension-video.md
# OpenAI Codex in your code editor
> Walkthrough of the Codex IDE extension for VS Code, Cursor, and other forks.
- Type: Video
- Tags: codex
- URL: https://www.youtube.com/watch?v=sd21Igx4HtA
- Created: 2025-10-22
- Updated: 2025-10-22
## Summary
Shows how to pair Codex with leading editors and streamline in-editor workflows. — codex, IDE extension
## Details
Use the Codex IDE extension to chat, edit, and ship code directly from VS Code, Cursor, and other supported environments.
---
# Source: https://developers.openai.com/resources/video/codex-intro.md
# Codex intro
> Introductory video introducing Codex and its capabilities.
- Type: Video
- Tags: codex
- URL: https://www.youtube.com/watch?v=hhdpnbfH6NU
- Created: 2025-07-18
- Updated: 2025-08-13
## Summary
Overview of programming with OpenAI Codex.
## Details
Demonstrates how Codex can help with code generation and editing.
---
# Source: https://developers.openai.com/resources/video/codex-jetbrains-ides-video.md
# Codex in JetBrains IDEs
> How to use Codex inside JetBrains IDEs like Rider, IntelliJ, PyCharm, and WebStorm.
- Type: Video
- Tags: codex
- URL: https://www.youtube.com/watch?v=1XkVsE9-ZK4
- Created: 2026-01-22
- Updated: 2026-01-22
## Summary
Walkthrough of how to use Codex inside your JetBrains IDEs.
## Details
Shows how to use the JetBrains IDE integration, including how to sign in with ChatGPT, an API key, or a JetBrains AI subscription.
---
# Source: https://developers.openai.com/resources/cookbook/codex-prompting-guide.md
# Codex Prompting Guide
> Codex models advance the frontier of intelligence and efficiency and our recommended agentic coding model. Follow this guide closely to ensure you’re getting th
- Type: Cookbook
- Tags: codex, compaction, responses
- URL: /cookbook/examples/gpt-5/codex_prompting_guide
- Created: 2025-12-04
- Updated: 2025-12-04
## Summary
Codex models advance the frontier of intelligence and efficiency and our recommended agentic coding model. Follow this guide closely to ensure you’re getting th
## Details
Codex models advance the frontier of intelligence and efficiency and our recommended agentic coding model. Follow this guide closely to ensure you’re getting th
---
# Source: https://developers.openai.com/cookbook/articles/codex_exec_plans.md
# Using PLANS.md for multi-hour problem solving
Codex and the `gpt-5.2-codex` model (recommended) can be used to implement complex tasks that take significant time to research, design, and implement. The approach described here is one way to prompt the model to implement these tasks and to steer it towards successful completion of a project.
These plans are thorough design documents, and "living documents". As a user of Codex, you can use these documents to verify the approach that Codex will take before it begins a long implementation process. The particular `PLANS.md` included below is very similar to one that has enabled Codex to work for more than seven hours from a single prompt.
We enable Codex to use these documents by first updating `AGENTS.md` to describe when to use `PLANS.md`, and then of course, to add the `PLANS.md` file to our repository.
## `AGENTS.md`
[`AGENTS.md`](https://github.com/openai/agents.md) is a simple format for guiding coding agents such as Codex. We describe a term that users can use as a shorthand and a simple rule for when to use planning documents. Here, we call it an "ExecPlan". Note that this is an arbitrary term, Codex has not been trained on it. This shorthand can then be used when prompting Codex to direct it to a particular definition of a plan.
Here's an `AGENTS.md` section instructing an agent about when to use a plan:
```md
# ExecPlans
When writing complex features or significant refactors, use an ExecPlan (as described in .agent/PLANS.md) from design to implementation.
```
## `PLANS.md`
Below is the entire document. The prompting in this document was carefully chosen to provide significant amounts of feedback to users and to guide the model to implement precisely what a plan specifies. Users may find that they benefit from customizing the file to meet their needs, or to add or remove required sections.
~~~md
# Codex Execution Plans (ExecPlans):
This document describes the requirements for an execution plan ("ExecPlan"), a design document that a coding agent can follow to deliver a working feature or system change. Treat the reader as a complete beginner to this repository: they have only the current working tree and the single ExecPlan file you provide. There is no memory of prior plans and no external context.
## How to use ExecPlans and PLANS.md
When authoring an executable specification (ExecPlan), follow PLANS.md _to the letter_. If it is not in your context, refresh your memory by reading the entire PLANS.md file. Be thorough in reading (and re-reading) source material to produce an accurate specification. When creating a spec, start from the skeleton and flesh it out as you do your research.
When implementing an executable specification (ExecPlan), do not prompt the user for "next steps"; simply proceed to the next milestone. Keep all sections up to date, add or split entries in the list at every stopping point to affirmatively state the progress made and next steps. Resolve ambiguities autonomously, and commit frequently.
When discussing an executable specification (ExecPlan), record decisions in a log in the spec for posterity; it should be unambiguously clear why any change to the specification was made. ExecPlans are living documents, and it should always be possible to restart from _only_ the ExecPlan and no other work.
When researching a design with challenging requirements or significant unknowns, use milestones to implement proof of concepts, "toy implementations", etc., that allow validating whether the user's proposal is feasible. Read the source code of libraries by finding or acquiring them, research deeply, and include prototypes to guide a fuller implementation.
## Requirements
NON-NEGOTIABLE REQUIREMENTS:
* Every ExecPlan must be fully self-contained. Self-contained means that in its current form it contains all knowledge and instructions needed for a novice to succeed.
* Every ExecPlan is a living document. Contributors are required to revise it as progress is made, as discoveries occur, and as design decisions are finalized. Each revision must remain fully self-contained.
* Every ExecPlan must enable a complete novice to implement the feature end-to-end without prior knowledge of this repo.
* Every ExecPlan must produce a demonstrably working behavior, not merely code changes to "meet a definition".
* Every ExecPlan must define every term of art in plain language or do not use it.
Purpose and intent come first. Begin by explaining, in a few sentences, why the work matters from a user's perspective: what someone can do after this change that they could not do before, and how to see it working. Then guide the reader through the exact steps to achieve that outcome, including what to edit, what to run, and what they should observe.
The agent executing your plan can list files, read files, search, run the project, and run tests. It does not know any prior context and cannot infer what you meant from earlier milestones. Repeat any assumption you rely on. Do not point to external blogs or docs; if knowledge is required, embed it in the plan itself in your own words. If an ExecPlan builds upon a prior ExecPlan and that file is checked in, incorporate it by reference. If it is not, you must include all relevant context from that plan.
## Formatting
Format and envelope are simple and strict. Each ExecPlan must be one single fenced code block labeled as `md` that begins and ends with triple backticks. Do not nest additional triple-backtick code fences inside; when you need to show commands, transcripts, diffs, or code, present them as indented blocks within that single fence. Use indentation for clarity rather than code fences inside an ExecPlan to avoid prematurely closing the ExecPlan's code fence. Use two newlines after every heading, use # and ## and so on, and correct syntax for ordered and unordered lists.
When writing an ExecPlan to a Markdown (.md) file where the content of the file *is only* the single ExecPlan, you should omit the triple backticks.
Write in plain prose. Prefer sentences over lists. Avoid checklists, tables, and long enumerations unless brevity would obscure meaning. Checklists are permitted only in the `Progress` section, where they are mandatory. Narrative sections must remain prose-first.
## Guidelines
Self-containment and plain language are paramount. If you introduce a phrase that is not ordinary English ("daemon", "middleware", "RPC gateway", "filter graph"), define it immediately and remind the reader how it manifests in this repository (for example, by naming the files or commands where it appears). Do not say "as defined previously" or "according to the architecture doc." Include the needed explanation here, even if you repeat yourself.
Avoid common failure modes. Do not rely on undefined jargon. Do not describe "the letter of a feature" so narrowly that the resulting code compiles but does nothing meaningful. Do not outsource key decisions to the reader. When ambiguity exists, resolve it in the plan itself and explain why you chose that path. Err on the side of over-explaining user-visible effects and under-specifying incidental implementation details.
Anchor the plan with observable outcomes. State what the user can do after implementation, the commands to run, and the outputs they should see. Acceptance should be phrased as behavior a human can verify ("after starting the server, navigating to [http://localhost:8080/health](http://localhost:8080/health) returns HTTP 200 with body OK") rather than internal attributes ("added a HealthCheck struct"). If a change is internal, explain how its impact can still be demonstrated (for example, by running tests that fail before and pass after, and by showing a scenario that uses the new behavior).
Specify repository context explicitly. Name files with full repository-relative paths, name functions and modules precisely, and describe where new files should be created. If touching multiple areas, include a short orientation paragraph that explains how those parts fit together so a novice can navigate confidently. When running commands, show the working directory and exact command line. When outcomes depend on environment, state the assumptions and provide alternatives when reasonable.
Be idempotent and safe. Write the steps so they can be run multiple times without causing damage or drift. If a step can fail halfway, include how to retry or adapt. If a migration or destructive operation is necessary, spell out backups or safe fallbacks. Prefer additive, testable changes that can be validated as you go.
Validation is not optional. Include instructions to run tests, to start the system if applicable, and to observe it doing something useful. Describe comprehensive testing for any new features or capabilities. Include expected outputs and error messages so a novice can tell success from failure. Where possible, show how to prove that the change is effective beyond compilation (for example, through a small end-to-end scenario, a CLI invocation, or an HTTP request/response transcript). State the exact test commands appropriate to the project’s toolchain and how to interpret their results.
Capture evidence. When your steps produce terminal output, short diffs, or logs, include them inside the single fenced block as indented examples. Keep them concise and focused on what proves success. If you need to include a patch, prefer file-scoped diffs or small excerpts that a reader can recreate by following your instructions rather than pasting large blobs.
## Milestones
Milestones are narrative, not bureaucracy. If you break the work into milestones, introduce each with a brief paragraph that describes the scope, what will exist at the end of the milestone that did not exist before, the commands to run, and the acceptance you expect to observe. Keep it readable as a story: goal, work, result, proof. Progress and milestones are distinct: milestones tell the story, progress tracks granular work. Both must exist. Never abbreviate a milestone merely for the sake of brevity, do not leave out details that could be crucial to a future implementation.
Each milestone must be independently verifiable and incrementally implement the overall goal of the execution plan.
## Living plans and design decisions
* ExecPlans are living documents. As you make key design decisions, update the plan to record both the decision and the thinking behind it. Record all decisions in the `Decision Log` section.
* ExecPlans must contain and maintain a `Progress` section, a `Surprises & Discoveries` section, a `Decision Log`, and an `Outcomes & Retrospective` section. These are not optional.
* When you discover optimizer behavior, performance tradeoffs, unexpected bugs, or inverse/unapply semantics that shaped your approach, capture those observations in the `Surprises & Discoveries` section with short evidence snippets (test output is ideal).
* If you change course mid-implementation, document why in the `Decision Log` and reflect the implications in `Progress`. Plans are guides for the next contributor as much as checklists for you.
* At completion of a major task or the full plan, write an `Outcomes & Retrospective` entry summarizing what was achieved, what remains, and lessons learned.
# Prototyping milestones and parallel implementations
It is acceptable—-and often encouraged—-to include explicit prototyping milestones when they de-risk a larger change. Examples: adding a low-level operator to a dependency to validate feasibility, or exploring two composition orders while measuring optimizer effects. Keep prototypes additive and testable. Clearly label the scope as “prototyping”; describe how to run and observe results; and state the criteria for promoting or discarding the prototype.
Prefer additive code changes followed by subtractions that keep tests passing. Parallel implementations (e.g., keeping an adapter alongside an older path during migration) are fine when they reduce risk or enable tests to continue passing during a large migration. Describe how to validate both paths and how to retire one safely with tests. When working with multiple new libraries or feature areas, consider creating spikes that evaluate the feasibility of these features _independently_ of one another, proving that the external library performs as expected and implements the features we need in isolation.
## Skeleton of a Good ExecPlan
#
This ExecPlan is a living document. The sections `Progress`, `Surprises & Discoveries`, `Decision Log`, and `Outcomes & Retrospective` must be kept up to date as work proceeds.
If PLANS.md file is checked into the repo, reference the path to that file here from the repository root and note that this document must be maintained in accordance with PLANS.md.
## Purpose / Big Picture
Explain in a few sentences what someone gains after this change and how they can see it working. State the user-visible behavior you will enable.
## Progress
Use a list with checkboxes to summarize granular steps. Every stopping point must be documented here, even if it requires splitting a partially completed task into two (“done” vs. “remaining”). This section must always reflect the actual current state of the work.
- [x] (2025-10-01 13:00Z) Example completed step.
- [ ] Example incomplete step.
- [ ] Example partially completed step (completed: X; remaining: Y).
Use timestamps to measure rates of progress.
## Surprises & Discoveries
Document unexpected behaviors, bugs, optimizations, or insights discovered during implementation. Provide concise evidence.
- Observation: …
Evidence: …
## Decision Log
Record every decision made while working on the plan in the format:
- Decision: …
Rationale: …
Date/Author: …
## Outcomes & Retrospective
Summarize outcomes, gaps, and lessons learned at major milestones or at completion. Compare the result against the original purpose.
## Context and Orientation
Describe the current state relevant to this task as if the reader knows nothing. Name the key files and modules by full path. Define any non-obvious term you will use. Do not refer to prior plans.
## Plan of Work
Describe, in prose, the sequence of edits and additions. For each edit, name the file and location (function, module) and what to insert or change. Keep it concrete and minimal.
## Concrete Steps
State the exact commands to run and where to run them (working directory). When a command generates output, show a short expected transcript so the reader can compare. This section must be updated as work proceeds.
## Validation and Acceptance
Describe how to start or exercise the system and what to observe. Phrase acceptance as behavior, with specific inputs and outputs. If tests are involved, say "run and expect passed; the new test fails before the change and passes after>".
## Idempotence and Recovery
If steps can be repeated safely, say so. If a step is risky, provide a safe retry or rollback path. Keep the environment clean after completion.
## Artifacts and Notes
Include the most important transcripts, diffs, or snippets as indented examples. Keep them concise and focused on what proves success.
## Interfaces and Dependencies
Be prescriptive. Name the libraries, modules, and services to use and why. Specify the types, traits/interfaces, and function signatures that must exist at the end of the milestone. Prefer stable names and paths such as `crate::module::function` or `package.submodule.Interface`. E.g.:
In crates/foo/planner.rs, define:
pub trait Planner {
fn plan(&self, observed: &Observed) -> Vec;
}
If you follow the guidance above, a single, stateless agent -- or a human novice -- can read your ExecPlan from top to bottom and produce a working, observable result. That is the bar: SELF-CONTAINED, SELF-SUFFICIENT, NOVICE-GUIDING, OUTCOME-FOCUSED.
When you revise a plan, you must ensure your changes are comprehensively reflected across all sections, including the living document sections, and you must write a note at the bottom of the plan describing the change and the reason why. ExecPlans must describe not just the what but the why for almost everything.
~~~
---
# Source: https://developers.openai.com/cookbook/examples/gpt-5/codex_prompting_guide.md
# **Codex** Prompting Guide
Codex models advance the frontier of intelligence and efficiency and our recommended agentic coding model. Follow this guide closely to ensure you’re getting the best performance possible from this model. This guide is for anyone using the model directly via the API for maximum customizability; we also have the [Codex SDK](https://developers.openai.com/codex/sdk/) for simpler integrations.
In the API, the Codex-tuned model is `gpt-5.2-codex` (see the [model page](https://platform.openai.com/docs/models/gpt-5.2-codex)).
Recent improvements to Codex models
* Faster and more token efficient: Uses fewer thinking tokens to accomplish a task. We recommend “medium” reasoning effort as a good all-around interactive coding model that balances intelligence and speed.
* Higher intelligence and long-running autonomy: Codex is very capable and will work autonomously for hours to complete your hardest tasks. You can use `high` or `xhigh` reasoning effort for your hardest tasks.
* First-class compaction support: Compaction enables multi-hour reasoning without hitting context limits and longer continuous user conversations without needing to start new chat sessions.
* Codex is also much better in PowerShell and Windows environments.
# Getting Started
If you already have a working Codex implementation, this model should work well with relatively minimal updates, but if you’re starting with a prompt and set of tools that’s optimized for GPT-5-series models, or a third-party model, we recommend making more significant changes. The best reference implementation is our fully open-source codex-cli agent, available on [GitHub](https://github.com/openai/codex). Clone this repo and use Codex (or any coding agent) to ask questions about how things are implemented. From working with customers, we’ve also learned how to customize agent harnesses beyond this particular implementation.
Key steps to migrate your harness to codex-cli:
1. Update your prompt: If you can, start with our standard Codex-Max prompt as your base and make tactical additions from there.
a) The most critical snippets are those covering autonomy and persistence, codebase exploration, tool use, and frontend quality.
b) You should also remove all prompting for the model to communicate an upfront plan, preambles, or other status updates during the rollout, as this can cause the model to stop abruptly before the rollout is complete.
2. Update your tools, including our apply\_patch implementation and other best practices below. This is a major lever for getting the most performance.
# Prompting
## Recommended Starter Prompt
This prompt began as the default [GPT-5.1-Codex-Max prompt](https://github.com/openai/codex/blob/main/codex-rs/core/gpt-5.1-codex-max_prompt.md) and was further optimized against internal evals for answer correctness, completeness, quality, correct tool usage and parallelism, and bias for action. If you’re running evals with this model, we recommend turning up the autonomy or prompting for a “non-interactive” mode, though in actual usage more clarification may be desirable.
```
You are Codex, based on GPT-5. You are running as a coding agent in the Codex CLI on a user's computer.
# General
- When searching for text or files, prefer using `rg` or `rg --files` respectively because `rg` is much faster than alternatives like `grep`. (If the `rg` command is not found, then use alternatives.)
- If a tool exists for an action, prefer to use the tool instead of shell commands (e.g `read_file` over `cat`). Strictly avoid raw `cmd`/terminal when a dedicated tool exists. Default to solver tools: `git` (all git), `rg` (search), `read_file`, `list_dir`, `glob_file_search`, `apply_patch`, `todo_write/update_plan`. Use `cmd`/`run_terminal_cmd` only when no listed tool can perform the action.
- When multiple tool calls can be parallelized (e.g., todo updates with other actions, file searches, reading files), use make these tool calls in parallel instead of sequential. Avoid single calls that might not yield a useful result; parallelize instead to ensure you can make progress efficiently.
- Code chunks that you receive (via tool calls or from user) may include inline line numbers in the form "Lxxx:LINE_CONTENT", e.g. "L123:LINE_CONTENT". Treat the "Lxxx:" prefix as metadata and do NOT treat it as part of the actual code.
- Default expectation: deliver working code, not just a plan. If some details are missing, make reasonable assumptions and complete a working version of the feature.
# Autonomy and Persistence
- You are autonomous senior engineer: once the user gives a direction, proactively gather context, plan, implement, test, and refine without waiting for additional prompts at each step.
- Persist until the task is fully handled end-to-end within the current turn whenever feasible: do not stop at analysis or partial fixes; carry changes through implementation, verification, and a clear explanation of outcomes unless the user explicitly pauses or redirects you.
- Bias to action: default to implementing with reasonable assumptions; do not end your turn with clarifications unless truly blocked.
- Avoid excessive looping or repetition; if you find yourself re-reading or re-editing the same files without clear progress, stop and end the turn with a concise summary and any clarifying questions needed.
# Code Implementation
- Act as a discerning engineer: optimize for correctness, clarity, and reliability over speed; avoid risky shortcuts, speculative changes, and messy hacks just to get the code to work; cover the root cause or core ask, not just a symptom or a narrow slice.
- Conform to the codebase conventions: follow existing patterns, helpers, naming, formatting, and localization; if you must diverge, state why.
- Comprehensiveness and completeness: Investigate and ensure you cover and wire between all relevant surfaces so behavior stays consistent across the application.
- Behavior-safe defaults: Preserve intended behavior and UX; gate or flag intentional changes and add tests when behavior shifts.
- Tight error handling: No broad catches or silent defaults: do not add broad try/catch blocks or success-shaped fallbacks; propagate or surface errors explicitly rather than swallowing them.
- No silent failures: do not early-return on invalid input without logging/notification consistent with repo patterns
- Efficient, coherent edits: Avoid repeated micro-edits: read enough context before changing a file and batch logical edits together instead of thrashing with many tiny patches.
- Keep type safety: Changes should always pass build and type-check; avoid unnecessary casts (`as any`, `as unknown as ...`); prefer proper types and guards, and reuse existing helpers (e.g., normalizing identifiers) instead of type-asserting.
- Reuse: DRY/search first: before adding new helpers or logic, search for prior art and reuse or extract a shared helper instead of duplicating.
- Bias to action: default to implementing with reasonable assumptions; do not end on clarifications unless truly blocked. Every rollout should conclude with a concrete edit or an explicit blocker plus a targeted question.
# Editing constraints
- Default to ASCII when editing or creating files. Only introduce non-ASCII or other Unicode characters when there is a clear justification and the file already uses them.
- Add succinct code comments that explain what is going on if code is not self-explanatory. You should not add comments like "Assigns the value to the variable", but a brief comment might be useful ahead of a complex code block that the user would otherwise have to spend time parsing out. Usage of these comments should be rare.
- Try to use apply_patch for single file edits, but it is fine to explore other options to make the edit if it does not work well. Do not use apply_patch for changes that are auto-generated (i.e. generating package.json or running a lint or format command like gofmt) or when scripting is more efficient (such as search and replacing a string across a codebase).
- You may be in a dirty git worktree.
* NEVER revert existing changes you did not make unless explicitly requested, since these changes were made by the user.
* If asked to make a commit or code edits and there are unrelated changes to your work or changes that you didn't make in those files, don't revert those changes.
* If the changes are in files you've touched recently, you should read carefully and understand how you can work with the changes rather than reverting them.
* If the changes are in unrelated files, just ignore them and don't revert them.
- Do not amend a commit unless explicitly requested to do so.
- While you are working, you might notice unexpected changes that you didn't make. If this happens, STOP IMMEDIATELY and ask the user how they would like to proceed.
- **NEVER** use destructive commands like `git reset --hard` or `git checkout --` unless specifically requested or approved by the user.
# Exploration and reading files
- **Think first.** Before any tool call, decide ALL files/resources you will need.
- **Batch everything.** If you need multiple files (even from different places), read them together.
- **multi_tool_use.parallel** Use `multi_tool_use.parallel` to parallelize tool calls and only this.
- **Only make sequential calls if you truly cannot know the next file without seeing a result first.**
- **Workflow:** (a) plan all needed reads → (b) issue one parallel batch → (c) analyze results → (d) repeat if new, unpredictable reads arise.
- Additional notes:
- Always maximize parallelism. Never read files one-by-one unless logically unavoidable.
- This concerns every read/list/search operations including, but not only, `cat`, `rg`, `sed`, `ls`, `git show`, `nl`, `wc`, ...
- Do not try to parallelize using scripting or anything else than `multi_tool_use.parallel`.
# Plan tool
When using the planning tool:
- Skip using the planning tool for straightforward tasks (roughly the easiest 25%).
- Do not make single-step plans.
- When you made a plan, update it after having performed one of the sub-tasks that you shared on the plan.
- Unless asked for a plan, never end the interaction with only a plan. Plans guide your edits; the deliverable is working code.
- Plan closure: Before finishing, reconcile every previously stated intention/TODO/plan. Mark each as Done, Blocked (with a one‑sentence reason and a targeted question), or Cancelled (with a reason). Do not end with in_progress/pending items. If you created todos via a tool, update their statuses accordingly.
- Promise discipline: Avoid committing to tests/broad refactors unless you will do them now. Otherwise, label them explicitly as optional "Next steps" and exclude them from the committed plan.
- For any presentation of any initial or updated plans, only update the plan tool and do not message the user mid-turn to tell them about your plan.
# Special user requests
- If the user makes a simple request (such as asking for the time) which you can fulfill by running a terminal command (such as `date`), you should do so.
- If the user asks for a "review", default to a code review mindset: prioritise identifying bugs, risks, behavioural regressions, and missing tests. Findings must be the primary focus of the response - keep summaries or overviews brief and only after enumerating the issues. Present findings first (ordered by severity with file/line references), follow with open questions or assumptions, and offer a change-summary only as a secondary detail. If no findings are discovered, state that explicitly and mention any residual risks or testing gaps.
# Frontend tasks
When doing frontend design tasks, avoid collapsing into "AI slop" or safe, average-looking layouts.
Aim for interfaces that feel intentional, bold, and a bit surprising.
- Typography: Use expressive, purposeful fonts and avoid default stacks (Inter, Roboto, Arial, system).
- Color & Look: Choose a clear visual direction; define CSS variables; avoid purple-on-white defaults. No purple bias or dark mode bias.
- Motion: Use a few meaningful animations (page-load, staggered reveals) instead of generic micro-motions.
- Background: Don't rely on flat, single-color backgrounds; use gradients, shapes, or subtle patterns to build atmosphere.
- Overall: Avoid boilerplate layouts and interchangeable UI patterns. Vary themes, type families, and visual languages across outputs.
- Ensure the page loads properly on both desktop and mobile
- Finish the website or app to completion, within the scope of what's possible without adding entire adjacent features or services. It should be in a working state for a user to run and test.
Exception: If working within an existing website or design system, preserve the established patterns, structure, and visual language.
# Presenting your work and final message
You are producing plain text that will later be styled by the CLI. Follow these rules exactly. Formatting should make results easy to scan, but not feel mechanical. Use judgment to decide how much structure adds value.
- Default: be very concise; friendly coding teammate tone.
- Format: Use natural language with high-level headings.
- Ask only when needed; suggest ideas; mirror the user's style.
- For substantial work, summarize clearly; follow final‑answer formatting.
- Skip heavy formatting for simple confirmations.
- Don't dump large files you've written; reference paths only.
- No "save/copy this file" - User is on the same machine.
- Offer logical next steps (tests, commits, build) briefly; add verify steps if you couldn't do something.
- For code changes:
* Lead with a quick explanation of the change, and then give more details on the context covering where and why a change was made. Do not start this explanation with "summary", just jump right in.
* If there are natural next steps the user may want to take, suggest them at the end of your response. Do not make suggestions if there are no natural next steps.
* When suggesting multiple options, use numeric lists for the suggestions so the user can quickly respond with a single number.
- The user does not command execution outputs. When asked to show the output of a command (e.g. `git show`), relay the important details in your answer or summarize the key lines so the user understands the result.
## Final answer structure and style guidelines
- Plain text; CLI handles styling. Use structure only when it helps scanability.
- Headers: optional; short Title Case (1-3 words) wrapped in **…**; no blank line before the first bullet; add only if they truly help.
- Bullets: use - ; merge related points; keep to one line when possible; 4–6 per list ordered by importance; keep phrasing consistent.
- Monospace: backticks for commands/paths/env vars/code ids and inline examples; use for literal keyword bullets; never combine with **.
- Code samples or multi-line snippets should be wrapped in fenced code blocks; include an info string as often as possible.
- Structure: group related bullets; order sections general → specific → supporting; for subsections, start with a bolded keyword bullet, then items; match complexity to the task.
- Tone: collaborative, concise, factual; present tense, active voice; self‑contained; no "above/below"; parallel wording.
- Don'ts: no nested bullets/hierarchies; no ANSI codes; don't cram unrelated keywords; keep keyword lists short—wrap/reformat if long; avoid naming formatting styles in answers.
- Adaptation: code explanations → precise, structured with code refs; simple tasks → lead with outcome; big changes → logical walkthrough + rationale + next actions; casual one-offs → plain sentences, no headers/bullets.
- File References: When referencing files in your response follow the below rules:
* Use inline code to make file paths clickable.
* Each reference should have a stand alone path. Even if it's the same file.
* Accepted: absolute, workspace‑relative, a/ or b/ diff prefixes, or bare filename/suffix.
* Optionally include line/column (1‑based): :line[:column] or #Lline[Ccolumn] (column defaults to 1).
* Do not use URIs like file://, vscode://, or https://.
* Do not provide range of lines
* Examples: src/app.ts, src/app.ts:42, b/server/index.js#L10, C:\repo\project\main.rs:12:5
```
## Mid-Rollout User Updates
The Codex model family uses reasoning summaries to communicate user updates as it’s working. This can be in the form of one-liner headings (which updates the ephemeral text in Codex-CLI), or both heading and a short body. This is done by a separate model and therefore is **not promptable**, and we advise against adding any instructions to the prompt related to intermediate plans or messages to the user. We’ve improved these summaries for Codex-Max to be more communicative and provide more critical information about what’s happening and why; some of our users are updating their UX to promote these summaries more prominently in their UI, similar to how intermediate messages are displayed for GPT-5 series models.
## Using agents.md
Codex-cli automatically enumerates these files and injects them into the conversation; the model has been trained to closely adhere to these instructions.
1\. Files are pulled from \~/.codex plus each directory from repo root to CWD (with optional fallback names and a size cap).
2\. They’re merged in order, later directories overriding earlier ones.
3\. Each merged chunk shows up to the model as its own user-role message like so:
```
# AGENTS.md instructions for
...file contents...
```
Additional details
* Each discovered file becomes its own user-role message that starts with \# AGENTS.md instructions for \, where \ is the path (relative to the repo root) of the folder that provided that file.
* Messages are injected near the top of the conversation history, before the user prompt, in root-to-leaf order: global instructions first, then repo root, then each deeper directory. If an AGENTS.override.md was used, its directory name still appears in the header (e.g., \# AGENTS.md instructions for backend/api), so the context is obvious in the transcript.
# Compaction
Compaction unlocks significantly longer effective context windows, where user conversations can persist for many turns without hitting context window limits or long context performance degradation, and agents can perform very long trajectories that exceed a typical context window for long-running, complex tasks. A weaker version of this was previously possible with ad-hoc scaffolding and conversation summarization, but our first-class implementation, available via the Responses API, is integrated with the model and is highly performant.
How it works:
1. You use the Responses API as today, sending input items that include tool calls, user inputs, and assistant messages.
2. When your context window grows large, you can invoke /compact to generate a new, compacted context window. Two things to note:
1. The context window that you send to /compact should fit within your model’s context window.
2. The endpoint is ZDR compatible and will return an “encrypted\_content” item that you can pass into future requests.
3. For subsequent calls to the /responses endpoint, you can pass your updated, compacted list of conversation items (including the added compaction item). The model retains key prior state with fewer conversation tokens.
For endpoint details see our `/responses/compact` [docs](https://platform.openai.com/docs/api-reference/responses/compact).
# Tools
1. We strongly recommend using our exact `apply_patch` implementation as the model has been trained to excel at this diff format. For terminal commands we recommend our `shell` tool, and for plan/TODO items our `update_plan` tool should be most performant.
2. If you prefer your agent to use more “terminal-like tools” (like `file_read()` instead of calling \`sed\` in the terminal), this model can reliably call them instead of terminal (following the instructions below)
3. For other tools, including semantic search, MCPs, or other custom tools, they can work but it requires more tuning and experimentation.
### Apply\_patch
The easiest way to implement apply\_patch is with our first-class implementation in the Responses API, but you can also use our freeform tool implementation with [context-free grammar](https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_and_tools?utm_source=chatgpt.com#3-contextfree-grammar-cfg). Both are demonstrated below.
```py
# Sample script to demonstrate the server-defined apply_patch tool
import json
from pprint import pprint
from typing import cast
from openai import OpenAI
from openai.types.responses import ResponseInputParam, ToolParam
client = OpenAI()
## Shared tools and prompt
user_request = """Add a cancel button that logs when clicked"""
file_excerpt = """\
export default function Page() {
return (
\n'
# ' );\n'
# ' }\n',
# 'path': '/app/page.tsx',
# 'type': 'update_file'}
### Get patch with custom tool implementation, including freeform tool definition and context-free grammar
apply_patch_grammar = """
start: begin_patch hunk+ end_patch
begin_patch: "*** Begin Patch" LF
end_patch: "*** End Patch" LF?
hunk: add_hunk | delete_hunk | update_hunk
add_hunk: "*** Add File: " filename LF add_line+
delete_hunk: "*** Delete File: " filename LF
update_hunk: "*** Update File: " filename LF change_move? change?
filename: /(.+)/
add_line: "+" /(.*)/ LF -> line
change_move: "*** Move to: " filename LF
change: (change_context | change_line)+ eof_line?
change_context: ("@@" | "@@ " /(.+)/) LF
change_line: ("+" | "-" | " ") /(.*)/ LF
eof_line: "*** End of File" LF
%import common.LF
"""
tools_with_cfg: list[ToolParam] = [
read_file_tool,
cast(
ToolParam,
{
"type": "custom",
"name": "apply_patch_grammar",
"description": "Use the `apply_patch` tool to edit files. This is a FREEFORM tool, so do not wrap the patch in JSON.",
"format": {
"type": "grammar",
"syntax": "lark",
"definition": apply_patch_grammar,
},
},
),
]
response_cfg = client.responses.create(
model="gpt-5.1-Codex-Max",
input=input_items,
tools=tools_with_cfg,
parallel_tool_calls=False,
)
for item in response_cfg.output:
if item.type == "custom_tool_call":
print("\n\nContext-free grammar apply_patch patch:")
print(item.input)
# Output
# *** Begin Patch
# *** Update File: /app/page.tsx
# @@
#
#
Page component not implemented
#
# +
#
# );
# }
# *** End Patch
```
Patches objects the Responses API tool can be implemented by following this [example](https://github.com/openai/openai-agents-python/blob/main/examples/tools/apply_patch.py) and patches from the freeform tool can be applied with the logic in our canonical GPT-5 [apply\_patch.py](https://github.com/openai/openai-cookbook/blob/main/examples/gpt-5/apply_patch.py%20) implementation.
### Shell\_command
This is our default shell tool. Note that we have seen better performance with a command type “string” rather than a list of commands.
```
{
"type": "function",
"function": {
"name": "shell_command",
"description": "Runs a shell command and returns its output.\n- Always set the `workdir` param when using the shell_command function. Do not use `cd` unless absolutely necessary.",
"strict": false,
"parameters": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "The shell script to execute in the user's default shell"
},
"workdir": {
"type": "string",
"description": "The working directory to execute the command in"
},
"timeout_ms": {
"type": "number",
"description": "The timeout for the command in milliseconds"
},
"with_escalated_permissions": {
"type": "boolean",
"description": "Whether to request escalated permissions. Set to true if command needs to be run without sandbox restrictions"
},
"justification": {
"type": "string",
"description": "Only set if with_escalated_permissions is true. 1-sentence explanation of why we want to run this command."
}
},
"required": ["command"],
"additionalProperties": false
}
}
}
```
If you’re using Windows PowerShell, update to this tool description.
```
Runs a shell command and returns its output. The arguments you pass will be invoked via PowerShell (e.g., ["pwsh", "-NoLogo", "-NoProfile", "-Command", ""]). Always fill in workdir; avoid using cd in the command string.
```
You can check out codex-cli for the implementation for `exec_command`, which launches a long-lived PTY when you need streaming output, REPLs, or interactive sessions; and `write_stdin`, to feed extra keystrokes (or just poll output) for an existing exec\_command session.
### Update Plan
This is our default TODO tool; feel free to customize as you’d prefer. See the `## Plan tool` section of our starter prompt for additional instructions to maintain hygiene and tweak behavior.
```json
{
"type": "function",
"function": {
"name": "update_plan",
"description": "Updates the task plan.\nProvide an optional explanation and a list of plan items, each with a step and status.\nAt most one step can be in_progress at a time.",
"strict": false,
"parameters": {
"type": "object",
"properties": {
"explanation": {
"type": "string"
},
"plan": {
"type": "array",
"items": {
"type": "object",
"properties": {
"step": {
"type": "string"
},
"status": {
"type": "string",
"description": "One of: pending, in_progress, completed"
}
},
"additionalProperties": false,
"required": [
"step",
"status"
]
},
"description": "The list of steps"
}
},
"additionalProperties": false,
"required": [
"plan"
]
}
}
}
```
### View\_image
This is a basic function used in codex-cli for the model to view images.
```
{
"type": "function",
"function": {
"name": "view_image",
"description": "Attach a local image (by filesystem path) to the conversation context for this turn.",
"strict": false,
"parameters": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Local filesystem path to an image file"
}
},
"additionalProperties": false,
"required": [
"path"
]
}
}
}
```
## Dedicated terminal-wrapping tools
If you would prefer your codex agent to use terminal-wrapping tools (like a dedicated `list_dir(‘.’)` tool instead of `terminal(‘ls .’)`, this generally works well. We see the best results when the name of the tool, the arguments, and the output are as close as possible to those from the underlying command, so it’s as in-distribution as possible for the model (which was primarily trained using a dedicated terminal tool). For example, if you notice the model using git via the terminal and would prefer it to use a dedicated tool, we found that creating a related tool, and adding a directive in the prompt to only use that tool for git commands, fully mitigated the model’s terminal usage for git commands.
```
GIT_TOOL = {
"type": "function",
"name": "git",
"description": (
"Execute a git command in the repository root. Behaves like running git in the"
" terminal; supports any subcommand and flags. The command can be provided as a"
" full git invocation (e.g., `git status -sb`) or just the arguments after git"
" (e.g., `status -sb`)."
),
"parameters": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": (
"The git command to execute. Accepts either a full git invocation or"
" only the subcommand/args."
),
},
"timeout_sec": {
"type": "integer",
"minimum": 1,
"maximum": 1800,
"description": "Optional timeout in seconds for the git command.",
},
},
"required": ["command"],
},
}
...
PROMPT_TOOL_USE_DIRECTIVE = "- Strictly avoid raw `cmd`/terminal when a dedicated tool exists. Default to solver tools: `git` (all git), `list_dir`, `apply_patch`. Use `cmd`/`run_terminal_cmd` only when no listed tool can perform the action." # update with your desired tools
```
## Other Custom Tools (web search, semantic search, memory, etc.)
The model hasn’t necessarily been post-trained to excel at these tools, but we have seen success here as well. To get the most out of these tools, we recommend:
1. Making the tool names and arguments as semantically “correct” as possible, for example “search” is ambiguous but “semantic\_search” clearly indicates what the tool does, relative to other potential search-related tools you might have. “Query” would be a good param name for this tool.
2. Be explicit in your prompt about when, why, and how to use these tools, including good and bad examples.
3. It could also be helpful to make the results look different from outputs the model is accustomed to seeing from other tools, for example ripgrep results should look different from semantic search results to avoid the model collapsing into old habits.
## Parallel Tool Calling
In codex-cli, when parallel tool calling is enabled, the responses API request sets `parallel_tool_calls: true` and the following snippet is added to the system instructions:
```
## Exploration and reading files
- **Think first.** Before any tool call, decide ALL files/resources you will need.
- **Batch everything.** If you need multiple files (even from different places), read them together.
- **multi_tool_use.parallel** Use `multi_tool_use.parallel` to parallelize tool calls and only this.
- **Only make sequential calls if you truly cannot know the next file without seeing a result first.**
- **Workflow:** (a) plan all needed reads → (b) issue one parallel batch → (c) analyze results → (d) repeat if new, unpredictable reads arise.
**Additional notes**:
- Always maximize parallelism. Never read files one-by-one unless logically unavoidable.
- This concerns every read/list/search operations including, but not only, `cat`, `rg`, `sed`, `ls`, `git show`, `nl`, `wc`, ...
- Do not try to parallelize using scripting or anything else than `multi_tool_use.parallel`.
```
We've found it to be helpful and more in-distribution if parallel tool call items and responses are ordered in the following way:
```
function_call
function_call
function_call_output
function_call_output
```
## Tool Response Truncation
We recommend doing tool call response truncation as follows to be as in-distribution for the model as possible:
* Limit to 10k tokens. You can cheaply approximate this by computing `num_bytes/4`.
* If you hit the truncation limit, you should use half of the budget for the beginning, half for the end, and truncate in the middle with `…3 tokens truncated…`
---
# Source: https://developers.openai.com/codex/ide/commands.md
# Source: https://developers.openai.com/codex/app/commands.md
# Codex app commands
Use these commands and keyboard shortcuts to navigate the Codex app.
## Keyboard shortcuts
| | Action | macOS shortcut |
| ----------- | ------------------ | --------------------------------------------------------------------------------- |
| **General** | | |
| | Command menu | Cmd + Shift + P or Cmd + K |
| | Settings | Cmd + , |
| | Open folder | Cmd + O |
| | Navigate back | Cmd + [ |
| | Navigate forward | Cmd + ] |
| | Increase font size | Cmd + + or Cmd + = |
| | Decrease font size | Cmd + - or Cmd + \_ |
| | Toggle sidebar | Cmd + B |
| | Toggle diff panel | Cmd + Option + B |
| | Toggle terminal | Cmd + J |
| | Clear the terminal | Ctrl + L |
| **Thread** | | |
| | New thread | Cmd + N or Cmd + Shift + O |
| | Find in thread | Cmd + F |
| | Previous thread | Cmd + Shift + [ |
| | Next thread | Cmd + Shift + ] |
| | Dictation | Ctrl + M |
## Slash commands
Slash commands let you control Codex without leaving the thread composer. Available commands vary based on your environment and access.
### Use a slash command
1. In the thread composer, type `/`.
2. Select a command from the list, or keep typing to filter (for example, `/status`).
You can also explicitly invoke skills by typing `$` in the thread composer. See [Skills](https://developers.openai.com/codex/skills).
Enabled skills also appear in the slash command list (for example, `/imagegen`).
### Available slash commands
| Slash command | Description |
| ------------- | -------------------------------------------------------------------------------------- |
| `/feedback` | Open the feedback dialog to submit feedback and optionally include logs. |
| `/mcp` | Open MCP status to view connected servers. |
| `/plan-mode` | Toggle plan mode for multi-step planning. |
| `/review` | Start code review mode to review uncommitted changes or compare against a base branch. |
| `/status` | Show the thread ID, context usage, and rate limits. |
## See also
- [Features](https://developers.openai.com/codex/app/features)
- [Settings](https://developers.openai.com/codex/app/settings)
---
# Source: https://developers.openai.com/cookbook/examples/evaluation/use-cases/completion-monitoring.md
# Evaluations Example: Push Notifications Summarizer Monitoring
Evals are **task-oriented** and iterative, they're the best way to check how your LLM integration is doing and improve it.
In the following eval, we are going to focus on the task of **detecting our prompt changes for regressions**.
Our use-case is:
1. We have been logging chat completion requests by setting `store=True` in our production chat completions requests. Note that you can also enable "on by default" logging in your admin panel (https://platform.openai.com/settings/organization/data-controls/data-retention).
2. We want to see whether our prompt changes have introduced regressions.
## Evals structure
Evals have two parts, the "Eval" and the "Run". An "Eval" holds the configuration for your testing criteria and the structure of the data for your "Runs". An Eval can have many Runs, which are each evaluated using your testing criteria.
```python
from openai import AsyncOpenAI
import os
import asyncio
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")
client = AsyncOpenAI()
```
## Use-case
We're testing the following integration, a push notifications summary, which takes in multiple push notifications and collapses them into a single one, this is a chat completions call.
# Generate our test data
I'm going to produce simulated production chat completions requests with two different prompt versions to test how each performs. The first is a "good" prompt, the second is a "bad" prompt. These will have different metadata which we'll use later.
```python
push_notification_data = [
"""
- New message from Sarah: "Can you call me later?"
- Your package has been delivered!
- Flash sale: 20% off electronics for the next 2 hours!
""",
"""
- Weather alert: Thunderstorm expected in your area.
- Reminder: Doctor's appointment at 3 PM.
- John liked your photo on Instagram.
""",
"""
- Breaking News: Local elections results are in.
- Your daily workout summary is ready.
- Check out your weekly screen time report.
""",
"""
- Your ride is arriving in 2 minutes.
- Grocery order has been shipped.
- Don't miss the season finale of your favorite show tonight!
""",
"""
- Event reminder: Concert starts at 7 PM.
- Your favorite team just scored!
- Flashback: Memories from 3 years ago.
""",
"""
- Low battery alert: Charge your device.
- Your friend Mike is nearby.
- New episode of "The Tech Hour" podcast is live!
""",
"""
- System update available.
- Monthly billing statement is ready.
- Your next meeting starts in 15 minutes.
""",
"""
- Alert: Unauthorized login attempt detected.
- New comment on your blog post: "Great insights!"
- Tonight's dinner recipe: Pasta Primavera.
""",
"""
- Special offer: Free coffee with any breakfast order.
- Your flight has been delayed by 30 minutes.
- New movie release: "Adventures Beyond" now streaming.
""",
"""
- Traffic alert: Accident reported on Main Street.
- Package out for delivery: Expected by 5 PM.
- New friend suggestion: Connect with Emma.
"""]
```
```python
PROMPTS = [
(
"""
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
Output only the final summary, nothing else.
""",
"v1"
),
(
"""
You are a helpful assistant that summarizes push notifications.
You are given a list of push notifications and you need to collapse them into a single one.
The summary should be longer than it needs to be and include more information than is necessary.
Output only the final summary, nothing else.
""",
"v2"
)
]
tasks = []
for notifications in push_notification_data:
for (prompt, version) in PROMPTS:
tasks.append(client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "developer", "content": prompt},
{"role": "user", "content": notifications},
],
store=True,
metadata={"prompt_version": version, "usecase": "push_notifications_summarizer"},
))
await asyncio.gather(*tasks)
```
You can view the completions you just created at https://platform.openai.com/logs.
**Make sure that the chat completions show up, as they are necessary for the next step.**
```python
completions = await client.chat.completions.list()
assert completions.data, "No completions found. You may need to enable logs in your admin panel."
completions.data[0]
```
# Setting up your eval
An Eval holds the configuration that is shared across multiple *Runs*, it has two components:
1. Data source configuration `data_source_config` - the schema (columns) that your future *Runs* conform to.
- The `data_source_config` uses JSON Schema to define what variables are available in the Eval.
2. Testing Criteria `testing_criteria` - How you'll determine if your integration is working for each *row* of your data source.
For this use-case, we're using stored-completions, so we'll set up that data_source_config
**Important**
You are likely to have many different stored completions use-cases, metadata is the best way to keep track of this for evals to keep them focused and task oriented.
```python
# We want our input data to be available in our variables, so we set the item_schema to
# PushNotifications.model_json_schema()
data_source_config = {
"type": "stored_completions",
"metadata": {
"usecase": "push_notifications_summarizer"
}
}
```
This data_source_config defines what variables are available throughout the eval.
The stored completions config provides two variables for you to use throughout your eval:
1. {{item.input}} - the messages sent to the completions call
2. {{sample.output_text}} - the text response from the assistant
**Now, we'll use those variables to set up our test criteria.**
```python
GRADER_DEVELOPER_PROMPT = """
Label the following push notification summary as either correct or incorrect.
The push notification and the summary will be provided below.
A good push notificiation summary is concise and snappy.
If it is good, then label it as correct, if not, then incorrect.
"""
GRADER_TEMPLATE_PROMPT = """
Push notifications: {{item.input}}
Summary: {{sample.output_text}}
"""
push_notification_grader = {
"name": "Push Notification Summary Grader",
"type": "label_model",
"model": "o3-mini",
"input": [
{
"role": "developer",
"content": GRADER_DEVELOPER_PROMPT,
},
{
"role": "user",
"content": GRADER_TEMPLATE_PROMPT,
},
],
"passing_labels": ["correct"],
"labels": ["correct", "incorrect"],
}
```
The `push_notification_grader` is a model grader (llm-as-a-judge), which looks at the input `{{item.input}}` and the generated summary `{{sample.output_text}}` and labels it as "correct" or "incorrect".
Note: under the hood, this uses structured outputs so that labels are always valid.
**Now we'll create our eval!, and start adding data to it**
```python
eval_create_result = await client.evals.create(
name="Push Notification Completion Monitoring",
metadata={"description": "This eval monitors completions"},
data_source_config=data_source_config,
testing_criteria=[push_notification_grader],
)
eval_id = eval_create_result.id
```
# Creating runs
Now that we have our eval set-up with our test_criteria, we can start adding runs.
I want to compare the performance between my two **prompt versions**
To do this, we just define our source as "stored_completions" with a metadata filter for each of our prompt versions.
```python
# Grade prompt_version=v1
eval_run_result = await client.evals.runs.create(
eval_id=eval_id,
name="v1-run",
data_source={
"type": "completions",
"source": {
"type": "stored_completions",
"metadata": {
"prompt_version": "v1",
}
}
}
)
print(eval_run_result.report_url)
```
```python
# Grade prompt_version=v2
eval_run_result_v2 = await client.evals.runs.create(
eval_id=eval_id,
name="v2-run",
data_source={
"type": "completions",
"source": {
"type": "stored_completions",
"metadata": {
"prompt_version": "v2",
}
}
}
)
print(eval_run_result_v2.report_url)
```
Just for to be thorough, let's see how this prompt would do with 4o, instead of 4o-mini, with both prompt versions as the starting point.
All we have to do is reference the input messages ({{item.input}}) and set the model to 4o. Since we don't already have any stored completions for 4o, this eval run will generate new completions.
```python
tasks = []
for prompt_version in ["v1", "v2"]:
tasks.append(client.evals.runs.create(
eval_id=eval_id,
name=f"post-fix-new-model-run-{prompt_version}",
data_source={
"type": "completions",
"input_messages": {
"type": "item_reference",
"item_reference": "item.input",
},
"model": "gpt-4o",
"source": {
"type": "stored_completions",
"metadata": {
"prompt_version": prompt_version,
}
}
},
))
result = await asyncio.gather(*tasks)
for run in result:
print(run.report_url)
```
If you view that report, you'll see that we can see that prompt_version=v2 has a regression!
## Congratulations, you just discovered a bug, you could revert it, or make another prompt change, etc.!
---
# Source: https://developers.openai.com/resources/cookbook/completions-usage-api.md
# How to use the Usage API and Cost API to monitor your OpenAI usage
> Cookbook to fetch and visualize Completions Usage and cost data via API.
- Type: Cookbook
- Tags: cost-api, usage-api
- URL: /cookbook/examples/completions_usage_api
- Created: 2025-01-14
- Updated: 2025-01-14
## Summary
Cookbook to fetch and visualize Completions Usage and cost data via API.
## Details
Cookbook to fetch and visualize Completions Usage and cost data via API.
---
# Source: https://developers.openai.com/cookbook/examples/completions_usage_api.md
# OpenAI Completions Usage API Extended Example
For most of our users, the [default usage and cost dashboards](https://platform.openai.com/usage) are sufficient. However, if you need more detailed data or a custom dashboard, you can use the Completions Usage API.
This notebook demonstrates how to retrieve and visualize usage data from the OpenAI Completions Usage API and Costs API. We'll:
- Call the API to get completions usage data.
- Parse the JSON response into a pandas DataFrame.
- Visualize token usage over time using matplotlib.
- Use grouping by model to analyze token usage across different models.
- Display model distribution with a pie chart.
We also include placeholders for all possible API parameters for a comprehensive overview.
```python
# Install required libraries (if not already installed)
!pip install requests pandas numpy matplotlib --quiet
# Import libraries
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import time
import json
# For inline plotting in Jupyter
%matplotlib inline
```
## Setup API Credentials and Parameters
Set up an Admin Key - https://platform.openai.com/settings/organization/admin-keys
Replace `'PLACEHOLDER'` with your actual ADMIN API key. It's best practice to load the key from an environment variable for security.
```python
# Reusable function for retrieving paginated data from the API
def get_data(url, params):
# Set up the API key and headers
OPENAI_ADMIN_KEY = 'PLACEHOLDER'
headers = {
"Authorization": f"Bearer {OPENAI_ADMIN_KEY}",
"Content-Type": "application/json",
}
# Initialize an empty list to store all data
all_data = []
# Initialize pagination cursor
page_cursor = None
# Loop to handle pagination
while True:
if page_cursor:
params["page"] = page_cursor
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200:
data_json = response.json()
all_data.extend(data_json.get("data", []))
page_cursor = data_json.get("next_page")
if not page_cursor:
break
else:
print(f"Error: {response.status_code}")
break
if all_data:
print("Data retrieved successfully!")
else:
print("Issue: No data available to retrieve.")
return all_data
```
```python
# Define the API endpoint
url = "https://api.openai.com/v1/organization/usage/completions"
# Calculate start time: n days ago from now
days_ago = 30
start_time = int(time.time()) - (days_ago * 24 * 60 * 60)
# Define parameters with placeholders for all possible options
params = {
"start_time": start_time, # Required: Start time (Unix seconds)
# "end_time": end_time, # Optional: End time (Unix seconds)
"bucket_width": "1d", # Optional: '1m', '1h', or '1d' (default '1d')
# "project_ids": ["proj_example"], # Optional: List of project IDs
# "user_ids": ["user_example"], # Optional: List of user IDs
# "api_key_ids": ["key_example"], # Optional: List of API key IDs
# "models": ["o1-2024-12-17", "gpt-4o-2024-08-06", "gpt-4o-mini-2024-07-18"], # Optional: List of models
# "batch": False, # Optional: True for batch jobs, False for non-batch
# "group_by": ["model"], # Optional: Fields to group by
"limit": 7, # Optional: Number of buckets to return, this will chunk the data into 7 buckets
# "page": "cursor_string" # Optional: Cursor for pagination
}
usage_data = get_data(url, params)
```
```text
Data retrieved successfully!
```
## Inspect the JSON Response
Let's take a look at the raw JSON response from the API to understand its structure.
```python
print(json.dumps(usage_data, indent=2))
```
_Matrix output omitted from the markdown export._
## Parse the API Response and Create a DataFrame
Now we will parse the JSON data, extract relevant fields, and create a pandas DataFrame for easier manipulation and analysis.
```python
# Initialize a list to hold parsed records
records = []
# Iterate through the data to extract bucketed data
for bucket in usage_data:
start_time = bucket.get("start_time")
end_time = bucket.get("end_time")
for result in bucket.get("results", []):
records.append(
{
"start_time": start_time,
"end_time": end_time,
"input_tokens": result.get("input_tokens", 0),
"output_tokens": result.get("output_tokens", 0),
"input_cached_tokens": result.get("input_cached_tokens", 0),
"input_audio_tokens": result.get("input_audio_tokens", 0),
"output_audio_tokens": result.get("output_audio_tokens", 0),
"num_model_requests": result.get("num_model_requests", 0),
"project_id": result.get("project_id"),
"user_id": result.get("user_id"),
"api_key_id": result.get("api_key_id"),
"model": result.get("model"),
"batch": result.get("batch"),
}
)
# Create a DataFrame from the records
df = pd.DataFrame(records)
# Convert Unix timestamps to datetime for readability
df["start_datetime"] = pd.to_datetime(df["start_time"], unit="s")
df["end_datetime"] = pd.to_datetime(df["end_time"], unit="s")
# Reorder columns for better readability
df = df[
[
"start_datetime",
"end_datetime",
"start_time",
"end_time",
"input_tokens",
"output_tokens",
"input_cached_tokens",
"input_audio_tokens",
"output_audio_tokens",
"num_model_requests",
"project_id",
"user_id",
"api_key_id",
"model",
"batch",
]
]
# Display the DataFrame
df.head()
```
start_datetime
end_datetime
start_time
end_time
input_tokens
output_tokens
input_cached_tokens
input_audio_tokens
output_audio_tokens
num_model_requests
project_id
user_id
api_key_id
model
batch
0
2025-01-11 17:31:00
2025-01-12
1736616660
1736640000
141201
9756
0
0
0
470
None
None
None
None
None
1
2025-01-12 00:00:00
2025-01-13
1736640000
1736726400
45949
282
0
0
0
150
None
None
None
None
None
2
2025-01-13 00:00:00
2025-01-14
1736726400
1736812800
3718360
97756
76544
5776
3166
3053
None
None
None
None
None
3
2025-01-14 00:00:00
2025-01-15
1736812800
1736899200
52786
38204
5440
4066
1097
157
None
None
None
None
None
4
2025-01-15 00:00:00
2025-01-16
1736899200
1736985600
35664
1835
192
2520
1549
55
None
None
None
None
None
## Visualize Token Usage Over Time
We'll create a bar chart to visualize input and output token usage for each time bucket.
```python
if not df.empty:
plt.figure(figsize=(12, 6))
# Create bar charts for input and output tokens
width = 0.35 # width of the bars
indices = range(len(df))
plt.bar(indices, df["input_tokens"], width=width, label="Input Tokens", alpha=0.7)
plt.bar(
[i + width for i in indices],
df["output_tokens"],
width=width,
label="Output Tokens",
alpha=0.7,
)
# Set labels and ticks
plt.xlabel("Time Bucket")
plt.ylabel("Number of Tokens")
plt.title("Daily Input vs Output Token Usage Last 30 Days")
plt.xticks(
[i + width / 2 for i in indices],
[dt.strftime("%Y-%m-%d") for dt in df["start_datetime"]],
rotation=45,
)
plt.legend()
plt.tight_layout()
plt.show()
else:
print("No data available to plot.")
```

## Visual Example: Grouping by Model
In this section, we retrieve and visualize usage data grouped by model and project_id. This can help you see the total tokens used by each model over the specified period.
### Note on Grouping Parameter
- If you do not specify a `group_by` parameter, fields such as `project_id`, `model`, and others will return as `null`.
Although the `group_by` parameter is optional, it is recommended to include it in most cases to retrieve meaningful data.
- You can specify multiple group fields by separating them with commas. For example: `group_by=["model", "project_id"]`.
```python
# Calculate start time: n days ago from now
days_ago = 30
start_time = int(time.time()) - (days_ago * 24 * 60 * 60)
# Define parameters with grouping by model and project_id
params = {
"start_time": start_time, # Required: Start time (Unix seconds)
"bucket_width": "1d", # Optional: '1m', '1h', or '1d' (default '1d')
"group_by": ["model", "project_id"], # Group data by model and project_id
"limit": 7, # Optional: Number of buckets to return
}
# Initialize an empty list to store all data
all_group_data = get_data(url, params)
# Initialize a list to hold parsed records
records = []
# Iterate through the data to extract bucketed data
for bucket in all_group_data:
start_time = bucket.get("start_time")
end_time = bucket.get("end_time")
for result in bucket.get("results", []):
records.append(
{
"start_time": start_time,
"end_time": end_time,
"input_tokens": result.get("input_tokens", 0),
"output_tokens": result.get("output_tokens", 0),
"input_cached_tokens": result.get("input_cached_tokens", 0),
"input_audio_tokens": result.get("input_audio_tokens", 0),
"output_audio_tokens": result.get("output_audio_tokens", 0),
"num_model_requests": result.get("num_model_requests", 0),
"project_id": result.get("project_id", "N/A"),
"user_id": result.get("user_id", "N/A"),
"api_key_id": result.get("api_key_id", "N/A"),
"model": result.get("model", "N/A"),
"batch": result.get("batch", "N/A"),
}
)
# Create a DataFrame from the records
df = pd.DataFrame(records)
# Convert Unix timestamps to datetime for readability
df["start_datetime"] = pd.to_datetime(df["start_time"], unit="s", errors="coerce")
df["end_datetime"] = pd.to_datetime(df["end_time"], unit="s", errors="coerce")
# Reorder columns for better readability
df = df[
[
"start_datetime",
"end_datetime",
"start_time",
"end_time",
"input_tokens",
"output_tokens",
"input_cached_tokens",
"input_audio_tokens",
"output_audio_tokens",
"num_model_requests",
"project_id",
"user_id",
"api_key_id",
"model",
"batch",
]
]
# Display the DataFrame
df.head()
```
```text
Data retrieved successfully!
```
start_datetime
end_datetime
start_time
end_time
input_tokens
output_tokens
input_cached_tokens
input_audio_tokens
output_audio_tokens
num_model_requests
project_id
user_id
api_key_id
model
batch
0
2025-01-11 17:31:39
2025-01-12
1736616699
1736640000
6897
97
0
0
0
97
proj_hNhhQzyYu7HxySZWs7cA3Ugu
None
None
gpt-4o-mini-2024-07-18
None
1
2025-01-11 17:31:39
2025-01-12
1736616699
1736640000
33984
206
0
0
0
95
proj_hNhhQzyYu7HxySZWs7cA3Ugu
None
None
ft:gpt-4o-2024-08-06:distillation-test:wordle2...
None
2
2025-01-11 17:31:39
2025-01-12
1736616699
1736640000
2846
8874
0
0
0
8
proj_hNhhQzyYu7HxySZWs7cA3Ugu
None
None
o1-mini-2024-09-12
None
3
2025-01-11 17:31:39
2025-01-12
1736616699
1736640000
97474
579
0
0
0
270
proj_hNhhQzyYu7HxySZWs7cA3Ugu
None
None
gpt-4o-2024-08-06
None
4
2025-01-12 00:00:00
2025-01-13
1736640000
1736726400
1989
28
0
0
0
28
proj_hNhhQzyYu7HxySZWs7cA3Ugu
None
None
gpt-4o-mini-2024-07-18
None
## Parse the API Response into DataFrame and render a stacked bar chart
Now we will parse the JSON data, extract relevant fields, and create a pandas DataFrame for easier manipulation and analysis.
```python
# Group data by model and project_id and aggregate model request counts
grouped_by_model_project = (
df.groupby(["model", "project_id"])
.agg(
{
"num_model_requests": "sum",
}
)
.reset_index()
)
# Determine unique models and project IDs for plotting and color mapping
models = sorted(grouped_by_model_project["model"].unique())
project_ids = sorted(grouped_by_model_project["project_id"].unique())
distinct_colors = [
"#1f77b4",
"#ff7f0e",
"#2ca02c",
"#d62728",
"#9467bd",
"#8c564b",
"#e377c2",
"#7f7f7f",
"#bcbd22",
"#17becf",
]
project_color_mapping = {
pid: distinct_colors[i % len(distinct_colors)] for i, pid in enumerate(project_ids)
}
# Calculate total number of requests per project_id for legend
project_totals = (
grouped_by_model_project.groupby("project_id")["num_model_requests"]
.sum()
.sort_values(ascending=False) # Sort by highest total first
)
# Set up bar positions
n_models = len(models)
bar_width = 0.6
x = np.arange(n_models)
plt.figure(figsize=(12, 6))
# Plot stacked bars for each model
for model_idx, model in enumerate(models):
# Filter data for the current model
model_data = grouped_by_model_project[grouped_by_model_project["model"] == model]
bottom = 0
# Stack segments for each project ID within the bars
for _, row in model_data.iterrows():
color = project_color_mapping[row["project_id"]]
plt.bar(
x[model_idx],
row["num_model_requests"],
width=bar_width,
bottom=bottom,
color=color,
)
bottom += row["num_model_requests"]
# Labeling and styling
plt.xlabel("Model")
plt.ylabel("Number of Model Requests")
plt.title("Total Model Requests by Model and Project ID Last 30 Days")
plt.xticks(x, models, rotation=45, ha="right")
# Create a sorted legend with totals
handles = [
mpatches.Patch(color=project_color_mapping[pid], label=f"{pid} (Total: {total})")
for pid, total in project_totals.items()
]
plt.legend(handles=handles, bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()
```

## Visual Example: Model Distribution Pie Chart
This section visualizes the distribution of token usage across different models using a pie chart.
```python
records = []
for bucket in all_group_data:
for result in bucket.get("results", []):
records.append(
{
"project_id": result.get("project_id", "N/A"),
"num_model_requests": result.get("num_model_requests", 0),
}
)
# Create a DataFrame
df = pd.DataFrame(records)
# Aggregate data by project_id
grouped_by_project = (
df.groupby("project_id").agg({"num_model_requests": "sum"}).reset_index()
)
# Visualize Pie Chart
if not grouped_by_project.empty:
# Filter out rows where num_model_requests == 0
filtered_grouped_by_project = grouped_by_project[
grouped_by_project["num_model_requests"] > 0
]
# Calculate the total model requests after filtering
total_requests = filtered_grouped_by_project["num_model_requests"].sum()
if total_requests > 0:
# Calculate percentage of total for each project
filtered_grouped_by_project["percentage"] = (
filtered_grouped_by_project["num_model_requests"] / total_requests
) * 100
# Separate "Other" projects (below 5%)
other_projects = filtered_grouped_by_project[
filtered_grouped_by_project["percentage"] < 5
]
main_projects = filtered_grouped_by_project[
filtered_grouped_by_project["percentage"] >= 5
]
# Sum up "Other" projects
if not other_projects.empty:
other_row = pd.DataFrame(
{
"project_id": ["Other"],
"num_model_requests": [other_projects["num_model_requests"].sum()],
"percentage": [other_projects["percentage"].sum()],
}
)
filtered_grouped_by_project = pd.concat(
[main_projects, other_row], ignore_index=True
)
# Sort by number of requests for better legend organization
filtered_grouped_by_project = filtered_grouped_by_project.sort_values(
by="num_model_requests", ascending=False
)
# Main pie chart for distribution of model requests by project_id
plt.figure(figsize=(10, 8))
plt.pie(
filtered_grouped_by_project["num_model_requests"],
labels=filtered_grouped_by_project["project_id"],
autopct=lambda p: f"{p:.1f}%\n({int(p * total_requests / 100):,})",
startangle=140,
textprops={"fontsize": 10},
)
plt.title("Distribution of Model Requests by Project ID", fontsize=14)
plt.axis("equal") # Equal aspect ratio ensures pie chart is circular.
plt.tight_layout()
plt.show()
# If there are "Other" projects, generate a second pie chart for breakdown
if not other_projects.empty:
other_total_requests = other_projects["num_model_requests"].sum()
plt.figure(figsize=(10, 8))
plt.pie(
other_projects["num_model_requests"],
labels=other_projects["project_id"],
autopct=lambda p: f"{p:.1f}%\n({int(p * other_total_requests / 100):,})",
startangle=140,
textprops={"fontsize": 10},
)
plt.title('Breakdown of "Other" Projects by Model Requests', fontsize=14)
plt.axis("equal") # Equal aspect ratio ensures pie chart is circular.
plt.tight_layout()
plt.show()
else:
print("Total model requests is zero. Pie chart will not be rendered.")
else:
print("No grouped data available for pie chart.")
```


## Costs API Example
In this section, we'll work with the OpenAI Costs API to retrieve and visualize cost data. Similar to the completions data, we'll:
- Call the Costs API to get aggregated cost data.
- Parse the JSON response into a pandas DataFrame.
- Visualize costs grouped by line item using a bar chart.
```python
# Calculate start time: n days ago from now
days_ago = 30
start_time = int(time.time()) - (days_ago * 24 * 60 * 60)
# Define the Costs API endpoint
costs_url = "https://api.openai.com/v1/organization/costs"
costs_params = {
"start_time": start_time, # Required: Start time (Unix seconds)
"bucket_width": "1d", # Optional: Currently only '1d' is supported
"limit": 30, # Optional: Number of buckets to return
}
# Initialize an empty list to store all data
all_costs_data = get_data(costs_url, costs_params)
```
```text
Data retrieved successfully!
```
```python
print(json.dumps(all_costs_data, indent=2))
```
_Matrix output omitted from the markdown export._
## Parse the Costs API Response and Create a DataFrame
We will now parse the JSON data from the Costs API, extract relevant fields, and create a pandas DataFrame for further analysis.
```python
# Initialize a list to hold parsed cost records
cost_records = []
# Extract bucketed cost data from all_costs_data
for bucket in all_costs_data:
start_time = bucket.get("start_time")
end_time = bucket.get("end_time")
for result in bucket.get("results", []):
cost_records.append(
{
"start_time": start_time,
"end_time": end_time,
"amount_value": result.get("amount", {}).get("value", 0),
"currency": result.get("amount", {}).get("currency", "usd"),
"line_item": result.get("line_item"),
"project_id": result.get("project_id"),
}
)
# Create a DataFrame from the cost records
cost_df = pd.DataFrame(cost_records)
# Convert Unix timestamps to datetime for readability
cost_df["start_datetime"] = pd.to_datetime(cost_df["start_time"], unit="s")
cost_df["end_datetime"] = pd.to_datetime(cost_df["end_time"], unit="s")
# Display the first few rows of the DataFrame
cost_df.head()
```
start_time
end_time
amount_value
currency
line_item
project_id
start_datetime
end_datetime
0
1736553600
1736640000
0.130804
usd
None
None
2025-01-11
2025-01-12
1
1736640000
1736726400
0.122704
usd
None
None
2025-01-12
2025-01-13
2
1736726400
1736812800
9.888144
usd
None
None
2025-01-13
2025-01-14
3
1736812800
1736899200
0.350764
usd
None
None
2025-01-14
2025-01-15
4
1736899200
1736985600
0.297748
usd
None
None
2025-01-15
2025-01-16
## Visualize Total Costs per Day
We'll create a bar chart to visualize the total costs aggregated by day. This helps give a high level perspective on organizational spend.
```python
if not cost_df.empty:
# Ensure datetime conversion for 'start_datetime' column
if (
"start_datetime" not in cost_df.columns
or not pd.api.types.is_datetime64_any_dtype(cost_df["start_datetime"])
):
cost_df["start_datetime"] = pd.to_datetime(
cost_df["start_time"], unit="s", errors="coerce"
)
# Create a new column for just the date part of 'start_datetime'
cost_df["date"] = cost_df["start_datetime"].dt.date
# Group by date and sum the amounts
cost_per_day = cost_df.groupby("date")["amount_value"].sum().reset_index()
# Plot the data
plt.figure(figsize=(12, 6))
plt.bar(
cost_per_day["date"],
cost_per_day["amount_value"],
width=0.6,
color="skyblue",
alpha=0.8,
)
plt.xlabel("Date")
plt.ylabel("Total Cost (USD)")
plt.title("Total Cost per Day (Last 30 Days)")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
else:
print("No cost data available to plot.")
```

## Visualize Costs by Line Item
We'll create a bar chart to visualize the total costs aggregated by line item. This helps identify which categories (e.g., models or other services) contribute most to the expenses.
```python
days_ago = 30
start_time = int(time.time()) - (days_ago * 24 * 60 * 60)
costs_params = {
"start_time": start_time, # Required: Start time (Unix seconds)
"bucket_width": "1d", # Optional: Currently only '1d' is supported
"limit": 30, # Optional: Number of buckets to return
"group_by": ["line_item"],
}
line_item_cost_data = get_data(costs_url, costs_params)
# Initialize a list to hold parsed cost records
cost_records = []
# Extract bucketed cost data from all_costs_data
for bucket in line_item_cost_data:
start_time = bucket.get("start_time")
end_time = bucket.get("end_time")
for result in bucket.get("results", []):
cost_records.append(
{
"start_time": start_time,
"end_time": end_time,
"amount_value": result.get("amount", {}).get("value", 0),
"currency": result.get("amount", {}).get("currency", "usd"),
"line_item": result.get("line_item"),
"project_id": result.get("project_id"),
}
)
# Create a DataFrame from the cost records
cost_df = pd.DataFrame(cost_records)
# Convert Unix timestamps to datetime for readability
cost_df["start_datetime"] = pd.to_datetime(cost_df["start_time"], unit="s")
cost_df["end_datetime"] = pd.to_datetime(cost_df["end_time"], unit="s")
# Display the first few rows of the DataFrame
cost_df.head()
```
```text
Data retrieved successfully!
```
start_time
end_time
amount_value
currency
line_item
project_id
start_datetime
end_datetime
0
1736553600
1736640000
0.127440
usd
ft-gpt-4o-2024-08-06, input
proj_hNhhQzyYu7HxySZWs7cA3Ugu
2025-01-11
2025-01-12
1
1736553600
1736640000
0.003090
usd
ft-gpt-4o-2024-08-06, output
proj_hNhhQzyYu7HxySZWs7cA3Ugu
2025-01-11
2025-01-12
2
1736553600
1736640000
0.000271
usd
assistants api | file search
proj_L67gOme4S2nBA8aQieEOwLy7
2025-01-11
2025-01-12
3
1736553600
1736640000
0.000003
usd
assistants api | file search
proj_VV4ZAjd6ALfFd9uh0vY8joR1
2025-01-11
2025-01-12
4
1736640000
1736726400
0.028607
usd
evals | gpt-4o-mini-2024-07-18, input
proj_L67gOme4S2nBA8aQieEOwLy7
2025-01-12
2025-01-13
```python
if not cost_df.empty:
# Ensure datetime conversion for 'start_datetime' column
if "start_datetime" not in cost_df.columns or not pd.api.types.is_datetime64_any_dtype(cost_df["start_datetime"]):
cost_df["start_datetime"] = pd.to_datetime(cost_df["start_time"], unit="s", errors="coerce")
# Create a new column for just the date part of 'start_datetime'
cost_df["date"] = cost_df["start_datetime"].dt.date
# Group by date and line_item and sum the amounts
cost_per_day = cost_df.groupby(["date", "line_item"])["amount_value"].sum().reset_index()
# Pivot the DataFrame so each date has one bar with line_item stacks
cost_pivot = cost_per_day.pivot(index="date", columns="line_item", values="amount_value").fillna(0)
cost_pivot = cost_pivot.sort_index()
# Plot a stacked bar chart with one bar for each grouped day
plt.figure(figsize=(12, 6))
ax = cost_pivot.plot(kind="bar", stacked=True, ax=plt.gca(), width=0.8)
plt.xlabel("Date")
plt.ylabel("Total Cost (USD)")
plt.title("Total Cost by Line Item")
plt.xticks(rotation=45, ha="right")
# Update legend so it doesn't overlay the graph by placing it outside the plot area
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0.)
plt.tight_layout()
plt.show()
else:
print("No cost data available to plot.")
```
```text
/var/folders/r_/g8r2dz8s2qd104th5p5yxljr0000gp/T/ipykernel_49468/2813361465.py:25: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all Axes decorations.
plt.tight_layout()
```

## Additional Visualizations (Optional)
You can extend this notebook with more visualizations for both the Completions and Costs APIs. For example:
**Completions API:**
- Group by user, project, or model to see which ones consume the most tokens.
- Create line plots for time series analysis of token usage over days or hours.
- Use pie charts to visualize distribution of tokens across models, users, or projects.
- Experiment with different `group_by` parameters (e.g., `["model", "user_id"]`) to gain deeper insights.
**Costs API:**
- Group by project or line item to identify spending patterns.
- Create line or bar charts to visualize daily cost trends.
- Use pie charts to show how costs are distributed across projects, services, or line items.
- Try various `group_by` options (e.g., `["project_id"]`, `["line_item"]`) for granular analysis.
Experiment with different parameters and visualization techniques using `pandas` and `matplotlib` (or libraries like Plotly/Bokeh) to gain deeper insights, and consider integrating these visualizations into interactive dashboards for real-time monitoring.
## Integrating with Third-Party Dashboarding Platforms
To bring OpenAI usage and cost data into external dashboarding tools like Tableau, Power BI, or custom platforms (e.g., Plotly Dash, Bokeh), follow these steps:
1. **Data Collection & Preparation:**
- Use Python scripts to regularly fetch data from the Completions and Costs APIs.
- Process and aggregate the data with pandas, then store it in a database, data warehouse, or export it as CSV/JSON files.
2. **Connecting to a Dashboard:**
- **BI Tools (Tableau, Power BI):**
- Connect directly to the prepared data source (SQL database, CSV files, or web APIs).
- Use built-in connectors to schedule data refreshes, ensuring dashboards always display current information.
- **Custom Dashboards (Plotly Dash, Bokeh):**
- Embed API calls and data processing into the dashboard code.
- Build interactive visual components that automatically update as new data is fetched.
3. **Real-Time & Automated Updates:**
- Schedule scripts using cron jobs, task schedulers, or workflow tools (e.g., Apache Airflow) to refresh data periodically.
- Implement webhooks or streaming APIs (if available) for near real-time data updates.
By integrating API data into third-party platforms, you can create interactive, real-time dashboards that combine OpenAI metrics with other business data, offering comprehensive insights and automated monitoring.
---
# Source: https://developers.openai.com/apps-sdk/plan/components.md
# Design components
## Why components matter
UI components are the human-visible half of your connector. They let users view or edit data inline, switch to fullscreen when needed, and keep context synchronized between typed prompts and UI actions. Planning them early ensures your MCP server returns the right structured data and component metadata from day one.
## Explore sample components
We publish reusable examples in [openai-apps-sdk-examples](https://github.com/openai/openai-apps-sdk-examples) so you can see common patterns before you build your own. The pizzaz gallery covers every default surface we provide today:
### List
Renders dynamic collections with empty-state handling. [View the code](https://github.com/openai/openai-apps-sdk-examples/tree/main/src/pizzaz-list).

### Map
Plots geo data with marker clustering and detail panes. [View the code](https://github.com/openai/openai-apps-sdk-examples/tree/main/src/pizzaz).

### Album
Showcases media grids with fullscreen transitions. [View the code](https://github.com/openai/openai-apps-sdk-examples/tree/main/src/pizzaz-albums).

### Carousel
Highlights featured content with swipe gestures. [View the code](https://github.com/openai/openai-apps-sdk-examples/tree/main/src/pizzaz-carousel).

### Shop
Demonstrates product browsing with checkout affordances. [View the code](https://github.com/openai/openai-apps-sdk-examples/tree/main/src/pizzaz-shop).


## Clarify the user interaction
For each use case, decide what the user needs to see and manipulate:
- **Viewer vs. editor** – is the component read-only (a chart, a dashboard) or should it support editing and writebacks (forms, kanban boards)?
- **Single-shot vs. multiturn** – will the user accomplish the task in one invocation, or should state persist across turns as they iterate?
- **Inline vs. fullscreen** – some tasks are comfortable in the default inline card, while others benefit from fullscreen or picture-in-picture modes. Sketch these states before you implement.
Write down the fields, affordances, and empty states you need so you can validate them with design partners and reviewers.
## Map data requirements
Components should receive everything they need in the tool response. When planning:
- **Structured content** – define the JSON payload that the component will parse.
- **Initial component state** – use `window.openai.toolOutput` as the initial render data. On subsequent followups that invoke `callTool`, use the return value of `callTool`. To cache state for re-rendering, you can use `window.openai.setWidgetState`.
- **Auth context** – note whether the component should display linked-account information, or whether the model must prompt the user to connect first.
Feeding this data through the MCP response is simpler than adding ad-hoc APIs later.
## Design for responsive layouts
Components run inside an iframe on both desktop and mobile. Plan for:
- **Adaptive breakpoints** – set a max width and design layouts that collapse gracefully on small screens.
- **Accessible color and motion** – respect system dark mode (match color-scheme) and provide focus states for keyboard navigation.
- **Launcher transitions** – if the user opens your component from the launcher or expands to fullscreen, make sure navigation elements stay visible.
Document CSS variables, font stacks, and iconography up front so they are consistent across components.
## Define the state contract
Because components and the chat surface share conversation state, be explicit about what is stored where:
- **Component state** – use the `window.openai.setWidgetState` API to persist state the host should remember (selected record, scroll position, staged form data).
- **Server state** – store authoritative data in your backend or the built-in storage layer. Decide how to merge server changes back into component state after follow-up tool calls.
- **Model messages** – think about what human-readable updates the component should send back via `sendFollowUpMessage` so the transcript stays meaningful.
Capturing this state diagram early prevents hard-to-debug sync issues later.
## Plan telemetry and debugging hooks
Inline experiences are hardest to debug without instrumentation. Decide in advance how you will:
- Emit analytics events for component loads, button clicks, and validation errors.
- Log tool-call IDs alongside component telemetry so you can trace issues end to end.
- Provide fallbacks when the component fails to load (e.g., show the structured JSON and prompt the user to retry).
Once these plans are in place you are ready to move on to the implementation details in [Build a ChatGPT UI](https://developers.openai.com/apps-sdk/build/chatgpt-ui).
---
# Source: https://developers.openai.com/codex/concepts.md
# Tasks & Prompts
## Local tasks
Codex can perform two types of tasks for you: local tasks and [cloud tasks](#cloud-tasks).
Codex completes local tasks directly on your machine. This can be your personal laptop, desktop, or even a server you have access to.
For local tasks, Codex directly interacts with your local file system to change files and run commands. This means you can see which files are changing in real time, let Codex use your local tools, and have it jump into parts of your codebase that you are currently working on.
To [limit the risk of Codex modifying files outside of your workspace](/codex/security), or perform other undesired actions, Codex runs local tasks in a [sandbox](#sandbox) environment by default.
## Cloud tasks
The alternative to local tasks is cloud tasks, which are helpful when you want Codex to work on tasks in parallel or when inspiration strikes on the go.
Codex runs each cloud task in an isolated [environment](/codex/cloud/environments) that allows the Codex agent to work on the task in a secure and isolated way. To set up the environment, Codex will clone your repository and check out the relevant branch it's working on. To use Codex for cloud tasks, push your code to GitHub first. If you haven't pushed your code to GitHub yet, you can also use the Codex CLI or IDE extension to [delegate tasks from your local machine](/codex/ide/cloud-tasks), which includes the current code you are working on.
By default, environments come with common programming languages and dependency management tools. To get the most out of Codex cloud tasks, you can also install more packages and enable internet access by [customizing the environment](/codex/cloud/environments) for your project.
## Codex interfaces
Codex is available through a range of interfaces depending on your use case. You can use Codex in [your terminal](/codex/cli), [your IDE](/codex/ide), on [GitHub](/codex/integrations/github), in [Slack](/codex/integrations/slack), and more. The goal is for Codex to be available wherever you are, whenever you need it.
[Codex Web](/codex/cloud) is our web interface available at [chatgpt.com/codex](https://chatgpt.com/codex). You can use Codex Web to configure your cloud task environments, delegate tasks to Codex, and track [code reviews](/codex/integrations/github).
## Prompting Codex
Just like ChatGPT, Codex is only as effective as the instructions you give it. Here are some tips we find helpful when prompting Codex:
- Codex produces higher-quality outputs when it can verify its work. Provide **steps to reproduce an issue, validate a feature, and run any linter or pre-commit checks**. If additional packages or custom setups are needed, see [Environment configuration](/codex/cloud/environments).
- Like a human engineer, Codex handles really complex work better when it's broken into smaller, focused steps. Smaller tasks are easier for Codex to test and for you to review. You can even ask Codex to help break tasks down.
---
# Source: https://developers.openai.com/codex/config-advanced.md
# Advanced Configuration
Use these options when you need more control over providers, policies, and integrations. For a quick start, see [Config basics](https://developers.openai.com/codex/config-basic).
## Profiles
Profiles let you save named sets of configuration values and switch between them from the CLI.
Profiles are experimental and may change or be removed in future releases.
Profiles are not currently supported in the Codex IDE extension.
Define profiles under `[profiles.]` in `config.toml`, then run `codex --profile `:
```toml
model = "gpt-5-codex"
approval_policy = "on-request"
[profiles.deep-review]
model = "gpt-5-pro"
model_reasoning_effort = "high"
approval_policy = "never"
[profiles.lightweight]
model = "gpt-4.1"
approval_policy = "untrusted"
```
To make a profile the default, add `profile = "deep-review"` at the top level of `config.toml`. Codex loads that profile unless you override it on the command line.
## One-off overrides from the CLI
In addition to editing `~/.codex/config.toml`, you can override configuration for a single run from the CLI:
- Prefer dedicated flags when they exist (for example, `--model`).
- Use `-c` / `--config` when you need to override an arbitrary key.
Examples:
```shell
# Dedicated flag
codex --model gpt-5.2
# Generic key/value override (value is TOML, not JSON)
codex --config model='"gpt-5.2"'
codex --config sandbox_workspace_write.network_access=true
codex --config 'shell_environment_policy.include_only=["PATH","HOME"]'
```
Notes:
- Keys can use dot notation to set nested values (for example, `mcp_servers.context7.enabled=false`).
- `--config` values are parsed as TOML. When in doubt, quote the value so your shell doesn't split it on spaces.
- If the value can't be parsed as TOML, Codex treats it as a string.
## Config and state locations
Codex stores its local state under `CODEX_HOME` (defaults to `~/.codex`).
Common files you may see there:
- `config.toml` (your local configuration)
- `auth.json` (if you use file-based credential storage) or your OS keychain/keyring
- `history.jsonl` (if history persistence is enabled)
- Other per-user state such as logs and caches
For authentication details (including credential storage modes), see [Authentication](https://developers.openai.com/codex/auth). For the full list of configuration keys, see [Configuration Reference](https://developers.openai.com/codex/config-reference).
For shared defaults, rules, and skills checked into repos or system paths, see [Team Config](https://developers.openai.com/codex/enterprise/admin-setup#team-config).
If you just need to point the built-in OpenAI provider at an LLM proxy, router, or data-residency enabled project, set environment variable `OPENAI_BASE_URL` instead of defining a new provider. This overrides the default OpenAI endpoint without a `config.toml` change.
```shell
export OPENAI_BASE_URL="https://api.openai.com/v1"
codex
```
## Project config files (`.codex/config.toml`)
In addition to your user config, Codex reads project-scoped overrides from `.codex/config.toml` files inside your repo. Codex walks from the project root to your current working directory and loads every `.codex/config.toml` it finds. If multiple files define the same key, the closest file to your working directory wins.
For security, Codex loads project-scoped config files only when the project is trusted. If the project is untrusted, Codex ignores `.codex/config.toml` files in the project.
Relative paths inside a project config (for example, `experimental_instructions_file`) are resolved relative to the `.codex/` folder that contains the `config.toml`.
## Project root detection
Codex discovers project configuration (for example, `.codex/` layers and `AGENTS.md`) by walking up from the working directory until it reaches a project root.
By default, Codex treats a directory containing `.git` as the project root. To customize this behavior, set `project_root_markers` in `config.toml`:
```toml
# Treat a directory as the project root when it contains any of these markers.
project_root_markers = [".git", ".hg", ".sl"]
```
Set `project_root_markers = []` to skip searching parent directories and treat the current working directory as the project root.
## Custom model providers
A model provider defines how Codex connects to a model (base URL, wire API, and optional HTTP headers).
Define additional providers and point `model_provider` at them:
```toml
model = "gpt-5.1"
model_provider = "proxy"
[model_providers.proxy]
name = "OpenAI using LLM proxy"
base_url = "http://proxy.example.com"
env_key = "OPENAI_API_KEY"
[model_providers.ollama]
name = "Ollama"
base_url = "http://localhost:11434/v1"
[model_providers.mistral]
name = "Mistral"
base_url = "https://api.mistral.ai/v1"
env_key = "MISTRAL_API_KEY"
```
Add request headers when needed:
```toml
[model_providers.example]
http_headers = { "X-Example-Header" = "example-value" }
env_http_headers = { "X-Example-Features" = "EXAMPLE_FEATURES" }
```
## OSS mode (local providers)
Codex can run against a local "open source" provider (for example, Ollama or LM Studio) when you pass `--oss`. If you pass `--oss` without specifying a provider, Codex uses `oss_provider` as the default.
```toml
# Default local provider used with `--oss`
oss_provider = "ollama" # or "lmstudio"
```
## Azure provider and per-provider tuning
```toml
[model_providers.azure]
name = "Azure"
base_url = "https://YOUR_PROJECT_NAME.openai.azure.com/openai"
env_key = "AZURE_OPENAI_API_KEY"
query_params = { api-version = "2025-04-01-preview" }
wire_api = "responses"
[model_providers.openai]
request_max_retries = 4
stream_max_retries = 10
stream_idle_timeout_ms = 300000
```
## ChatGPT customers using data residency
Projects created with [data residency](https://help.openai.com/en/articles/9903489-data-residency-and-inference-residency-for-chatgpt) enabled can create a model provider to update the base_url with the [correct prefix](https://platform.openai.com/docs/guides/your-data#which-models-and-features-are-eligible-for-data-residency).
```toml
model_provider = "openaidr"
[model_providers.openaidr]
name = "OpenAI Data Residency"
base_url = "https://us.api.openai.com/v1" # Replace 'us' with domain prefix
```
## Model reasoning, verbosity, and limits
```toml
model_reasoning_summary = "none" # Disable summaries
model_verbosity = "low" # Shorten responses
model_supports_reasoning_summaries = true # Force reasoning
model_context_window = 128000 # Context window size
```
`model_verbosity` applies only to providers using the Responses API. Chat Completions providers will ignore the setting.
## Approval policies and sandbox modes
Pick approval strictness (affects when Codex pauses) and sandbox level (affects file/network access). See [Sandbox & approvals](https://developers.openai.com/codex/security) for deeper examples.
```toml
approval_policy = "untrusted" # Other options: on-request, on-failure, never
sandbox_mode = "workspace-write"
[sandbox_workspace_write]
exclude_tmpdir_env_var = false # Allow $TMPDIR
exclude_slash_tmp = false # Allow /tmp
writable_roots = ["/Users/YOU/.pyenv/shims"]
network_access = false # Opt in to outbound network
```
In workspace-write mode, some environments keep `.git/` and `.codex/`
read-only even when the rest of the workspace is writable. This is why
commands like `git commit` may still require approval to run outside the
sandbox. If you want Codex to skip specific commands (for example, block `git
commit` outside the sandbox), use
rules.
Disable sandboxing entirely (use only if your environment already isolates processes):
```toml
sandbox_mode = "danger-full-access"
```
## Shell environment policy
`shell_environment_policy` controls which environment variables Codex passes to any subprocess it launches (for example, when running a tool-command the model proposes). Start from a clean start (`inherit = "none"`) or a trimmed set (`inherit = "core"`), then layer on excludes, includes, and overrides to avoid leaking secrets while still providing the paths, keys, or flags your tasks need.
```toml
[shell_environment_policy]
inherit = "none"
set = { PATH = "/usr/bin", MY_FLAG = "1" }
ignore_default_excludes = false
exclude = ["AWS_*", "AZURE_*"]
include_only = ["PATH", "HOME"]
```
Patterns are case-insensitive globs (`*`, `?`, `[A-Z]`); `ignore_default_excludes = false` keeps the automatic KEY/SECRET/TOKEN filter before your includes/excludes run.
## MCP servers
See the dedicated [MCP documentation](https://developers.openai.com/codex/mcp) for configuration details.
## Observability and telemetry
Enable OpenTelemetry (OTel) log export to track Codex runs (API requests, SSE/events, prompts, tool approvals/results). Disabled by default; opt in via `[otel]`:
```toml
[otel]
environment = "staging" # defaults to "dev"
exporter = "none" # set to otlp-http or otlp-grpc to send events
log_user_prompt = false # redact user prompts unless explicitly enabled
```
Choose an exporter:
```toml
[otel]
exporter = { otlp-http = {
endpoint = "https://otel.example.com/v1/logs",
protocol = "binary",
headers = { "x-otlp-api-key" = "${OTLP_TOKEN}" }
}}
```
```toml
[otel]
exporter = { otlp-grpc = {
endpoint = "https://otel.example.com:4317",
headers = { "x-otlp-meta" = "abc123" }
}}
```
If `exporter = "none"` Codex records events but sends nothing. Exporters batch asynchronously and flush on shutdown. Event metadata includes service name, CLI version, env tag, conversation id, model, sandbox/approval settings, and per-event fields (see [Config Reference](https://developers.openai.com/codex/config-reference)).
### What gets emitted
Codex emits structured log events for runs and tool usage. Representative event types include:
- `codex.conversation_starts` (model, reasoning settings, sandbox/approval policy)
- `codex.api_request` and `codex.sse_event` (durations, status, token counts)
- `codex.user_prompt` (length; content redacted unless explicitly enabled)
- `codex.tool_decision` (approved/denied and whether the decision came from config vs user)
- `codex.tool_result` (duration, success, output snippet)
For more security and privacy guidance around telemetry, see [Security](https://developers.openai.com/codex/security#monitoring-and-telemetry).
### Metrics
By default, Codex periodically sends a small amount of anonymous usage and health data back to OpenAI. This helps detect when Codex isn't working correctly and shows what features and configuration options are being used, so the Codex team can focus on what matters most. These metrics don't contain any personally identifiable information (PII). Metrics collection is independent of OTel log/trace export.
If you want to disable metrics collection entirely across Codex surfaces on a machine, set the analytics flag in your config:
```toml
[analytics]
enabled = false
```
Each metric includes its own fields plus the default context fields below.
#### Default context fields (applies to every event/metric)
- `auth_mode`: `swic` | `api` | `unknown`.
- `model`: name of the model used.
- `app.version`: Codex version.
#### Metrics catalog
Each metric includes the required fields plus the default context fields above. Every metric is prefixed by `codex.`.
If a metric includes the `tool` field, it reflects the internal tool used (for example, `apply_patch` or `shell`) and doesn't contain the actual shell command or patch `codex` is trying to apply.
| Metric | Type | Fields | Description |
| ---------------------------------------- | --------- | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- |
| `feature.state` | counter | `feature`, `value` | Feature values that differ from defaults (emit one row per non-default). |
| `thread.started` | counter | `is_git` | New thread created. |
| `task.compact` | counter | `type` | Number of compactions per type (`remote` or `local`), including manual and auto. |
| `task.user_shell` | counter | | Number of user shell actions (`!` in the TUI for example). |
| `task.review` | counter | | Number of reviews triggered. |
| `task.undo` | counter | | Number of undo actions triggered. |
| `approval.requested` | counter | `tool`, `approved` | Tool approval request result (`approved`, `approved_with_amendment`, `approved_for_session`, `denied`, `abort`). |
| `conversation.turn.count` | counter | | User/assistant turns per thread, recorded at the end of the thread. |
| `turn.e2e_duration_ms` | histogram | | End-to-end time for a full turn. |
| `mcp.call` | counter | `status` | MCP tool invocation result (`ok` or error string). |
| `model_warning` | counter | | Warning sent to the model. |
| `tool.call` | counter | `tool`, `success` | Tool invocation result (`success`: `true` or `false`). |
| `tool.call.duration_ms` | histogram | `tool`, `success` | Tool execution time. |
| `remote_models.fetch_update.duration_ms` | histogram | | Time to fetch remote model definitions. |
| `remote_models.load_cache.duration_ms` | histogram | | Time to load the remote model cache. |
| `shell_snapshot` | counter | `success` | Whether taking a shell snapshot succeeded. |
| `shell_snapshot.duration_ms` | histogram | `success` | Time to take a shell snapshot. |
| `db.init` | counter | `status` | State DB initialization outcomes (`opened`, `created`, `open_error`, `init_error`). |
| `db.backfill` | counter | `status` | Initial state DB backfill results (`upserted`, `failed`). |
| `db.backfill.duration_ms` | histogram | `status` | Duration of the initial state DB backfill, tagged with `success`, `failed`, or `partial_failure`. |
| `db.error` | counter | `stage` | Errors during state DB operations (for example, `extract_metadata_from_rollout`, `backfill_sessions`, `apply_rollout_items`). |
| `db.compare_error` | counter | `stage`, `reason` | State DB discrepancies detected during reconciliation. |
### Feedback controls
By default, Codex lets users send feedback from `/feedback`. To disable feedback collection across Codex surfaces on a machine, update your config:
```toml
[feedback]
enabled = false
```
When disabled, `/feedback` shows a disabled message and Codex rejects feedback submissions.
### Hide or surface reasoning events
If you want to reduce noisy "reasoning" output (for example in CI logs), you can suppress it:
```toml
hide_agent_reasoning = true
```
If you want to surface raw reasoning content when a model emits it:
```toml
show_raw_agent_reasoning = true
```
Enable raw reasoning only if it's acceptable for your workflow. Some models/providers (like `gpt-oss`) don't emit raw reasoning; in that case, this setting has no visible effect.
## Notifications
Use `notify` to trigger an external program whenever Codex emits supported events (currently only `agent-turn-complete`). This is handy for desktop toasts, chat webhooks, CI updates, or any side-channel alerting that the built-in TUI notifications don't cover.
```toml
notify = ["python3", "/path/to/notify.py"]
```
Example `notify.py` (truncated) that reacts to `agent-turn-complete`:
```python
#!/usr/bin/env python3
import json, subprocess, sys
def main() -> int:
notification = json.loads(sys.argv[1])
if notification.get("type") != "agent-turn-complete":
return 0
title = f"Codex: {notification.get('last-assistant-message', 'Turn Complete!')}"
message = " ".join(notification.get("input-messages", []))
subprocess.check_output([
"terminal-notifier",
"-title", title,
"-message", message,
"-group", "codex-" + notification.get("thread-id", ""),
"-activate", "com.googlecode.iterm2",
])
return 0
if __name__ == "__main__":
sys.exit(main())
```
The script receives a single JSON argument. Common fields include:
- `type` (currently `agent-turn-complete`)
- `thread-id` (session identifier)
- `turn-id` (turn identifier)
- `cwd` (working directory)
- `input-messages` (user messages that led to the turn)
- `last-assistant-message` (last assistant message text)
Place the script somewhere on disk and point `notify` to it.
#### `notify` vs `tui.notifications`
- `notify` runs an external program (good for webhooks, desktop notifiers, CI hooks).
- `tui.notifications` is built in to the TUI and can optionally filter by event type (for example, `agent-turn-complete` and `approval-requested`).
- `tui.notification_method` controls how the TUI emits terminal notifications (`auto`, `osc9`, or `bel`).
In `auto` mode, Codex prefers OSC 9 notifications (a terminal escape sequence some terminals interpret as a desktop notification) and falls back to BEL (`\x07`) otherwise.
See [Configuration Reference](https://developers.openai.com/codex/config-reference) for the exact keys.
## History persistence
By default, Codex saves local session transcripts under `CODEX_HOME` (for example, `~/.codex/history.jsonl`). To disable local history persistence:
```toml
[history]
persistence = "none"
```
To cap the history file size, set `history.max_bytes`. When the file exceeds the cap, Codex drops the oldest entries and compacts the file while keeping the newest records.
```toml
[history]
max_bytes = 104857600 # 100 MiB
```
## Clickable citations
If you use a terminal/editor integration that supports it, Codex can render file citations as clickable links. Configure `file_opener` to pick the URI scheme Codex uses:
```toml
file_opener = "vscode" # or cursor, windsurf, vscode-insiders, none
```
Example: a citation like `/home/user/project/main.py:42` can be rewritten into a clickable `vscode://file/...:42` link.
## Project instructions discovery
Codex reads `AGENTS.md` (and related files) and includes a limited amount of project guidance in the first turn of a session. Two knobs control how this works:
- `project_doc_max_bytes`: how much to read from each `AGENTS.md` file
- `project_doc_fallback_filenames`: additional filenames to try when `AGENTS.md` is missing at a directory level
For a detailed walkthrough, see [Custom instructions with AGENTS.md](https://developers.openai.com/codex/guides/agents-md).
## TUI options
Running `codex` with no subcommand launches the interactive terminal UI (TUI). Codex exposes some TUI-specific configuration under `[tui]`, including:
- `tui.notifications`: enable/disable notifications (or restrict to specific types)
- `tui.notification_method`: choose `auto`, `osc9`, or `bel` for terminal notifications
- `tui.animations`: enable/disable ASCII animations and shimmer effects
- `tui.alternate_screen`: control alternate screen usage (set to `never` to keep terminal scrollback)
- `tui.show_tooltips`: show or hide onboarding tooltips on the welcome screen
`tui.notification_method` defaults to `auto`. In `auto` mode, Codex prefers OSC 9 notifications (a terminal escape sequence some terminals interpret as a desktop notification) when the terminal appears to support them, and falls back to BEL (`\x07`) otherwise.
See [Configuration Reference](https://developers.openai.com/codex/config-reference) for the full key list.
---
# Source: https://developers.openai.com/codex/config-basic.md
# Config basics
Codex reads configuration details from more than one location. Your personal defaults live in `~/.codex/config.toml`, and you can add project overrides with `.codex/config.toml` files. For security, Codex loads project config files only when you trust the project.
## Codex configuration file
Codex stores user-level configuration at `~/.codex/config.toml`. To scope settings to a specific project or subfolder, add a `.codex/config.toml` file in your repo.
To open the configuration file from the Codex IDE extension, select the gear icon in the top-right corner, then select **Codex Settings > Open config.toml**.
The CLI and IDE extension share the same configuration layers. You can use them to:
- Set the default model and provider.
- Configure [approval policies and sandbox settings](https://developers.openai.com/codex/security).
- Configure [MCP servers](https://developers.openai.com/codex/mcp).
## Configuration precedence
Codex resolves values in this order (highest precedence first):
1. CLI flags and `--config` overrides
2. [Profile](https://developers.openai.com/codex/config-advanced#profiles) values (from `--profile `)
3. Project config files: `.codex/config.toml`, ordered from the project root down to your current working directory (closest wins; trusted projects only)
4. User config: `~/.codex/config.toml`
5. System config (if present): `/etc/codex/config.toml` on Unix
6. Built-in defaults
Use that precedence to set shared defaults at the top level and keep profiles focused on the values that differ.
If you mark a project as untrusted, Codex skips project-scoped `.codex/` layers (including `.codex/config.toml`) and falls back to user, system, and built-in defaults.
For one-off overrides via `-c`/`--config` (including TOML quoting rules), see [Advanced Config](https://developers.openai.com/codex/config-advanced#one-off-overrides-from-the-cli).
On managed machines, your organization may also enforce constraints via
`requirements.toml` (for example, disallowing `approval_policy = "never"` or
`sandbox_mode = "danger-full-access"`). See [Security](https://developers.openai.com/codex/security).
## Common configuration options
Here are a few options people change most often:
#### Default model
Choose the model Codex uses by default in the CLI and IDE.
```toml
model = "gpt-5.2"
```
#### Approval prompts
Control when Codex pauses to ask before running generated commands.
```toml
approval_policy = "on-request"
```
#### Sandbox level
Adjust how much filesystem and network access Codex has while executing commands.
```toml
sandbox_mode = "workspace-write"
```
#### Web search mode
Codex enables web search by default for local tasks and serves results from a web search cache. The cache is an OpenAI-maintained index of web results, so cached mode returns pre-indexed results instead of fetching live pages. This reduces exposure to prompt injection from arbitrary live content, but you should still treat web results as untrusted. If you are using `--yolo` or another [full access sandbox setting](https://developers.openai.com/codex/security), web search defaults to live results. Choose a mode with `web_search`:
- `"cached"` (default) serves results from the web search cache.
- `"live"` fetches the most recent data from the web (same as `--search`).
- `"disabled"` turns off the web search tool.
```toml
web_search = "cached" # default; serves results from the web search cache
# web_search = "live" # fetch the most recent data from the web (same as --search)
# web_search = "disabled"
```
#### Reasoning effort
Tune how much reasoning effort the model applies when supported.
```toml
model_reasoning_effort = "high"
```
#### Command environment
Control which environment variables Codex forwards to spawned commands.
```toml
[shell_environment_policy]
include_only = ["PATH", "HOME"]
```
## Feature flags
Use the `[features]` table in `config.toml` to toggle optional and experimental capabilities.
```toml
[features]
shell_snapshot = true # Speed up repeated commands
```
### Supported features
| Key | Default | Maturity | Description |
| ------------------------------ | :-----: | ------------ | ------------------------------------------------------------- |
| `apply_patch_freeform` | false | Experimental | Include the freeform `apply_patch` tool |
| `elevated_windows_sandbox` | false | Experimental | Use the elevated Windows sandbox pipeline |
| `exec_policy` | true | Experimental | Enforce rules checks for `shell`/`unified_exec` |
| `experimental_windows_sandbox` | false | Experimental | Use the Windows restricted-token sandbox |
| `remote_compaction` | true | Experimental | Enable remote compaction (ChatGPT auth only) |
| `remote_models` | false | Experimental | Refresh remote model list before showing readiness |
| `request_rule` | true | Stable | Enable Smart approvals (`prefix_rule` suggestions) |
| `shell_snapshot` | false | Beta | Snapshot your shell environment to speed up repeated commands |
| `shell_tool` | true | Stable | Enable the default `shell` tool |
| `unified_exec` | false | Beta | Use the unified PTY-backed exec tool |
| `undo` | true | Stable | Enable undo via per-turn git ghost snapshots |
| `web_search` | true | Deprecated | Legacy toggle; prefer the top-level `web_search` setting |
| `web_search_cached` | true | Deprecated | Legacy toggle that maps to `web_search = "cached"` when unset |
| `web_search_request` | true | Deprecated | Legacy toggle that maps to `web_search = "live"` when unset |
The Maturity column uses feature maturity labels such as Experimental, Beta,
and Stable. See [Feature Maturity](https://developers.openai.com/codex/feature-maturity) for how to
interpret these labels.
Omit feature keys to keep their defaults.
### Enabling features
- In `config.toml`, add `feature_name = true` under `[features]`.
- From the CLI, run `codex --enable feature_name`.
- To enable more than one feature, run `codex --enable feature_a --enable feature_b`.
- To disable a feature, set the key to `false` in `config.toml`.
---
# Source: https://developers.openai.com/codex/config-reference.md
# Configuration Reference
Use this page as a searchable reference for Codex configuration files. For conceptual guidance and examples, start with [Config basics](https://developers.openai.com/codex/config-basic) and [Advanced Config](https://developers.openai.com/codex/config-advanced).
## `config.toml`
User-level configuration lives in `~/.codex/config.toml`. You can also add project-scoped overrides in `.codex/config.toml` files. Codex loads project-scoped config files only when you trust the project.
",
description:
'Additional writable roots when `sandbox_mode = "workspace-write"`.',
},
{
key: "sandbox_workspace_write.network_access",
type: "boolean",
description:
"Allow outbound network access inside the workspace-write sandbox.",
},
{
key: "sandbox_workspace_write.exclude_tmpdir_env_var",
type: "boolean",
description:
"Exclude `$TMPDIR` from writable roots in workspace-write mode.",
},
{
key: "sandbox_workspace_write.exclude_slash_tmp",
type: "boolean",
description:
"Exclude `/tmp` from writable roots in workspace-write mode.",
},
{
key: "notify",
type: "array",
description:
"Command invoked for notifications; receives a JSON payload from Codex.",
},
{
key: "check_for_update_on_startup",
type: "boolean",
description:
"Check for Codex updates on startup (set to false only when updates are centrally managed).",
},
{
key: "feedback.enabled",
type: "boolean",
description:
"Enable feedback submission via `/feedback` across Codex surfaces (default: true).",
},
{
key: "instructions",
type: "string",
description:
"Reserved for future use; prefer `model_instructions_file` or `AGENTS.md`.",
},
{
key: "developer_instructions",
type: "string",
description:
"Additional developer instructions injected into the session (optional).",
},
{
key: "compact_prompt",
type: "string",
description: "Inline override for the history compaction prompt.",
},
{
key: "model_instructions_file",
type: "string (path)",
description:
"Replacement for built-in instructions instead of `AGENTS.md`.",
},
{
key: "experimental_compact_prompt_file",
type: "string (path)",
description:
"Load the compaction prompt override from a file (experimental).",
},
{
key: "skills.config",
type: "array