# Promptfoo

> Adaline Gateway is a fully local production-grade Super SDK that provides a simple, unified, and powerful interface for calling 300+ LLMs.

## Pages

- [Adaline Gateway](adaline.md): Adaline Gateway is a fully local production-grade Super SDK that provides a simple, unified, and powerful interface f...
- [Aegis: NVIDIA AI Content Safety Dataset](aegis.md): The Aegis plugin uses NVIDIA's [Aegis AI Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Cont...
- [Age Bias Plugin](age-bias.md): The Age Bias plugin (`bias:age`) tests whether your AI system reinforces age-based stereotypes or discrimination.
- [How to red team LLM Agents](agents.md): LLM agents are capable of interacting with their environment and executing complex tasks using natural language inter...
- [AI21 Labs](ai21.md): The [AI21 Labs API](https://docs.ai21.com/reference/chat-completion) offers access to AI21 models such as`jamba-1.5-...
- [AI/ML API](aimlapi.md): [AI/ML API](https://aimlapi.com/) provides access to 300+ AI models through a unified OpenAI-compatible interface, in...
- [Alibaba Cloud (Qwen)](alibaba.md): [Alibaba Cloud's DashScope API](https://www.alibabacloud.com/help/en/model-studio/getting-started/models) provides Op...
- [Answer Relevance](answer-relevance.md): The`answer-relevance`assertion evaluates whether an LLM's output is relevant to the original query. It uses a combi...
- [Anthropic](anthropic.md): This provider supports the [Anthropic Claude](https://www.anthropic.com/claude) series of models.
- [API Reference | Promptfoo | Promptfoo](api-reference.md): - [Red Teaming](/red-teaming/)
- [Architecture](architecture.md): Promptfoo automated red teaming consists of three main components: **plugins**, **strategies**, and **targets**.
- [ASCII Smuggling for LLMs](ascii-smuggling.md): ASCII smuggling is a technique that uses a special set of Unicode code points from the Tags Unicode Block to embed in...
- [attack-generation](attack-generation.md): Sometimes attacks may not be generated as expected. This is usually due to the`Purpose`property not being clear eno...
- [Audio Jailbreaking](audio.md): The Audio strategy converts prompt text into speech audio and then encodes that audio as a base64 string. This allows...
- [Audit Logging](audit-logging.md): Audit Logging is a feature of promptfoo Enterprise that provides forensic access information at the organization leve...
- [Authentication](authentication.md): Promptfoo supports both basic authentication and SSO through SAML 2.0 and OIDC. To configure SSO with Promptfoo Enter...
- [Authoritative Markup Injection Strategy](authoritative-markup-injection.md): The Authoritative Markup Injection strategy tests whether AI systems are more susceptible to harmful requests when th...
- [AWS Bedrock | Promptfoo](aws-bedrock.md): 1. **Model Access**: Amazon Bedrock provides automatic access to serverless foundation models with no manual approval...
- [Azure Pipelines Integration](azure-pipelines.md): This guide demonstrates how to set up promptfoo with Azure Pipelines to run evaluations as part of your CI pipeline.
- [OpenAI vs Azure: How to benchmark](azure-vs-openai.md): Whether you use GPT through the OpenAI or Azure APIs, the results are pretty similar. But there are some key differen...
- [Azure OpenAI Provider | Promptfoo](azure.md): There are three ways to authenticate with Azure OpenAI:
- [Base64 Encoding Strategy](base64.md): The Base64 Encoding strategy tests an AI system's ability to resist encoded inputs that might bypass security control...
- [Basic Strategy](basic.md): The basic strategy controls whether the original plugin-generated test cases (without any strategies applied) are inc...
- [BeaverTails Dataset for LLM Safety Testing](beavertails.md): The BeaverTails plugin uses the [BeaverTails dataset](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), a d...
- [AWS Bedrock Agents](bedrock-agents.md): The AWS Bedrock Agents provider enables you to test and evaluate AI agents built with Amazon Bedrock Agents. Amazon B...
- [Best-of-N (BoN) Jailbreaking Strategy](best-of-n.md): Best-of-N (BoN) is a simple but effective black-box jailbreaking algorithm that works by repeatedly sampling variatio...
- [Best Practices for Configuring AI Red Teaming](best-practices.md): To successfully use AI red teaming automation, you **must** provide rich application context and a diverse set of att...
- [Broken Function Level Authorization (BFLA) Plugin](bfla.md): The BFLA (Broken Function Level Authorization) red teaming plugin is designed to test an AI system's ability to maint...
- [Bias Detection Plugins](bias.md): Test whether your AI system produces or reinforces stereotypes, biases, or discrimination across different protected ...
- [Bitbucket Pipelines Integration](bitbucket-pipelines.md): This guide demonstrates how to set up promptfoo with Bitbucket Pipelines to run evaluations as part of your CI pipeline.
- [BOLA (Broken Object Level Authorization) Plugin](bola.md): The BOLA (Broken Object Level Authorization) red teaming plugin is designed to test an AI system's vulnerability to a...
- [Browser Provider](browser.md): The Browser Provider enables automated web browser interactions for testing complex web applications and JavaScript-h...
- [Building trust in AI with Portkey and Promptfoo](building-trust-in-ai-with-portkey-and-promptfoo.md): This guide was written by **Drishti Shah** from [Portkey](https://portkey.ai/), a guest author contributing to the Pr...
- [Finding LLM Jailbreaks with Burp Suite](burp.md): This guide shows how to integrate Promptfoo's application-level jailbreak creation with Burp Suite's Intruder feature...
- [Caching](caching.md): promptfoo caches the results of API calls to LLM providers to help save time and cost.
- [Cerebras](cerebras.md): This provider enables you to use Cerebras models through their [Inference API](https://docs.cerebras.ai/).
- [changelog](changelog.md): One doc tagged with "changelog"
- [Chat Conversations / Threads](chat.md): The [prompt file](/docs/configuration/prompts/#file-based-prompts) supports a message in OpenAI's JSON prompt format....
- [Red teaming a Chatbase Chatbot](chatbase-redteam.md): [Chatbase](https://www.chatbase.co/) is a platform for building custom AI chatbots that can be embedded into websites...
- [Choosing the best GPT model: benchmark on your own data](choosing-best-gpt-model.md): This guide will walk you through how to compare OpenAI's GPT-4o and GPT-4.1-mini, top contenders for the most powerfu...
- [CI/CD Integration for LLM Evaluation and Security](ci-cd.md): Integrate promptfoo into your CI/CD pipelines to automatically evaluate prompts, test for security vulnerabilities, a...
- [Setting up Promptfoo with CircleCI](circle-ci.md): This guide shows how to integrate promptfoo's LLM evaluation into your CircleCI pipeline. This allows you to automati...
- [Authority-based Jailbreaking](citation.md): The Citation strategy is a red teaming technique that uses academic citations and references to potentially bypass an...
- [Classifier grading](classifier.md): Use the`classifier`assert type to run the LLM output through any [HuggingFace text classifier](https://huggingface....
- [Claude Agent SDK](claude-agent-sdk.md): This provider makes [Claude Agent SDK](https://docs.claude.com/en/api/agent-sdk/overview) available for evals through...
- [Claude 3.7 vs GPT-4.1: Benchmark on Your Own Data](claude-vs-gpt.md): When evaluating the performance of LLMs, generic benchmarks will only get you so far. This is especially the case for...
- [CLI Command](cli.md): The`promptfoo code-scans`command scans code changes for LLM-related security vulnerabilities, helping you identify ...
- [Cloudera](cloudera.md): The Cloudera provider allows you to interact with Cloudera's AI endpoints using the OpenAI protocol. It supports chat...
- [Cloudflare Workers AI](cloudflare-ai.md): This provider supports the [models](https://developers.cloudflare.com/workers-ai/models/) provided by Cloudflare Work...
- [Code Scanning](code-scanning.md): Promptfoo Code Scanning uses AI agents to find LLM-related vulnerabilities in your codebase and helps you fix them be...
- [Command R vs GPT vs Claude: create your own benchmark](cohere-command-r-benchmark.md): While public benchmarks provide a general sense of capability, the only way to truly understand which model will perf...
- [Cohere](cohere.md): The`cohere`provider is an interface to Cohere AI's [chat inference API](https://docs.cohere.com/reference/chat), wi...
- [CometAPI](cometapi.md): The`cometapi`provider lets you use [CometAPI](https://www.cometapi.com/?utm_source=promptfoo&utm_campaign=integrati...
- [Command line](command-line.md): The`promptfoo`command line utility supports the following subcommands:
- [Llama 3.1 vs GPT: Benchmark on your own data](compare-llama2-vs-gpt.md): This guide describes how to compare three models - Llama 3.1 405B, GPT 4o, and gpt-5-mini - using the`promptfoo`CLI.
- [Competitors Plugin](competitors.md): The Competitors red teaming plugin is designed to test whether an AI system can be influenced to mention or recommend...
- [Composite Jailbreaks Strategy](composite-jailbreaks.md): The Composite Jailbreaks strategy combines multiple jailbreak techniques from top research papers to create more soph...
- [configuration](configuration.md): Configuration
- [Connecting to Targets](connecting-to-targets.md): When setting up your target, use these best practices:
- [Context Compliance Attack Plugin](context-compliance-attack.md): Context Compliance Attacks (CCAs) exploit a dangerous flaw in many LLM deployments: **the failure to verify conversat...
- [Context faithfulness](context-faithfulness.md): Checks if the LLM's response only makes claims that are supported by the provided context.
- [Context recall](context-recall.md): Checks if your retrieved context contains the information needed to generate a known correct answer.
- [Context relevance](context-relevance.md): Measures what fraction of retrieved context is minimally needed to answer the query.
- [Contracts Plugin](contracts.md): The Contracts red teaming plugin is designed to test whether an AI system can be influenced to enter into unintended ...
- [Contributing to promptfoo](contributing.md): We welcome contributions from the community to help make promptfoo better. This guide will help you get started. If y...
- [Conversation Relevance](conversation-relevance.md): The`conversation-relevance`assertion evaluates whether responses in a conversation remain relevant throughout the d...
- [COPPA](coppa.md): The COPPA (Children's Online Privacy Protection Act) red teaming plugin tests whether AI systems properly protect chi...
- [cross-session-leak](cross-session-leak.md): Cross-Session Leak Plugin
- [Javascript Provider](custom-api.md): Custom Javascript providers let you create providers in JavaScript or TypeScript to integrate with any API or service...
- [Custom Scripts](custom-script.md): You may use any shell command as an API provider. This is particularly useful when you want to use a language or fram...
- [Custom Strategy](custom-strategy.md): Write natural language instructions to create powerful multi-turn red team strategies. No coding required.
- [One doc tagged with "custom"](custom.md): Create reusable red team strategies by writing natural language instructions that guide AI through multi-turn convers...
- [CyberSecEval Dataset for LLM Security Testing](cyberseceval.md): The CyberSecEval plugin uses Meta's [Purple Llama CyberSecEval dataset](https://meta-llama.github.io/PurpleLlama/docs...
- [Data Handling and Privacy](data-handling.md): This page explains what data leaves your machine during red team testing and how to control it.
- [Databricks Foundation Model APIs](databricks.md): The Databricks provider integrates with Databricks' Foundation Model APIs, offering access to state-of-the-art models...
- [Dataset generation](datasets.md): Your dataset is the heart of your LLM eval. To the extent possible, it should closely represent true inputs into your...
- [DBRX vs Mixtral vs GPT: create your own benchmark](dbrx-benchmark.md): There are many generic benchmarks that measure LLMs like DBRX, Mixtral, and others in a similar performance class. Bu...
- [Debug Access Plugin](debug-access.md): The Debug Access red teaming plugin is designed to test whether an AI system has an exposed debugging interface or re...
- [Deepseek vs GPT vs O3 vs Llama: Run a Custom Benchmark](deepseek-benchmark.md): Deepseek is a new Mixture-of-Experts (MoE) model that's all the rage due to its impressive performance, especially in...
- [deepseek](deepseek.md): [DeepSeek](https://platform.deepseek.com/) provides an OpenAI-compatible API for their language models, with speciali...
- [Deterministic Metrics for LLM Output Validation](deterministic.md): These metrics are created by logical tests that are run on LLM output.
- [Disability Bias Plugin](disability-bias.md): The Disability Bias plugin (`bias:disability`) tests whether your AI system reinforces disability stereotypes or disc...
- [Target Discovery](discovery.md): Promptfoo's **Target Discovery Agent** automatically extracts useful information about generative AI systems that you...
- [Divergent Repetition Plugin](divergent-repetition.md): The Divergent Repetition red teaming plugin is designed to test whether an AI system can be manipulated into revealin...
- [Docker Model Runner](docker.md): [Docker Model Runner](https://docs.docker.com/ai/model-runner/) makes it easy to manage, run, and deploy AI models us...
- [DoNotAnswer Dataset](donotanswer.md): The DoNotAnswer plugin tests how well LLMs handle harmful queries. The dataset contains questions that responsible AI...
- [Echo Provider](echo.md): The Echo Provider is a simple utility provider that returns the input prompt as the output. It's particularly useful ...
- [E-commerce Red Teaming Plugins](ecommerce.md): The e-commerce red teaming plugins are designed to test AI systems deployed in online retail contexts for critical vu...
- [ElevenLabs](elevenlabs.md): The ElevenLabs provider integrates multiple AI audio capabilities for comprehensive voice AI testing and evaluation.
- [Promptfoo Enterprise](enterprise.md): Promptfoo offers two deployment options to meet your security needs:
- [Envoy AI Gateway](envoy.md): [Envoy AI Gateway](https://aigateway.envoyproxy.io/) is an open-source AI gateway that provides a unified proxy layer...
- [EU AI Act](eu-ai-act.md): The EU Artificial Intelligence Act (AI Act) is the world's first comprehensive legal framework specifically regulatin...
- [Evaluating LLM safety with HarmBench](evaling-with-harmbench.md): Recent research has shown that even the most advanced LLMs [remain vulnerable](https://unit42.paloaltonetworks.com/ja...
- [Evaluate Coding Agents](evaluate-coding-agents.md): Coding agents present a different evaluation challenge than standard LLMs. A chat model transforms input to output in...
- [Red Teaming a CrewAI Agent](evaluate-crewai.md): [CrewAI](https://github.com/joaomdmoura/crewai) is a cutting-edge multi-agent platform designed to help teams streaml...
- [Evaluating ElevenLabs voice AI](evaluate-elevenlabs.md): This guide walks you through testing ElevenLabs voice AI capabilities using Promptfoo, from basic text-to-speech qual...
- [LLM evaluation techniques for JSON outputs](evaluate-json.md): Getting an LLM to output valid JSON can be a difficult task. There are a few failure modes:
- [Evaluate LangGraph: Red Teaming and Testing Stateful Agents](evaluate-langgraph.md): [LangGraph](https://github.com/langchain-ai/langgraph) is an advanced framework built on top of LangChain, designed t...
- [Choosing the right temperature for your LLM](evaluate-llm-temperature.md): The`temperature`setting in language models is like a dial that adjusts how predictable or surprising the responses ...
- [How to evaluate OpenAI Assistants](evaluate-openai-assistants.md): OpenAI recently released an [Assistants API](https://platform.openai.com/docs/assistants/overview) that offers simpli...
- [Evaluating RAG pipelines](evaluate-rag.md): Retrieval-augmented generation is a method for enriching LLM prompts with relevant data. Typically, the user prompt w...
- [How to evaluate GPT 3.5 vs Llama2-70b with Replicate Lifeboat](evaluate-replicate-lifeboat.md): Replicate put together a ["Lifeboat" OpenAI proxy](https://lifeboat.replicate.dev/) that allows you to swap to their ...
- [Excessive Agency Plugin](excessive-agency.md): The Excessive Agency red teaming plugin tests whether an AI is aware of its own capabilities and limitations by promp...
- [Assertions & metrics](expected-outputs.md): Assertions are used to compare the LLM output against expected values or conditions. While assertions are not require...
- [F5](f5.md): [F5](https://f5.ai/) provides an interface for a handful of LLM APIs.
- [Evaluating factuality](factuality-eval.md): Factuality is the measure of how accurately an LLM's response aligns with established facts or reference information....
- [Factuality](factuality.md): The`factuality`assertion evaluates the factual consistency between an LLM output and a reference answer. It uses a ...
- [fal.ai](fal.md): The`fal`provider supports the [fal.ai](https://fal.ai/) inference API using the [fal-js](https://github.com/fal-ai/...
- [Preventing False Positives](false-positives.md): False positives occur when a test case is marked as passing when it should have been marked as failing or vice versa....
- [Frequently asked questions](faq.md): Promptfoo is a local-first, open-source tool designed to help evaluate (eval) large language models (LLMs). Promptfoo...
- [features](features.md): One doc tagged with "features"
- [FERPA](ferpa.md): The FERPA (Family Educational Rights and Privacy Act) red teaming plugin tests whether AI systems properly protect st...
- [Financial Red-Teaming Plugins](financial.md): The Financial Red-Teaming Plugins are a specialized suite of tests designed for AI systems operating in financial ins...
- [Findings and Reports](findings.md): Promptfoo Enterprise allows you to review findings and reports from scans within the Promptfoo application.
- [fireworks](fireworks.md): Fireworks AI
- [How to Red Team Foundation Models](foundation-models.md): LLM security starts at the foundation model level. Assessing the security of foundation models is the first step to b...
- [G-Eval](g-eval.md): G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on custom criteria. I...
- [Greedy Coordinate Gradient (GCG)](gcg.md): The GCG strategy implements the attack method described in "[Universal and Transferable Adversarial Attacks on Aligne...
- [GDPR](gdpr.md): The EU General Data Protection Regulation (GDPR) is the world's most comprehensive data privacy and security law. Whi...
- [Gemini vs GPT: benchmark on your own data](gemini-vs-gpt.md): When comparing Gemini with GPT, you'll find plenty of eval and opinions online. Model capabilities set a _ceiling_ on...
- [Gemma vs Llama: benchmark on your own data](gemma-vs-llama.md): Comparing Google's Gemma and Meta's Llama involves more than just looking at their specs and reading about generic be...
- [Gemma vs Mistral: benchmark on your own data](gemma-vs-mistral.md): When comparing the performance of LLMs, it's best not to rely on generic benchmarks. This guide shows you how to set ...
- [gender-bias](gender-bias.md): The Gender Bias plugin (`bias:gender`) tests whether your AI system reinforces gender stereotypes or discrimination.
- [Getting started](getting-started.md): After [installing](/docs/installation/) promptfoo, you can set up your first config file in two ways:
- [GitHub Action](github-action.md): Automatically scan pull requests for LLM security vulnerabilities with promptfoo's [code scanning GitHub action](/cod...
- [GitHub Models](github.md): [GitHub Models](https://github.com/marketplace/models/) provides access to industry-leading AI models from OpenAI, An...
- [Setting up Promptfoo with GitLab CI](gitlab-ci.md): This guide shows how to integrate Promptfoo's LLM evaluation into your GitLab CI pipeline. This allows you to automat...
- [Custom Go Provider](go.md): The Go (`golang`) provider allows you to use Go code as an API provider for evaluating prompts. This is useful when y...
- [Goal Misalignment Plugin](goal-misalignment.md): The Goal Misalignment Plugin tests whether AI systems recognize when optimizing measurable proxy metrics might not al...
- [GOAT Technique for Jailbreaking LLMs](goat.md): The GOAT (Generative Offensive Agent Tester) strategy is an advanced automated red teaming technique that uses an "at...
- [Testing Google Cloud Model Armor](google-cloud-model-armor.md): [Model Armor](https://cloud.google.com/security-command-center/docs/model-armor-overview) is a Google Cloud service t...
- [Google Sheets Integration](google-sheets.md): promptfoo allows you to import eval test cases directly from Google Sheets. This can be done either unauthenticated (...
- [Google AI / Gemini](google.md): The`google`provider enables integration with Google AI Studio and the Gemini API. It provides access to Google's st...
- [GPT 3.5 vs GPT 4: benchmark on your own data](gpt-35-vs-gpt-4.md): This guide will walk you through how to compare OpenAI's GPT-3.5 and GPT-4 using promptfoo. This testing framework wi...
- [GPT-4o vs GPT-4.1-mini: Benchmark on Your Own Data](gpt-4-vs-gpt-4o.md): OpenAI released [gpt-5-mini](https://openai.com/index/gpt-5-mini-advancing-cost-efficient-intelligence/), a highly co...
- [GPT-4.1 vs GPT-4o: MMLU Benchmark Comparison](gpt-41-vs-gpt-4o-mmlu.md): OpenAI's [GPT-4.1](https://openai.com/index/introducing-gpt-4-1-in-the-api/) scores **90.2% on MMLU** vs GPT-4o's 85....
- [gpt-5 vs o1: Benchmark on Your Own Data](gpt-vs-o1.md): OpenAI has released a new model series called o1 designed to spend more time thinking before responding and excel at ...
- [About the Grader](grading-results.md): When you run a red team scan against a target, Promptfoo will evaluate the results of the output and determine whethe...
- [Groq](groq.md): [Groq](https://groq.com/) is an extremely fast inference API compatible with all the options provided by Promptfoo's ...
- [Guardrails](guardrails.md): Use the`guardrails`assert type to ensure that LLM outputs pass safety checks based on the provider's built-in guard...
- [Configuration Guide - Getting Started with Promptfoo](guide.md): The YAML configuration format runs each prompt through a series of example inputs (aka "test case") and checks if the...
- [Guides](guides.md): Learn how to test complex LLM chains and RAG systems with unit tests and end-to-end validation to ensure reliable out...
- [Hallucination Plugin](hallucination.md): The Hallucination red teaming plugin is designed to test whether a large language model can be led to fabricate answe...
- [HarmBench Plugin](harmbench.md): The HarmBench plugin uses the [HarmBench dataset](https://github.com/centerforaisafety/HarmBench/tree/main/data/behav...
- [Harmful Content Plugin](harmful.md): The Harmful Content plugin is designed to test an AI system's ability to handle and respond appropriately to various ...
- [helicone](helicone.md): Helicone integration
- [Hex Encoding Strategy](hex.md): The Hex Encoding strategy tests an AI system's ability to resist ASCII hex-encoded inputs that might bypass security ...
- [Hijacking Plugin](hijacking.md): The Hijacking red teaming plugin is designed to test whether a large language model can be distracted from its intend...
- [Testing Humanity's Last Exam with Promptfoo](hle-benchmark.md): [Humanity's Last Exam (HLE)](https://arxiv.org/abs/2501.14249) is a challenging benchmark commissioned by Scale AI an...
- [homoglyph](homoglyph.md): Homoglyph Encoding Strategy
- [HTTP/HTTPS API | Promptfoo](http.md): Setting the provider ID to a URL sends an HTTP request to the endpoint. This provides a general-purpose way to use an...
- [HuggingFace Datasets](huggingface-datasets.md): Promptfoo can import test cases directly from [HuggingFace datasets](https://huggingface.co/docs/datasets) using the ...
- [HuggingFace](huggingface.md): Promptfoo includes support for the [HuggingFace Inference Providers](https://huggingface.co/docs/inference-providers)...
- [Hydra Multi-turn Strategy](hydra.md): The Hydra strategy (`jailbreak:hydra`) runs a multi-turn attacker agent that adapts to every response from your targe...
- [Hyperbolic](hyperbolic.md): The`hyperbolic`provider supports [Hyperbolic's API](https://docs.hyperbolic.xyz/), which provides access to various...
- [ibm-bam](ibm-bam.md): IBM BAM (Deprecated)
- [Image Jailbreaking](image.md): The Image strategy converts prompt text into an image and then encodes that image as a base64 string. This approach e...
- [Imitation Plugin](imitation.md): The Imitation red teaming plugin is designed to test whether an AI system can be influenced to imitate a specific per...
- [Indirect Prompt Injection Plugin](indirect-prompt-injection.md): Tests whether untrusted data (RAG context, emails, user profiles) can hijack your model when placed into the prompt.
- [Configuring Inference](inference-limit.md): Promptfoo open-source red teaming requires inference to generate probes and grade results. When using Promptfoo’s ope...
- [Installation](installation.md): - Node.js 20 or newer
- [Insurance Red-Teaming Plugins](insurance.md): The Insurance Red-Teaming Plugins are a specialized suite designed for AI systems operating in health insurance conte...
- [Integrations](integrations.md): Use Python for promptfoo evals - providers, assertions, test generators, and prompts. Integrates with LangChain, Lang...
- [Intent (Custom Prompts) Plugin](intent.md): The Intent plugin is designed to make it easy to test preset inputs to see if they can successfully manipulate an AI ...
- [Intro](intro.md): `promptfoo`is an [open-source](https://github.com/promptfoo/promptfoo) CLI and library for evaluating and red-teamin...
- [ISO 42001](iso-42001.md): ISO/IEC 42001:2023 is the international standard for AI Management Systems. It provides organizations with a structur...
- [Iterative Jailbreaks Strategy](iterative.md): The Iterative Jailbreaks strategy is a technique designed to systematically probe and potentially bypass an AI system...
- [Javascript assertions](javascript.md): The`javascript`[assertion](https://www.promptfoo.dev/docs/configuration/expected-outputs/) allows you to provide a ...
- [Setting up Promptfoo with Jenkins](jenkins.md): This guide demonstrates how to integrate Promptfoo's LLM evaluation into your Jenkins pipeline. This setup enables au...
- [Testing prompts with Jest and Vitest](jest.md): `promptfoo`can be integrated with test frameworks like [Jest](https://jestjs.io/) and [Vitest](https://vitest.dev/) ...
- [JFrog ML](jfrog.md): This documentation covers the **JFrog ML** provider for AI model inference (formerly known as Qwak). This is differen...
- [Using LangChain PromptTemplate with Promptfoo](langchain-prompttemplate.md): LangChain PromptTemplate is commonly used to format prompts with injecting variables. Promptfoo allows you to evaluat...
- [Langfuse integration](langfuse.md): [Langfuse](https://langfuse.com/) is an open-source LLM engineering platform that includes collaborative prompt manag...
- [Layer Strategy](layer.md): The Layer strategy allows you to compose multiple red team strategies sequentially, creating sophisticated attack cha...
- [Leetspeak Strategy](leetspeak.md): The Leetspeak strategy tests an AI system's ability to resist encoded inputs that might bypass security controls by r...
- [Likert-based Jailbreaks Strategy](likert.md): The Likert-based Jailbreaks strategy is an advanced technique that leverages an LLM's evaluation capabilities by fram...
- [linking-targets](linking-targets.md): When using custom providers (Python, JavaScript, HTTP), link your local configuration to a cloud target using`linked...
- [LiteLLM](litellm.md): [LiteLLM](https://docs.litellm.ai/docs/) provides access to 400+ LLMs through a unified OpenAI-compatible interface.
- [Llama.cpp](llamacpp.md): The`llama`provider is compatible with the HTTP server bundled with [llama.cpp](https://github.com/ggerganov/llama.c...
- [How to benchmark Llama2 Uncensored vs. GPT-3.5 on your own inputs](llama2-uncensored-benchmark-ollama.md): Most LLMs go through fine-tuning that prevents them from answering questions like "_How do you make Tylenol_", "_Who ...
- [Meta Llama API](llamaapi.md): The Llama API provider enables you to use Meta's hosted Llama models through their official API service. This include...
- [llamafile](llamafile.md): Llamafile has an [OpenAI-compatible HTTP endpoint](https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file#json-...
- [How to red team LLM applications](llm-redteaming.md): Promptfoo is a popular open source evaluation framework that includes LLM red team and penetration testing capabilities.
- [LLM Rubric](llm-rubric.md): `llm-rubric`is promptfoo's general-purpose grader for "LLM as a judge" evaluation.
- [LLM Supply Chain Security](llm-supply-chain.md): Secure your LLM supply chain with static model scanning and dynamic behavioral testing to detect trojans, backdoors, ...
- [Types of LLM vulnerabilities](llm-vulnerability-types.md): This page documents categories of potential LLM vulnerabilities and failure modes.
- [Local AI](localai.md): LocalAI is an API wrapper for open-source LLMs that is compatible with OpenAI. You can run LocalAI for compatibility ...
- [Setting up Promptfoo with Looper](looper.md): This guide shows you how to integrate **Promptfoo** evaluations into a Looper CI/CD workflow so that every pull‑reque...
- [Malicious Code Plugin](malicious-code.md): The Malicious Code plugin tests an AI system's ability to resist generating harmful code, exploits, or providing tech...
- [Manual Input Provider](manual-input.md): The Manual Input Provider allows you to manually enter responses for each prompt during the evaluation process. This ...
- [Math Prompt Strategy](math-prompt.md): The Math Prompt strategy tests an AI system's ability to handle harmful inputs using mathematical concepts like set t...
- [Max Score](max-score.md): The`max-score`assertion selects the output with the highest aggregate score from other assertions. Unlike`select-b...
- [MCP Security Testing Guide](mcp-security-testing.md): This guide covers security testing approaches for Model Context Protocol (MCP) servers.
- [Promptfoo MCP Server](mcp-server.md): - Node.js installed on your system
- [Using MCP (Model Context Protocol) in Promptfoo](mcp.md): Promptfoo supports the Model Context Protocol (MCP) for advanced tool use, and agentic workflows. MCP allows you to c...
- [Medical Red-Teaming Plugins](medical.md): The Medical Red-Teaming Plugins are a comprehensive suite of tests designed specifically for AI systems operating in ...
- [Memory Poisoning Plugin](memory-poisoning.md): The Memory Poisoning plugin tests whether stateful agents are vulnerable to memory poisoning attacks that manipulate ...
- [Meta-Agent Jailbreaks Strategy](meta.md): The Meta-Agent Jailbreaks strategy (`jailbreak:meta`) uses strategic decision-making to test your system's resilience...
- [Mischievous User Strategy](mischievous-user.md): The **Mischievous User** simulates a multi-turn conversation between a user who is innocently mischievous and likes t...
- [Recreating Mistral Magistral AIME2024 Benchmarks](mistral-magistral-aime2024.md): Mistral's [Magistral models](https://mistral.ai/news/magistral/) achieved **73.6% on AIME2024** (Medium) and **70.7%*...
- [Mistral vs Llama: benchmark on your own data](mistral-vs-llama.md): When Mistral was released, it was the "best 7B model to date" based on a [number of evals](https://mistral.ai/news/an...
- [Mistral AI](mistral.md): The [Mistral AI API](https://docs.mistral.ai/api/) provides access to cutting-edge language models that deliver excep...
- [MITRE ATLAS](mitre-atlas.md): MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is a knowledge base of adversary tacti...
- [Mixtral vs GPT: Run a benchmark with your own data](mixtral-vs-gpt.md): In this guide, we'll walk through the steps to compare three large language models (LLMs): Mixtral, GPT-4.1-mini, and...
- [Testing prompts with Mocha/Chai](mocha-chai.md): `promptfoo`can be integrated with test frameworks like [Mocha](https://mochajs.org/) and assertion libraries like [C...
- [Model Scanning](model-audit.md): ModelAudit is a lightweight static security scanner for machine learning models accessible through Promptfoo. It scan...
- [Detecting Model Drift with Red Teaming](model-drift.md): Model drift occurs when an LLM's behavior changes over time. This can happen due to provider model updates, fine-tuni...
- [Model-graded Closed QA](model-graded-closedqa.md): `model-graded-closedqa`is a criteria-checking evaluation that uses OpenAI's public evals prompt to determine if an L...
- [Model-graded metrics](model-graded.md): promptfoo supports several types of model-graded assertions:
- [Moderation](moderation.md): Use the`moderation`assert type to ensure that LLM outputs are safe.
- [Managing Large Promptfoo Configurations](modular-configs.md): As your Promptfoo evaluations grow more complex, you'll need strategies to keep your configurations manageable, maint...
- [Session Management](multi-turn-sessions.md): Session management is important for our multi-turn strategies like Crescendo and GOAT. In these cases you want to mak...
- [One doc tagged with "multi-turn"](multi-turn.md): Create reusable red team strategies by writing natural language instructions that guide AI through multi-turn convers...
- [Multi-Modal Red Teaming](multimodal-red-team.md): Large language models with multi-modal capabilities (vision, audio, etc.) present unique security challenges compared...
- [Handling Multiple Response Types](multiple-response-types.md): There are cases where your target could respond with multiple object types. This is usually the case when the target ...
- [Using Promptfoo in n8n Workflows](n8n.md): This guide shows how to run Promptfoo evaluations from an **n8n** workflow so you can:
- [NIST AI Risk Management Framework](nist-ai-rmf.md): The NIST AI Risk Management Framework (AI RMF) is a voluntary framework developed by the U.S. National Institute of S...
- [Using the node package](node-package.md): promptfoo is available as a node package [on npm](https://www.npmjs.com/package/promptfoo):
- [Nscale](nscale.md): The Nscale provider enables you to use [Nscale's Serverless Inference API](https://nscale.com/serverless) models with...
- [Off-Topic Plugin](off-topic.md): The Off-Topic Plugin tests whether AI systems can be manipulated to go off-topic from their intended purpose by perfo...
- [Ollama](ollama.md): The`ollama`provider is compatible with [Ollama](https://github.com/jmorganca/ollama), which enables access to Llama...
- [OpenAI Agents](openai-agents.md): Test multi-turn agentic workflows built with the [@openai/agents](https://github.com/openai/openai-agents-js) SDK. Ev...
- [OpenAI ChatKit](openai-chatkit.md): Evaluate [ChatKit](https://platform.openai.com/docs/guides/chatkit) workflows from OpenAI's Agent Builder. This provi...
- [OpenAI Codex SDK](openai-codex-sdk.md): This provider makes OpenAI's Codex SDK available for evals. The Codex SDK supports code generation and manipulation w...
- [OpenAI](openai.md): To use the OpenAI API, set the`OPENAI_API_KEY`environment variable to your OpenAI API key.
- [OpenCode SDK](opencode-sdk.md): This provider integrates [OpenCode](https://opencode.ai/), an open-source AI coding agent for the terminal with suppo...
- [OpenLLM](openllm.md): To use [OpenLLM](https://github.com/bentoml/OpenLLM) with promptfoo, we take advantage of OpenLLM's support for [Open...
- [OpenRouter](openrouter.md): [OpenRouter](https://openrouter.ai/) provides a unified interface for accessing various LLM APIs, including models fr...
- [Other Encodings](other-encodings.md): The other-encodings strategy collection provides multiple text transformation methods to test model resilience agains...
- [Output Formats](outputs.md): Save and analyze your evaluation results in various formats.
- [Overreliance Plugin](overreliance.md): The Overreliance red teaming plugin helps identify vulnerabilities where an AI model might accept and act upon incorr...
- [Red Team Troubleshooting Guide](overview.md): Common issues encountered when red teaming LLM applications with promptfoo.
- [OWASP Top 10 for Agentic Applications](owasp-agentic-ai.md): The [OWASP Top 10 for Agentic Applications](https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications/) ...
- [OWASP API Security Top 10](owasp-api-top-10.md): The OWASP API Security Top 10 is a security awareness document that identifies the most critical security risks to AP...
- [OWASP LLM Top 10](owasp-llm-top-10.md): The OWASP Top 10 for Large Language Model Applications educates developers about security risks in deploying and mana...
- [Prompts, tests, and outputs](parameters.md): Configure how promptfoo evaluates your LLM applications.
- [Perplexity](perplexity.md): The [Perplexity API](https://blog.perplexity.ai/blog/introducing-pplx-api) provides chat completion models with built...
- [Pharmacy Red-Teaming Plugins](pharmacy.md): The Pharmacy Red-Teaming Plugins are a specialized suite designed for AI systems operating in pharmacy and pharmaceut...
- [Phi vs Llama: Benchmark on your own data](phi-vs-llama.md): When choosing between LLMs like Phi 3 and Llama 3.1, it's important to benchmark them on your specific use cases rath...
- [Pi Scorer](pi.md): `pi`is an alternative approach to model grading that uses a dedicated scoring model instead of the "LLM as a judge" ...
- [PII Plugin](pii.md): The PII (Personally Identifiable Information) plugin tests an AI system's ability to protect sensitive personal data....
- [Pliny prompt injections for LLMs](pliny.md): The Pliny plugin is designed to test LLM systems using a curated collection of prompts from the [L1B3RT4S repository]...
- [Red Team Plugins](plugins.md): Plugins are Promptfoo's modular system for testing a variety of risks and vulnerabilities in LLM models and LLM-power...
- [Policy Plugin](policy.md): The Policy red teaming plugin is a customizable tool designed to test whether an AI system adheres to specific polici...
- [Politics Plugin](politics.md): The Politics red teaming plugin is designed to test whether an AI system can be influenced to make political statemen...
- [Portkey AI integration](portkey.md): Portkey is an AI observability suite that includes prompt management capabilities.
- [How to Measure and Prevent LLM Hallucinations](prevent-llm-hallucinations.md): LLMs have great potential, but they are prone to generating incorrect or misleading information, a phenomenon known a...
- [Prompt Extraction Plugin](prompt-extraction.md): The Prompt Extraction red teaming plugin tests an AI system's vulnerability to attacks aimed at extracting the system...
- [Prompt Injection Strategy](prompt-injection.md): The Prompt Injection strategy tests common direct prompt injection vulnerabilities in LLMs.
- [Prompt Configuration](prompts.md): Define what you send to your LLMs - from simple strings to complex multi-turn conversations.
- [LLM Providers](providers.md): Providers in promptfoo are the interfaces to various language models and AI services. This guide will help you unders...
- [Python Assertions](python.md): The`python`assertion allows you to provide a custom Python function to validate the LLM output.
- [Quickstart](quickstart.md): Promptfoo is an [open-source](https://github.com/promptfoo/promptfoo) tool for red teaming gen AI applications.
- [Qwen vs Llama vs GPT: Run a Custom Benchmark](qwen-benchmark.md): As a product developer using LLMs, you are likely focused on a specific use case. Generic benchmarks are easily gamed...
- [Race Bias Plugin](race-bias.md): The Race Bias plugin (`bias:race`) tests whether your AI system reinforces racial stereotypes or discrimination.
- [RAG Document Exfiltration Plugin](rag-document-exfiltration.md): The RAG Document Exfiltration plugin is designed to identify vulnerabilities where an AI model might inadvertently ex...
- [RAG Poisoning](rag-poisoning.md): Promptfoo includes a RAG Poisoning utility that tests your system's resilience against adversarial attacks on the doc...
- [How to red team RAG applications](rag.md): Retrieval-Augmented Generation (RAG) is an increasingly popular LLM-based architecture for knowledge-based AI product...
- [Role-Based Access Control (RBAC) Plugin](rbac.md): The RBAC (Role-Based Access Control) red teaming plugin is designed to test an AI system's ability to maintain proper...
- [Reasoning DoS Plugin](reasoning-dos.md): Reasoning DoS (Denial of Service) is a new vulnerability introduced by reasoning models.
- [One doc tagged with "red-team"](red-team.md): Create reusable red team strategies by writing natural language instructions that guide AI through multi-turn convers...
- [Red teaming](red-teaming.md): Red team LLM systems through systematic adversarial testing to detect content policy violations, information leakage,...
- [Running Red Teams](red-teams.md): Promptfoo Enterprise allows you to configure targets, plugin collections, and scan configurations that can be shared ...
- [Configuration Reference - Complete API Documentation | Promptfoo](reference.md): - Config
- [releases](releases.md): One doc tagged with "releases"
- [Religion Plugin](religion.md): The Religion red teaming plugin is designed to test whether an AI system can be influenced to make potentially contro...
- [Remediation Reports](remediation-reports.md): Promptfoo Enterprise automatically generates remediation reports after each red team scan. These reports provide acti...
- [Remote Generation Errors](remote-generation.md): You may encounter connection issues due to corporate firewalls or security policies. Since our service generates pote...
- [Replicate](replicate.md): Replicate is an API for machine learning models. It currently hosts models like [Llama v2](https://replicate.com/repl...
- [Retry Strategy](retry.md): The retry strategy automatically incorporates previously failed test cases into your test suite, creating a regressio...
- [Risk Scoring](risk-scoring.md): Promptfoo provides a risk scoring system that quantifies the severity and likelihood of vulnerabilities in your LLM a...
- [ROT13 Encoding Strategy](rot13.md): The ROT13 Encoding strategy tests an AI system's ability to resist encoded inputs that might bypass security controls...
- [Ruby Assertions](ruby.md): The`ruby`assertion allows you to provide a custom Ruby function to validate the LLM output.
- [Amazon SageMaker AI](sagemaker.md): The`sagemaker`provider allows you to use Amazon SageMaker AI endpoints in your evals. This enables testing and eval...
- [Sandboxed Evaluations of LLM-Generated Code](sandboxed-code-evals.md): You're using LLMs to generate code snippets, functions, or even entire programs. Blindly trusting and executing this ...
- [ModelAudit Scanners](scanners.md): ModelAudit includes specialized scanners for different model formats and file types. Each scanner is designed to iden...
- [Scenarios](scenarios.md): The`scenarios`configuration lets you group a set of data along with a set of tests that should be run on that data....
- [Search-Rubric](search-rubric.md): The`search-rubric`assertion type is like`llm-rubric`but with web search capabilities. It evaluates outputs accord...
- [Select Best](select-best.md): The`select-best`assertion compares multiple outputs in the same test case and selects the one that best meets a spe...
- [Self-hosting Promptfoo](self-hosting.md): Promptfoo provides a basic Docker image that allows you to host a server that stores evals. This guide covers various...
- [Sequence Provider](sequence.md): The Sequence Provider allows you to send a series of prompts to another provider in sequence, collecting and combinin...
- [Service Accounts](service-accounts.md): Service accounts allow you to create API keys for programmatic access to Promptfoo Enterprise. These are useful for C...
- [SharePoint Integration](sharepoint.md): promptfoo allows you to import eval test cases directly from Microsoft SharePoint CSV files using certificate-based a...
- [Sharing](sharing.md): Share your eval results with others using the`share`command or the web interface.
- [Shell Injection Plugin](shell-injection.md): The Shell Injection plugin is designed to test an AI system's vulnerability to attacks that attempt to execute unauth...
- [Similarity (embeddings)](similar.md): The`similar`assertion checks if an embedding of the LLM's output is semantically similar to the expected value, usi...
- [Simulated User](simulated-user.md): The Simulated User Provider enables testing of multi-turn conversations between an AI agent and a simulated user. Thi...
- [Slack Provider](slack.md): The Slack provider enables human-in-the-loop evaluations by sending prompts to Slack channels or users and collecting...
- [snowflake](snowflake.md): [Snowflake Cortex](https://docs.snowflake.com/en/user-guide/snowflake-cortex/overview) is Snowflake's AI and ML platf...
- [Integrate Promptfoo with SonarQube](sonarqube.md): This guide demonstrates how to integrate Promptfoo's scanning results into SonarQube, allowing red team findings to a...
- [Special Token Injection for LLMs](special-token-injection.md): Special Token Injection (STI) is a technique that exploits conversation format delimiters to manipulate LLM behavior....
- [SQL Injection Plugin](sql-injection.md): The SQL Injection red teaming plugin is designed to test an AI system's vulnerability to attacks that attempt to exec...
- [Server-Side Request Forgery (SSRF) Plugin](ssrf.md): The SSRF (Server-Side Request Forgery) red teaming plugin is designed to test an AI system's vulnerability to attacks...
- [One doc tagged with "strategies"](strategies.md): Create reusable red team strategies by writing natural language instructions that guide AI through multi-turn convers...
- [System Prompt Override Plugin](system-prompt-override.md): System prompts serve as the foundation of LLM security and behavior control. They define how a model should behave, w...
- [Tags](tags.md): - [changelog](/docs/tags/changelog/) (1)
- [Managing Roles and Teams](teams.md): Promptfoo Enterprise supports a flexible role-based access control (RBAC) system that allows you to manage user acces...
- [Telemetry](telemetry.md): `promptfoo`collects basic anonymous telemetry by default. This telemetry helps us decide how to spend time on develo...
- [Test Case Configuration](test-cases.md): Define evaluation scenarios with variables, assertions, and test data.
- [Testing and Validating Guardrails](testing-guardrails.md): Guardrails are security filters that help protect your AI applications from misuse. This guide explains how to test a...
- [Testing LLM chains](testing-llm-chains.md): Prompt chaining is a common pattern used to perform more complex reasoning with LLMs. It's used by libraries like [La...
- [text-generation-webui](text-generation-webui.md): promptfoo can run evals on oobabooga's gradio based [text-generation-webui](https://github.com/oobabooga/text-generat...
- [Evaluating LLM text-to-SQL performance](text-to-sql-evaluation.md): Promptfoo is a command-line tool that allows you to test and validate text-to-SQL conversions.
- [Together AI](togetherai.md): [Together AI](https://www.together.ai/) provides access to open-source models through an API compatible with OpenAI's...
- [Tool Discovery](tool-discovery.md): The Tool Discovery plugin tests if an AI system reveals the list of tools, functions, or API calls that it has access...
- [toxic-chat](toxic-chat.md): The ToxicChat plugin tests your model's ability to handle toxic user prompts and resist jailbreaking attempts using t...
- [Tracing](tracing.md): Promptfoo supports OpenTelemetry (OTLP) tracing to help you understand the internal operations of your LLM providers ...
- [Travis CI Integration](travis-ci.md): This guide demonstrates how to set up promptfoo with Travis CI to run evaluations as part of your CI pipeline.
- [Tree-based Jailbreaks Strategy](tree.md): The Tree-based Jailbreaks strategy is an advanced technique designed to systematically explore and potentially bypass...
- [Troubleshooting](troubleshooting.md): Before troubleshooting specific issues, you can access detailed logs to help diagnose problems:
- [TrueFoundry](truefoundry.md): [TrueFoundry](https://www.truefoundry.com/ai-gateway) is an LLM gateway that provides unified access to 1000+ LLMs th...
- [UnsafeBench Plugin](unsafebench.md): The UnsafeBench plugin tests multi-modal models with potentially unsafe images from the [UnsafeBench dataset](https:/...
- [Unverifiable Claims Plugin](unverifiable-claims.md): The Unverifiable Claims plugin tests whether AI systems make claims about information that cannot be verified or meas...
- [updates](updates.md): One doc tagged with "updates"
- [usage](usage.md): Usage
- [Google Vertex](vertex.md): The`vertex`provider enables integration with Google's official Vertex AI platform, which provides access to foundat...
- [Video Jailbreaking](video.md): The Video strategy converts prompt text into a video with text overlay and then encodes that video as a base64 string...
- [VLGuard Plugin](vlguard.md): The VLGuard plugin tests multi-modal models with potentially unsafe images from the [VLGuard dataset](https://hugging...
- [vllm](vllm.md): vllm's [OpenAI-compatible server](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-se...
- [Voyage AI](voyage.md): [Voyage AI](https://www.voyageai.com/) is Anthropic's [recommended](https://docs.anthropic.com/en/docs/embeddings) em...
- [VS Code Extension](vscode-extension.md): The Promptfoo Security Scanner for VS Code detects LLM security vulnerabilities directly in your editor. It finds pro...
- [WatsonX](watsonx.md): [IBM WatsonX](https://www.ibm.com/watsonx) offers a range of enterprise-grade foundation models optimized for various...
- [Using the web viewer](web-ui.md): After [running an eval](/docs/getting-started/), view results in your browser:
- [Generic Webhook](webhook.md): The webhook provider can be useful for triggering more complex flows or prompt chains end to end in your app.
- [Webhook Integration](webhooks.md): Promptfoo Enterprise provides webhooks to notify external systems when security vulnerabilities (issues) are created ...
- [WebSockets](websocket.md): The WebSocket provider allows you to connect to a WebSocket endpoint for inference. This is useful for real-time, bid...
- [Wordplay Plugin](wordplay.md): The Wordplay red teaming plugin tests whether an AI system can be tricked into generating profanity or offensive lang...
- [Write for Promptfoo](write-for-promptfoo.md): If you enjoy Promptfoo, want to help others learn it, and would like to build your reputation as a writer, this is fo...
- [xAI (Grok) Provider](xai.md): The`xai`provider supports xAI's Grok models through an API interface compatible with OpenAI's format. The provider ...
- [XSTest Homonym Dataset](xstest.md): The XSTest plugin tests how well LLMs handle ambiguous words (homonyms) that can have both harmful and benign interpr...