# Promptfoo > Adaline Gateway is a fully local production-grade Super SDK that provides a simple, unified, and powerful interface for calling 300+ LLMs. ## Pages - [Adaline Gateway](adaline.md): Adaline Gateway is a fully local production-grade Super SDK that provides a simple, unified, and powerful interface f... - [Aegis: NVIDIA AI Content Safety Dataset](aegis.md): The Aegis plugin uses NVIDIA's [Aegis AI Content Safety Dataset](https://huggingface.co/datasets/nvidia/Aegis-AI-Cont... - [Age Bias Plugin](age-bias.md): The Age Bias plugin (`bias:age`) tests whether your AI system reinforces age-based stereotypes or discrimination. - [How to red team LLM Agents](agents.md): LLM agents are capable of interacting with their environment and executing complex tasks using natural language inter... - [AI21 Labs](ai21.md): The [AI21 Labs API](https://docs.ai21.com/reference/chat-completion) offers access to AI21 models such as`jamba-1.5-... - [AI/ML API](aimlapi.md): [AI/ML API](https://aimlapi.com/) provides access to 300+ AI models through a unified OpenAI-compatible interface, in... - [Alibaba Cloud (Qwen)](alibaba.md): [Alibaba Cloud's DashScope API](https://www.alibabacloud.com/help/en/model-studio/getting-started/models) provides Op... - [Answer Relevance](answer-relevance.md): The`answer-relevance`assertion evaluates whether an LLM's output is relevant to the original query. It uses a combi... - [Anthropic](anthropic.md): This provider supports the [Anthropic Claude](https://www.anthropic.com/claude) series of models. - [API Reference | Promptfoo | Promptfoo](api-reference.md): - [Red Teaming](/red-teaming/) - [Architecture](architecture.md): Promptfoo automated red teaming consists of three main components: **plugins**, **strategies**, and **targets**. - [ASCII Smuggling for LLMs](ascii-smuggling.md): ASCII smuggling is a technique that uses a special set of Unicode code points from the Tags Unicode Block to embed in... - [attack-generation](attack-generation.md): Sometimes attacks may not be generated as expected. This is usually due to the`Purpose`property not being clear eno... - [Audio Jailbreaking](audio.md): The Audio strategy converts prompt text into speech audio and then encodes that audio as a base64 string. This allows... - [Audit Logging](audit-logging.md): Audit Logging is a feature of promptfoo Enterprise that provides forensic access information at the organization leve... - [Authentication](authentication.md): Promptfoo supports both basic authentication and SSO through SAML 2.0 and OIDC. To configure SSO with Promptfoo Enter... - [Authoritative Markup Injection Strategy](authoritative-markup-injection.md): The Authoritative Markup Injection strategy tests whether AI systems are more susceptible to harmful requests when th... - [AWS Bedrock | Promptfoo](aws-bedrock.md): 1. **Model Access**: Amazon Bedrock provides automatic access to serverless foundation models with no manual approval... - [Azure Pipelines Integration](azure-pipelines.md): This guide demonstrates how to set up promptfoo with Azure Pipelines to run evaluations as part of your CI pipeline. - [OpenAI vs Azure: How to benchmark](azure-vs-openai.md): Whether you use GPT through the OpenAI or Azure APIs, the results are pretty similar. But there are some key differen... - [Azure OpenAI Provider | Promptfoo](azure.md): There are three ways to authenticate with Azure OpenAI: - [Base64 Encoding Strategy](base64.md): The Base64 Encoding strategy tests an AI system's ability to resist encoded inputs that might bypass security control... - [Basic Strategy](basic.md): The basic strategy controls whether the original plugin-generated test cases (without any strategies applied) are inc... - [BeaverTails Dataset for LLM Safety Testing](beavertails.md): The BeaverTails plugin uses the [BeaverTails dataset](https://huggingface.co/datasets/PKU-Alignment/BeaverTails), a d... - [AWS Bedrock Agents](bedrock-agents.md): The AWS Bedrock Agents provider enables you to test and evaluate AI agents built with Amazon Bedrock Agents. Amazon B... - [Best-of-N (BoN) Jailbreaking Strategy](best-of-n.md): Best-of-N (BoN) is a simple but effective black-box jailbreaking algorithm that works by repeatedly sampling variatio... - [Best Practices for Configuring AI Red Teaming](best-practices.md): To successfully use AI red teaming automation, you **must** provide rich application context and a diverse set of att... - [Broken Function Level Authorization (BFLA) Plugin](bfla.md): The BFLA (Broken Function Level Authorization) red teaming plugin is designed to test an AI system's ability to maint... - [Bias Detection Plugins](bias.md): Test whether your AI system produces or reinforces stereotypes, biases, or discrimination across different protected ... - [Bitbucket Pipelines Integration](bitbucket-pipelines.md): This guide demonstrates how to set up promptfoo with Bitbucket Pipelines to run evaluations as part of your CI pipeline. - [BOLA (Broken Object Level Authorization) Plugin](bola.md): The BOLA (Broken Object Level Authorization) red teaming plugin is designed to test an AI system's vulnerability to a... - [Browser Provider](browser.md): The Browser Provider enables automated web browser interactions for testing complex web applications and JavaScript-h... - [Building trust in AI with Portkey and Promptfoo](building-trust-in-ai-with-portkey-and-promptfoo.md): This guide was written by **Drishti Shah** from [Portkey](https://portkey.ai/), a guest author contributing to the Pr... - [Finding LLM Jailbreaks with Burp Suite](burp.md): This guide shows how to integrate Promptfoo's application-level jailbreak creation with Burp Suite's Intruder feature... - [Caching](caching.md): promptfoo caches the results of API calls to LLM providers to help save time and cost. - [Cerebras](cerebras.md): This provider enables you to use Cerebras models through their [Inference API](https://docs.cerebras.ai/). - [changelog](changelog.md): One doc tagged with "changelog" - [Chat Conversations / Threads](chat.md): The [prompt file](/docs/configuration/prompts/#file-based-prompts) supports a message in OpenAI's JSON prompt format.... - [Red teaming a Chatbase Chatbot](chatbase-redteam.md): [Chatbase](https://www.chatbase.co/) is a platform for building custom AI chatbots that can be embedded into websites... - [Choosing the best GPT model: benchmark on your own data](choosing-best-gpt-model.md): This guide will walk you through how to compare OpenAI's GPT-4o and GPT-4.1-mini, top contenders for the most powerfu... - [CI/CD Integration for LLM Evaluation and Security](ci-cd.md): Integrate promptfoo into your CI/CD pipelines to automatically evaluate prompts, test for security vulnerabilities, a... - [Setting up Promptfoo with CircleCI](circle-ci.md): This guide shows how to integrate promptfoo's LLM evaluation into your CircleCI pipeline. This allows you to automati... - [Authority-based Jailbreaking](citation.md): The Citation strategy is a red teaming technique that uses academic citations and references to potentially bypass an... - [Classifier grading](classifier.md): Use the`classifier`assert type to run the LLM output through any [HuggingFace text classifier](https://huggingface.... - [Claude Agent SDK](claude-agent-sdk.md): This provider makes [Claude Agent SDK](https://docs.claude.com/en/api/agent-sdk/overview) available for evals through... - [Claude 3.7 vs GPT-4.1: Benchmark on Your Own Data](claude-vs-gpt.md): When evaluating the performance of LLMs, generic benchmarks will only get you so far. This is especially the case for... - [CLI Command](cli.md): The`promptfoo code-scans`command scans code changes for LLM-related security vulnerabilities, helping you identify ... - [Cloudera](cloudera.md): The Cloudera provider allows you to interact with Cloudera's AI endpoints using the OpenAI protocol. It supports chat... - [Cloudflare Workers AI](cloudflare-ai.md): This provider supports the [models](https://developers.cloudflare.com/workers-ai/models/) provided by Cloudflare Work... - [Code Scanning](code-scanning.md): Promptfoo Code Scanning uses AI agents to find LLM-related vulnerabilities in your codebase and helps you fix them be... - [Command R vs GPT vs Claude: create your own benchmark](cohere-command-r-benchmark.md): While public benchmarks provide a general sense of capability, the only way to truly understand which model will perf... - [Cohere](cohere.md): The`cohere`provider is an interface to Cohere AI's [chat inference API](https://docs.cohere.com/reference/chat), wi... - [CometAPI](cometapi.md): The`cometapi`provider lets you use [CometAPI](https://www.cometapi.com/?utm_source=promptfoo&utm_campaign=integrati... - [Command line](command-line.md): The`promptfoo`command line utility supports the following subcommands: - [Llama 3.1 vs GPT: Benchmark on your own data](compare-llama2-vs-gpt.md): This guide describes how to compare three models - Llama 3.1 405B, GPT 4o, and gpt-5-mini - using the`promptfoo`CLI. - [Competitors Plugin](competitors.md): The Competitors red teaming plugin is designed to test whether an AI system can be influenced to mention or recommend... - [Composite Jailbreaks Strategy](composite-jailbreaks.md): The Composite Jailbreaks strategy combines multiple jailbreak techniques from top research papers to create more soph... - [configuration](configuration.md): Configuration - [Connecting to Targets](connecting-to-targets.md): When setting up your target, use these best practices: - [Context Compliance Attack Plugin](context-compliance-attack.md): Context Compliance Attacks (CCAs) exploit a dangerous flaw in many LLM deployments: **the failure to verify conversat... - [Context faithfulness](context-faithfulness.md): Checks if the LLM's response only makes claims that are supported by the provided context. - [Context recall](context-recall.md): Checks if your retrieved context contains the information needed to generate a known correct answer. - [Context relevance](context-relevance.md): Measures what fraction of retrieved context is minimally needed to answer the query. - [Contracts Plugin](contracts.md): The Contracts red teaming plugin is designed to test whether an AI system can be influenced to enter into unintended ... - [Contributing to promptfoo](contributing.md): We welcome contributions from the community to help make promptfoo better. This guide will help you get started. If y... - [Conversation Relevance](conversation-relevance.md): The`conversation-relevance`assertion evaluates whether responses in a conversation remain relevant throughout the d... - [COPPA](coppa.md): The COPPA (Children's Online Privacy Protection Act) red teaming plugin tests whether AI systems properly protect chi... - [cross-session-leak](cross-session-leak.md): Cross-Session Leak Plugin - [Javascript Provider](custom-api.md): Custom Javascript providers let you create providers in JavaScript or TypeScript to integrate with any API or service... - [Custom Scripts](custom-script.md): You may use any shell command as an API provider. This is particularly useful when you want to use a language or fram... - [Custom Strategy](custom-strategy.md): Write natural language instructions to create powerful multi-turn red team strategies. No coding required. - [One doc tagged with "custom"](custom.md): Create reusable red team strategies by writing natural language instructions that guide AI through multi-turn convers... - [CyberSecEval Dataset for LLM Security Testing](cyberseceval.md): The CyberSecEval plugin uses Meta's [Purple Llama CyberSecEval dataset](https://meta-llama.github.io/PurpleLlama/docs... - [Data Handling and Privacy](data-handling.md): This page explains what data leaves your machine during red team testing and how to control it. - [Databricks Foundation Model APIs](databricks.md): The Databricks provider integrates with Databricks' Foundation Model APIs, offering access to state-of-the-art models... - [Dataset generation](datasets.md): Your dataset is the heart of your LLM eval. To the extent possible, it should closely represent true inputs into your... - [DBRX vs Mixtral vs GPT: create your own benchmark](dbrx-benchmark.md): There are many generic benchmarks that measure LLMs like DBRX, Mixtral, and others in a similar performance class. Bu... - [Debug Access Plugin](debug-access.md): The Debug Access red teaming plugin is designed to test whether an AI system has an exposed debugging interface or re... - [Deepseek vs GPT vs O3 vs Llama: Run a Custom Benchmark](deepseek-benchmark.md): Deepseek is a new Mixture-of-Experts (MoE) model that's all the rage due to its impressive performance, especially in... - [deepseek](deepseek.md): [DeepSeek](https://platform.deepseek.com/) provides an OpenAI-compatible API for their language models, with speciali... - [Deterministic Metrics for LLM Output Validation](deterministic.md): These metrics are created by logical tests that are run on LLM output. - [Disability Bias Plugin](disability-bias.md): The Disability Bias plugin (`bias:disability`) tests whether your AI system reinforces disability stereotypes or disc... - [Target Discovery](discovery.md): Promptfoo's **Target Discovery Agent** automatically extracts useful information about generative AI systems that you... - [Divergent Repetition Plugin](divergent-repetition.md): The Divergent Repetition red teaming plugin is designed to test whether an AI system can be manipulated into revealin... - [Docker Model Runner](docker.md): [Docker Model Runner](https://docs.docker.com/ai/model-runner/) makes it easy to manage, run, and deploy AI models us... - [DoNotAnswer Dataset](donotanswer.md): The DoNotAnswer plugin tests how well LLMs handle harmful queries. The dataset contains questions that responsible AI... - [Echo Provider](echo.md): The Echo Provider is a simple utility provider that returns the input prompt as the output. It's particularly useful ... - [E-commerce Red Teaming Plugins](ecommerce.md): The e-commerce red teaming plugins are designed to test AI systems deployed in online retail contexts for critical vu... - [ElevenLabs](elevenlabs.md): The ElevenLabs provider integrates multiple AI audio capabilities for comprehensive voice AI testing and evaluation. - [Promptfoo Enterprise](enterprise.md): Promptfoo offers two deployment options to meet your security needs: - [Envoy AI Gateway](envoy.md): [Envoy AI Gateway](https://aigateway.envoyproxy.io/) is an open-source AI gateway that provides a unified proxy layer... - [EU AI Act](eu-ai-act.md): The EU Artificial Intelligence Act (AI Act) is the world's first comprehensive legal framework specifically regulatin... - [Evaluating LLM safety with HarmBench](evaling-with-harmbench.md): Recent research has shown that even the most advanced LLMs [remain vulnerable](https://unit42.paloaltonetworks.com/ja... - [Evaluate Coding Agents](evaluate-coding-agents.md): Coding agents present a different evaluation challenge than standard LLMs. A chat model transforms input to output in... - [Red Teaming a CrewAI Agent](evaluate-crewai.md): [CrewAI](https://github.com/joaomdmoura/crewai) is a cutting-edge multi-agent platform designed to help teams streaml... - [Evaluating ElevenLabs voice AI](evaluate-elevenlabs.md): This guide walks you through testing ElevenLabs voice AI capabilities using Promptfoo, from basic text-to-speech qual... - [LLM evaluation techniques for JSON outputs](evaluate-json.md): Getting an LLM to output valid JSON can be a difficult task. There are a few failure modes: - [Evaluate LangGraph: Red Teaming and Testing Stateful Agents](evaluate-langgraph.md): [LangGraph](https://github.com/langchain-ai/langgraph) is an advanced framework built on top of LangChain, designed t... - [Choosing the right temperature for your LLM](evaluate-llm-temperature.md): The`temperature`setting in language models is like a dial that adjusts how predictable or surprising the responses ... - [How to evaluate OpenAI Assistants](evaluate-openai-assistants.md): OpenAI recently released an [Assistants API](https://platform.openai.com/docs/assistants/overview) that offers simpli... - [Evaluating RAG pipelines](evaluate-rag.md): Retrieval-augmented generation is a method for enriching LLM prompts with relevant data. Typically, the user prompt w... - [How to evaluate GPT 3.5 vs Llama2-70b with Replicate Lifeboat](evaluate-replicate-lifeboat.md): Replicate put together a ["Lifeboat" OpenAI proxy](https://lifeboat.replicate.dev/) that allows you to swap to their ... - [Excessive Agency Plugin](excessive-agency.md): The Excessive Agency red teaming plugin tests whether an AI is aware of its own capabilities and limitations by promp... - [Assertions & metrics](expected-outputs.md): Assertions are used to compare the LLM output against expected values or conditions. While assertions are not require... - [F5](f5.md): [F5](https://f5.ai/) provides an interface for a handful of LLM APIs. - [Evaluating factuality](factuality-eval.md): Factuality is the measure of how accurately an LLM's response aligns with established facts or reference information.... - [Factuality](factuality.md): The`factuality`assertion evaluates the factual consistency between an LLM output and a reference answer. It uses a ... - [fal.ai](fal.md): The`fal`provider supports the [fal.ai](https://fal.ai/) inference API using the [fal-js](https://github.com/fal-ai/... - [Preventing False Positives](false-positives.md): False positives occur when a test case is marked as passing when it should have been marked as failing or vice versa.... - [Frequently asked questions](faq.md): Promptfoo is a local-first, open-source tool designed to help evaluate (eval) large language models (LLMs). Promptfoo... - [features](features.md): One doc tagged with "features" - [FERPA](ferpa.md): The FERPA (Family Educational Rights and Privacy Act) red teaming plugin tests whether AI systems properly protect st... - [Financial Red-Teaming Plugins](financial.md): The Financial Red-Teaming Plugins are a specialized suite of tests designed for AI systems operating in financial ins... - [Findings and Reports](findings.md): Promptfoo Enterprise allows you to review findings and reports from scans within the Promptfoo application. - [fireworks](fireworks.md): Fireworks AI - [How to Red Team Foundation Models](foundation-models.md): LLM security starts at the foundation model level. Assessing the security of foundation models is the first step to b... - [G-Eval](g-eval.md): G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on custom criteria. I... - [Greedy Coordinate Gradient (GCG)](gcg.md): The GCG strategy implements the attack method described in "[Universal and Transferable Adversarial Attacks on Aligne... - [GDPR](gdpr.md): The EU General Data Protection Regulation (GDPR) is the world's most comprehensive data privacy and security law. Whi... - [Gemini vs GPT: benchmark on your own data](gemini-vs-gpt.md): When comparing Gemini with GPT, you'll find plenty of eval and opinions online. Model capabilities set a _ceiling_ on... - [Gemma vs Llama: benchmark on your own data](gemma-vs-llama.md): Comparing Google's Gemma and Meta's Llama involves more than just looking at their specs and reading about generic be... - [Gemma vs Mistral: benchmark on your own data](gemma-vs-mistral.md): When comparing the performance of LLMs, it's best not to rely on generic benchmarks. This guide shows you how to set ... - [gender-bias](gender-bias.md): The Gender Bias plugin (`bias:gender`) tests whether your AI system reinforces gender stereotypes or discrimination. - [Getting started](getting-started.md): After [installing](/docs/installation/) promptfoo, you can set up your first config file in two ways: - [GitHub Action](github-action.md): Automatically scan pull requests for LLM security vulnerabilities with promptfoo's [code scanning GitHub action](/cod... - [GitHub Models](github.md): [GitHub Models](https://github.com/marketplace/models/) provides access to industry-leading AI models from OpenAI, An... - [Setting up Promptfoo with GitLab CI](gitlab-ci.md): This guide shows how to integrate Promptfoo's LLM evaluation into your GitLab CI pipeline. This allows you to automat... - [Custom Go Provider](go.md): The Go (`golang`) provider allows you to use Go code as an API provider for evaluating prompts. This is useful when y... - [Goal Misalignment Plugin](goal-misalignment.md): The Goal Misalignment Plugin tests whether AI systems recognize when optimizing measurable proxy metrics might not al... - [GOAT Technique for Jailbreaking LLMs](goat.md): The GOAT (Generative Offensive Agent Tester) strategy is an advanced automated red teaming technique that uses an "at... - [Testing Google Cloud Model Armor](google-cloud-model-armor.md): [Model Armor](https://cloud.google.com/security-command-center/docs/model-armor-overview) is a Google Cloud service t... - [Google Sheets Integration](google-sheets.md): promptfoo allows you to import eval test cases directly from Google Sheets. This can be done either unauthenticated (... - [Google AI / Gemini](google.md): The`google`provider enables integration with Google AI Studio and the Gemini API. It provides access to Google's st... - [GPT 3.5 vs GPT 4: benchmark on your own data](gpt-35-vs-gpt-4.md): This guide will walk you through how to compare OpenAI's GPT-3.5 and GPT-4 using promptfoo. This testing framework wi... - [GPT-4o vs GPT-4.1-mini: Benchmark on Your Own Data](gpt-4-vs-gpt-4o.md): OpenAI released [gpt-5-mini](https://openai.com/index/gpt-5-mini-advancing-cost-efficient-intelligence/), a highly co... - [GPT-4.1 vs GPT-4o: MMLU Benchmark Comparison](gpt-41-vs-gpt-4o-mmlu.md): OpenAI's [GPT-4.1](https://openai.com/index/introducing-gpt-4-1-in-the-api/) scores **90.2% on MMLU** vs GPT-4o's 85.... - [gpt-5 vs o1: Benchmark on Your Own Data](gpt-vs-o1.md): OpenAI has released a new model series called o1 designed to spend more time thinking before responding and excel at ... - [About the Grader](grading-results.md): When you run a red team scan against a target, Promptfoo will evaluate the results of the output and determine whethe... - [Groq](groq.md): [Groq](https://groq.com/) is an extremely fast inference API compatible with all the options provided by Promptfoo's ... - [Guardrails](guardrails.md): Use the`guardrails`assert type to ensure that LLM outputs pass safety checks based on the provider's built-in guard... - [Configuration Guide - Getting Started with Promptfoo](guide.md): The YAML configuration format runs each prompt through a series of example inputs (aka "test case") and checks if the... - [Guides](guides.md): Learn how to test complex LLM chains and RAG systems with unit tests and end-to-end validation to ensure reliable out... - [Hallucination Plugin](hallucination.md): The Hallucination red teaming plugin is designed to test whether a large language model can be led to fabricate answe... - [HarmBench Plugin](harmbench.md): The HarmBench plugin uses the [HarmBench dataset](https://github.com/centerforaisafety/HarmBench/tree/main/data/behav... - [Harmful Content Plugin](harmful.md): The Harmful Content plugin is designed to test an AI system's ability to handle and respond appropriately to various ... - [helicone](helicone.md): Helicone integration - [Hex Encoding Strategy](hex.md): The Hex Encoding strategy tests an AI system's ability to resist ASCII hex-encoded inputs that might bypass security ... - [Hijacking Plugin](hijacking.md): The Hijacking red teaming plugin is designed to test whether a large language model can be distracted from its intend... - [Testing Humanity's Last Exam with Promptfoo](hle-benchmark.md): [Humanity's Last Exam (HLE)](https://arxiv.org/abs/2501.14249) is a challenging benchmark commissioned by Scale AI an... - [homoglyph](homoglyph.md): Homoglyph Encoding Strategy - [HTTP/HTTPS API | Promptfoo](http.md): Setting the provider ID to a URL sends an HTTP request to the endpoint. This provides a general-purpose way to use an... - [HuggingFace Datasets](huggingface-datasets.md): Promptfoo can import test cases directly from [HuggingFace datasets](https://huggingface.co/docs/datasets) using the ... - [HuggingFace](huggingface.md): Promptfoo includes support for the [HuggingFace Inference Providers](https://huggingface.co/docs/inference-providers)... - [Hydra Multi-turn Strategy](hydra.md): The Hydra strategy (`jailbreak:hydra`) runs a multi-turn attacker agent that adapts to every response from your targe... - [Hyperbolic](hyperbolic.md): The`hyperbolic`provider supports [Hyperbolic's API](https://docs.hyperbolic.xyz/), which provides access to various... - [ibm-bam](ibm-bam.md): IBM BAM (Deprecated) - [Image Jailbreaking](image.md): The Image strategy converts prompt text into an image and then encodes that image as a base64 string. This approach e... - [Imitation Plugin](imitation.md): The Imitation red teaming plugin is designed to test whether an AI system can be influenced to imitate a specific per... - [Indirect Prompt Injection Plugin](indirect-prompt-injection.md): Tests whether untrusted data (RAG context, emails, user profiles) can hijack your model when placed into the prompt. - [Configuring Inference](inference-limit.md): Promptfoo open-source red teaming requires inference to generate probes and grade results. When using Promptfoo’s ope... - [Installation](installation.md): - Node.js 20 or newer - [Insurance Red-Teaming Plugins](insurance.md): The Insurance Red-Teaming Plugins are a specialized suite designed for AI systems operating in health insurance conte... - [Integrations](integrations.md): Use Python for promptfoo evals - providers, assertions, test generators, and prompts. Integrates with LangChain, Lang... - [Intent (Custom Prompts) Plugin](intent.md): The Intent plugin is designed to make it easy to test preset inputs to see if they can successfully manipulate an AI ... - [Intro](intro.md): `promptfoo`is an [open-source](https://github.com/promptfoo/promptfoo) CLI and library for evaluating and red-teamin... - [ISO 42001](iso-42001.md): ISO/IEC 42001:2023 is the international standard for AI Management Systems. It provides organizations with a structur... - [Iterative Jailbreaks Strategy](iterative.md): The Iterative Jailbreaks strategy is a technique designed to systematically probe and potentially bypass an AI system... - [Javascript assertions](javascript.md): The`javascript`[assertion](https://www.promptfoo.dev/docs/configuration/expected-outputs/) allows you to provide a ... - [Setting up Promptfoo with Jenkins](jenkins.md): This guide demonstrates how to integrate Promptfoo's LLM evaluation into your Jenkins pipeline. This setup enables au... - [Testing prompts with Jest and Vitest](jest.md): `promptfoo`can be integrated with test frameworks like [Jest](https://jestjs.io/) and [Vitest](https://vitest.dev/) ... - [JFrog ML](jfrog.md): This documentation covers the **JFrog ML** provider for AI model inference (formerly known as Qwak). This is differen... - [Using LangChain PromptTemplate with Promptfoo](langchain-prompttemplate.md): LangChain PromptTemplate is commonly used to format prompts with injecting variables. Promptfoo allows you to evaluat... - [Langfuse integration](langfuse.md): [Langfuse](https://langfuse.com/) is an open-source LLM engineering platform that includes collaborative prompt manag... - [Layer Strategy](layer.md): The Layer strategy allows you to compose multiple red team strategies sequentially, creating sophisticated attack cha... - [Leetspeak Strategy](leetspeak.md): The Leetspeak strategy tests an AI system's ability to resist encoded inputs that might bypass security controls by r... - [Likert-based Jailbreaks Strategy](likert.md): The Likert-based Jailbreaks strategy is an advanced technique that leverages an LLM's evaluation capabilities by fram... - [linking-targets](linking-targets.md): When using custom providers (Python, JavaScript, HTTP), link your local configuration to a cloud target using`linked... - [LiteLLM](litellm.md): [LiteLLM](https://docs.litellm.ai/docs/) provides access to 400+ LLMs through a unified OpenAI-compatible interface. - [Llama.cpp](llamacpp.md): The`llama`provider is compatible with the HTTP server bundled with [llama.cpp](https://github.com/ggerganov/llama.c... - [How to benchmark Llama2 Uncensored vs. GPT-3.5 on your own inputs](llama2-uncensored-benchmark-ollama.md): Most LLMs go through fine-tuning that prevents them from answering questions like "_How do you make Tylenol_", "_Who ... - [Meta Llama API](llamaapi.md): The Llama API provider enables you to use Meta's hosted Llama models through their official API service. This include... - [llamafile](llamafile.md): Llamafile has an [OpenAI-compatible HTTP endpoint](https://github.com/Mozilla-Ocho/llamafile?tab=readme-ov-file#json-... - [How to red team LLM applications](llm-redteaming.md): Promptfoo is a popular open source evaluation framework that includes LLM red team and penetration testing capabilities. - [LLM Rubric](llm-rubric.md): `llm-rubric`is promptfoo's general-purpose grader for "LLM as a judge" evaluation. - [LLM Supply Chain Security](llm-supply-chain.md): Secure your LLM supply chain with static model scanning and dynamic behavioral testing to detect trojans, backdoors, ... - [Types of LLM vulnerabilities](llm-vulnerability-types.md): This page documents categories of potential LLM vulnerabilities and failure modes. - [Local AI](localai.md): LocalAI is an API wrapper for open-source LLMs that is compatible with OpenAI. You can run LocalAI for compatibility ... - [Setting up Promptfoo with Looper](looper.md): This guide shows you how to integrate **Promptfoo** evaluations into a Looper CI/CD workflow so that every pull‑reque... - [Malicious Code Plugin](malicious-code.md): The Malicious Code plugin tests an AI system's ability to resist generating harmful code, exploits, or providing tech... - [Manual Input Provider](manual-input.md): The Manual Input Provider allows you to manually enter responses for each prompt during the evaluation process. This ... - [Math Prompt Strategy](math-prompt.md): The Math Prompt strategy tests an AI system's ability to handle harmful inputs using mathematical concepts like set t... - [Max Score](max-score.md): The`max-score`assertion selects the output with the highest aggregate score from other assertions. Unlike`select-b... - [MCP Security Testing Guide](mcp-security-testing.md): This guide covers security testing approaches for Model Context Protocol (MCP) servers. - [Promptfoo MCP Server](mcp-server.md): - Node.js installed on your system - [Using MCP (Model Context Protocol) in Promptfoo](mcp.md): Promptfoo supports the Model Context Protocol (MCP) for advanced tool use, and agentic workflows. MCP allows you to c... - [Medical Red-Teaming Plugins](medical.md): The Medical Red-Teaming Plugins are a comprehensive suite of tests designed specifically for AI systems operating in ... - [Memory Poisoning Plugin](memory-poisoning.md): The Memory Poisoning plugin tests whether stateful agents are vulnerable to memory poisoning attacks that manipulate ... - [Meta-Agent Jailbreaks Strategy](meta.md): The Meta-Agent Jailbreaks strategy (`jailbreak:meta`) uses strategic decision-making to test your system's resilience... - [Mischievous User Strategy](mischievous-user.md): The **Mischievous User** simulates a multi-turn conversation between a user who is innocently mischievous and likes t... - [Recreating Mistral Magistral AIME2024 Benchmarks](mistral-magistral-aime2024.md): Mistral's [Magistral models](https://mistral.ai/news/magistral/) achieved **73.6% on AIME2024** (Medium) and **70.7%*... - [Mistral vs Llama: benchmark on your own data](mistral-vs-llama.md): When Mistral was released, it was the "best 7B model to date" based on a [number of evals](https://mistral.ai/news/an... - [Mistral AI](mistral.md): The [Mistral AI API](https://docs.mistral.ai/api/) provides access to cutting-edge language models that deliver excep... - [MITRE ATLAS](mitre-atlas.md): MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is a knowledge base of adversary tacti... - [Mixtral vs GPT: Run a benchmark with your own data](mixtral-vs-gpt.md): In this guide, we'll walk through the steps to compare three large language models (LLMs): Mixtral, GPT-4.1-mini, and... - [Testing prompts with Mocha/Chai](mocha-chai.md): `promptfoo`can be integrated with test frameworks like [Mocha](https://mochajs.org/) and assertion libraries like [C... - [Model Scanning](model-audit.md): ModelAudit is a lightweight static security scanner for machine learning models accessible through Promptfoo. It scan... - [Detecting Model Drift with Red Teaming](model-drift.md): Model drift occurs when an LLM's behavior changes over time. This can happen due to provider model updates, fine-tuni... - [Model-graded Closed QA](model-graded-closedqa.md): `model-graded-closedqa`is a criteria-checking evaluation that uses OpenAI's public evals prompt to determine if an L... - [Model-graded metrics](model-graded.md): promptfoo supports several types of model-graded assertions: - [Moderation](moderation.md): Use the`moderation`assert type to ensure that LLM outputs are safe. - [Managing Large Promptfoo Configurations](modular-configs.md): As your Promptfoo evaluations grow more complex, you'll need strategies to keep your configurations manageable, maint... - [Session Management](multi-turn-sessions.md): Session management is important for our multi-turn strategies like Crescendo and GOAT. In these cases you want to mak... - [One doc tagged with "multi-turn"](multi-turn.md): Create reusable red team strategies by writing natural language instructions that guide AI through multi-turn convers... - [Multi-Modal Red Teaming](multimodal-red-team.md): Large language models with multi-modal capabilities (vision, audio, etc.) present unique security challenges compared... - [Handling Multiple Response Types](multiple-response-types.md): There are cases where your target could respond with multiple object types. This is usually the case when the target ... - [Using Promptfoo in n8n Workflows](n8n.md): This guide shows how to run Promptfoo evaluations from an **n8n** workflow so you can: - [NIST AI Risk Management Framework](nist-ai-rmf.md): The NIST AI Risk Management Framework (AI RMF) is a voluntary framework developed by the U.S. National Institute of S... - [Using the node package](node-package.md): promptfoo is available as a node package [on npm](https://www.npmjs.com/package/promptfoo): - [Nscale](nscale.md): The Nscale provider enables you to use [Nscale's Serverless Inference API](https://nscale.com/serverless) models with... - [Off-Topic Plugin](off-topic.md): The Off-Topic Plugin tests whether AI systems can be manipulated to go off-topic from their intended purpose by perfo... - [Ollama](ollama.md): The`ollama`provider is compatible with [Ollama](https://github.com/jmorganca/ollama), which enables access to Llama... - [OpenAI Agents](openai-agents.md): Test multi-turn agentic workflows built with the [@openai/agents](https://github.com/openai/openai-agents-js) SDK. Ev... - [OpenAI ChatKit](openai-chatkit.md): Evaluate [ChatKit](https://platform.openai.com/docs/guides/chatkit) workflows from OpenAI's Agent Builder. This provi... - [OpenAI Codex SDK](openai-codex-sdk.md): This provider makes OpenAI's Codex SDK available for evals. The Codex SDK supports code generation and manipulation w... - [OpenAI](openai.md): To use the OpenAI API, set the`OPENAI_API_KEY`environment variable to your OpenAI API key. - [OpenCode SDK](opencode-sdk.md): This provider integrates [OpenCode](https://opencode.ai/), an open-source AI coding agent for the terminal with suppo... - [OpenLLM](openllm.md): To use [OpenLLM](https://github.com/bentoml/OpenLLM) with promptfoo, we take advantage of OpenLLM's support for [Open... - [OpenRouter](openrouter.md): [OpenRouter](https://openrouter.ai/) provides a unified interface for accessing various LLM APIs, including models fr... - [Other Encodings](other-encodings.md): The other-encodings strategy collection provides multiple text transformation methods to test model resilience agains... - [Output Formats](outputs.md): Save and analyze your evaluation results in various formats. - [Overreliance Plugin](overreliance.md): The Overreliance red teaming plugin helps identify vulnerabilities where an AI model might accept and act upon incorr... - [Red Team Troubleshooting Guide](overview.md): Common issues encountered when red teaming LLM applications with promptfoo. - [OWASP Top 10 for Agentic Applications](owasp-agentic-ai.md): The [OWASP Top 10 for Agentic Applications](https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications/) ... - [OWASP API Security Top 10](owasp-api-top-10.md): The OWASP API Security Top 10 is a security awareness document that identifies the most critical security risks to AP... - [OWASP LLM Top 10](owasp-llm-top-10.md): The OWASP Top 10 for Large Language Model Applications educates developers about security risks in deploying and mana... - [Prompts, tests, and outputs](parameters.md): Configure how promptfoo evaluates your LLM applications. - [Perplexity](perplexity.md): The [Perplexity API](https://blog.perplexity.ai/blog/introducing-pplx-api) provides chat completion models with built... - [Pharmacy Red-Teaming Plugins](pharmacy.md): The Pharmacy Red-Teaming Plugins are a specialized suite designed for AI systems operating in pharmacy and pharmaceut... - [Phi vs Llama: Benchmark on your own data](phi-vs-llama.md): When choosing between LLMs like Phi 3 and Llama 3.1, it's important to benchmark them on your specific use cases rath... - [Pi Scorer](pi.md): `pi`is an alternative approach to model grading that uses a dedicated scoring model instead of the "LLM as a judge" ... - [PII Plugin](pii.md): The PII (Personally Identifiable Information) plugin tests an AI system's ability to protect sensitive personal data.... - [Pliny prompt injections for LLMs](pliny.md): The Pliny plugin is designed to test LLM systems using a curated collection of prompts from the [L1B3RT4S repository]... - [Red Team Plugins](plugins.md): Plugins are Promptfoo's modular system for testing a variety of risks and vulnerabilities in LLM models and LLM-power... - [Policy Plugin](policy.md): The Policy red teaming plugin is a customizable tool designed to test whether an AI system adheres to specific polici... - [Politics Plugin](politics.md): The Politics red teaming plugin is designed to test whether an AI system can be influenced to make political statemen... - [Portkey AI integration](portkey.md): Portkey is an AI observability suite that includes prompt management capabilities. - [How to Measure and Prevent LLM Hallucinations](prevent-llm-hallucinations.md): LLMs have great potential, but they are prone to generating incorrect or misleading information, a phenomenon known a... - [Prompt Extraction Plugin](prompt-extraction.md): The Prompt Extraction red teaming plugin tests an AI system's vulnerability to attacks aimed at extracting the system... - [Prompt Injection Strategy](prompt-injection.md): The Prompt Injection strategy tests common direct prompt injection vulnerabilities in LLMs. - [Prompt Configuration](prompts.md): Define what you send to your LLMs - from simple strings to complex multi-turn conversations. - [LLM Providers](providers.md): Providers in promptfoo are the interfaces to various language models and AI services. This guide will help you unders... - [Python Assertions](python.md): The`python`assertion allows you to provide a custom Python function to validate the LLM output. - [Quickstart](quickstart.md): Promptfoo is an [open-source](https://github.com/promptfoo/promptfoo) tool for red teaming gen AI applications. - [Qwen vs Llama vs GPT: Run a Custom Benchmark](qwen-benchmark.md): As a product developer using LLMs, you are likely focused on a specific use case. Generic benchmarks are easily gamed... - [Race Bias Plugin](race-bias.md): The Race Bias plugin (`bias:race`) tests whether your AI system reinforces racial stereotypes or discrimination. - [RAG Document Exfiltration Plugin](rag-document-exfiltration.md): The RAG Document Exfiltration plugin is designed to identify vulnerabilities where an AI model might inadvertently ex... - [RAG Poisoning](rag-poisoning.md): Promptfoo includes a RAG Poisoning utility that tests your system's resilience against adversarial attacks on the doc... - [How to red team RAG applications](rag.md): Retrieval-Augmented Generation (RAG) is an increasingly popular LLM-based architecture for knowledge-based AI product... - [Role-Based Access Control (RBAC) Plugin](rbac.md): The RBAC (Role-Based Access Control) red teaming plugin is designed to test an AI system's ability to maintain proper... - [Reasoning DoS Plugin](reasoning-dos.md): Reasoning DoS (Denial of Service) is a new vulnerability introduced by reasoning models. - [One doc tagged with "red-team"](red-team.md): Create reusable red team strategies by writing natural language instructions that guide AI through multi-turn convers... - [Red teaming](red-teaming.md): Red team LLM systems through systematic adversarial testing to detect content policy violations, information leakage,... - [Running Red Teams](red-teams.md): Promptfoo Enterprise allows you to configure targets, plugin collections, and scan configurations that can be shared ... - [Configuration Reference - Complete API Documentation | Promptfoo](reference.md): - Config - [releases](releases.md): One doc tagged with "releases" - [Religion Plugin](religion.md): The Religion red teaming plugin is designed to test whether an AI system can be influenced to make potentially contro... - [Remediation Reports](remediation-reports.md): Promptfoo Enterprise automatically generates remediation reports after each red team scan. These reports provide acti... - [Remote Generation Errors](remote-generation.md): You may encounter connection issues due to corporate firewalls or security policies. Since our service generates pote... - [Replicate](replicate.md): Replicate is an API for machine learning models. It currently hosts models like [Llama v2](https://replicate.com/repl... - [Retry Strategy](retry.md): The retry strategy automatically incorporates previously failed test cases into your test suite, creating a regressio... - [Risk Scoring](risk-scoring.md): Promptfoo provides a risk scoring system that quantifies the severity and likelihood of vulnerabilities in your LLM a... - [ROT13 Encoding Strategy](rot13.md): The ROT13 Encoding strategy tests an AI system's ability to resist encoded inputs that might bypass security controls... - [Ruby Assertions](ruby.md): The`ruby`assertion allows you to provide a custom Ruby function to validate the LLM output. - [Amazon SageMaker AI](sagemaker.md): The`sagemaker`provider allows you to use Amazon SageMaker AI endpoints in your evals. This enables testing and eval... - [Sandboxed Evaluations of LLM-Generated Code](sandboxed-code-evals.md): You're using LLMs to generate code snippets, functions, or even entire programs. Blindly trusting and executing this ... - [ModelAudit Scanners](scanners.md): ModelAudit includes specialized scanners for different model formats and file types. Each scanner is designed to iden... - [Scenarios](scenarios.md): The`scenarios`configuration lets you group a set of data along with a set of tests that should be run on that data.... - [Search-Rubric](search-rubric.md): The`search-rubric`assertion type is like`llm-rubric`but with web search capabilities. It evaluates outputs accord... - [Select Best](select-best.md): The`select-best`assertion compares multiple outputs in the same test case and selects the one that best meets a spe... - [Self-hosting Promptfoo](self-hosting.md): Promptfoo provides a basic Docker image that allows you to host a server that stores evals. This guide covers various... - [Sequence Provider](sequence.md): The Sequence Provider allows you to send a series of prompts to another provider in sequence, collecting and combinin... - [Service Accounts](service-accounts.md): Service accounts allow you to create API keys for programmatic access to Promptfoo Enterprise. These are useful for C... - [SharePoint Integration](sharepoint.md): promptfoo allows you to import eval test cases directly from Microsoft SharePoint CSV files using certificate-based a... - [Sharing](sharing.md): Share your eval results with others using the`share`command or the web interface. - [Shell Injection Plugin](shell-injection.md): The Shell Injection plugin is designed to test an AI system's vulnerability to attacks that attempt to execute unauth... - [Similarity (embeddings)](similar.md): The`similar`assertion checks if an embedding of the LLM's output is semantically similar to the expected value, usi... - [Simulated User](simulated-user.md): The Simulated User Provider enables testing of multi-turn conversations between an AI agent and a simulated user. Thi... - [Slack Provider](slack.md): The Slack provider enables human-in-the-loop evaluations by sending prompts to Slack channels or users and collecting... - [snowflake](snowflake.md): [Snowflake Cortex](https://docs.snowflake.com/en/user-guide/snowflake-cortex/overview) is Snowflake's AI and ML platf... - [Integrate Promptfoo with SonarQube](sonarqube.md): This guide demonstrates how to integrate Promptfoo's scanning results into SonarQube, allowing red team findings to a... - [Special Token Injection for LLMs](special-token-injection.md): Special Token Injection (STI) is a technique that exploits conversation format delimiters to manipulate LLM behavior.... - [SQL Injection Plugin](sql-injection.md): The SQL Injection red teaming plugin is designed to test an AI system's vulnerability to attacks that attempt to exec... - [Server-Side Request Forgery (SSRF) Plugin](ssrf.md): The SSRF (Server-Side Request Forgery) red teaming plugin is designed to test an AI system's vulnerability to attacks... - [One doc tagged with "strategies"](strategies.md): Create reusable red team strategies by writing natural language instructions that guide AI through multi-turn convers... - [System Prompt Override Plugin](system-prompt-override.md): System prompts serve as the foundation of LLM security and behavior control. They define how a model should behave, w... - [Tags](tags.md): - [changelog](/docs/tags/changelog/) (1) - [Managing Roles and Teams](teams.md): Promptfoo Enterprise supports a flexible role-based access control (RBAC) system that allows you to manage user acces... - [Telemetry](telemetry.md): `promptfoo`collects basic anonymous telemetry by default. This telemetry helps us decide how to spend time on develo... - [Test Case Configuration](test-cases.md): Define evaluation scenarios with variables, assertions, and test data. - [Testing and Validating Guardrails](testing-guardrails.md): Guardrails are security filters that help protect your AI applications from misuse. This guide explains how to test a... - [Testing LLM chains](testing-llm-chains.md): Prompt chaining is a common pattern used to perform more complex reasoning with LLMs. It's used by libraries like [La... - [text-generation-webui](text-generation-webui.md): promptfoo can run evals on oobabooga's gradio based [text-generation-webui](https://github.com/oobabooga/text-generat... - [Evaluating LLM text-to-SQL performance](text-to-sql-evaluation.md): Promptfoo is a command-line tool that allows you to test and validate text-to-SQL conversions. - [Together AI](togetherai.md): [Together AI](https://www.together.ai/) provides access to open-source models through an API compatible with OpenAI's... - [Tool Discovery](tool-discovery.md): The Tool Discovery plugin tests if an AI system reveals the list of tools, functions, or API calls that it has access... - [toxic-chat](toxic-chat.md): The ToxicChat plugin tests your model's ability to handle toxic user prompts and resist jailbreaking attempts using t... - [Tracing](tracing.md): Promptfoo supports OpenTelemetry (OTLP) tracing to help you understand the internal operations of your LLM providers ... - [Travis CI Integration](travis-ci.md): This guide demonstrates how to set up promptfoo with Travis CI to run evaluations as part of your CI pipeline. - [Tree-based Jailbreaks Strategy](tree.md): The Tree-based Jailbreaks strategy is an advanced technique designed to systematically explore and potentially bypass... - [Troubleshooting](troubleshooting.md): Before troubleshooting specific issues, you can access detailed logs to help diagnose problems: - [TrueFoundry](truefoundry.md): [TrueFoundry](https://www.truefoundry.com/ai-gateway) is an LLM gateway that provides unified access to 1000+ LLMs th... - [UnsafeBench Plugin](unsafebench.md): The UnsafeBench plugin tests multi-modal models with potentially unsafe images from the [UnsafeBench dataset](https:/... - [Unverifiable Claims Plugin](unverifiable-claims.md): The Unverifiable Claims plugin tests whether AI systems make claims about information that cannot be verified or meas... - [updates](updates.md): One doc tagged with "updates" - [usage](usage.md): Usage - [Google Vertex](vertex.md): The`vertex`provider enables integration with Google's official Vertex AI platform, which provides access to foundat... - [Video Jailbreaking](video.md): The Video strategy converts prompt text into a video with text overlay and then encodes that video as a base64 string... - [VLGuard Plugin](vlguard.md): The VLGuard plugin tests multi-modal models with potentially unsafe images from the [VLGuard dataset](https://hugging... - [vllm](vllm.md): vllm's [OpenAI-compatible server](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-se... - [Voyage AI](voyage.md): [Voyage AI](https://www.voyageai.com/) is Anthropic's [recommended](https://docs.anthropic.com/en/docs/embeddings) em... - [VS Code Extension](vscode-extension.md): The Promptfoo Security Scanner for VS Code detects LLM security vulnerabilities directly in your editor. It finds pro... - [WatsonX](watsonx.md): [IBM WatsonX](https://www.ibm.com/watsonx) offers a range of enterprise-grade foundation models optimized for various... - [Using the web viewer](web-ui.md): After [running an eval](/docs/getting-started/), view results in your browser: - [Generic Webhook](webhook.md): The webhook provider can be useful for triggering more complex flows or prompt chains end to end in your app. - [Webhook Integration](webhooks.md): Promptfoo Enterprise provides webhooks to notify external systems when security vulnerabilities (issues) are created ... - [WebSockets](websocket.md): The WebSocket provider allows you to connect to a WebSocket endpoint for inference. This is useful for real-time, bid... - [Wordplay Plugin](wordplay.md): The Wordplay red teaming plugin tests whether an AI system can be tricked into generating profanity or offensive lang... - [Write for Promptfoo](write-for-promptfoo.md): If you enjoy Promptfoo, want to help others learn it, and would like to build your reputation as a writer, this is fo... - [xAI (Grok) Provider](xai.md): The`xai`provider supports xAI's Grok models through an API interface compatible with OpenAI's format. The provider ... - [XSTest Homonym Dataset](xstest.md): The XSTest plugin tests how well LLMs handle ambiguous words (homonyms) that can have both harmful and benign interpr...