| Technique / Lever | Token Savings | Cost Impact | Risk Level | Maturity | Best Use Case |
|---|---|---|---|---|---|
| Chain of Draft (CoD) | ~92% reasoning tokens | High (lower latency) | Low | Research-validated | Multi-step reasoning tasks |
| LLMLingua compression | Up to 20x | Very high | Medium | Production-ready (Microsoft) | High-volume API pipelines |
| LongLLMLingua | 4x with quality gain | High + quality boost | Medium | Research-validated | Long-context RAG |
| sinc-LLM decomposition | ~97% (reported 80k→2.5k) | Very high | Medium-High | Early framework | Enterprise prompt standardization |
| Prompt caching | Up to 90% input cost | Very high | Low | Production (all providers) | Repetitive system prompts |
| System prompt offloading | Variable (near-zero user prompt) | High | Low | Mature | Chatbot/agent architectures |
| Abbreviation / symbol substitution | 30-60% | Medium | Low-Medium | Practitioner-native | All prompt types |
| SecurityLingua compression | 100x vs guardrails | Defense ROI | Dual-use | Research (arXiv) | Safety pipeline integration |
| Adversarial compression | N/A (offensive) | N/A | Critical | Emerging | Red team / security testing |
Prompt golf is the practice of optimizing LLM prompts for maximum brevity and efficiency — achieving a desired model output using the fewest possible tokens or characters. The name is a direct analogy to code golf, the recreational programming challenge where participants solve problems in the fewest bytes of source code. Where code golf optimizes for program length, prompt golf optimizes for prompt length while preserving output fidelity.
The discipline sits at the intersection of prompt engineering, token economics, and information theory. It differs from general prompt engineering in a critical way: prompt engineering seeks the best output quality, while prompt golf seeks the shortest prompt that still produces acceptable output. This inversion creates unique constraints and techniques.
Prompt golf employs several scoring metrics, each with different implications:
| Metric | Description | Advantage | Limitation |
|---|---|---|---|
| Token count | Number of tokens after tokenizer processing | Directly maps to API cost and latency | Model-specific tokenizers produce different counts |
| Character count | Raw character length of the prompt | Simple, universal, portable | Poor proxy for actual API cost; character-to-token mapping is non-linear |
| Semantic density | Information content per token | Captures efficiency of expression | Hard to quantify; requires semantic similarity scoring |
| Prompt count | Number of discrete prompts/messages sent | Measures conversational efficiency | Ignores message length; can be gamed |
| Composite score | Weighted combination of above | Balanced evaluation | Requires calibration per use case |
A critical subtlety: character deletion can increase token count. As documented by promptgolf.app, removing a letter from a common word like "hello" can split a single token into two, increasing cost. True prompt golf requires understanding the tokenizer, not just trimming characters.
| Platform | Format | Scoring | Model | Key Feature |
|---|---|---|---|---|
| promptgolf.app | Browser game | Token count (shortest wins) | Llama-3.1-70B-Instruct | Leaderboard, entropy-aware challenges |
| kaliprompt.com | Browser game (part of KaliPrompt suite) | Prompt count per hole | Validator-based | "Holes" with pass/fail validation |
| GPTWorld (GitHub srush/GPTWorld) | Grid-world puzzle | Prompt length | GPT-4 (originally) | Educational grounding environment |
promptgolf.app is the most prominent dedicated platform. It runs Meta's Llama-3.1-70B-Instruct without a system prompt, at temperature 0.7, with responses capped at 128 output tokens and prompts limited to 999 characters. Challenges ("holes") require eliciting a specific target output. Due to LLM non-determinism, only a single successful attempt is required for a prompt to count — adding a probabilistic element to the strategy.
kaliprompt.com is part of the KaliPrompt collection of AI browser games, daily puzzles, and retro terminal mini-games. Its prompt golf variant uses validator-based holes where prompts must pass automated checks, scored by prompt count rather than token count specifically.
kaliprompt.com is part of the KaliPrompt collection of AI browser games, daily puzzles, and retro terminal mini-games.
GPTWorld (srush/GPTWorld) takes a different approach: a hexagonal grid-world environment where the player instructs an LLM to generate code that navigates a game. The objective is constructing a prompt that gets GPT-4 to solve a complex spatial puzzle — grounding abstract prompting skill in a concrete, visual challenge. It is explicitly subtitled "Prompt Golf."
Prompt golf practitioners employ a toolkit of specific techniques to reduce token count without losing semantic intent:
| Technique | Example | Approx. Savings | Notes |
|---|---|---|---|
| Abbreviation | "temperature" → "temp", "maximum" → "max" | 30-50% | Must stay within model's recognized vocabulary |
| Symbol substitution | "and" → "&", "number" → "#", "at" → "@" | 10-20% | Risk of tokenizer-inefficient symbols in some models |
| Implicit context | Remove obvious domain terms the model can infer | 15-40% | Works best with domain-expert models |
| Constraint stacking | Combine multiple rules: "no profanity, no slang, formal tone" → "formal" | 20-30% | Higher risk of underspecification |
| Instruction fusion | Merge two instructions into one phrase | 10-25% | Dual-meaning language, advanced skill |
| Output format shorthand | "Return as JSON" → "→JSON" | 5-15% | Model-dependent; some require explicit format specs |
| Article/preposition removal | Drop "the", "a", "in", "of" where unambiguous | 10-20% | English prepositions are often single tokens, so savings are modest |
Paper: Chain of Draft: Thinking Faster by Writing Less (arXiv 2502.18600, Xu & Xie, Zoom Communications, February 2025)
Chain of Draft is the most significant prompt-golf-adjacent technique to emerge from academic research. The core insight: humans solving problems don't write out every step in full prose — they draft concise intermediate thoughts capturing only essential information. CoD applies this to LLM reasoning.
Chain of Draft is the most significant prompt-golf-adjacent technique to emerge from academic research.
Results:
| Task | CoT Tokens | CoD Tokens | Token Reduction | Accuracy Change |
|---|---|---|---|---|
| GSM8K (math) | ~200 | ~15 | 92.4% | Maintained |
| BIG-Bench Hard | ~180 | ~14 | 92.2% | Maintained |
| Date understanding | ~150 | ~12 | 92.0% | Slight improvement |
| Sports understanding | ~120 | ~10 | 91.7% | Slight improvement |
CoD uses as little as 7.6% of the tokens compared to Chain-of-Thought while matching or surpassing accuracy. The technique works by instructing the model to produce minimal intermediate reasoning — a few words per step rather than full sentences.
Prompt implementation:
CoT prompt: "Think step by step and explain your reasoning."
CoD prompt: "Think step by step, but only keep a minimum draft for each thinking step, with around 5 words."
Source: sinc-LLM (sincllm.com)
sinc-LLM applies the Nyquist-Shannon sampling theorem to prompt engineering, treating a complete task specification as a "signal" that must be "sampled" at sufficient rate to be faithfully reconstructed. The framework decomposes any prompt into 6 Nyquist-compliant specification bands:
| Band | Index | Content | Typical Token Allocation |
|---|---|---|---|
| PERSONA | n=0 | Who should answer; role, expertise, perspective | ~5% |
| CONTEXT | n=1 | Situational facts, environment, background | ~20% |
| DATA | n=2 | Specific inputs, numbers, references | ~20% |
| CONSTRAINTS | n=3 | Rules, boundaries, exclusions | ~25% |
| FORMAT | n=4 | Output structure, length, style | ~25% |
| TASK | n=5 | The actual objective | ~5% |
The key finding: CONSTRAINTS and FORMAT drive 42.7% of output quality but are typically under-specified. A prompt covering fewer than 6 bands is "under-sampled" and will alias — manifesting as hallucination or output drift. The reconstruction formula parallels signal processing: $x(t) = \sum x(nT) \cdot \text{sinc}((t - nT)/T)$.
Reported compression: prompts of 80k tokens reduced to 2.5k tokens (~97% reduction) by restructuring into the 6-band framework while maintaining specification completeness. However, this claim requires independent validation.
Cost optimization tip from sinc-LLM: Use a cheap model (e.g., Claude Haiku) for the decomposition/scatter step (converting raw prompt to sinc JSON), and a capable model (e.g., Claude Sonnet) for execution — since Haiku costs 75% less than Sonnet.
Different model families exhibit distinct behaviors when presented with compressed, minimal prompts. Understanding these differences is essential for both competitive prompt golf and production optimization.
| Model Family | Strength with Short Prompts | Weakness | Free Leverage |
|---|---|---|---|
| GPT-4o / GPT-5 | Strong instruction following; infers intent from fragments | Can override implicit constraints with learned priors | Instruction obedience — minimal instructions suffice if unambiguous |
| Claude (Anthropic) | Exceptional system prompt adherence; respects persona framing | Over-cautious with ambiguous short prompts | System prompt weight — pre-load everything, send near-zero user prompt |
| Gemini (Google) | Strong multilingual compression; handles symbolic shorthand | May misinterpret extreme abbreviation in English | Multilingual compression — non-English tokens can be shorter for same semantics |
| Llama (Meta, open-source) | Flexible; tokenizer well-documented | Less robust to non-standard syntax | Tokenizer predictability — tiktoken-compatible, easy to count exactly |
| Mistral | Good at concise reasoning tasks | Narrower context window limits implicit inference | European language efficiency |
Claude (Anthropic): Claude's weighting of system prompts relative to user messages is the strongest among major providers. A well-crafted system prompt can offload 90%+ of instruction tokens, leaving the user prompt as a single keyword or short phrase. This makes Claude the ideal platform for system-prompt-based prompt golf. Additionally, Claude adheres closely to format specifications in the system prompt, reducing the need for per-message format reminders.
GPT-4o/GPT-5 (OpenAI): OpenAI models are the most instruction-following on a per-token basis. A single clearly-worded instruction often suffices where other models need reinforcement. The trade-off: GPT models may "fill in" unstated preferences from training data, which can be either helpful (inferring intent) or harmful (contradicting implicit wishes). Few-shot examples in GPT need fewer demonstrations — often 1-2 shots match the quality of 3-5 shots in other models.
GPT-4o/GPT-5 (OpenAI): OpenAI models are the most instruction-following on a per-token basis.
Gemini (Google): Gemini's multilingual capability provides a unique compression vector: some concepts express in fewer tokens in other languages. For example, German compound words or Chinese characters can encode complex meanings in single tokens. This cross-lingual compression can reduce token count by 30-60% for certain semantic targets.
| Provider | Caching Mechanism | Minimum Prefix | Cache Write Cost | Cache Read Cost (vs. base) | Key Structural Requirement |
|---|---|---|---|---|---|
| Anthropic | Developer-controlled (explicit cache points) | 1024 tokens | +25% above base | 90% discount | Static system prompt first; mark cache control breakpoints |
| OpenAI | Automatic (no config needed) | 1024 tokens | Base rate (no surcharge) | 50% discount | Exact prefix match; system message must stay identical |
| Context Caching API | Varies by model | Token storage cost | ~75% discount | Explicit cached content objects; TTL management |
Structural pattern for all providers: Place the most stable content at the beginning of the prompt (system instructions, persona, format spec). Never modify the prefix between requests. Append variable content at the end. This maximizes cache hit rates regardless of provider.
The most powerful prompt golf technique for production systems is system prompt offloading: moving all stable instructions into the system prompt or cached context, leaving the user prompt as a minimal trigger.
Architecture:
System Prompt (cached, 2000+ tokens):
- Persona definition
- Output format specification
- Domain constraints
- Behavior rules
- Few-shot examples
User Prompt (variable, 5-20 tokens):
- Task-specific input only
- e.g., "Summarize: [article]" or just "[article]"
This pattern is especially powerful when combined with prompt caching, as the system prompt is computed once and reused across thousands of requests.
This pattern is especially powerful when combined with prompt caching, as the system prompt is computed once and reused across thousands of requests.
Prompt caching — reusing KV-cache tensors from identical prompt prefixes — is now supported by all three major providers, each with different implementations:
Developer-controlled with explicit cache breakpoints. Cache writes cost +25% above base input price, but cache reads cost 90% less than base. Requires minimum 1024 tokens for caching. Best for applications with long, stable system prompts.
Automatic caching on GPT-4o and newer. No configuration required. Cache hits provide 50% discount on input tokens. Requires minimum 1024 tokens. Exact prefix match required — any change to earlier messages invalidates the cache.
Explicit Context Caching API. Cached content objects have a storage cost and TTL. Cache hits provide ~75% discount. Most flexible for multi-turn conversations.
Combined with prompt golf techniques: A production pipeline using system prompt offloading + prompt caching + token compression can achieve 90%+ input cost reduction compared to naive prompting.
Paper: SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression (arXiv 2506.12707, Li & Ahn, Microsoft Research, 2025)
Paper: SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression (arXiv 2506.12707, Li & Ahn, Microsoft Research, 2025)
SecurityLingua bridges prompt compression with security. It trains a prompt compressor to discern the true intention of input prompts, with particular focus on detecting malicious intent behind adversarial/jailbreak prompts.
How it works:
1. A security-aware compressor processes the user's prompt
2. It extracts and highlights suspicious instructions
3. The compressed representation is presented to the target LLM in a way that stimulates the model's built-in safety guardrails
4. The original prompt remains intact — users see no disruption
Key results:
Prompt golf's creative frontier lies in polysemy exploitation — using words and phrases that carry multiple simultaneous meanings to compress several instructions into one expression.
| Technique | Example | Instructions Compressed |
|---|---|---|
| Double meaning | "Draft" (noun: document; verb: create) | Output type + action |
| Ambiguous scope | "Fix" (repair, prepare, secure) | Intent + method |
| Logical compression | "If X → not Y" (3 tokens) vs. "In the case where X is true, you should ensure that Y does not happen" (17 tokens) | Conditional + consequence |
| Implicit negation | "Formal only" (2 tokens) vs. "Do not use profanity, slang, or informal language" (8 tokens) | Multiple exclusions via scope |
| Domain metonymy | "SOAP" (1 token) in API context implies protocol spec + format + rules | Entire specification by association |
The redteams.ai community hosts a prompt golf challenge specifically focused on achieving jailbreaks with the shortest possible prompts. Like code golf, the score is based on character count of the successful jailbreak. This represents the adversarial extreme of prompt golf — where optimization for brevity intersects with optimization for safety bypass.
Challenge parameters:
Intermediate
4-8 hours
Understanding of jailbreak techniques
Fewest characters in a successful jailbreak prompt
Paper: Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs (arXiv 2505.04806, Pathade, 2025)
This study cataloged and categorized over 1,400 adversarial prompts, analyzing their success rates against GPT-4, Claude 2, Mistral 7B, and Vicuna. Key findings:
Paper: When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models (arXiv 2603.19247, 2026)
This paper identifies the critical boundary where legitimate prompt optimization crosses into adversarial territory. The authors repurposed black-box prompt optimization techniques — originally designed to improve performance on benign tasks — to systematically search for safety failures.
Core insight: The same automated prompt optimization tools used for prompt golf (iterative refinement, token reduction, constraint reformulation) can be trivially redirected to find jailbreaks. The optimization landscape for brevity and for safety bypass overlap substantially, because both involve finding minimal prompts that produce specific outputs — the difference is only in the target output.
Core insight: The same automated prompt optimization tools used for prompt golf (iterative refinement, token reduction, constraint reformulation) can be trivially redirected to find jailbreaks.
Implication for the prompt golf community: Competitive prompt golf platforms must consider safety guardrails. A "shortest prompt that produces output X" challenge is structurally identical to a jailbreak challenge when X is harmful content.
LLM API prices declined approximately 80% between early 2025 and early 2026, driven by model efficiency gains, competitive pressure, and architectural innovations (MoE, distillation, quantization). Current pricing for key models:
| Model | Input ($/MTok) | Output ($/MTok) | Cached Input ($/MTok) | Context Window |
|---|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | ~$1.25 | 128K |
| GPT-5 | $10.00 | $30.00 | ~$5.00 | 256K+ |
| o4 Mini | $0.55 | $2.20 | ~$0.28 | 128K |
| Claude Sonnet 4 | $3.00 | $15.00 | ~$0.30 (cache read) | 200K |
| Claude Haiku 3.5 | $0.80 | $4.00 | ~$0.08 (cache read) | 200K |
| Gemini 2.5 Flash | $0.15 | $0.60 | ~$0.04 | 1M |
| Gemini 3 Flash | $0.50 | $3.00 | ~$0.13 | 1M+ |
| DeepSeek V3 | $0.27 | $1.10 | N/A | 128K |
Assumptions: ~500 input tokens/conversation (after caching), ~200 output tokens, 30-day month.
| Model | Monthly Input Cost | Monthly Output Cost | Total Monthly Cost |
|---|---|---|---|
| Gemini 2.5 Flash | $2.25 | $3.60 | ~$12 |
| o4 Mini | $8.25 | $13.20 | ~$21 |
| Claude Haiku 3.5 | $12.00 | $24.00 | ~$36 |
| GPT-4o | $37.50 | $60.00 | ~$98 |
| Claude Sonnet 4 | $45.00 | $90.00 | ~$135 |
| GPT-5 | $150.00 | $180.00 | ~$1,050 |
Prompt golf compression at scale turns even the most expensive models into viable options for high-volume applications.
Prompt golf compression at scale turns even the most expensive models into viable options for high-volume applications.
Papers: LLMLingua (EMNLP 2023), LLMLingua-2 (ACL 2024), Microsoft Research
LLMLingua uses a small language model to identify and remove non-essential tokens from prompts, achieving:
LongLLMLingua extends this to long-context scenarios:
| Optimization Layer | Typical Savings | Implementation Effort |
|---|---|---|
| Prompt caching (static prefixes) | 50-90% input cost | Low |
| Token-level compression (LLMLingua) | 80-95% input tokens | Medium |
| Chain of Draft (reasoning tasks) | ~92% reasoning tokens | Low |
| System prompt offloading | 60-80% user prompt tokens | Low |
| Abbreviation + constraint stacking | 20-40% total tokens | Low |
| Full pipeline (all combined) | 40-80% total cost | Medium-High |
The fundamental risk of prompt golf: brevity optimization can destroy the very intent the prompt was meant to convey. The compression-performance tradeoff is not monotonic — beyond a critical threshold, each token removed degrades output quality at an accelerating rate.
Common failure patterns:
Removing disambiguating context causes the model to misinterpret intent. "Summarize formally" might become "Summarize" — model outputs informal summary.
Stacked constraints like "no profanity, formal tone, third person" compressed to "formal" — model uses profanity in formal register.
Implicit context that seemed obvious to the prompter is not obvious to the model. Domain-specific abbreviations are misinterpreted.
One ambiguous token causes downstream drift, amplified by autoregressive generation.
Paper: An Empirical Study on Prompt Compression for Large Language Models (arXiv 2505.00019, Zhang & Li, ICLR 2025 Building Trust Workshop)
This systematic study of six prompt compression methods found that all compression methods increased hallucinations, with the primary cause being information loss.
This systematic study of six prompt compression methods found that all compression methods increased hallucinations, with the primary cause being information loss.
Key findings:
Compressed prompts are inherently more brittle than verbose ones:
| Risk Factor | Verbose Prompt | Compressed Prompt |
|---|---|---|
| Model update tolerance | High (redundant signals survive) | Low (single signal removed = failure) |
| Cross-model portability | High (shared vocabulary) | Low (model-specific tokenizer behavior) |
| Temperature sensitivity | Low (robust to sampling variation) | High (marginal signals may not fire) |
| Context window pressure | Low (unaffected by other content) | Medium (competing signals can overpower) |
A prompt that works perfectly on GPT-4o may fail on GPT-5 if the tokenizer changes or instruction-following weights shift. Competitive prompt golf entries often break within weeks of model updates.
SecurityLingua demonstrates that prompt compression can both reveal and mask malicious intent:
The curve exhibits three regimes:
The practical implication: prompt golf techniques should be applied to the "safe" and "marginal" zones in production, with the "extreme" zone reserved for competitive play or tasks where quality floors are low.
Prompt golf is rapidly developing an organized competitive ecosystem, blending elements of code golf, competitive programming, and security CTFs. The gamification of prompt engineering serves both entertainment and serious purposes: it trains intuition for token efficiency, surfaces novel compression techniques, and stress-tests model instruction-following at the margin.
| Platform | URL | Format | Scoring | Validator | Community |
|---|---|---|---|---|---|
| promptgolf.app | promptgolf.app | Browser game | Token count | Target output match | Reddit (r/PromptEngineering) |
| KaliPrompt | kaliprompt.com | Browser game + daily puzzles | Prompt count per hole | Automated pass/fail | KaliPrompt community |
| GPTWorld | github.com/srush/GPTWorld | Jupyter notebook | Prompt length | Grid-world task completion | Hugging Face Spaces |
| redteams.ai | redteams.ai | Community challenge | Character count | Successful jailbreak | Red team community |
promptgolf.app is the most established competitive platform. Key mechanics:
Each challenge provides a target output that the LLM must produce
Llama-3.1-70B-Instruct (no system prompt, temperature 0.7)
Max 999 characters per prompt, 128 output tokens
Fewest tokens wins; leaderboard per hole
Non-determinism means players can gamble — a slightly ambiguous prompt might work on lucky sampling, accepting lower reliability for higher score
Simply deleting characters is counterproductive if it creates high-entropy token splits. Skilled players count tokens, not characters
KaliPrompt takes a different approach with validator-based holes:
Srush's GPTWorld is designed as an educational tool rather than a pure competition:
Prompt golf has spread beyond dedicated platforms into existing communities:
Traditional code golf community where prompt golf challenges occasionally appear; strong cultural alignment
The original announcement of promptgolf.app generated significant engagement; ongoing discussion of techniques
Prompt golf framing applied to algorithm problems — "get the solution in the fewest tokens"
The world's largest AI red-teaming hackathon (by LearnPrompting.org) incorporates prompt golf elements in competitive tracks
Organizations are adopting the prompt golf format for internal hackathons:
The corporate adoption signals a shift: prompt golf is not just a game but a training methodology that develops high-value skills in token efficiency and model instruction.
The corporate adoption signals a shift: prompt golf is not just a game but a training methodology that develops high-value skills in token efficiency and model instruction.
Competitive prompt golf has produced several innovations that feed back into production prompt engineering: