AGENT ZERO

Prompt Golf: A Comprehensive Research Brief

Opportunity Matrix

Technique / Lever	Token Savings	Cost Impact	Risk Level	Maturity	Best Use Case
Chain of Draft (CoD)	~92% reasoning tokens	High (lower latency)	Low	Research-validated	Multi-step reasoning tasks
LLMLingua compression	Up to 20x	Very high	Medium	Production-ready (Microsoft)	High-volume API pipelines
LongLLMLingua	4x with quality gain	High + quality boost	Medium	Research-validated	Long-context RAG
sinc-LLM decomposition	~97% (reported 80k→2.5k)	Very high	Medium-High	Early framework	Enterprise prompt standardization
Prompt caching	Up to 90% input cost	Very high	Low	Production (all providers)	Repetitive system prompts
System prompt offloading	Variable (near-zero user prompt)	High	Low	Mature	Chatbot/agent architectures
Abbreviation / symbol substitution	30-60%	Medium	Low-Medium	Practitioner-native	All prompt types
SecurityLingua compression	100x vs guardrails	Defense ROI	Dual-use	Research (arXiv)	Safety pipeline integration
Adversarial compression	N/A (offensive)	N/A	Critical	Emerging	Red team / security testing

Token Savings by Compression Technique

A. Definition & Core Mechanics

What Is Prompt Golf?

Prompt golf is the practice of optimizing LLM prompts for maximum brevity and efficiency — achieving a desired model output using the fewest possible tokens or characters. The name is a direct analogy to code golf, the recreational programming challenge where participants solve problems in the fewest bytes of source code. Where code golf optimizes for program length, prompt golf optimizes for prompt length while preserving output fidelity.

The discipline sits at the intersection of prompt engineering, token economics, and information theory. It differs from general prompt engineering in a critical way: prompt engineering seeks the best output quality, while prompt golf seeks the shortest prompt that still produces acceptable output. This inversion creates unique constraints and techniques.

Scoring Heuristics

Prompt golf employs several scoring metrics, each with different implications:

Metric	Description	Advantage	Limitation
Token count	Number of tokens after tokenizer processing	Directly maps to API cost and latency	Model-specific tokenizers produce different counts
Character count	Raw character length of the prompt	Simple, universal, portable	Poor proxy for actual API cost; character-to-token mapping is non-linear
Semantic density	Information content per token	Captures efficiency of expression	Hard to quantify; requires semantic similarity scoring
Prompt count	Number of discrete prompts/messages sent	Measures conversational efficiency	Ignores message length; can be gamed
Composite score	Weighted combination of above	Balanced evaluation	Requires calibration per use case

A critical subtlety: character deletion can increase token count. As documented by promptgolf.app, removing a letter from a common word like "hello" can split a single token into two, increasing cost. True prompt golf requires understanding the tokenizer, not just trimming characters.

Prompt Golf Platforms

Platform	Format	Scoring	Model	Key Feature
promptgolf.app	Browser game	Token count (shortest wins)	Llama-3.1-70B-Instruct	Leaderboard, entropy-aware challenges
kaliprompt.com	Browser game (part of KaliPrompt suite)	Prompt count per hole	Validator-based	"Holes" with pass/fail validation
GPTWorld (GitHub srush/GPTWorld)	Grid-world puzzle	Prompt length	GPT-4 (originally)	Educational grounding environment

promptgolf.app is the most prominent dedicated platform. It runs Meta's Llama-3.1-70B-Instruct without a system prompt, at temperature 0.7, with responses capped at 128 output tokens and prompts limited to 999 characters. Challenges ("holes") require eliciting a specific target output. Due to LLM non-determinism, only a single successful attempt is required for a prompt to count — adding a probabilistic element to the strategy.

kaliprompt.com is part of the KaliPrompt collection of AI browser games, daily puzzles, and retro terminal mini-games. Its prompt golf variant uses validator-based holes where prompts must pass automated checks, scored by prompt count rather than token count specifically.

kaliprompt.com is part of the KaliPrompt collection of AI browser games, daily puzzles, and retro terminal mini-games.

GPTWorld (srush/GPTWorld) takes a different approach: a hexagonal grid-world environment where the player instructs an LLM to generate code that navigates a game. The objective is constructing a prompt that gets GPT-4 to solve a complex spatial puzzle — grounding abstract prompting skill in a concrete, visual challenge. It is explicitly subtitled "Prompt Golf."

Chain of Draft vs Chain of Thought: Token Comparison

B. Token-Economy Techniques

Practical Token-Stripping Tricks

Prompt golf practitioners employ a toolkit of specific techniques to reduce token count without losing semantic intent:

Technique	Example	Approx. Savings	Notes
Abbreviation	"temperature" → "temp", "maximum" → "max"	30-50%	Must stay within model's recognized vocabulary
Symbol substitution	"and" → "&", "number" → "#", "at" → "@"	10-20%	Risk of tokenizer-inefficient symbols in some models
Implicit context	Remove obvious domain terms the model can infer	15-40%	Works best with domain-expert models
Constraint stacking	Combine multiple rules: "no profanity, no slang, formal tone" → "formal"	20-30%	Higher risk of underspecification
Instruction fusion	Merge two instructions into one phrase	10-25%	Dual-meaning language, advanced skill
Output format shorthand	"Return as JSON" → "→JSON"	5-15%	Model-dependent; some require explicit format specs
Article/preposition removal	Drop "the", "a", "in", "of" where unambiguous	10-20%	English prepositions are often single tokens, so savings are modest

Chain of Draft (CoD)

Paper: Chain of Draft: Thinking Faster by Writing Less (arXiv 2502.18600, Xu & Xie, Zoom Communications, February 2025)

Chain of Draft is the most significant prompt-golf-adjacent technique to emerge from academic research. The core insight: humans solving problems don't write out every step in full prose — they draft concise intermediate thoughts capturing only essential information. CoD applies this to LLM reasoning.

Chain of Draft is the most significant prompt-golf-adjacent technique to emerge from academic research.

Results:

Task	CoT Tokens	CoD Tokens	Token Reduction	Accuracy Change
GSM8K (math)	~200	~15	92.4%	Maintained
BIG-Bench Hard	~180	~14	92.2%	Maintained
Date understanding	~150	~12	92.0%	Slight improvement
Sports understanding	~120	~10	91.7%	Slight improvement

CoD uses as little as 7.6% of the tokens compared to Chain-of-Thought while matching or surpassing accuracy. The technique works by instructing the model to produce minimal intermediate reasoning — a few words per step rather than full sentences.

Prompt implementation:

CoT prompt: "Think step by step and explain your reasoning."
CoD prompt: "Think step by step, but only keep a minimum draft for each thinking step, with around 5 words."

This simple instruction change yields 92% token reduction in reasoning with no accuracy loss — the single highest-ROI prompt modification documented in the literature.

Signal-Theoretic Decomposition (sinc-LLM)

Source: sinc-LLM (sincllm.com)

sinc-LLM applies the Nyquist-Shannon sampling theorem to prompt engineering, treating a complete task specification as a "signal" that must be "sampled" at sufficient rate to be faithfully reconstructed. The framework decomposes any prompt into 6 Nyquist-compliant specification bands:

Band	Index	Content	Typical Token Allocation
PERSONA	n=0	Who should answer; role, expertise, perspective	~5%
CONTEXT	n=1	Situational facts, environment, background	~20%
DATA	n=2	Specific inputs, numbers, references	~20%
CONSTRAINTS	n=3	Rules, boundaries, exclusions	~25%
FORMAT	n=4	Output structure, length, style	~25%
TASK	n=5	The actual objective	~5%

The key finding: CONSTRAINTS and FORMAT drive 42.7% of output quality but are typically under-specified. A prompt covering fewer than 6 bands is "under-sampled" and will alias — manifesting as hallucination or output drift. The reconstruction formula parallels signal processing: $x(t) = \sum x(nT) \cdot \text{sinc}((t - nT)/T)$.

Reported compression: prompts of 80k tokens reduced to 2.5k tokens (~97% reduction) by restructuring into the 6-band framework while maintaining specification completeness. However, this claim requires independent validation.

Cost optimization tip from sinc-LLM: Use a cheap model (e.g., Claude Haiku) for the decomposition/scatter step (converting raw prompt to sinc JSON), and a capable model (e.g., Claude Sonnet) for execution — since Haiku costs 75% less than Sonnet.

75%

Claude Sonnet) for execution — since Haiku costs

C. Model-Specific Opportunities

How Model Families Respond to Ultra-Short Prompts

Different model families exhibit distinct behaviors when presented with compressed, minimal prompts. Understanding these differences is essential for both competitive prompt golf and production optimization.

Model Family	Strength with Short Prompts	Weakness	Free Leverage
GPT-4o / GPT-5	Strong instruction following; infers intent from fragments	Can override implicit constraints with learned priors	Instruction obedience — minimal instructions suffice if unambiguous
Claude (Anthropic)	Exceptional system prompt adherence; respects persona framing	Over-cautious with ambiguous short prompts	System prompt weight — pre-load everything, send near-zero user prompt
Gemini (Google)	Strong multilingual compression; handles symbolic shorthand	May misinterpret extreme abbreviation in English	Multilingual compression — non-English tokens can be shorter for same semantics
Llama (Meta, open-source)	Flexible; tokenizer well-documented	Less robust to non-standard syntax	Tokenizer predictability — tiktoken-compatible, easy to count exactly
Mistral	Good at concise reasoning tasks	Narrower context window limits implicit inference	European language efficiency

Model-Specific Leverage Details

Claude (Anthropic): Claude's weighting of system prompts relative to user messages is the strongest among major providers. A well-crafted system prompt can offload 90%+ of instruction tokens, leaving the user prompt as a single keyword or short phrase. This makes Claude the ideal platform for system-prompt-based prompt golf. Additionally, Claude adheres closely to format specifications in the system prompt, reducing the need for per-message format reminders.

GPT-4o/GPT-5 (OpenAI): OpenAI models are the most instruction-following on a per-token basis. A single clearly-worded instruction often suffices where other models need reinforcement. The trade-off: GPT models may "fill in" unstated preferences from training data, which can be either helpful (inferring intent) or harmful (contradicting implicit wishes). Few-shot examples in GPT need fewer demonstrations — often 1-2 shots match the quality of 3-5 shots in other models.

GPT-4o/GPT-5 (OpenAI): OpenAI models are the most instruction-following on a per-token basis.

Gemini (Google): Gemini's multilingual capability provides a unique compression vector: some concepts express in fewer tokens in other languages. For example, German compound words or Chinese characters can encode complex meanings in single tokens. This cross-lingual compression can reduce token count by 30-60% for certain semantic targets.

Cache-Friendly Prompt Structures Per Provider

Provider	Caching Mechanism	Minimum Prefix	Cache Write Cost	Cache Read Cost (vs. base)	Key Structural Requirement
Anthropic	Developer-controlled (explicit cache points)	1024 tokens	+25% above base	90% discount	Static system prompt first; mark cache control breakpoints
OpenAI	Automatic (no config needed)	1024 tokens	Base rate (no surcharge)	50% discount	Exact prefix match; system message must stay identical
Google	Context Caching API	Varies by model	Token storage cost	~75% discount	Explicit cached content objects; TTL management

Structural pattern for all providers: Place the most stable content at the beginning of the prompt (system instructions, persona, format spec). Never modify the prefix between requests. Append variable content at the end. This maximizes cache hit rates regardless of provider.

sinc-LLM Specification Band Token Allocation

D. Leveraging System Prompts & Context

The System Prompt Offload Strategy

The most powerful prompt golf technique for production systems is system prompt offloading: moving all stable instructions into the system prompt or cached context, leaving the user prompt as a minimal trigger.

Architecture:

System Prompt (cached, 2000+ tokens):
  - Persona definition
  - Output format specification
  - Domain constraints
  - Behavior rules
  - Few-shot examples

User Prompt (variable, 5-20 tokens):
  - Task-specific input only
  - e.g., "Summarize: [article]" or just "[article]"

This pattern is especially powerful when combined with prompt caching, as the system prompt is computed once and reused across thousands of requests.

This pattern is especially powerful when combined with prompt caching, as the system prompt is computed once and reused across thousands of requests.

Prompt Caching Savings

Prompt caching — reusing KV-cache tensors from identical prompt prefixes — is now supported by all three major providers, each with different implementations:

Anthropic: Developer-controlled with explicit cache breakpoints. Cache writes cost +25% above base input price, but cache reads cost 90% less than base. Requires minimum 1024 tokens for caching. Best for applications with long, stable system prompts.
OpenAI: Automatic caching on GPT-4o and newer. No configuration required. Cache hits provide 50% discount on input tokens. Requires minimum 1024 tokens. Exact prefix match required — any change to earlier messages invalidates the cache.
Google: Explicit Context Caching API. Cached content objects have a storage cost and TTL. Cache hits provide ~75% discount. Most flexible for multi-turn conversations.

Combined with prompt golf techniques: A production pipeline using system prompt offloading + prompt caching + token compression can achieve 90%+ input cost reduction compared to naive prompting.

SecurityLingua: Security-Aware Prompt Compression

Paper: SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression (arXiv 2506.12707, Li & Ahn, Microsoft Research, 2025)

Paper: SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression (arXiv 2506.12707, Li & Ahn, Microsoft Research, 2025)

SecurityLingua bridges prompt compression with security. It trains a prompt compressor to discern the true intention of input prompts, with particular focus on detecting malicious intent behind adversarial/jailbreak prompts.

How it works:
1. A security-aware compressor processes the user's prompt
2. It extracts and highlights suspicious instructions
3. The compressed representation is presented to the target LLM in a way that stimulates the model's built-in safety guardrails
4. The original prompt remains intact — users see no disruption

Key results:
- 100x less token cost compared to state-of-the-art LLM guardrail approaches
- Reveals malicious intent that obfuscated prompts attempt to hide
- Negligible overhead added to inference pipeline
- Highlights a dual-use concern: the same compression that reveals malicious intent could theoretically be inverted to mask intent

E. Adversarial & Creative Opportunities

Linguistic Compression and Ambiguity

Prompt golf's creative frontier lies in polysemy exploitation — using words and phrases that carry multiple simultaneous meanings to compress several instructions into one expression.

Technique	Example	Instructions Compressed
Double meaning	"Draft" (noun: document; verb: create)	Output type + action
Ambiguous scope	"Fix" (repair, prepare, secure)	Intent + method
Logical compression	"If X → not Y" (3 tokens) vs. "In the case where X is true, you should ensure that Y does not happen" (17 tokens)	Conditional + consequence
Implicit negation	"Formal only" (2 tokens) vs. "Do not use profanity, slang, or informal language" (8 tokens)	Multiple exclusions via scope
Domain metonymy	"SOAP" (1 token) in API context implies protocol spec + format + rules	Entire specification by association

Red Team Prompt Golf

The redteams.ai community hosts a prompt golf challenge specifically focused on achieving jailbreaks with the shortest possible prompts. Like code golf, the score is based on character count of the successful jailbreak. This represents the adversarial extreme of prompt golf — where optimization for brevity intersects with optimization for safety bypass.

Challenge parameters:
- Difficulty: Intermediate
- Estimated time: 4-8 hours
- Prerequisites: Understanding of jailbreak techniques
- Scoring: Fewest characters in a successful jailbreak prompt

Challenge parameters: - Difficulty: Intermediate - Estimated time: 4-8 hours - Prerequisites: Understanding of jailbreak techniques - Scoring: Fewest characters in a successful jailbreak prompt

Systematic Adversarial Prompt Analysis

Paper: Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs (arXiv 2505.04806, Pathade, 2025)

This study cataloged and categorized over 1,400 adversarial prompts, analyzing their success rates against GPT-4, Claude 2, Mistral 7B, and Vicuna. Key findings:

Jailbreak strategies cluster into distinct categories: role-playing, prompt injection, context manipulation, multi-turn exploitation, and encoding-based attacks
Shorter adversarial prompts were not consistently less effective — some of the most successful attacks used fewer than 20 tokens
Model susceptibility varies significantly; no single model was immune to all attack categories
The study proposes layered mitigation strategies as the most effective defense

When Optimization Becomes Jailbreaking

Paper: When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models (arXiv 2603.19247, 2026)

Paper: When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models (arXiv 2603.19247, 2026)

This paper identifies the critical boundary where legitimate prompt optimization crosses into adversarial territory. The authors repurposed black-box prompt optimization techniques — originally designed to improve performance on benign tasks — to systematically search for safety failures.

Core insight: The same automated prompt optimization tools used for prompt golf (iterative refinement, token reduction, constraint reformulation) can be trivially redirected to find jailbreaks. The optimization landscape for brevity and for safety bypass overlap substantially, because both involve finding minimal prompts that produce specific outputs — the difference is only in the target output.

Implication for the prompt golf community: Competitive prompt golf platforms must consider safety guardrails. A "shortest prompt that produces output X" challenge is structurally identical to a jailbreak challenge when X is harmful content.

Implication for the prompt golf community: Competitive prompt golf platforms must consider safety guardrails.

Provider Prompt Caching Discount Comparison

F. Cost & Latency Savings

LLM API Price Evolution

LLM API prices declined approximately 80% between early 2025 and early 2026, driven by model efficiency gains, competitive pressure, and architectural innovations (MoE, distillation, quantization). Current pricing for key models:

Model	Input ($/MTok)	Output ($/MTok)	Cached Input ($/MTok)	Context Window
GPT-4o	$2.50	$10.00	~$1.25	128K
GPT-5	$10.00	$30.00	~$5.00	256K+
o4 Mini	$0.55	$2.20	~$0.28	128K
Claude Sonnet 4	$3.00	$15.00	~$0.30 (cache read)	200K
Claude Haiku 3.5	$0.80	$4.00	~$0.08 (cache read)	200K
Gemini 2.5 Flash	$0.15	$0.60	~$0.04	1M
Gemini 3 Flash	$0.50	$3.00	~$0.13	1M+
DeepSeek V3	$0.27	$1.10	N/A	128K

Monthly Cost by Model (1,000 conversations/day)

Assumptions: ~500 input tokens/conversation (after caching), ~200 output tokens, 30-day month.

Model	Monthly Input Cost	Monthly Output Cost	Total Monthly Cost
Gemini 2.5 Flash	$2.25	$3.60	~$12
o4 Mini	$8.25	$13.20	~$21
Claude Haiku 3.5	$12.00	$24.00	~$36
GPT-4o	$37.50	$60.00	~$98
Claude Sonnet 4	$45.00	$90.00	~$135
GPT-5	$150.00	$180.00	~$1,050

Prompt golf compression at scale turns even the most expensive models into viable options for high-volume applications.

Prompt golf compression at scale turns even the most expensive models into viable options for high-volume applications.

LLMLingua: 20x Compression with Minimal Performance Loss

Papers: LLMLingua (EMNLP 2023), LLMLingua-2 (ACL 2024), Microsoft Research

LLMLingua uses a small language model to identify and remove non-essential tokens from prompts, achieving:

Up to 20x compression with only 1.5% accuracy drop on GSM8K
On GSM8K at 20x compression: LLMLingua achieves 77.33 exact match vs. Selective-Context's 44.20 and GPT4-Generation's 56.33
Coarse-to-fine compression: budget controller maintains semantic integrity, token-level iterative compression removes low-information tokens
Works with black-box LLMs (no model modification required)

LongLLMLingua extends this to long-context scenarios:
- 21.4% performance improvement on NaturalQuestions benchmark with 4x fewer tokens
- 17.1% average improvement across long-context tasks at 4x compression
- Specifically designed for RAG and document Q&A where context length is the bottleneck
- 94% cost reduction demonstrated on LooGLE (long context RAG) while improving GPT-3.5-Turbo performance

21.4%

MLingua extends this to long-context scenarios: -

Real-World Pipeline Savings

Production deployments consistently report 40-80% token cost savings through systematic optimization:

Optimization Layer	Typical Savings	Implementation Effort
Prompt caching (static prefixes)	50-90% input cost	Low
Token-level compression (LLMLingua)	80-95% input tokens	Medium
Chain of Draft (reasoning tasks)	~92% reasoning tokens	Low
System prompt offloading	60-80% user prompt tokens	Low
Abbreviation + constraint stacking	20-40% total tokens	Low
Full pipeline (all combined)	40-80% total cost	Medium-High

Token-Stripping Technique Effectiveness

G. Risks & Failure Modes

Over-Compression and Intent Alignment Failure

The fundamental risk of prompt golf: brevity optimization can destroy the very intent the prompt was meant to convey. The compression-performance tradeoff is not monotonic — beyond a critical threshold, each token removed degrades output quality at an accelerating rate.

Common failure patterns:

Ambiguity introduction: Removing disambiguating context causes the model to misinterpret intent. "Summarize formally" might become "Summarize" — model outputs informal summary.
Constraint erosion: Stacked constraints like "no profanity, formal tone, third person" compressed to "formal" — model uses profanity in formal register.
Scope collapse: Implicit context that seemed obvious to the prompter is not obvious to the model. Domain-specific abbreviations are misinterpreted.
Cascading misinterpretation: One ambiguous token causes downstream drift, amplified by autoregressive generation.

Compression Increases Hallucinations

Paper: An Empirical Study on Prompt Compression for Large Language Models (arXiv 2505.00019, Zhang & Li, ICLR 2025 Building Trust Workshop)

Paper: An Empirical Study on Prompt Compression for Large Language Models (arXiv 2505.00019, Zhang & Li, ICLR 2025 Building Trust Workshop)

This systematic study of six prompt compression methods found that all compression methods increased hallucinations, with the primary cause being information loss.

Key findings:

Compression has a greater impact in long-context settings than short ones
Hallucination increase is consistent across all tested methods — not a weakness of specific algorithms
The relationship between compression ratio and hallucination rate is approximately linear up to 10x compression, then accelerates non-linearly
Removing constraint tokens causes more hallucination than removing context tokens
The sinc-LLM framework's Nyquist analogy is relevant here: under-sampling (removing too many specification bands) causes aliasing, which manifests as hallucination

Brittleness Across Model Updates

Compressed prompts are inherently more brittle than verbose ones:

Risk Factor	Verbose Prompt	Compressed Prompt
Model update tolerance	High (redundant signals survive)	Low (single signal removed = failure)
Cross-model portability	High (shared vocabulary)	Low (model-specific tokenizer behavior)
Temperature sensitivity	Low (robust to sampling variation)	High (marginal signals may not fire)
Context window pressure	Low (unaffected by other content)	Medium (competing signals can overpower)

A prompt that works perfectly on GPT-4o may fail on GPT-5 if the tokenizer changes or instruction-following weights shift. Competitive prompt golf entries often break within weeks of model updates.

SecurityLingua's Dual-Use Risk

SecurityLingua demonstrates that prompt compression can both reveal and mask malicious intent:

Defensive use: Compressor extracts true intention, exposing jailbreaks that were obfuscated in verbose prompts
Offensive inversion: An adversary could train an inverse compressor that takes a malicious instruction and compresses it into a form that appears benign to safety filters while still triggering the target LLM's compliance
This dual-use nature means prompt compression tools must be treated as safety-sensitive infrastructure

The Compression-Performance Tradeoff Curve

The curve exhibits three regimes:

Safe compression (1x-4x): Performance maintained or slightly improved (LongLLMLingua's 21.4% boost lives here)
Marginal compression (4x-10x): Gradual degradation begins; hallucination risk grows linearly
Extreme compression (10x-20x+): Accelerating quality loss; only viable for highly constrained tasks or when paired with recovery mechanisms

The practical implication: prompt golf techniques should be applied to the "safe" and "marginal" zones in production, with the "extreme" zone reserved for competitive play or tasks where quality floors are low.

H. Competitive Prompt Golf

The Emerging Sport of Prompt Golf

Prompt golf is rapidly developing an organized competitive ecosystem, blending elements of code golf, competitive programming, and security CTFs. The gamification of prompt engineering serves both entertainment and serious purposes: it trains intuition for token efficiency, surfaces novel compression techniques, and stress-tests model instruction-following at the margin.

Platform Comparison

Platform	URL	Format	Scoring	Validator	Community
promptgolf.app	promptgolf.app	Browser game	Token count	Target output match	Reddit (r/PromptEngineering)
KaliPrompt	kaliprompt.com	Browser game + daily puzzles	Prompt count per hole	Automated pass/fail	KaliPrompt community
GPTWorld	github.com/srush/GPTWorld	Jupyter notebook	Prompt length	Grid-world task completion	Hugging Face Spaces
redteams.ai	redteams.ai	Community challenge	Character count	Successful jailbreak	Red team community

promptgolf.app — The Standard-Bearer

promptgolf.app is the most established competitive platform. Key mechanics:

Holes: Each challenge provides a target output that the LLM must produce
Model: Llama-3.1-70B-Instruct (no system prompt, temperature 0.7)
Constraints: Max 999 characters per prompt, 128 output tokens
Scoring: Fewest tokens wins; leaderboard per hole
Strategy: Non-determinism means players can gamble — a slightly ambiguous prompt might work on lucky sampling, accepting lower reliability for higher score
Entropy awareness: Simply deleting characters is counterproductive if it creates high-entropy token splits. Skilled players count tokens, not characters

KaliPrompt — Validator-Based Competition

KaliPrompt takes a different approach with validator-based holes:

Each "hole" has an automated validator that checks whether the model's output meets the challenge criteria
Scoring is by prompt count rather than token count specifically
Part of a broader platform including daily puzzles, typing challenges, and retro terminal mini-games
Designed for "quick sessions and shareable scores" — lower barrier to entry than promptgolf.app

GPTWorld — Educational Grounding

Srush's GPTWorld is designed as an educational tool rather than a pure competition:

Players prompt an LLM to generate Python code that navigates a hexagonal grid-world
The AI agent (red dot) must pick up a key and reach the exit
Teaches prompting through a concrete, visual, grounded environment
Available as Jupyter notebook and Hugging Face Space
The "golf" element: solve the puzzle with the shortest possible prompt

Community Ecosystem

Prompt golf has spread beyond dedicated platforms into existing communities:

r/codegolf: Traditional code golf community where prompt golf challenges occasionally appear; strong cultural alignment
r/PromptEngineering: The original announcement of promptgolf.app generated significant engagement; ongoing discussion of techniques
r/leetcode: Prompt golf framing applied to algorithm problems — "get the solution in the fewest tokens"
HackAPrompt 2.0: The world's largest AI red-teaming hackathon (by LearnPrompting.org) incorporates prompt golf elements in competitive tracks

Corporate Hackathons

Organizations are adopting the prompt golf format for internal hackathons:

Advance Intelligence Group has used prompt golf-style competitions to train teams on prompt efficiency
OutSystems documented their prompt engineering hackathon experience, emphasizing gamification's effectiveness for skill development
AWS PartyRock Hackathon (7,600+ participants) included efficiency challenges with prompt golf elements

The corporate adoption signals a shift: prompt golf is not just a game but a training methodology that develops high-value skills in token efficiency and model instruction.

How Gamification Drives Innovation

Competitive prompt golf has produced several innovations that feed back into production prompt engineering:

Token-aware writing: Competitors develop intuition for tokenizer behavior, leading to more efficient production prompts
Adversarial testing: Red team prompt golf surfaces model vulnerabilities that inform safety research
Compression techniques: Techniques discovered in competition (abbreviation patterns, constraint stacking) transfer to production
Model comparison: Comparing performance across models under extreme compression reveals architectural strengths and weaknesses
Community benchmarking: Leaderboards create standardized benchmarks for prompt efficiency that did not exist before

Model Family Short-Prompt Capability Radar

References

Academic Papers

Xu, S. & Xie, D. (2025). Chain of Draft: Thinking Faster by Writing Less. arXiv:2502.18600. https://arxiv.org/abs/2502.18600
Jiang, H. et al. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. EMNLP 2023. https://aclanthology.org/2023.emnlp-main.825/
Jiang, H. et al. (2024). LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. ACL 2024. https://www.llmlingua.com/
Zhang, Y. & Li, W. (2025). An Empirical Study on Prompt Compression for Large Language Models. arXiv:2505.00019. https://arxiv.org/abs/2505.00019
Pathade, C. (2025). Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs. arXiv:2505.04806. https://arxiv.org/abs/2505.04806
Li, Z. & Ahn, H. (2025). SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression. arXiv:2506.12707. https://arxiv.org/abs/2506.12707
When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models. arXiv:2603.19247. https://arxiv.org/abs/2603.19247

Platforms & Tools

promptgolf.app — https://promptgolf.app/
KaliPrompt — https://kaliprompt.com/
GPTWorld (srush) — https://github.com/srush/GPTWorld
redteams.ai Prompt Golf Challenge — https://redteams.ai/topics/community-challenges/contributed/contrib-prompt-golf
sinc-LLM — https://sincllm.com/
Microsoft LLMLingua — https://github.com/microsoft/LLMLingua

Pricing & Caching References

LLM API Pricing Comparison 2026 — https://benchlm.ai/llm-pricing
Prompt Caching: How Anthropic, OpenAI, and Google Cut LLM Costs by 90% — https://www.morphllm.com/prompt-caching
OpenAI Prompt Caching Documentation — https://platform.openai.com/docs/guides/prompt-caching
Anthropic Prompt Caching Documentation — https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

Community

r/PromptEngineering — https://reddit.com/r/PromptEngineering
r/codegolf — https://reddit.com/r/codegolf
HackAPrompt 2.0 — https://learnprompting.org/blog/announce-hackaprompt-2