AGENT ZERO

Prompt Golf: A Comprehensive Research Brief

Opportunity Matrix

Technique / Lever Token Savings Cost Impact Risk Level Maturity Best Use Case
Chain of Draft (CoD) ~92% reasoning tokens High (lower latency) Low Research-validated Multi-step reasoning tasks
LLMLingua compression Up to 20x Very high Medium Production-ready (Microsoft) High-volume API pipelines
LongLLMLingua 4x with quality gain High + quality boost Medium Research-validated Long-context RAG
sinc-LLM decomposition ~97% (reported 80k→2.5k) Very high Medium-High Early framework Enterprise prompt standardization
Prompt caching Up to 90% input cost Very high Low Production (all providers) Repetitive system prompts
System prompt offloading Variable (near-zero user prompt) High Low Mature Chatbot/agent architectures
Abbreviation / symbol substitution 30-60% Medium Low-Medium Practitioner-native All prompt types
SecurityLingua compression 100x vs guardrails Defense ROI Dual-use Research (arXiv) Safety pipeline integration
Adversarial compression N/A (offensive) N/A Critical Emerging Red team / security testing

Token Savings by Compression Technique

A. Definition & Core Mechanics

What Is Prompt Golf?

Prompt golf is the practice of optimizing LLM prompts for maximum brevity and efficiency — achieving a desired model output using the fewest possible tokens or characters. The name is a direct analogy to code golf, the recreational programming challenge where participants solve problems in the fewest bytes of source code. Where code golf optimizes for program length, prompt golf optimizes for prompt length while preserving output fidelity.

The discipline sits at the intersection of prompt engineering, token economics, and information theory. It differs from general prompt engineering in a critical way: prompt engineering seeks the best output quality, while prompt golf seeks the shortest prompt that still produces acceptable output. This inversion creates unique constraints and techniques.

Scoring Heuristics

Prompt golf employs several scoring metrics, each with different implications:

Metric Description Advantage Limitation
Token count Number of tokens after tokenizer processing Directly maps to API cost and latency Model-specific tokenizers produce different counts
Character count Raw character length of the prompt Simple, universal, portable Poor proxy for actual API cost; character-to-token mapping is non-linear
Semantic density Information content per token Captures efficiency of expression Hard to quantify; requires semantic similarity scoring
Prompt count Number of discrete prompts/messages sent Measures conversational efficiency Ignores message length; can be gamed
Composite score Weighted combination of above Balanced evaluation Requires calibration per use case

A critical subtlety: character deletion can increase token count. As documented by promptgolf.app, removing a letter from a common word like "hello" can split a single token into two, increasing cost. True prompt golf requires understanding the tokenizer, not just trimming characters.

Prompt Golf Platforms

Platform Format Scoring Model Key Feature
promptgolf.app Browser game Token count (shortest wins) Llama-3.1-70B-Instruct Leaderboard, entropy-aware challenges
kaliprompt.com Browser game (part of KaliPrompt suite) Prompt count per hole Validator-based "Holes" with pass/fail validation
GPTWorld (GitHub srush/GPTWorld) Grid-world puzzle Prompt length GPT-4 (originally) Educational grounding environment

promptgolf.app is the most prominent dedicated platform. It runs Meta's Llama-3.1-70B-Instruct without a system prompt, at temperature 0.7, with responses capped at 128 output tokens and prompts limited to 999 characters. Challenges ("holes") require eliciting a specific target output. Due to LLM non-determinism, only a single successful attempt is required for a prompt to count — adding a probabilistic element to the strategy.

kaliprompt.com is part of the KaliPrompt collection of AI browser games, daily puzzles, and retro terminal mini-games. Its prompt golf variant uses validator-based holes where prompts must pass automated checks, scored by prompt count rather than token count specifically.

kaliprompt.com is part of the KaliPrompt collection of AI browser games, daily puzzles, and retro terminal mini-games.

GPTWorld (srush/GPTWorld) takes a different approach: a hexagonal grid-world environment where the player instructs an LLM to generate code that navigates a game. The objective is constructing a prompt that gets GPT-4 to solve a complex spatial puzzle — grounding abstract prompting skill in a concrete, visual challenge. It is explicitly subtitled "Prompt Golf."


B. Token-Economy Techniques

Practical Token-Stripping Tricks

Prompt golf practitioners employ a toolkit of specific techniques to reduce token count without losing semantic intent:

Technique Example Approx. Savings Notes
Abbreviation "temperature" → "temp", "maximum" → "max" 30-50% Must stay within model's recognized vocabulary
Symbol substitution "and" → "&", "number" → "#", "at" → "@" 10-20% Risk of tokenizer-inefficient symbols in some models
Implicit context Remove obvious domain terms the model can infer 15-40% Works best with domain-expert models
Constraint stacking Combine multiple rules: "no profanity, no slang, formal tone" → "formal" 20-30% Higher risk of underspecification
Instruction fusion Merge two instructions into one phrase 10-25% Dual-meaning language, advanced skill
Output format shorthand "Return as JSON" → "→JSON" 5-15% Model-dependent; some require explicit format specs
Article/preposition removal Drop "the", "a", "in", "of" where unambiguous 10-20% English prepositions are often single tokens, so savings are modest

Chain of Draft (CoD)

Paper: Chain of Draft: Thinking Faster by Writing Less (arXiv 2502.18600, Xu & Xie, Zoom Communications, February 2025)

Chain of Draft is the most significant prompt-golf-adjacent technique to emerge from academic research. The core insight: humans solving problems don't write out every step in full prose — they draft concise intermediate thoughts capturing only essential information. CoD applies this to LLM reasoning.

Chain of Draft is the most significant prompt-golf-adjacent technique to emerge from academic research.

Results:

Task CoT Tokens CoD Tokens Token Reduction Accuracy Change
GSM8K (math) ~200 ~15 92.4% Maintained
BIG-Bench Hard ~180 ~14 92.2% Maintained
Date understanding ~150 ~12 92.0% Slight improvement
Sports understanding ~120 ~10 91.7% Slight improvement

CoD uses as little as 7.6% of the tokens compared to Chain-of-Thought while matching or surpassing accuracy. The technique works by instructing the model to produce minimal intermediate reasoning — a few words per step rather than full sentences.

Prompt implementation:

CoT prompt: "Think step by step and explain your reasoning."
CoD prompt: "Think step by step, but only keep a minimum draft for each thinking step, with around 5 words."
This simple instruction change yields token reduction in reasoning with no accuracy loss — the sing92%
92%

Signal-Theoretic Decomposition (sinc-LLM)

Source: sinc-LLM (sincllm.com)

sinc-LLM applies the Nyquist-Shannon sampling theorem to prompt engineering, treating a complete task specification as a "signal" that must be "sampled" at sufficient rate to be faithfully reconstructed. The framework decomposes any prompt into 6 Nyquist-compliant specification bands:

Band Index Content Typical Token Allocation
PERSONA n=0 Who should answer; role, expertise, perspective ~5%
CONTEXT n=1 Situational facts, environment, background ~20%
DATA n=2 Specific inputs, numbers, references ~20%
CONSTRAINTS n=3 Rules, boundaries, exclusions ~25%
FORMAT n=4 Output structure, length, style ~25%
TASK n=5 The actual objective ~5%

The key finding: CONSTRAINTS and FORMAT drive 42.7% of output quality but are typically under-specified. A prompt covering fewer than 6 bands is "under-sampled" and will alias — manifesting as hallucination or output drift. The reconstruction formula parallels signal processing: $x(t) = \sum x(nT) \cdot \text{sinc}((t - nT)/T)$.

42.7%
The key finding: CONSTRAINTS and FORMAT drive

Reported compression: prompts of 80k tokens reduced to 2.5k tokens (~97% reduction) by restructuring into the 6-band framework while maintaining specification completeness. However, this claim requires independent validation.

Cost optimization tip from sinc-LLM: Use a cheap model (e.g., Claude Haiku) for the decomposition/scatter step (converting raw prompt to sinc JSON), and a capable model (e.g., Claude Sonnet) for execution — since Haiku costs 75% less than Sonnet.


Chain of Draft: Token Reduction Across Tasks
01
C. Model-Specific Opportunities

C. Model-Specific Opportunities

How Model Families Respond to Ultra-Short Prompts

Different model families exhibit distinct behaviors when presented with compressed, minimal prompts. Understanding these differences is essential for both competitive prompt golf and production optimization.

Model Family Strength with Short Prompts Weakness Free Leverage
GPT-4o / GPT-5 Strong instruction following; infers intent from fragments Can override implicit constraints with learned priors Instruction obedience — minimal instructions suffice if unambiguous
Claude (Anthropic) Exceptional system prompt adherence; respects persona framing Over-cautious with ambiguous short prompts System prompt weight — pre-load everything, send near-zero user prompt
Gemini (Google) Strong multilingual compression; handles symbolic shorthand May misinterpret extreme abbreviation in English Multilingual compression — non-English tokens can be shorter for same semantics
Llama (Meta, open-source) Flexible; tokenizer well-documented Less robust to non-standard syntax Tokenizer predictability — tiktoken-compatible, easy to count exactly
Mistral Good at concise reasoning tasks Narrower context window limits implicit inference European language efficiency

Model-Specific Leverage Details

Claude (Anthropic): Claude's weighting of system prompts relative to user messages is the strongest among major providers. A well-crafted system prompt can offload 90%+ of instruction tokens, leaving the user prompt as a single keyword or short phrase. This makes Claude the ideal platform for system-prompt-based prompt golf. Additionally, Claude adheres closely to format specifications in the system prompt, reducing the need for per-message format reminders.

GPT-4o/GPT-5 (OpenAI): OpenAI models are the most instruction-following on a per-token basis. A single clearly-worded instruction often suffices where other models need reinforcement. The trade-off: GPT models may "fill in" unstated preferences from training data, which can be either helpful (inferring intent) or harmful (contradicting implicit wishes). Few-shot examples in GPT need fewer demonstrations — often 1-2 shots match the quality of 3-5 shots in other models.

GPT-4o/GPT-5 (OpenAI): OpenAI models are the most instruction-following on a per-token basis.

Gemini (Google): Gemini's multilingual capability provides a unique compression vector: some concepts express in fewer tokens in other languages. For example, German compound words or Chinese characters can encode complex meanings in single tokens. This cross-lingual compression can reduce token count by 30-60% for certain semantic targets.

Cache-Friendly Prompt Structures Per Provider

Provider Caching Mechanism Minimum Prefix Cache Write Cost Cache Read Cost (vs. base) Key Structural Requirement
Anthropic Developer-controlled (explicit cache points) 1024 tokens +25% above base 90% discount Static system prompt first; mark cache control breakpoints
OpenAI Automatic (no config needed) 1024 tokens Base rate (no surcharge) 50% discount Exact prefix match; system message must stay identical
Google Context Caching API Varies by model Token storage cost ~75% discount Explicit cached content objects; TTL management

Structural pattern for all providers: Place the most stable content at the beginning of the prompt (system instructions, persona, format spec). Never modify the prefix between requests. Append variable content at the end. This maximizes cache hit rates regardless of provider.


D. Leveraging System Prompts & Context

The System Prompt Offload Strategy

The most powerful prompt golf technique for production systems is system prompt offloading: moving all stable instructions into the system prompt or cached context, leaving the user prompt as a minimal trigger.

Architecture:

System Prompt (cached, 2000+ tokens):
  - Persona definition
  - Output format specification
  - Domain constraints
  - Behavior rules
  - Few-shot examples

User Prompt (variable, 5-20 tokens):
  - Task-specific input only
  - e.g., "Summarize: [article]" or just "[article]"

This pattern is especially powerful when combined with prompt caching, as the system prompt is computed once and reused across thousands of requests.

This pattern is especially powerful when combined with prompt caching, as the system prompt is computed once and reused across thousands of requests.

Prompt Caching Savings

Prompt caching — reusing KV-cache tensors from identical prompt prefixes — is now supported by all three major providers, each with different implementations:

Anthropic

Developer-controlled with explicit cache breakpoints. Cache writes cost +25% above base input price, but cache reads cost 90% less than base. Requires minimum 1024 tokens for caching. Best for applications with long, stable system prompts.

OpenAI

Automatic caching on GPT-4o and newer. No configuration required. Cache hits provide 50% discount on input tokens. Requires minimum 1024 tokens. Exact prefix match required — any change to earlier messages invalidates the cache.

Google

Explicit Context Caching API. Cached content objects have a storage cost and TTL. Cache hits provide ~75% discount. Most flexible for multi-turn conversations.

Combined with prompt golf techniques: A production pipeline using system prompt offloading + prompt caching + token compression can achieve 90%+ input cost reduction compared to naive prompting.

SecurityLingua: Security-Aware Prompt Compression

Paper: SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression (arXiv 2506.12707, Li & Ahn, Microsoft Research, 2025)

Paper: SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression (arXiv 2506.12707, Li & Ahn, Microsoft Research, 2025)

SecurityLingua bridges prompt compression with security. It trains a prompt compressor to discern the true intention of input prompts, with particular focus on detecting malicious intent behind adversarial/jailbreak prompts.

How it works:
1. A security-aware compressor processes the user's prompt
2. It extracts and highlights suspicious instructions
3. The compressed representation is presented to the target LLM in a way that stimulates the model's built-in safety guardrails
4. The original prompt remains intact — users see no disruption

Key results:

100x
less token cost compared to state-of-the-art LLM guardrail approaches

sinc-LLM Band Token Allocation

E. Adversarial & Creative Opportunities

Linguistic Compression and Ambiguity

Prompt golf's creative frontier lies in polysemy exploitation — using words and phrases that carry multiple simultaneous meanings to compress several instructions into one expression.

Technique Example Instructions Compressed
Double meaning "Draft" (noun: document; verb: create) Output type + action
Ambiguous scope "Fix" (repair, prepare, secure) Intent + method
Logical compression "If X → not Y" (3 tokens) vs. "In the case where X is true, you should ensure that Y does not happen" (17 tokens) Conditional + consequence
Implicit negation "Formal only" (2 tokens) vs. "Do not use profanity, slang, or informal language" (8 tokens) Multiple exclusions via scope
Domain metonymy "SOAP" (1 token) in API context implies protocol spec + format + rules Entire specification by association

Red Team Prompt Golf

The redteams.ai community hosts a prompt golf challenge specifically focused on achieving jailbreaks with the shortest possible prompts. Like code golf, the score is based on character count of the successful jailbreak. This represents the adversarial extreme of prompt golf — where optimization for brevity intersects with optimization for safety bypass.

Challenge parameters:

Difficulty

Intermediate

Estimated time

4-8 hours

Prerequisites

Understanding of jailbreak techniques

Scoring

Fewest characters in a successful jailbreak prompt

Systematic Adversarial Prompt Analysis

Paper: Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs (arXiv 2505.04806, Pathade, 2025)

This study cataloged and categorized over 1,400 adversarial prompts, analyzing their success rates against GPT-4, Claude 2, Mistral 7B, and Vicuna. Key findings:

When Optimization Becomes Jailbreaking

Paper: When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models (arXiv 2603.19247, 2026)

This paper identifies the critical boundary where legitimate prompt optimization crosses into adversarial territory. The authors repurposed black-box prompt optimization techniques — originally designed to improve performance on benign tasks — to systematically search for safety failures.

Core insight: The same automated prompt optimization tools used for prompt golf (iterative refinement, token reduction, constraint reformulation) can be trivially redirected to find jailbreaks. The optimization landscape for brevity and for safety bypass overlap substantially, because both involve finding minimal prompts that produce specific outputs — the difference is only in the target output.

Core insight: The same automated prompt optimization tools used for prompt golf (iterative refinement, token reduction, constraint reformulation) can be trivially redirected to find jailbreaks.

Implication for the prompt golf community: Competitive prompt golf platforms must consider safety guardrails. A "shortest prompt that produces output X" challenge is structurally identical to a jailbreak challenge when X is harmful content.


02
F. Cost & Latency Savings

F. Cost & Latency Savings

LLM API Price Evolution

LLM API prices declined approximately 80% between early 2025 and early 2026, driven by model efficiency gains, competitive pressure, and architectural innovations (MoE, distillation, quantization). Current pricing for key models:

Model Input ($/MTok) Output ($/MTok) Cached Input ($/MTok) Context Window
GPT-4o $2.50 $10.00 ~$1.25 128K
GPT-5 $10.00 $30.00 ~$5.00 256K+
o4 Mini $0.55 $2.20 ~$0.28 128K
Claude Sonnet 4 $3.00 $15.00 ~$0.30 (cache read) 200K
Claude Haiku 3.5 $0.80 $4.00 ~$0.08 (cache read) 200K
Gemini 2.5 Flash $0.15 $0.60 ~$0.04 1M
Gemini 3 Flash $0.50 $3.00 ~$0.13 1M+
DeepSeek V3 $0.27 $1.10 N/A 128K

Monthly Cost by Model (1,000 conversations/day)

Assumptions: ~500 input tokens/conversation (after caching), ~200 output tokens, 30-day month.

Model Monthly Input Cost Monthly Output Cost Total Monthly Cost
Gemini 2.5 Flash $2.25 $3.60 ~$12
o4 Mini $8.25 $13.20 ~$21
Claude Haiku 3.5 $12.00 $24.00 ~$36
GPT-4o $37.50 $60.00 ~$98
Claude Sonnet 4 $45.00 $90.00 ~$135
GPT-5 $150.00 $180.00 ~$1,050

Prompt golf compression at scale turns even the most expensive models into viable options for high-volume applications.

Prompt golf compression at scale turns even the most expensive models into viable options for high-volume applications.

LLMLingua: 20x Compression with Minimal Performance Loss

Papers: LLMLingua (EMNLP 2023), LLMLingua-2 (ACL 2024), Microsoft Research

LLMLingua uses a small language model to identify and remove non-essential tokens from prompts, achieving:

Up to 20x compression with only accuracy drop on GSM8K1.5%
1.5%
20x
On GSM8K at compression: LLMLingua achieves 77.33 exact match vs. Selective-Context's 44.20 and GPT

LongLLMLingua extends this to long-context scenarios:

performance improvement on NaturalQuestions benchmark with 4x fewer tokens21.4%
21.4%
average improvement across long-context tasks at 4x compression17.1%
17.1%
cost reduction demonstrated on LooGLE (long context RAG) while improving GPT-3.5-Turbo performance94%
94%

Real-World Pipeline Savings

Production deployments consistently report 40- token cost savings through systematic optimization80%
80%
Optimization Layer Typical Savings Implementation Effort
Prompt caching (static prefixes) 50-90% input cost Low
Token-level compression (LLMLingua) 80-95% input tokens Medium
Chain of Draft (reasoning tasks) ~92% reasoning tokens Low
System prompt offloading 60-80% user prompt tokens Low
Abbreviation + constraint stacking 20-40% total tokens Low
Full pipeline (all combined) 40-80% total cost Medium-High

Token-Stripping Techniques: Savings Potential

G. Risks & Failure Modes

Over-Compression and Intent Alignment Failure

The fundamental risk of prompt golf: brevity optimization can destroy the very intent the prompt was meant to convey. The compression-performance tradeoff is not monotonic — beyond a critical threshold, each token removed degrades output quality at an accelerating rate.

Common failure patterns:

Ambiguity introduction

Removing disambiguating context causes the model to misinterpret intent. "Summarize formally" might become "Summarize" — model outputs informal summary.

Constraint erosion

Stacked constraints like "no profanity, formal tone, third person" compressed to "formal" — model uses profanity in formal register.

Scope collapse

Implicit context that seemed obvious to the prompter is not obvious to the model. Domain-specific abbreviations are misinterpreted.

Cascading misinterpretation

One ambiguous token causes downstream drift, amplified by autoregressive generation.

Compression Increases Hallucinations

Paper: An Empirical Study on Prompt Compression for Large Language Models (arXiv 2505.00019, Zhang & Li, ICLR 2025 Building Trust Workshop)

This systematic study of six prompt compression methods found that all compression methods increased hallucinations, with the primary cause being information loss.

This systematic study of six prompt compression methods found that all compression methods increased hallucinations, with the primary cause being information loss.

Key findings:

10x
The relationship between compression ratio and hallucination rate is approximately linear up to com

Brittleness Across Model Updates

Compressed prompts are inherently more brittle than verbose ones:

Risk Factor Verbose Prompt Compressed Prompt
Model update tolerance High (redundant signals survive) Low (single signal removed = failure)
Cross-model portability High (shared vocabulary) Low (model-specific tokenizer behavior)
Temperature sensitivity Low (robust to sampling variation) High (marginal signals may not fire)
Context window pressure Low (unaffected by other content) Medium (competing signals can overpower)

A prompt that works perfectly on GPT-4o may fail on GPT-5 if the tokenizer changes or instruction-following weights shift. Competitive prompt golf entries often break within weeks of model updates.

SecurityLingua's Dual-Use Risk

SecurityLingua demonstrates that prompt compression can both reveal and mask malicious intent:

The Compression-Performance Tradeoff Curve

The curve exhibits three regimes:

1. Safe compression (1x-4x): Performance maintained or slightly improved (LongLLMLingua's boost liv21.4%
21.4%
4x
2. Marginal compression (-10x): Gradual degradation begins; hallucination risk grows linearly
10x
3. Extreme compression (-20x+): Accelerating quality loss; only viable for highly constrained tasks

The practical implication: prompt golf techniques should be applied to the "safe" and "marginal" zones in production, with the "extreme" zone reserved for competitive play or tasks where quality floors are low.


H. Competitive Prompt Golf

The Emerging Sport of Prompt Golf

Prompt golf is rapidly developing an organized competitive ecosystem, blending elements of code golf, competitive programming, and security CTFs. The gamification of prompt engineering serves both entertainment and serious purposes: it trains intuition for token efficiency, surfaces novel compression techniques, and stress-tests model instruction-following at the margin.

Platform Comparison

Platform URL Format Scoring Validator Community
promptgolf.app promptgolf.app Browser game Token count Target output match Reddit (r/PromptEngineering)
KaliPrompt kaliprompt.com Browser game + daily puzzles Prompt count per hole Automated pass/fail KaliPrompt community
GPTWorld github.com/srush/GPTWorld Jupyter notebook Prompt length Grid-world task completion Hugging Face Spaces
redteams.ai redteams.ai Community challenge Character count Successful jailbreak Red team community

promptgolf.app — The Standard-Bearer

promptgolf.app is the most established competitive platform. Key mechanics:

Holes

Each challenge provides a target output that the LLM must produce

Model

Llama-3.1-70B-Instruct (no system prompt, temperature 0.7)

Constraints

Max 999 characters per prompt, 128 output tokens

Scoring

Fewest tokens wins; leaderboard per hole

Strategy

Non-determinism means players can gamble — a slightly ambiguous prompt might work on lucky sampling, accepting lower reliability for higher score

Entropy awareness

Simply deleting characters is counterproductive if it creates high-entropy token splits. Skilled players count tokens, not characters

KaliPrompt — Validator-Based Competition

KaliPrompt takes a different approach with validator-based holes:

GPTWorld — Educational Grounding

Srush's GPTWorld is designed as an educational tool rather than a pure competition:

Community Ecosystem

Prompt golf has spread beyond dedicated platforms into existing communities:

r/codegolf

Traditional code golf community where prompt golf challenges occasionally appear; strong cultural alignment

r/PromptEngineering

The original announcement of promptgolf.app generated significant engagement; ongoing discussion of techniques

r/leetcode

Prompt golf framing applied to algorithm problems — "get the solution in the fewest tokens"

HackAPrompt 2.0

The world's largest AI red-teaming hackathon (by LearnPrompting.org) incorporates prompt golf elements in competitive tracks

Corporate Hackathons

Organizations are adopting the prompt golf format for internal hackathons:

The corporate adoption signals a shift: prompt golf is not just a game but a training methodology that develops high-value skills in token efficiency and model instruction.

The corporate adoption signals a shift: prompt golf is not just a game but a training methodology that develops high-value skills in token efficiency and model instruction.

How Gamification Drives Innovation

Competitive prompt golf has produced several innovations that feed back into production prompt engineering:

  1. Token-aware writing: Competitors develop intuition for tokenizer behavior, leading to more efficient production prompts
  2. Adversarial testing: Red team prompt golf surfaces model vulnerabilities that inform safety research
  3. Compression techniques: Techniques discovered in competition (abbreviation patterns, constraint stacking) transfer to production
  4. Model comparison: Comparing performance across models under extreme compression reveals architectural strengths and weaknesses
  5. Community benchmarking: Leaderboards create standardized benchmarks for prompt efficiency that did not exist before

Model Family Capability Comparison
03
References

References

Academic Papers

  1. Xu, S. & Xie, D. (2025). Chain of Draft: Thinking Faster by Writing Less. arXiv:2502.18600. https://arxiv.org/abs/2502.18600
  2. Jiang, H. et al. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. EMNLP 2023. https://aclanthology.org/2023.emnlp-main.825/
  3. Jiang, H. et al. (2024). LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. ACL 2024. https://www.llmlingua.com/
  4. Zhang, Y. & Li, W. (2025). An Empirical Study on Prompt Compression for Large Language Models. arXiv:2505.00019. https://arxiv.org/abs/2505.00019
  5. Pathade, C. (2025). Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs. arXiv:2505.04806. https://arxiv.org/abs/2505.04806
  6. Li, Z. & Ahn, H. (2025). SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression. arXiv:2506.12707. https://arxiv.org/abs/2506.12707
  7. When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models. arXiv:2603.19247. https://arxiv.org/abs/2603.19247

Platforms & Tools

  1. promptgolf.app — https://promptgolf.app/
  2. KaliPrompt — https://kaliprompt.com/
  3. GPTWorld (srush) — https://github.com/srush/GPTWorld
  4. redteams.ai Prompt Golf Challenge — https://redteams.ai/topics/community-challenges/contributed/contrib-prompt-golf
  5. sinc-LLM — https://sincllm.com/
  6. Microsoft LLMLingua — https://github.com/microsoft/LLMLingua

Pricing & Caching References

  1. LLM API Pricing Comparison 2026 — https://benchlm.ai/llm-pricing
15. Prompt Caching: How Anthropic, OpenAI, and Google Cut LLM Costs by — https://www.morphllm.com/p90%
90%
  1. OpenAI Prompt Caching Documentation — https://platform.openai.com/docs/guides/prompt-caching
  2. Anthropic Prompt Caching Documentation — https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

Community

  1. r/PromptEngineering — https://reddit.com/r/PromptEngineering
  2. r/codegolf — https://reddit.com/r/codegolf
  3. HackAPrompt 2.0 — https://learnprompting.org/blog/announce-hackaprompt-2

The most startling revelation is that prompts can be compressed by up to 97% while preserving acceptable output, shattering the assumption that verbosity ensures quality. This inversion—optimizing for brevity over perfection—exposes that LLMs thrive on far less semantic scaffolding than practitioners believe, making excess tokens a tax on efficiency rather than a guarantee of accuracy.
Haiku Artwork
Less tokens, same truthNinety-seven percent fallsBrevity unlocks