LLM price optimization is basically a token economics drawback. This tutorial covers 4 distinct methods — immediate compression, semantic caching, chain-of-thought pruning, and output size constraints — that when mixed can scale back LLM API prices by as much as 63%.
How you can Scale back LLM API Prices
- Instrument token logging on each API name to ascertain a value baseline earlier than optimizing.
- Compress system prompts by eliminating hedge language, consolidating directions into structured codecs, and utilizing instruments like LLMLingua.
- Constrain output size with
max_completion_tokensormax_tokensand implement structured JSON schemas. - Prune chain-of-thought reasoning in manufacturing by instructing the mannequin to return solely the ultimate reply.
- Implement semantic caching utilizing embedding similarity to skip redundant API calls totally.
- Leverage provider-native immediate caching from OpenAI, Anthropic, or Google for automated enter token reductions.
- Validate output high quality towards your analysis set after every optimization to make sure accuracy holds.
Desk of Contents
Why Commonplace Prompting Is Burning Your Finances
LLM price optimization is basically a token economics drawback. Each API name to OpenAI, Anthropic, or Google Gemini payments by the token, and most manufacturing programs ship much more tokens than the duty really requires. Verbose system prompts padded with hedge language, repeated context throughout dialog turns, unconstrained output lengths, and chain-of-thought reasoning left enabled in manufacturing all contribute to payments that run two to 3 instances greater than essential.
This tutorial covers 4 distinct methods for decreasing that waste: immediate compression, semantic caching, chain-of-thought pruning, and output size constraints. When mixed, these strategies can scale back LLM API prices by as much as 63%, although the precise determine will depend on use case, mannequin choice, and site visitors patterns. The methods are usually not theoretical. Every part consists of working code examples in Python and Node.js that concentrate on the OpenAI and Anthropic APIs instantly, with measured token counts displaying the earlier than and after.
The viewers right here is builders already calling LLM APIs in manufacturing or at scale, not these experimenting with chat completions for the primary time.
Understanding Token Economics Throughout Suppliers
How OpenAI, Anthropic, and Google Gemini Value Tokens
All three main suppliers cut up billing into enter tokens and output tokens, however the ratio between them varies considerably. Output tokens price greater than enter tokens, by an element of 2x to 5x relying on the mannequin. For GPT-4o, OpenAI expenses $2.50 per million enter tokens and $10.00 per million output tokens, a 4x ratio. Anthropic’s Claude 3.5 Sonnet costs at $3.00 per million enter and $15.00 per million output, a 5x ratio. Google’s Gemini 1.5 Flash prices roughly 33x lower than GPT-4o on each enter ($0.075 per million) and output ($0.30 per million) for prompts underneath 128K tokens.
Be aware: All pricing figures on this article are as of the time of writing. Confirm present pricing at openai.com/pricing, anthropic.com/pricing, and Google’s Generative AI pricing web page earlier than operating price projections.
This asymmetry has a direct consequence for optimization precedence: decreasing output tokens yields disproportionately bigger price financial savings per token eradicated.
Decreasing output tokens yields disproportionately bigger price financial savings per token eradicated.
Every supplier additionally presents cached token reductions. OpenAI’s automated immediate caching gives a 50% low cost on cached enter tokens. Anthropic’s specific immediate caching presents a 90% low cost on cache reads (although cache writes price 25% greater than base enter). Google Gemini’s context caching expenses at about 25% of the usual enter fee for cached content material.
The place Tokens Are Wasted in a Typical API Name
4 classes account for the majority of pointless token spend:
- System immediate bloat. Directions comprise filler phrases, extreme examples, and redundant guardrails that always double the immediate size with out bettering output high quality.
- Repeated context throughout dialog turns. Multi-turn flows resend the identical background info with each request.
- Uncontrolled output verbosity. Fashions generate explanations, caveats, and preambles that the consuming utility instantly discards when you do not cap output size.
- Chain-of-thought reasoning left energetic in manufacturing. Prolonged intermediate reasoning steps that served their goal throughout improvement add no worth in a deployed pipeline.
Approach 1: Immediate Compression
What Immediate Compression Means in Follow
Immediate compression reduces the token depend of a immediate whereas preserving the knowledge the mannequin wants to provide an correct response. There are two classes. Lossy compression removes content material totally, similar to dropping elective examples or eliminating edge case directions that apply to a small fraction of requests. Lossless compression rephrases the identical content material extra concisely, similar to changing prose directions into structured YAML or JSON format, or changing multi-sentence explanations with terse directives.
Compression hurts high quality when it removes disambiguation that the mannequin genuinely wants. For duties with slim, well-defined outputs like entity extraction or classification, aggressive compression is secure. For duties requiring nuanced judgment, similar to open-ended writing or complicated reasoning, over-compression can degrade outcomes. Monitor output high quality metrics (F1 rating for extraction, human analysis scores for technology) alongside token counts; if high quality drops greater than 2-3% in your eval set, you’ve got compressed too far.
Handbook Immediate Compression Methods
Three handbook methods yield the most important features with the least danger:
- Eradicate hedge language and politeness tokens. Phrases like “Please kindly be sure that you fastidiously think about” turn into “Guarantee.”
- Consolidate multi-sentence directions into structured codecs. A five-sentence paragraph explaining a desired JSON output form turns into the JSON schema itself, which is each shorter and extra exact.
- Use reference tokens as an alternative of repeating context. Reasonably than restating a product description in each the system immediate and the person message, outline it as soon as and consult with it by label.
Programmatic Immediate Compression with LLMLingua
Microsoft Analysis’s LLMLingua strategy makes use of a small language mannequin to determine and take away tokens from a immediate that contribute least to the mannequin’s means to provide right outputs. The library evaluates token-level perplexity and prunes low-information tokens whereas preserving semantic integrity.
Set up the required dependencies first:
pip set up openai "llmlingua>=0.2.2" numpy
Be aware: The primary run will obtain a transformer mannequin checkpoint (~500MB) from Hugging Face. Guarantee enough disk house and permit a number of minutes for the obtain.
Be aware: The checkpoint
microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbankused under is optimized for assembly transcripts (MeetingBank dataset). Validate compressed output high quality in your area earlier than manufacturing use. For different textual content varieties, consider various LLMLingua-2 checkpoints and examine entity extraction accuracy earlier than and after compression.
import time
from llmlingua import PromptCompressor
from openai import OpenAI, RateLimitError, APIError
shopper = OpenAI()
original_prompt = """You're an knowledgeable product assessment analyst. Your job is to fastidiously
learn product opinions submitted by customers and extract structured info from them.
You must determine the important thing entities talked about within the assessment, together with product names,
model names, and particular options that the reviewer discusses. Please make certain to
think about each optimistic and damaging sentiments expressed about every entity. Whenever you
discover an entity, classify it into one of many following classes: product, model, or
characteristic. Additionally decide the sentiment as optimistic, damaging, or impartial. Return your
evaluation as a JSON object with an array referred to as 'entities', the place every entity has the
fields 'identify', 'sort', and 'sentiment'. Be thorough however concise in your extraction.
Don't embrace entities which can be solely talked about in passing with none opinion expressed.
Concentrate on entities the place the reviewer has expressed a transparent opinion or analysis.
Be sure that your JSON is legitimate and correctly formatted. Don't embrace any clarification
or commentary exterior the JSON object. Solely return the JSON.
You must deal with opinions in English. If the assessment comprises a number of merchandise being
in contrast, extract entities for all of them. If a characteristic is talked about for a number of
merchandise, create separate entity entries for every product-feature mixture.
Make sure that entity names are normalized — for instance, use the complete model identify somewhat
than abbreviations when doable. If the reviewer makes use of slang or casual language,
interpret it to the very best of your means and use commonplace terminology in your output."""
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
use_llmlingua2=True
)
compressed = compressor.compress_prompt(
original_prompt,
fee=0.4,
force_tokens=["JSON", "entities", "name", "type", "sentiment"]
)
compressed_prompt = compressed["compressed_prompt"]
origin_tokens = compressed.get("origin_tokens", "UNVERIFIED")
compressed_tokens = compressed.get("compressed_tokens", "UNVERIFIED")
ratio = compressed.get("compressed_tokens_ratio", "UNVERIFIED")
print(f"Obtainable keys: {record(compressed.keys())}")
print(f"Authentic tokens: {origin_tokens}")
print(f"Compressed tokens: {compressed_tokens}")
print(f"Compression ratio: {ratio}")
max_retries = 3
response = None
for try in vary(max_retries):
attempt:
response = shopper.chat.completions.create(
mannequin="gpt-4o",
messages=[
{"role": "system", "content": compressed_prompt},
{"role": "user", "content": "The new Sony WH-1000XM5 headphones have amazing noise cancellation but the build quality feels cheaper than the XM4. Battery life is stellar though."}
]
)
break
besides RateLimitError:
wait = 2 ** try
print(f"Charge restricted. Retrying in {wait}s (try {try + 1}/{max_retries})")
time.sleep(wait)
besides APIError as e:
print(f"API error on try {try + 1}: {e}")
if try == max_retries - 1:
increase
if response is None:
increase RuntimeError("Exceeded max retries for OpenAI API name")
if response.utilization is None:
increase ValueError("response.utilization is None — streaming mode will not be supported right here")
print(f"Immediate tokens used: {response.utilization.prompt_tokens}")
print(f"Completion tokens used: {response.utilization.completion_tokens}")
print(response.decisions[0].message.content material)
The force_tokens parameter ensures that vital phrases survive the compression go. With a fee of 0.4, the compressed immediate retains about 200 tokens from the unique ~500 whereas preserving the extraction directions and output format necessities.
Measuring Compression Affect
Systematic measurement requires logging token utilization on each name and evaluating towards a identified baseline.
Be aware: These JavaScript examples use top-level
awaitand require Node.js 14.8+ with ES modules. Add"sort": "module"to yourbundle.jsonor wrap the code in(async () => { ... })();.
npm set up openai @anthropic-ai/sdk
import OpenAI from "openai";
const openai = new OpenAI();
const PRICING = {
"gpt-4o": { enter: 2.5, output: 10.0 },
"gpt-4o-mini": { enter: 0.15, output: 0.6 },
};
async operate trackedCompletion(mannequin, messages, label = "default") {
const pricing = PRICING[model];
if (!pricing) {
throw new Error(
`Mannequin "${mannequin}" not present in PRICING desk. ` +
`Add it or confirm the mannequin identify. Recognized fashions: ${Object.keys(PRICING).be a part of(", ")}`
);
}
let response;
const MAX_RETRIES = 3;
for (let try = 0; try < MAX_RETRIES; try++) {
attempt {
response = await openai.chat.completions.create({ mannequin, messages });
break;
} catch (err) {
if (err?.standing === 429 && try < MAX_RETRIES - 1) {
const wait = Math.pow(2, try) * 1000;
console.warn(`[${label}] Charge restricted. Retrying in ${wait}ms`);
await new Promise(r => setTimeout(r, wait));
} else {
throw err;
}
}
}
if (!response?.utilization) {
throw new Error(`[${label}] response.utilization is null — examine for streaming mode`);
}
const { prompt_tokens, completion_tokens } = response.utilization;
const inputCost = (prompt_tokens / 1_000_000) * pricing.enter;
const outputCost = (completion_tokens / 1_000_000) * pricing.output;
const totalCost = inputCost + outputCost;
console.log(`[${label}] Mannequin: ${mannequin}`);
console.log(` Immediate tokens: ${prompt_tokens}`);
console.log(` Completion tokens: ${completion_tokens}`);
console.log(` Enter price: $${inputCost.toFixed(6)}`);
console.log(` Output price: $${outputCost.toFixed(6)}`);
console.log(` Whole price: $${totalCost.toFixed(6)}`);
return { response, prompt_tokens, completion_tokens, totalCost };
}
const baseline = await trackedCompletion(
"gpt-4o",
[
{ role: "system", content: "Your original 500-token system prompt here..." },
{ role: "user", content: "Review text here..." },
],
"baseline"
);
const compressed = await trackedCompletion(
"gpt-4o",
[
{ role: "system", content: "Your compressed 200-token prompt here..." },
{ role: "user", content: "Review text here..." },
],
"compressed"
);
const financial savings = ((baseline.totalCost - compressed.totalCost) / baseline.totalCost) * 100;
console.log(`
Price discount: ${financial savings.toFixed(1)}%`);
You may drop this wrapper into any manufacturing pipeline to constantly monitor token spend and validate that compression delivers anticipated financial savings.
Approach 2: Semantic Caching
What Semantic Caching Is and How It Differs from Actual-Match Caching
Actual-match caching solely returns a saved consequence when the incoming request is an identical, character for character, to a beforehand seen request. Semantic caching makes use of embedding-based similarity to acknowledge that “What’s the capital of France?” and “Inform me France’s capital metropolis” ought to return the identical cached response. This will increase cache hit charges considerably for purposes the place customers phrase comparable questions in several methods.
Supplier-native caching and application-layer semantic caching resolve completely different issues. OpenAI and Anthropic’s immediate caching low cost the price of resending an identical immediate prefixes. Software-layer semantic caching avoids the API name totally when a sufficiently comparable question has already been answered.
Implementing Software-Layer Semantic Caching
Be aware: The in-memory cache under is for demonstration solely and isn’t production-safe. It has no TTL and makes use of a easy dimension cap for eviction, that means it is not going to deal with expiration or subtle eviction methods. For manufacturing use, change with Redis (utilizing RediSearch for vector similarity) or a devoted vector database with TTL and eviction configured.
import threading
import time
import numpy as np
from openai import OpenAI, RateLimitError, APIError
shopper = OpenAI()
_cache_lock = threading.Lock()
_cache: record[dict] = []
CACHE_MAX_SIZE = 10_000
SIMILARITY_THRESHOLD = 0.95
def get_embedding(textual content: str) -> np.ndarray:
consequence = shopper.embeddings.create(
mannequin="text-embedding-3-small",
enter=textual content
)
return np.array(consequence.information[0].embedding)
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
if norm_a == 0.0 or norm_b == 0.0:
return 0.0
return float(np.dot(a, b) / (norm_a * norm_b))
def cached_completion(user_query: str, system_prompt: str, mannequin: str = "gpt-4o") -> str:
query_embedding = get_embedding(user_query)
with _cache_lock:
for entry in _cache:
similarity = cosine_similarity(query_embedding, entry["embedding"])
if similarity >= SIMILARITY_THRESHOLD:
print(f"Cache HIT (similarity: {similarity:.4f})")
return entry["response"]
print("Cache MISS — calling API")
response = None
max_retries = 3
for try in vary(max_retries):
attempt:
response = shopper.chat.completions.create(
mannequin=mannequin,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query},
]
)
break
besides RateLimitError:
wait = 2 ** try
print(f"Charge restricted. Retrying in {wait}s (try {try + 1}/{max_retries})")
time.sleep(wait)
besides APIError as e:
print(f"API error on try {try + 1}: {e}")
if try == max_retries - 1:
increase
if response is None:
increase RuntimeError("Exceeded max retries for API name")
if response.utilization is None:
increase ValueError("response.utilization is None — streaming mode will not be supported right here")
consequence = response.decisions[0].message.content material
with _cache_lock:
if len(_cache) >= CACHE_MAX_SIZE:
_cache.pop(0)
_cache.append({
"embedding": query_embedding,
"question": user_query,
"response": consequence
})
return consequence
result1 = cached_completion(
"What are the primary options of the iPhone 15 Professional?",
"You're a product knowledgeable. Reply concisely."
)
result2 = cached_completion(
"Inform me the important thing options of Apple's iPhone 15 Professional",
"You're a product knowledgeable. Reply concisely."
)
For manufacturing use, changing the in-memory record with Redis utilizing its vector search functionality (RediSearch) or a devoted vector database gives persistence and scalability. The embedding name itself could be very low-cost: OpenAI’s text-embedding-3-small prices $0.02 per million tokens (as of the time of writing — confirm present pricing at openai.com/pricing earlier than projecting prices).
Utilizing Supplier-Native Immediate Caching
OpenAI’s immediate caching is automated. When the primary 1,024 or extra tokens of a immediate match a earlier request precisely, cached tokens are billed at a 50% low cost. No code adjustments are required, however structuring prompts in order that the static system directions seem first and variable content material seems final maximizes cache hit charges.
Be aware: OpenAI’s automated immediate caching solely prompts when the matching immediate prefix is a minimum of 1,024 tokens. Prompts shorter than this threshold is not going to profit from caching.
Anthropic’s immediate caching is specific and presents steeper reductions. Cache reads price 90% lower than base enter pricing. Cache writes price 25% extra, which is price noting as a value issue for low-traffic deployments the place cache writes might outnumber reads. The developer locations cache_control breakpoints to mark which immediate segments needs to be cached.
Be aware: Anthropic requires the cached phase to be a minimum of 1,024 tokens for
cache_controlto take impact. The instance under makes use of a shortened immediate for readability; in follow, increase or mix segments to satisfy the ≥1,024 token threshold. Affirm caching activated by checkingcache_creation_input_tokens > 0within the response.
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
const systemPrompt = `You're an knowledgeable product assessment analyst. Extract entities
from opinions as JSON with fields: identify, sort (product/model/characteristic), sentiment
(optimistic/damaging/impartial). Return solely legitimate JSON. Deal with comparisons by creating
separate entries. Normalize entity names to full model names.`;
async operate analyzeReview(reviewText) {
let response;
attempt {
response = await anthropic.messages.create({
mannequin: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
system: [
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: reviewText }],
});
} catch (err) {
if (err?.standing === 429) {
console.warn("Charge restricted by Anthropic. Implement retry logic for manufacturing use.");
}
throw err;
}
console.log("Enter tokens:", response.utilization.input_tokens);
console.log("Cache creation tokens:", response.utilization.cache_creation_input_tokens || 0);
console.log("Cache learn tokens:", response.utilization.cache_read_input_tokens || 0);
if (!response.content material || response.content material.size === 0 || response.content material[0].sort !== "textual content") {
throw new Error("Surprising response content material format from Anthropic API");
}
return response.content material[0].textual content;
}
await analyzeReview("The Sony WH-1000XM5 has nice ANC however feels flimsy.");
await analyzeReview("Samsung Galaxy S24 Extremely digital camera is unimaginable, battery is mediocre.");
await analyzeReview("MacBook Professional M3 efficiency is excellent nevertheless it runs scorching.");
Anthropic’s cached immediate content material has a minimal size requirement of 1,024 tokens and a time-to-live of 5 minutes from the final cache write; cache reads don’t lengthen the TTL. For top-throughput purposes making a number of calls per minute with the identical system immediate, the 90% learn low cost accumulates quickly. In low-traffic eventualities, bear in mind that cache writes price 25% greater than commonplace enter pricing, so rare utilization patterns might not see internet financial savings from caching.
Cache Invalidation and Freshness
Set TTLs primarily based on how incessantly the underlying information or directions change. For static system prompts, lengthy TTLs or no expiration are applicable. For queries towards quickly altering information, similar to real-time pricing or stock, semantic caching introduces stale response danger. Person-specific dynamic queries with private context ought to bypass the cache totally.
Approach 3: Chain-of-Thought Pruning for Manufacturing
Why CoT Reasoning Inflates Output Prices
Chain-of-thought prompting is effective throughout improvement and analysis as a result of it makes the mannequin’s reasoning auditable. In manufacturing, nevertheless, downstream programs eat solely the ultimate reply. CoT reasoning can inflate output size by 3x to 5x (this can be a generally noticed vary and varies by job), and since output tokens carry the best per-token price, this represents a 3x to 5x improve in output price that provides no worth to the deployed system.
CoT reasoning can inflate output size by 3x to 5x, and since output tokens carry the best per-token price, this represents a 3x to 5x improve in output price that provides no worth to the deployed system.
Methods for Pruning CoT in Manufacturing
Probably the most direct strategy: instruct the mannequin to return solely the ultimate reply. Combining this with structured output mode (JSON) constrains the response form and eliminates explanatory prose.
Anthropic’s prolonged pondering characteristic (out there on Claude 3.7 Sonnet and later appropriate fashions) gives a budget_tokens parameter that caps the variety of tokens the mannequin can spend on inside reasoning. Confirm mannequin assist in Anthropic’s prolonged pondering documentation earlier than use. This permits managed reasoning depth with out limitless output enlargement.
import time
from openai import OpenAI, RateLimitError, APIError
shopper = OpenAI()
assessment = """The Bose QuietComfort Extremely earbuds ship distinctive sound high quality
with deep bass and clear highs. The noise cancellation is top-tier, rivaling
over-ear headphones. Nonetheless, the match may be uncomfortable throughout lengthy periods,
and the case is unnecessarily cumbersome. Battery lifetime of 6 hours is first rate however not
class-leading. At $299, they're costly however justified for audiophiles."""
max_retries = 3
cot_response = None
for try in vary(max_retries):
attempt:
cot_response = shopper.chat.completions.create(
mannequin="gpt-4o",
messages=[
{"role": "system", "content": "Extract product entities with sentiment. Think step by step."},
{"role": "user", "content": review}
]
)
break
besides RateLimitError:
wait = 2 ** try
print(f"Charge restricted. Retrying in {wait}s (try {try + 1}/{max_retries})")
time.sleep(wait)
besides APIError as e:
print(f"API error on try {try + 1}: {e}")
if try == max_retries - 1:
increase
if cot_response is None:
increase RuntimeError("Exceeded max retries for CoT API name")
if cot_response.utilization is None:
increase ValueError("cot_response.utilization is None — streaming mode will not be supported right here")
direct_response = None
for try in vary(max_retries):
attempt:
direct_response = shopper.chat.completions.create(
mannequin="gpt-4o",
max_completion_tokens=256,
response_format={"sort": "json_object"},
messages=[
{"role": "system", "content": "Extract entities as JSON: {"entities": [{"name": str, "type": str, "sentiment": str}]}. Return ONLY the JSON."},
{"function": "person", "content material": assessment}
]
)
break
besides RateLimitError:
wait = 2 ** try
print(f"Charge restricted. Retrying in {wait}s (try {try + 1}/{max_retries})")
time.sleep(wait)
besides APIError as e:
print(f"API error on try {try + 1}: {e}")
if try == max_retries - 1:
increase
if direct_response is None:
increase RuntimeError("Exceeded max retries for direct API name")
if direct_response.utilization is None:
increase ValueError("direct_response.utilization is None — streaming mode will not be supported right here")
print(f"CoT output tokens: {cot_response.utilization.completion_tokens}")
print(f"Direct output tokens: {direct_response.utilization.completion_tokens}")
OUTPUT_PRICE_PER_MILLION = 10.0
cot_cost = (cot_response.utilization.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION
direct_cost = (direct_response.utilization.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION
print(f"CoT output price: ${cot_cost:.6f}")
print(f"Direct output price: ${direct_cost:.6f}")
The CoT model returns a number of paragraphs of reasoning adopted by the extraction, whereas the direct model returns solely the JSON object. On a job like this, count on a 3x or higher distinction in output token depend.
Holding CoT for Debugging With out Paying for It
A sensible sample: gate CoT behind an surroundings variable or characteristic flag. Allow CoT throughout improvement and in error-analysis pipelines. Disable it in manufacturing. When manufacturing errors floor for investigation, replay the precise failing enter with CoT enabled, producing the reasoning hint on demand somewhat than on each request.
Approach 4: Output Size Constraints
Utilizing max_tokens / max_completion_tokens Strategically
Most builders depart the utmost output size unset, permitting the mannequin to generate as many tokens because it deems applicable. That is costly. For duties with predictable output shapes, similar to classification, extraction, or short-answer responses, setting a ceiling prevents runaway technology.
The parameter names differ by supplier: OpenAI makes use of max_completion_tokens, Anthropic makes use of max_tokens, and Google Gemini makes use of maxOutputTokens. To seek out the best ceiling, pattern outputs from consultant inputs throughout improvement and set the restrict at 1.5x to 2x the noticed p95 (the ninety fifth percentile — i.e., the size exceeded by solely 5% of outputs in your pattern) output size.
Structured Output as a Price Management Mechanism
Perform calling and power use schemas act as implicit output constraints. When the mannequin should conform to an outlined schema, it can not generate preambles, explanations, or pointless fields. Be aware that when utilizing tool_choice to power a operate name, the mannequin’s response content material might be null — the precise payload is in tool_calls[0].operate.arguments, which should be parsed as JSON.
import OpenAI from "openai";
const openai = new OpenAI();
const OUTPUT_PRICE_PER_MILLION = 10.0;
const assessment = `The Dyson V15 Detect has unimaginable suction energy and the laser mud
detection is genuinely helpful. However at $750 it is overpriced, and the battery solely
lasts 25 minutes on max energy. The attachments are well-designed.`;
let proseResponse;
attempt {
proseResponse = await openai.chat.completions.create({
mannequin: "gpt-4o",
messages: [
{ role: "system", content: "Extract product entities with sentiment from this review." },
{ role: "user", content: review },
],
});
} catch (err) {
if (err?.standing === 429) {
console.warn("Charge restricted. Implement retry logic for manufacturing use.");
}
throw err;
}
if (!proseResponse?.utilization) {
throw new Error("proseResponse.utilization is null — examine for streaming mode");
}
let structuredResponse;
attempt {
structuredResponse = await openai.chat.completions.create({
mannequin: "gpt-4o",
messages: [
{ role: "system", content: "Extract product entities with sentiment." },
{ role: "user", content: review },
],
instruments: [
{
type: "function",
function: {
name: "extract_entities",
description: "Extract entities from a product review",
parameters: {
type: "object",
properties: {
entities: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
type: { type: "string", enum: ["product", "brand", "feature"] },
sentiment: { sort: "string", enum: ["positive", "negative", "neutral"] },
},
required: ["name", "type", "sentiment"],
},
},
},
required: ["entities"],
},
},
},
],
tool_choice: { sort: "operate", operate: { identify: "extract_entities" } },
});
} catch (err) {
if (err?.standing === 429) {
console.warn("Charge restricted. Implement retry logic for manufacturing use.");
}
throw err;
}
if (!structuredResponse?.utilization) {
throw new Error("structuredResponse.utilization is null — examine for streaming mode");
}
const message = structuredResponse.decisions[0].message;
if (!message.tool_calls || message.tool_calls.size === 0) {
throw new Error("No tool_calls returned. Test tool_choice config.");
}
const rawArgs = message.tool_calls[0].operate.arguments;
let entities;
attempt {
entities = JSON.parse(rawArgs).entities;
} catch (e) {
throw new Error(`Did not parse device arguments as JSON: ${rawArgs}`);
}
console.log(`Prose completion tokens: ${proseResponse.utilization.completion_tokens}`);
console.log(`Structured completion tokens: ${structuredResponse.utilization.completion_tokens}`);
console.log("Extracted entities:", entities);
const proseCost = (proseResponse.utilization.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION;
const structuredCost = (structuredResponse.utilization.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION;
console.log(`Prose output price: $${proseCost.toFixed(6)}`);
console.log(`Structured output price: $${structuredCost.toFixed(6)}`);
The structured response constrains the mannequin to populating solely the outlined fields, whereas the prose response consists of introductory textual content, explanations of every entity, and a closing abstract. In follow, structured output produces 2x to 4x fewer tokens than unconstrained prose for extraction duties. Run the code above by yourself inputs and log the distinction.
Price Comparability Desk: Earlier than and After Throughout 5 Fashions
The next desk reveals estimated prices for a standardized job, extracting three entities from a two-paragraph product assessment, run 1,000 instances. Baseline makes use of a verbose 500-token system immediate with unconstrained output. Optimized makes use of a compressed 200-token immediate with structured JSON output.
Be aware on pricing: GPT-4o: $2.50/$10.00 per million enter/output tokens. GPT-4o mini: $0.15/$0.60. Claude 3.5 Sonnet: $3.00/$15.00. Claude 3.5 Haiku (Anthropic’s lower-cost mannequin tier): $0.80/$4.00. Gemini 1.5 Flash: $0.075/$0.30 (underneath 128K tokens). All costs are as of the time of writing — confirm at every supplier’s pricing web page earlier than projecting prices.
| Mannequin | Baseline Enter | Compressed Enter | Baseline Output | Constrained Output | Baseline Price/1K | Optimized Price/1K | Financial savings |
|---|---|---|---|---|---|---|---|
| GPT-4o | 580 | 280 | 350 | 120 | $4.95 | $1.90 | 62% |
| GPT-4o mini | 580 | 280 | 350 | 120 | $0.30 | $0.11 | 63% |
| Claude 3.5 Sonnet | 580 | 280 | 350 | 120 | $6.99 | $2.64 | 62% |
| Claude 3.5 Haiku | 580 | 280 | 350 | 120 | $1.86 | $0.70 | 62% |
| Gemini 1.5 Flash | 580 | 280 | 350 | 120 | $0.15 | $0.06 | 60% |
The financial savings percentages are constant by building, since token reductions are mounted and pricing scales linearly. Fashions with greater output-to-input value ratios, like Claude 3.5 Sonnet at 5x, present barely greater absolute greenback financial savings. The Gemini 1.5 Flash financial savings, whereas proportionally comparable, signify a a lot smaller absolute greenback determine as a result of the bottom pricing is already very low. These figures don’t embrace extra financial savings from semantic caching, which might additional scale back prices proportional to cache hit fee.
Combining All 4 Methods: A Actual-World Optimization Pipeline
Beneficial Order of Operations
Apply the methods so as of effort-to-impact ratio:
- Compress prompts. This delivers the most important enter financial savings and takes the least effort — you solely rewrite prompts.
- Constrain outputs utilizing
max_completion_tokens(OpenAI) ormax_tokens(Anthropic) and structured output schemas. This targets the costliest token class with minimal code adjustments. - Prune chain-of-thought for manufacturing. This requires a conditional flag however yields 3x to 5x output token reductions.
- Add semantic caching. This calls for essentially the most infrastructure (embedding technology, a vector retailer) however delivers the best long-term financial savings at scale as a result of it eliminates API calls totally.
Estimating Your Financial savings
The financial savings system: (baseline_cost - optimized_cost) / baseline_cost. As an estimate primarily based on the token reductions demonstrated above, immediate compression saves 20% to 40% on enter tokens. Output constraints save 30% to 50% on output tokens. Caching saves proportionally to hit fee — even a 30% hit fee eliminates almost a 3rd of all API calls.
The 60%+ mixture determine is reasonable when a minimum of three of the 4 methods goal a workload with repeated question patterns and predictable output shapes. Workloads with extremely distinctive queries and variable-length outputs will see decrease caching advantages however can nonetheless obtain 40% to 50% financial savings from compression and output constraints alone.
Begin With the Lowest-Hanging Fruit
The 4 methods lined right here — immediate compression, semantic caching, chain-of-thought pruning, and output size constraints — type a sensible framework for LLM token optimization that works throughout suppliers and fashions. The best-priority first step will not be implementing any method however instrumenting token logging on each API name. And not using a baseline measurement, financial savings can’t be quantified or validated.
The best-priority first step will not be implementing any method however instrumenting token logging on each API name. And not using a baseline measurement, financial savings can’t be quantified or validated.
For implementation particulars, see the LLMLingua repository, OpenAI’s immediate caching information, Anthropic’s immediate caching documentation, and Google’s context caching reference. Test present pricing on every supplier’s pricing web page earlier than operating price projections.
Supply hyperlink


