“Many builders use the identical context repeatedly throughout a number of API calls when constructing AI functions, like when making edits to a codebase or having lengthy, multi-turn conversations with a chatbot,” OpenAI defined, including that the rationale is to cut back token consumption when sending a request to the LLM.
What which means is that when a brand new request is available in, the LLM checks if some components of the request are cached. In case it’s cached, it makes use of the cached model, in any other case it runs the total request.
OpenAI’s new immediate caching functionality works on the identical basic precept, which might assist builders save on price and time.