AWS Mistral Model Caching: A How-To Guide

Select Language:

If you’re working with Mistral models, specifically Mistral Large, and want to reduce token usage, caching your prompts is a great way to do it. The idea is to store the common parts of your prompts—like the system prompt—so you don’t have to send them every time. Instead, you can cache these prompts in Bedrock, saving on tokens and improving efficiency.

However, many users run into a problem when trying to implement prompt caching with the Converse API using Boto3. For example, if you include the system prompt in the request like this:

json
system=[
{“text”: _system_prompt},
{“cachePoint”: {“type”: “default”}}
],

you might get an error message like this:

“AccessDeniedException: You invoked an unsupported model or your request did not allow prompt caching.”

This happens because certain models, including Mistral, don’t support prompt caching through this method. The API is designed to support prompt caching for some models, but not all, and unfortunately, Mistral falls into this unsupported category.

The good news is that documentation from AWS indicates that prompt caching is supported for Mistral models. You can check the AWS model card for Mistral Large here. Still, in practice, attempting to use the cachePoint parameter with Mistral models often results in an error.

Interestingly, when testing similar models like Amazon Nova 2 Lite, passing the cachePoint parameter in the invoke_model API does work. It recognizes the cache, and subsequent calls with the same input significantly reduce token usage, confirming that caching can be effective.

So, why doesn’t it work with Mistral? It might be that support for prompt caching is available in the backend, but the specific API calls or models you’re using haven’t implemented this feature yet. As of now, there doesn’t seem to be clear guidance on whether AWS plans to support caching for Mistral in the future.

In summary, while prompt caching can be a useful way to save tokens, it’s not currently supported for Mistral models in the way you might expect. Keep an eye on updates from AWS, as they might introduce support in the future. For now, optimizing your prompts to minimize repetition and data length is your best approach to control token usage with Mistral models.