[FEAT]: Add Support to Anthropic & OpenAI Batch APIs #1622

Open
opened 2026-02-28 05:34:33 -05:00 by deekerman · 2 comments
Owner

Originally created by @MichaelYochpaz on GitHub (Oct 15, 2024).

What would you like to see?

Hey there!
First off, thank you for working on this great project :)

Is it possible to add support for Batch APIs, provided by Anthropic and OpenAI?
This feature for their APIs basically allows a "50% discount" for the API calls, in exchange for allowing the responses to take up to 24 hours (so that they can run them when the servers aren't overloaded).

This is useful for saving money on calls that aren't necessarily needed immediately (for example, a request to summarize a book, suggest a design for a software, etc.). Especially when using the more expensive models, like Claude Opus and OpenAI o1, with big context size.

The way it works seems to be that once the request is sent, you can query it to check whether the request is finished or not (so you'll need to query it in interval. For example, every minute. Maybe make it configurable in the settings), and then once the response says it's ready, you can query for the output.

This probably won't be simple, as it requires implementing a new mechanism of waiting for a response (polling) and adding a way to communicate that in the UI (maybe a spinning wheel showing the response hasn't been generated yet), but I do think it will be a great addition that will be very useful.
Plus, since OpenAI introduced it, and now Anthropic followed, we might see more of these available for other APIs (this also means that if implemented, having generic code that will support other similar APIs in the future might be a good idea).

Originally created by @MichaelYochpaz on GitHub (Oct 15, 2024). ### What would you like to see? Hey there! First off, thank you for working on this great project :) Is it possible to add support for Batch APIs, provided by [Anthropic](https://docs.anthropic.com/en/docs/build-with-claude/message-batches) and [OpenAI](https://platform.openai.com/docs/guides/batch)? This feature for their APIs basically allows a "50% discount" for the API calls, in exchange for allowing the responses to take up to 24 hours (so that they can run them when the servers aren't overloaded). This is useful for saving money on calls that aren't necessarily needed immediately (for example, a request to summarize a book, suggest a design for a software, etc.). Especially when using the more expensive models, like Claude Opus and OpenAI o1, with big context size. The way it works seems to be that once the request is sent, you can query it to check whether the request is finished or not (so you'll need to query it in interval. For example, every minute. Maybe make it configurable in the settings), and then once the response says it's ready, you can query for the output. This probably won't be simple, as it requires implementing a new mechanism of waiting for a response (polling) and adding a way to communicate that in the UI (maybe a spinning wheel showing the response hasn't been generated yet), but I do think it will be a great addition that will be very useful. Plus, since OpenAI introduced it, and now Anthropic followed, we might see more of these available for other APIs (this also means that if implemented, having generic code that will support other similar APIs in the future might be a good idea).
Author
Owner

@timothycarambat commented on GitHub (Oct 30, 2024):

Following up on this, since the natural sense of using a UI for chatting would be to send and receive a result in a reasonable timeframe - how could one use batching to help with our use case? Sending a request and getting a response of "we will return a response later" to many may be frustrating or useless.

Can you expand with a specific use case in mind? I am not seeing a clear value for users to get responses minutes or possibly hours after request

@timothycarambat commented on GitHub (Oct 30, 2024): Following up on this, since the natural sense of using a UI for chatting would be to send and receive a result in a reasonable timeframe - how could one use batching to help with our use case? Sending a request and getting a response of "we will return a response later" to many may be frustrating or useless. Can you expand with a specific use case in mind? I am not seeing a clear value for users to get responses minutes or possibly hours after request
Author
Owner

@MichaelYochpaz commented on GitHub (Oct 31, 2024):

Hey @timothycarambat, thank you for responding :)

This feature is not really intended for general chatting use with simple short questions (which are quite cheap anyways), but for more complex ones that include a massive context and might be used with more expensive models (like o1-preview for example), where a 50% discount is quite meaningful and could save a few bucks for a single request.

The specific use-case I'd like to use

  • A prompt using repopack to add a large codebase (could be 100K+ tokens) as context, and ask the AI in the prompt to generate unit-tests for the whole project, suggest a better architecture, etc.

This type of prompt includes a huge context that will cost a meaningful amount of money (especially for models like o1-preview and o1-mini which are expensive), and I wouldn't mind getting the results a few hours later for saving a decent amount of money.

Another possible case that comes to mind is asking it to write a summary about a topic while adding several relevant books / research papers as context.

Now I don't think this should be for the entire chat, there should be an option for each message (for example, if I have a followup question after getting the result, and I don't want to wait again, I can untick a "batch request" checkbox, and then the next message will be a regular API request without the header settings it as a "batch request").

@MichaelYochpaz commented on GitHub (Oct 31, 2024): Hey @timothycarambat, thank you for responding :) This feature is not really intended for general chatting use with simple short questions (which are quite cheap anyways), but for more complex ones that include a massive context and might be used with more expensive models (like o1-preview for example), where a 50% discount is quite meaningful and could save a few bucks for a single request. The specific use-case I'd like to use - A prompt using [repopack](https://github.com/yamadashy/repopack) to add a large codebase (could be 100K+ tokens) as context, and ask the AI in the prompt to generate unit-tests for the whole project, suggest a better architecture, etc. This type of prompt includes a huge context that will cost a meaningful amount of money (especially for models like o1-preview and o1-mini which are expensive), and I wouldn't mind getting the results a few hours later for saving a decent amount of money. Another possible case that comes to mind is asking it to write a summary about a topic while adding several relevant books / research papers as context. Now I don't think this should be for the entire chat, there should be an option for each message (for example, if I have a followup question after getting the result, and I don't want to wait again, I can untick a "batch request" checkbox, and then the next message will be a regular API request without the header settings it as a "batch request").
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/anything-llm#1622
No description provided.