Async Batch Inference: When Not Needing Real-Time Saves You 90%

There's a quiet pricing decision baked into every cloud GPU product, and most teams pay it without noticing: you're charged for the GPU to be warm and waiting, not for the work it actually does. Real-time inference needs that. A batch job at 3am does not. The gap between those two worlds is roughly 10x in cost.

This post walks through what asynchronous batch inference actually means, the workloads it fits, and the cost arithmetic that makes it the right call when sub-second response isn't on the critical path.

What "async batch" actually means.

Real-time inference is a request — the GPU is warmed, the model is loaded, the worker is reserved, and the response streams back in milliseconds. You pay for the GPU to be ready, regardless of whether your request happens to land in that second.

Asynchronous batch inference inverts the model. You submit a job and get a job ID. The work joins a queue. The scheduler matches it to an available worker — one that wasn't going to be used in that moment anyway — the worker pulls the job, runs it, and returns the result. You retrieve the result via webhook or polling.

End-to-end you typically see 30 seconds to 5 minutes per job under normal load. For a workload that runs overnight or off the user's critical path, that latency is invisible. For a workload where a human is waiting for the answer right now, it's not the right tool.

Workloads that fit.

The pattern is "human is not waiting for this specific result in real time." That covers a lot more than people initially assume:

Document processing. Summarize tens of thousands of PDFs, extract structure from contracts, normalize OCR output. Submit at 11pm, results landed by morning standup.
Dataset enrichment. Classify, tag, embed, score records at warehouse scale. Webhook back into the ETL pipeline.
Content generation at scale. Personalized summaries, briefs, translations. Push 10,000 jobs in an hour, get them back in a few hours.
Research and experimentation. Grids over model × prompt × dataset. No reservation, no spin-up time, no idle burn between runs.
Automation pipelines. LLM steps in Airflow, Temporal, n8n. Idempotent retries, signed receipts, queued execution.
Overnight processing. Anything where the deadline is "before business hours start" and not "before this user clicks away."

If your workload is in this list, you're a candidate for batch.

Workloads that don't.

Worth being upfront about the boundary:

User-facing chat. A chatbot needs to start streaming tokens within a second. Don't put a queued job in front of a conversation.
Interactive autocomplete. Code completion, search-as-you-type, anything where milliseconds matter.
Mission-critical real-time control loops. Live monitoring, fraud scoring on a user payment, anything where a delay has a real-world consequence.

For these, use a real-time inference provider. The premium is paying for capacity guarantees, and you actually need them.

The cost math.

Take a concrete example. Llama 3.1 8B, 1,000 output tokens per job, 100,000 jobs per month.

Provider model	Per-job cost	Monthly bill
Major cloud serverless inference	~$0.15	~$15,000
Dedicated reserved GPU (H100, full month)	n/a	~$3,500 base + your ops cost
MicroDC.ai async batch	~$0.011	~$1,100

The dedicated GPU is cheaper than serverless if you can keep it utilized. Most teams can't — you'd need a steady ~30 jobs per minute, all month, with no overhead for redeployments, model swaps, or maintenance. Empirically, dedicated-GPU customers run at 15–40% utilization, which means the effective per-job cost is 2–6x the headline rate.

The async batch number is the per-job cost on a marketplace. You don't pay between jobs. You don't pay for the worker to be warm. You don't pay for the model load. You pay for the work that ran. At 100k jobs/month, that's roughly an order of magnitude cheaper than serverless and competitive with dedicated even at perfect utilization — without you having to manage the GPU.

What about burst?

One legitimate worry about batch is variable wait time under load. If the queue backs up, your jobs take longer. Two things help:

First, the queue is multi-region and multi-worker. A flood of jobs from one customer doesn't bottleneck on a single GPU; it spreads across the available pool. The system scales horizontally with the worker network.

Second, batch is for workloads where you don't need a tight latency SLA. If you submit 10,000 jobs and they take three hours instead of two, that's usually fine for a batch use case. If it isn't, you have a real-time workload, not a batch one — pick the right tool.

For workloads that sometimes need real-time and sometimes can wait, the right pattern is two providers: a real-time API for the user-facing path and a batch queue for everything else. Most LLM costs in practice are the everything-else.

Submitting your first batch.

Practically, async batch on MicroDC.ai is one API call:

from microdc import Client, LLMCall

client = Client(api_key=os.environ["MICRODC_API_KEY"])

job = LLMCall(model="llama-3.1-8b")
job.add_user_message("Summarize the attached contract...")

job_id = client.send_job(job, callback_url="https://your-api.example.com/done")

Loop that over your input set and you have a batch. Add a webhook handler at the callback URL and you're hands-off. Or skip the callback and poll with client.wait_for_job(job_id).

For OpenAI-shaped code, the same pattern works through our OpenAI-compatible endpoint — submit a chat completion synchronously and the work runs through the same async queue underneath.

The takeaway.

Real-time inference is expensive because you're paying for a guarantee. Most LLM workloads at scale don't need that guarantee — they need the answer within a reasonable window, with cost-per-job as the optimization target. For those workloads, async batch is the right shape, and the price difference is large enough to matter for any meaningful volume.

Audit your bill. If a meaningful fraction of it is overnight, scheduled, queued, or otherwise non-interactive, that's the slice that should be running async on a distributed network.

How the queue works → Try it with free credits →