I’m currently deploying a serverless API via Runpod and facing a challenge with an unexpected spike in latency.
The setup is simple: a script runs on a serverless worker with concurrency set to 32 workers. When I send a batch of 8 parallel requests (repeated 4 times for a total of 32 requests), the first batch has an expected average latency of around 0.5 seconds. However, the subsequent batches see the latency jump to around 5 seconds and stay there. I’m unsure of the cause for this behavior.
I've attached the sample inference/ping code. If you have experience in addressing similar issues, please share your hypothesis on the cause and how you would approach solving it.
Location: Anywhere
Posted: Oct. 8, 2024, 7:53 p.m.
Apply Now Company Website