Hackers Guide Cost Considerations and Comparing Providers

July 11, 2024

I tested every model, every adaptions of the model so you don't have to scratch your heads while figuring out configs, VMs and at the same time solve the coding challenges of fine tuning, pre-training, and also cost estimation.

The Stakes: High-res Dreams vs. Wallet-crushing Realities

Picture this: You've just cooked up a sweet Stable Diffusion model that can turn your buddy's selfies into Renaissance masterpieces. You're ready to unleash it on the world, but here's the kicker – running this bad boy at scale could either make you the next tech mogul or leave you broke and living in a cardboard box. Let's crunch some numbers:

# Quick and dirty cost estimator

def estimate_monthly_cost(requests_per_minute, cost_per_request, hours_per_day):
    daily_requests = requests_per_minute * 60 * hours_per_day
    monthly_requests = daily_requests * 30
    return monthly_requests * cost_per_request

# Let's say we're handling 30 req/min, $0.10 per req, 24/7
monthly_damage = estimate_monthly_cost(30, 0.10, 24)
print(f"Monthly damage: ${monthly_damage:,.2f}")

Run this, and you'll see we're looking at a cool $129,600 per month. Ouch! 🔥💸 - that was a joke. Actually fine tuning with A100 PICe with 120 gig RAM and 12 vCPUS on Runpod costs only 2 USD. But fear not, dear. We're about to dive deep into the trenches of cloud providers, GPU instances, and serverless voodoo to turn this potential money pit into a lean, mean, anymodel-serving machine.

Highly Customizable Market and Providers

When it comes to hosting our ML models, we have a plethora of options. Let's break down some of the key players and their offerings:


| Provider       | Cost for A10G | Cost A100            | CPU+mem  | 
|----------------|---------------|----------------------|----------|
| Modal          | $1.10/hr      | $3.17/hr             | $0.159/hr| 
| Oracle         | $2.00/hr      | $3.05/hr (8x)        | Included | 
| AWS            | Varies        | Varies               | Varies   | 
| Lambda Labs    | $0.75/hr      | $1.29/hr             | Included |
| Prime Intellect| $0.32/hr (4090)| $0.79/hr (80GB)     | Included |

Each provider offers a unique blend of performance, cost, and additional services. For instance, while Prime Intellect offers the most competitive GPU pricing, Modal provides substantial free credits which could be a significant factor in short-term decision making.

Serverless vs. Dedicated Instances

One key decision is whether to opt for a serverless architecture or dedicated instances. Our current setup on Modal leverages serverless capabilities, which offers benefits like:

Automatic scaling
Pay-per-use pricing
No need for infrastructure management

However, it also comes with challenges:

Cold start times (500-2000ms for Modal)
Potential for higher costs at scale
Less control over underlying infrastructure

Dedicated instances, on the other hand, provide:

Consistent performance
More control over the environment
Potentially lower costs at high, steady workloads

The choice between serverless and dedicated instances will depend on our workload patterns and scaling needs.

3. Performance Considerations

The performance of our API layer is crucial for real-time image and video manipulation. Let's compare some popular options:

| Metric                | Go     | Node.js | Python (FastAPI) | Modal  |
|-----------------------|--------|---------|------------------|--------|
| Requests/second       | 100k+  | 15-20k  | 5-10k            | Varies |
| Latency (ms)          | 0.5-2  | 2-10    | 5-15             | 10-50  |
| Memory usage          | Low    | Medium  | Medium-High      | Varies |
| Cold start time (ms)  | 10-50  | 50-200  | 50-200           | 500-2000 |
| Concurrent connections| 100k+  | 10-20k  | 5-10k            | Varies |

Go stands out with its impressive performance metrics, particularly in terms of throughput and latency. This aligns well with our need for real-time processing. However, it's worth noting that Python (with FastAPI) offers a rich ecosystem of ML libraries, which could be beneficial for certain aspects of our pipeline.

In our current setup, we've observed:

Container uptime for A10G or L4 with SDXL Lightening overnight costs around $50 (40 with L4)
We handle approximately 30 requests per minute
Each request costs about 5-10 cents if we make only one request

These metrics provide a baseline for comparing different infrastructure and API options.

4. Cost Breakdown

Understanding the granular costs associated with our ML model deployment is crucial for optimization. Let's break down the main cost components:

GPU Costs

GPU costs vary significantly across providers and instance types. For our use case, we're primarily interested in A10G and A100 GPUs:

| Provider       | A10G Cost/hr | A100 (40GB) Cost/hr |
|----------------|--------------|---------------------|
| Modal          | $1.10        | $3.17               |
| Lambda Labs    | $0.75        | $1.29               |
| Prime Intellect| $0.32 (4090) | $0.79 (80GB)        |

The variation in pricing is substantial, with Prime Intellect offering the most competitive rates. However, it's important to consider factors beyond raw GPU cost, such as included CPU and memory resources.

CPU and Memory Costs

While some providers bundle CPU and memory costs with GPU instances, others charge separately. For example, Modal charges $0.159/hr for CPU and memory in addition to GPU costs.

Storage and Data Transfer

Storage and data transfer costs can add up quickly, especially when dealing with large image and video files. Most cloud providers charge for both storage (e.g., S3 buckets) and data transfer (ingress is usually free, but egress is charged).

For example, AWS charges:

S3 storage: ~$0.023 per GB per month
Data transfer out: $0.09 per GB (first 10TB/month)

These costs need to be factored into our overall economic analysis, especially as our data volumes grow.

Back to the beast mode

Summary -

🏞️ The Battlefield: Cloud Providers and Their Shiny GPUs 💻

Let's talk hardware. We're not messing around with measly CPUs here – we need the big guns. I'm talking A100s, A10Gss

Here's the lay of the land:

| Provider              | GPU          | Cost/hr  | What you get                                 
|-----------------------|--------------|----------|----------------------------------------------
| 🛠️ **Modal**          | A100 (40GB)  | $3.17    | Serverless magic +  cold starts ❄️   
| 🔧 **Lambda Labs**    | A100 (40GB)  | $1.29    | Bare metal ⚙️, bring your own code 💻           
| 💰 **Prime Intellect**| RTX 4090     | $0.32    | Budget option- but can it handle the heat? 🔥

Now, I hear you asking, "But which one should I choose?" 🤔 Well, buckle up, buttercup 🧈, because that depends on your specific use case. Let's break it down. 👇

The Code: Serverless vs. Dedicated Throwdown

Alright, let's get our hands dirty with some code. We'll compare a serverless setup on Modal with a dedicated instance on Lambda Labs.

Serverless on Modal:

import modal

stub = modal.Stub("sd-api")

@stub.function(gpu="A100")
def generate_image(prompt):
    # Your fancy ML code here
    return "masterpiece.png"

@stub.local_entrypoint()
def main():
    image = generate_image.remote("a hacker in a hoodie, surrounded by holographic displays")
    print(f"Generated: {image}")

Dedicated on Lambda Labs:

import torch
from diffusers import StableDiffusionPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

def generate_image(prompt):
    image = pipe(prompt).images[0]
    return image

if __name__ == "__main__":
    image = generate_image("a hacker in a hoodie, surrounded by holographic displays")
    image.save("masterpiece.png")

The Modal code is slick – it handles scaling for you. But with Lambda Labs, you have more control and potentially lower costs if you're running 24/7.

The Numbers: Show Me The Money 💰

Let's break down the costs for handling 1 million requests per month:

Modal (Serverless):
- Cost per hour (A100): $3.17
- Assuming 2 seconds per request: 2,000,000 seconds = 555.56 hours
- Total: $1,761.11 + $0.159/hr for CPU/mem = $1,849.39
Lambda Labs (Dedicated):
- Cost per hour (A100): $1.29
- Running 24/7 for a month: 720 hours
- Total: $928.80

But wait, there's more! Don't forget about:

Storage costs (those images gotta live somewhere)
Data transfer (pushing pixels ain't free)
API gateway costs (gotta serve 'em somehow)

The Verdict: Hackers Choice

So, what's the play? If you're running hot 24/7, dedicated instances might be your jam. But if you're dealing with spiky traffic or want to sleep at night without worrying about scaling, serverless could be your new best friend.

Remember, this is just the tip of the iceberg. We haven't even talked about multi-GPU setups, distributed inference, or the dark arts of GPU memory optimization.

Stay tuned, because in the next part, we're diving into the nitty-gritty of API performance, cold start nightmares, and how to squeeze every last FLOP out of your GPU without setting your wallet on fire.

Until then, keep hacking, stay curious, and may your GPUs run cool and your bills run low! 🚀🔧💻

Back to all posts