Practical guide on How to deploy AI Models with RunPod + FastAPI

From Architect's Desk

Jul 01, 2025

This guide will walk you through deploying a powerful, open-source AI chatbot using FastAPI to create an API and RunPod to serve it on a high-performance GPU. We'll use a practical, widely-used model to make this a real-world example.

Before you begin, make sure you have the following:

A RunPod account with billing set up.
Docker installed on your local machine.
Basic knowledge of Python, FastAPI, and the command line.

Step 1: Set Up Your FastAPI Application

First, create a project directory on your local machine. Inside this directory, you will create three essential files: main.py for your API logic, requirements.txt to list your dependencies, and a Dockerfile to containerize the application.

1. Create the `requirements.txt` File

This file will list all the Python packages needed for your chatbot to run. We'll use the transformers library from Hugging Face, torch for PyTorch, and FastAPI with uvicorn as the server.

fastapi
uvicorn
torch
transformers
accelerate*
bitsandbytes*

*Point To Note : accelerate and bitsandbytes are included for model optimization and quantization, which helps the model run more efficiently on the GPU.

2. Write the FastAPI Code (`main.py`)

This Python script will load your chosen AI model and create an API endpoint to interact with it. For this example, we’ll use a powerful yet manageable model like microsoft/DialoGPT-large.

from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Initialize the FastAPI app
app = FastAPI()

# --- Model Loading ---
# Load the tokenizer and model from Hugging Face
# This will run once when the application starts
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


# --- API Data Models ---
class ChatInput(BaseModel):
    text: str
    chat_history_ids: list[int] = None

class ChatOutput(BaseModel):
    response: str
    chat_history_ids: list[int]

# --- API Endpoints ---
@app.get("/")
def read_root():
    return {"status": "AI Chatbot is running"}

@app.post("/chat", response_model=ChatOutput)
def chat_with_bot(payload: ChatInput):
    """
    Handles a chat interaction with the DialoGPT model.
    """
    # 1. Encode the new user input
    new_user_input_ids = tokenizer.encode(payload.text + tokenizer.eos_token, return_tensors='pt').to(device)

    # 2. Append the new user input to the chat history
    # If chat history exists, append to it, otherwise start a new one
    bot_input_ids = torch.cat([torch.LongTensor(payload.chat_history_ids).to(device), new_user_input_ids], dim=-1) if payload.chat_history_ids else new_user_input_ids

    # 3. Generate a response
    chat_history_ids = model.generate(
        bot_input_ids,
        max_length=1000,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        do_sample=True,
        top_k=100,
        top_p=0.7,
        temperature=0.8
    )

    # 4. Decode the response
    response_text = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)

    # 5. Return the response and updated history
    return {
        "response": response_text,
        "chat_history_ids": chat_history_ids.tolist()[0]
    }

This code sets up a /chat endpoint that takes user text and conversation history, generates a response using the model on the GPU, and sends back the chatbot's reply along with the updated history.

Step 2: Containerize with Docker

Now, create a Dockerfile to package your application and its dependencies into a container. This makes it portable and easy to deploy on RunPod.

# Use an official NVIDIA CUDA runtime as a parent image
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

# Set the working directory
WORKDIR /app

# Install Python and Pip
RUN apt-get update && apt-get install -y python3 python3-pip

# Copy the requirements file into the container
COPY requirements.txt .

# Install the Python dependencies
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy the rest of the application code
COPY . .

# Expose the port the app runs on
EXPOSE 8000

# Command to run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Of course. Here’s a guide on how to deploy an AI chatbot with RunPod and FastAPI.

Prerequisites

Before you begin, make sure you have the following:

A RunPod account with billing set up.
Docker installed on your local machine.
Basic knowledge of Python, FastAPI, and the command line.

Step 1: Set Up Your FastAPI Application

1. Create the `requirements.txt` File

This file will list all the Python packages needed for your chatbot to run. We'll use the transformers library from Hugging Face, torch for PyTorch, and FastAPI with uvicorn as the server.

Plaintext

fastapi
uvicorn
torch
transformers
accelerate
bitsandbytes

Note: accelerate and bitsandbytes are included for model optimization and quantization, which helps the model run more efficiently on the GPU.

2. Write the FastAPI Code (`main.py`)

This Python script will load your chosen AI model and create an API endpoint to interact with it. For this example, we’ll use a powerful yet manageable model like microsoft/DialoGPT-large.

Python

from fastapi import FastAPI
from pydantic import BaseModel
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Initialize the FastAPI app
app = FastAPI()

# --- Model Loading ---
# Load the tokenizer and model from Hugging Face
# This will run once when the application starts
tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


# --- API Data Models ---
class ChatInput(BaseModel):
    text: str
    chat_history_ids: list[int] = None

class ChatOutput(BaseModel):
    response: str
    chat_history_ids: list[int]

# --- API Endpoints ---
@app.get("/")
def read_root():
    return {"status": "AI Chatbot is running"}

@app.post("/chat", response_model=ChatOutput)
def chat_with_bot(payload: ChatInput):
    """
    Handles a chat interaction with the DialoGPT model.
    """
    # 1. Encode the new user input
    new_user_input_ids = tokenizer.encode(payload.text + tokenizer.eos_token, return_tensors='pt').to(device)

    # 2. Append the new user input to the chat history
    # If chat history exists, append to it, otherwise start a new one
    bot_input_ids = torch.cat([torch.LongTensor(payload.chat_history_ids).to(device), new_user_input_ids], dim=-1) if payload.chat_history_ids else new_user_input_ids

    # 3. Generate a response
    chat_history_ids = model.generate(
        bot_input_ids,
        max_length=1000,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        do_sample=True,
        top_k=100,
        top_p=0.7,
        temperature=0.8
    )

    # 4. Decode the response
    response_text = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)

    # 5. Return the response and updated history
    return {
        "response": response_text,
        "chat_history_ids": chat_history_ids.tolist()[0]
    }

This code sets up a /chat endpoint that takes user text and conversation history, generates a response using the model on the GPU, and sends back the chatbot's reply along with the updated history.

Step 2: Containerize with Docker

Now, create a Dockerfile to package your application and its dependencies into a container. This makes it portable and easy to deploy on RunPod.

Dockerfile

# Use an official NVIDIA CUDA runtime as a parent image
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

# Set the working directory
WORKDIR /app

# Install Python and Pip
RUN apt-get update && apt-get install -y python3 python3-pip

# Copy the requirements file into the container
COPY requirements.txt .

# Install the Python dependencies
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy the rest of the application code
COPY . .

# Expose the port the app runs on
EXPOSE 8000

# Command to run the application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Step 3: Build and Push the Docker Image

With the Dockerfile ready, you can build your container image and push it to a container registry like Docker Hub.

Build the image: Open your terminal in the project directory and run:

docker build -t your-dockerhub-username/runpod-chatbot:latest .

Log in to Docker Hub:

docker login

Push the image:

docker push your-dockerhub-username/runpod-chatbot:latest

Step 4: Deploy on RunPod

Now it's time to deploy your container on RunPod. You can choose between a persistent GPU Pod or a serverless endpoint. For a chatbot API, Serverless is often more cost-effective.

Navigate to Serverless in your RunPod dashboard and click New Endpoint.
Select a GPU: Choose a suitable GPU. For a model of this size, something like an NVIDIA RTX A4000 is a good starting point.
Configure the Endpoint:
- Container Image: Enter the name of the Docker image you just pushed (e.g., your-dockerhub-username/runpod-chatbot:latest).
- Container Disk: Allocate at least 15 GB to be safe.
- Min/Max Workers: Set the minimum number of workers (e.g., 0 or 1) and the maximum. The system will autoscale within this range based on demand.
Expose the Port: Under "Network," set the "Container Port" to 8000, which is the port our FastAPI app is running on inside the container.
Deploy: Click Deploy. RunPod will now pull your container and set up the serverless endpoint. This may take a few minutes as the model needs to be downloaded and loaded into memory on the first run.

Step 5: Test Your Deployed Chatbot

Once your endpoint is active, RunPod provides you with a unique URL. You can use any API client like Postman, or a simple curl command, to test it.

Bash

curl -X 'POST' \
  'YOUR_RUNPOD_ENDPOINT_URL/chat' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "Hello, how are you today?",
    "chat_history_ids": null
  }'

You should receive a JSON response containing the chatbot's reply and the conversation history IDs. For subsequent turns in the conversation, you would pass the received chat_history_ids back into the next request.

And there you have it! You've successfully navigated the entire lifecycle from local Python code to a globally deployed, GPU-accelerated AI chatbot. This combination of FastAPI and RunPod isn't just a technical exercise; it's a powerful and cost-effective blueprint for building serious AI applications, whether for a personal project, a startup, or an enterprise tool. I'd love to see what you build with this if you run into questions or create something cool, drop a comment below!

For more practical guides on building and deploying real-world AI systems, make sure to subscribe.

Codemiddlewares

Discussion about this post

Codemiddlewares

Practical guide on How to deploy AI Models with RunPod + FastAPI

From Architect's Desk

Step 1: Set Up Your FastAPI Application

1. Create the requirements.txt File

2. Write the FastAPI Code (main.py)

Step 2: Containerize with Docker

Prerequisites

Step 1: Set Up Your FastAPI Application

1. Create the requirements.txt File

2. Write the FastAPI Code (main.py)

Step 2: Containerize with Docker

Step 3: Build and Push the Docker Image

Step 4: Deploy on RunPod

Step 5: Test Your Deployed Chatbot

Discussion about this post

1. Create the `requirements.txt` File

2. Write the FastAPI Code (`main.py`)

1. Create the `requirements.txt` File

2. Write the FastAPI Code (`main.py`)