From CPU to GPU: A Practical Guide to LLM Deployment
“For decades, we understood that the CPU was the powerhouse of a computer, but little did we know the exact picture: the CPU is the powerhouse of sequential computing(eg — processing a word document) while the GPU is the powerhouse of parallel computing(eg — Machine Learning). At this point, we understood how the distinction between Vidya (knowledge) and Avidya (ignorance) can significantly benefit those aiming to build high-performance, scalable, and efficient applications.”
To better understand the increasing demand for GPUs, we must examine their unique architectural advantages.
If you look at the detailed architecture of GPU ,there are three key differences that make it unique compared to a CPU-:
- Core — First, the GPU core is where the logic and computation take place. As of August 3, 2024, GPUs have many more cores than CPUs. A GPU like the RTX 4090 has over 16,000 cores, while a top CPU like the Intel Core i9 has around 24 cores.
- Memory — Typically, CPU has hierarchal (L1,L2,L3) small caches but fast for rapid data access complemented by larger and faster RAM whereas GPU has DRAM accessible to all cores
- Instruction Set Architecture — Computers uses instructions defined by ISA to perform tasks .ISA for CPUs is designed to focus on arithmetic and complex floating operation while ISA for GPUs is designed to focus on vector and matrix operations
Let’s understand how these differences play out in the real world. GPUs excel in tasks needing lots of calculations at once, like those in Large Language Models (LLMs). To harness this power, let’s explore running Llama 3 on NVIDIA GPUs.
Setup your own custom GPT using openwebui on Hyperstack -:
1. Deploy a new virtual machine
- OS Image — Ubuntu Server 22.04 LTS R535 CUDA 12.2
- Flavor Details — A100–80G-PCIe
- Create new SSH Key
- Enable Public IP (to access the GPT)
- Modify the networking configuration
- Under firewall option enable SSH keys
- Add inbound rule for PORT RANGE (1–65535)
2. Connect to VM instance through Terminal
- Example — ssh -i ~/.ssh/my_key user@Public-IP-Address
3. Install OLLAMA(A platform for running Large Language Models LLMs locally)
- curl -fsSL https://ollama.com/install.sh | sh
- Run ollama
- - ollama run llama3
- Give any prompts (Observe the difference in response speed between running the same task on a CPU versus a GPU )
- Check the response through REST API
4. Create ChatGPT like interface using openwebui (an extensible, self-hosted interface designed to interact with LLMs)
- Install docker (To package Open WebUI and its dependencies into a Docker image)
5. Install nvidia driver(to utilize GPU effectively)
6. Run the docker container bundle
- docker run -d -p 3000:8080 — gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data — name open-webui — restart always ghcr.io/open-webui/open-webui:ollama
- This command will start the docker container on the port 3000
- Access through public-ip-address:3000
Enjoy and Understand !!!
Love to hear the feedback on my insights