How to quantize a large language model for edge devices
Convert a 7B parameter model to 4-bit GGUF format for inference on Raspberry Pi 4 or Jetson Nano using llama.cpp.
This guide converts a large language model into a quantized format optimized for low-power hardware. The steps target Ubuntu 24.04 LTS and Linux distributions running llama.cpp 0.1.72 or later.
Prerequisites
- Linux OS: Ubuntu 24.04 LTS, Debian 12, or similar with 8GB+ RAM.
- Hardware: x86_64 CPU or ARM64 (Raspberry Pi 4/5, Jetson Nano).
- Software: Git, CMake, make, python3, python3-pip, and a C++ compiler (gcc/g++).
- Model file: A Hugging Face model in
.safetensorsformat (e.g., Llama-3-8B).
Step 1: Clone the llama.cpp repository
Download the source code for llama.cpp from GitHub to your local machine. This repository contains the quantization tools and inference engine required for edge deployment.
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
Ensure the repository is cloned successfully by checking the directory contents.
ls -la
# Expected output:
# drwxr-xr-x 20 user user 4096
# -rw-r--r-- 1 user user 1234 README.md
Step 2: Build llama.cpp from source
Compile the project using CMake. This process builds the C++ inference library and the command-line tools needed for conversion and running models on your device.
make LLAMA_CUBLAS=0 -j$(nproc)
The build process compiles the core library and the llama-cli tool. On ARM devices like Raspberry Pi, ensure you use the standard build flags without CUDA-specific options.
# Expected output shows compilation progress
# Compiling src/llama-cli.cpp
# Compiling src/quantize.cpp
# [BUILD SUCCESS]
Step 3: Download the base model
Retrieve the original model weights from Hugging Face. Use the Llama-3-8B-Instruct model as a baseline for quantization. Download the safetensors version to preserve full precision before conversion.
huggingface-cli download meta-llama/Llama-3-8B-Instruct --include "model-*.safetensors" --local-dir /path/to/model
Alternatively, use wget or curl to fetch the files if huggingface-cli is unavailable. Store the model files in a dedicated directory for easy access.
ls /path/to/model
# Expected output:
# model-00001-of-00002.safetensors
# model-00002-of-00002.safetensors
Step 4: Quantize the model to Q4_K_M
Run the quantization tool to convert the safetensors model into GGUF format with Q4_K_M quantization. This format uses 4-bit integers with K-means clustering, reducing memory usage by approximately 75% while maintaining high accuracy.
./build/quantize /path/to/model/model-00001-of-00002.safetensors /path/to/model/model-00001-of-00002.Q4_K_M.gguf q4_k_m
Repeat this for the second shard of the model file. The output file will have a .gguf extension and be significantly smaller than the original.
ls -lh /path/to/model/*.gguf
# Expected output:
# -rw-r--r-- 1 user user 4.8G model-00001-of-00002.Q4_K_M.gguf
Combine the quantized shards into a single file if necessary using the llama-merge tool or by concatenating the shards manually for older versions.
# Concatenate shards for single-file GGUF (if required)
cat /path/to/model/model-*.Q4_K_M.gguf > /path/to/model/Llama-3-8B-Instruct.Q4_K_M.gguf
Step 5: Quantize to Q8_0 for higher precision
Some edge devices benefit from Q8_0 quantization, which uses 8-bit integers. This format reduces memory by about 60% compared to full precision but requires slightly more RAM than Q4_K_M.
./build/quantize /path/to/model/model-00001-of-00002.safetensors /path/to/model/model-00001-of-00002.Q8_0.gguf q8_0
Compare the file sizes and performance of Q4_K_M versus Q8_0 on your specific hardware. Q4_K_M is generally recommended for devices with limited RAM, while Q8_0 offers better fidelity for complex tasks.
Verify the installation
Run the quantized model using llama-cli to confirm it works correctly on your edge device. Use a simple prompt to test inference speed and memory footprint.
./build/llama-cli -m /path/to/model/Llama-3-8B-Instruct.Q4_K_M.gguf -p "What is the capital of France?" -n 128
Observe the output. The model should respond with "Paris" within a few seconds on a Raspberry Pi 4. Check the memory usage with free -h during inference to ensure it fits within your device's limits.
free -h
# Expected output on Raspberry Pi 4 (4GB RAM):
# total used free shared buff/cache available
# Mem: 3.9G 1.2G 1.5G 256M 1.2G 2.1G
Troubleshooting
- Error: "No such file or directory: 'quantize' or 'llama-cli'". Ensure you ran
makesuccessfully and that the binaries are in your current directory. Runls build/to verify. - Error: "Out of memory" during inference. Switch to a more aggressive quantization like Q3_K_M or Q2_K. Reduce the context size (
-n) or batch size (-b) in thellama-clicommand. - Error: "Invalid model format". Verify that the model file has a
.ggufextension and was created using thequantizetool, not a third-party converter. - Slow inference on ARM. Ensure you are using the correct build flags for ARM (e.g.,
-DGGML_USE_NEON=ON). Rebuild with these flags if performance is poor. - Model crashes on startup. Check that the model file is not corrupted. Re-download the model or try a different quantization level.
Follow these steps to deploy large language models on resource-constrained edge devices efficiently.