With advancements in Generative AI happening every day, more organizations are incorporating these models into their services to meet business requirements. As this trend grows, fine-tuning generative models for specific use cases has become increasingly important. In this article, we share our findings from deploying customer-specific fine-tuned LLMs in production.
We evaluated several frameworks for self-hosting large language models, including Hugging Face, NVIDIA Triton, and vLLM. While both NVIDIA Triton and vLLM emerged as leading solutions, we preferred vLLM due to a more favorable experience during our initial testing.
Using a fine-tuned 7B Mistral model, we demonstrate vLLM's performance in production by considering input tokens, output tokens, batch size, and parallel requests. Our results indicate that vLLM, with minimal manual tweaking, achieves a throughput of up to 130 tokens per second on an A100 with large text inputs (averaging 8k tokens). It can handle up to 32 concurrent requests, simulating a real-time workflow where requests are processed in parallel without prior batching. This makes vLLM ideal for hosting models used in real-time applications such as document classification, extraction, and summarization for large documents (averaging 8-10 pages). This setup can process about 20-30 million documents annually at an on-demand cost of $30,000, offering a more cost-effective solution compared to alternatives and reducing dependency on third parties and their API quotas. We also compare performance with a less expensive T4 option ($5000 annually) and a consumer-grade GPU like the RTX 3090 (typically not suitable for most businesses).
While the low cost and high volume of self-hosting are appealing, the main motivation is accuracy. Fine-tuned models consistently outperform GPT-4 models in specialized tasks such as business-specific entity extraction. Further details on the training process and accuracy enhancements will be explored in a subsequent article.
In this article, we dive into the practicalities of deploying fine-tuned large language models (LLMs) in production environments. You'll take away three key insights:
Before we start, let's address the question:
Why do we need to fine-tune instead of using other methods like Prompt Engineering [1], RAG [15]?
Each approach has its strengths and is suitable for different scenarios, and sometimes they are used together in a complementary manner. Prompt engineering involves designing prompts to maximize the model's efficiency, while retrieval-augmented generation (RAG) combines LLMs with external knowledge sources to incorporate up-to-date information.
Fine-tuning, on the other hand, is crucial when you need to teach a model "new skills" and capture the nuances of specific use cases and domains, such as healthcare, insurance, finance etc. For example, in the insurance industry, fine-tuning can help an LLM accurately identify business-specific claim numbers and claimant names in documents, achieving a level of accuracy necessary for business operations.
In the following sections, the article will focus on fine-tuning and specifically, on deploying a fine-tuned model in production using vLLM.
vLLM [4] (Virtually Large Language Model) is an open-source inference optimization framework for LLMs, developed at UC Berkeley. This framework was introduced in June 2023 and has become a popular framework rivalling the likes of TensorRT-LLM [5] by Nvidia.
Developers of vLLM are also the main authors of PagedAttention [4], a hardware-efficient attention algorithm which mirrors the idea of paging in operating systems. This has significantly contributed to vLLM's current popularity and its status as one of the most efficient frameworks for LLM inference. During inference, a decoder LLM uses the previous context (seen tokens) to generate a new token and continues repeating this step autoregressively until it reaches maximum output length, or it outputs an end-of-sentence (EOS) token or stop sequence.
Metrics
Model
Data
This comparison studies the throughput of vLLM inference and the performance of the native Hugging Face (HF) model. The results indicate that vLLM improves generation speed by approximately 25 times, even with KV caching enabled on the HF model.
Note that these results are obtained from offline inference. Therefore, there are no asynchronous requests to the inference server.
We study the throughput (in tokens/second) of the quantized vLLM model ("vLLM+awq") across varying input token sizes.
This experiment analyzes vLLM's performance at different batch sizes.
This study compares vLLM's performance on different GPU machines. It's important to note that the evaluation was conducted under specific conditions: both the maximum input prompt and the maximum output lengths were set to 8192 and 400 tokens, respectively. The version of the model tested is an AWQ-quantized variant, as the non-quantized version does not fit on lower-end GPUs. The vLLM model demonstrates remarkable performance across all tested machines, with the Nvidia A100 GPU delivering the superior performance of the group. For this study, we did not include the H100 GPUs, as they are not utilized for inference in our production workflow, despite being our choice for training all fine-tuned models.
GPU Machine |
GPU VRAM
|
Annual Cost (on demand |
Throughput (out tokens/sec)
|
|
1 |
A100
|
80GB
|
~ $30,000 USD
|
83.00
|
2 |
T4
|
16GB
|
~ $10,000 USD
|
21.96
|
3 |
RTX 3090
|
24GB
|
~ $5,000 USD
|
72.14
|
Until now, the analysis has focused on vLLM's offline inference. This section explores vLLM's ability to handle online inference by assessing its throughput with different numbers of concurrent requests. It is important to distinguish concurrent requests from batched requests: unlike batch processing, concurrency involves the server handling multiple simultaneous requests, each with a batch size of one, which is a scenario more reflective of real-time production environments.
The introduction of the vLLM framework represents a significant advancement in optimizing LLM deployments for efficiency and scalability. The framework enhances memory usage and computational efficiency, particularly through its PagedAttention feature. While NVIDIA Triton is a strong competitor to vLLM, we chose vLLM due to a more favorable experience during the initial testing phase
A100 GPUs, offering high scalability and throughput, are ideal for demanding applications but come with a high cost—up to $30,000 annually—limiting their use to high-volume, high-ROI projects. Conversely, T4 GPUs present a more budget-friendly option at $5,000 to $10,000, suitable for businesses with stricter inference budget constraints. The RTX3090’s performance, closely mirroring that of the A100, suggests the untapped potential of consumer-grade hardware.
The article shifts focus from batch sizes to managing concurrent requests, aligning more closely with real-world production scenarios. We are hoping it helps the user find a balance between computational efficiency and response speed in LLM applications and assist them in selecting the optimal GPU configuration for their needs by considering key factors such as input size, output size, incoming volume of requests, and budget constraints.
The article discusses the vLLM framework's benefits and its transformative role in generative AI, touching on its applications in diverse fields such as customer service and predictive analysis. However, it also notes a gap in research on GPU memory usage for vLLM, marking this as an area for future exploration.
Rohith Mukku is an AI Researcher at Roots Automation, where he is developing a universal document understanding model with a focus on optimizing inference for large language and vision models. Prior to joining Roots, he earned his master's in computer science from New York University, with a focus on advancing Behavioral Cloning in robotics and evaluating the effects of red-teaming on large language models (LLMs). Rohith completed his undergraduate degree in computer science at IIT Kanpur and previously worked as a software engineer at Samsung R&D Institute Delhi, focusing on Tizen kernel and Visual Display applications.
[1] White, Jules, et al. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv:2302.11382, arXiv, 21 Feb. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2302.11382.
[2] Mangrulkar, Sourab, et al. “PEFT: State-of-the-Art Parameter-Efficient Fine-Tuning Methods.” PEFT, 2022, https://github.com/huggingface/peft.
[3] Hu, Edward J., et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685, arXiv, 16 Oct. 2021. arXiv.org, https://doi.org/10.48550/arXiv.2106.09685.
[4] Kwon, Woosuk, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180, arXiv, 12 Sept. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2309.06180.
[5] Overview — Tensorrt_llm Documentation. https://nvidia.github.io/TensorRT-LLM/overview.html. Accessed 9 July 2024.
[6] “Achieve 23x LLM Inference Throughput & Reduce P50 Latency.” Anyscale, https://www.anyscale.com/blog/continuous-batching-llm-inference. Accessed 9 July 2024.
[7] Leviathan, Yaniv, et al. Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192, arXiv, 18 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2211.17192.
[8] Frantar, Elias, et al. GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers. arXiv:2210.17323, arXiv, 22 Mar. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2210.17323.
[9] Lin, Ji, et al. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978, arXiv, 23 Apr. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2306.00978.
[10] Bitsandbytes. https://huggingface.co/docs/bitsandbytes/main/en/index. Accessed 9 July 2024.
[11] Su, Jianlin, et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864, arXiv, 8 Nov. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2104.09864.
[12] Chen, Shouyuan, et al. Extending Context Window of Large Language Models via Positional Interpolation. arXiv:2306.15595, arXiv, 28 June 2023. arXiv.org, https://doi.org/10.48550/arXiv.2306.15595.
[13]LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023b. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_%20rope_allows_llama_models_to_have/
[14] Dao, Tri, et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135, arXiv, 23 June 2022. arXiv.org, https://doi.org/10.48550/arXiv.2205.14135.
[15] Lewis, P., et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv, 2020, arXiv.org, https://doi.org/10.48550/arXiv.2005.11401