Blog

What We Learned from Deploying Fine-Tuned LLMs in Production

Written by Roots Experts | August 26, 2024

Executive Summary

With advancements in Generative AI happening every day, more organizations are incorporating these models into their services to meet business requirements. As this trend grows, fine-tuning generative models for specific use cases has become increasingly important. In this article, we share our findings from deploying customer-specific fine-tuned LLMs in production.

We evaluated several frameworks for self-hosting large language models, including Hugging Face, NVIDIA Triton, and vLLM. While both NVIDIA Triton and vLLM emerged as leading solutions, we preferred vLLM due to a more favorable experience during our initial testing.

Using a fine-tuned 7B Mistral model, we demonstrate vLLM's performance in production by considering input tokens, output tokens, batch size, and parallel requests. Our results indicate that vLLM, with minimal manual tweaking, achieves a throughput of up to 130 tokens per second on an A100 with large text inputs (averaging 8k tokens). It can handle up to 32 concurrent requests, simulating a real-time workflow where requests are processed in parallel without prior batching. This makes vLLM ideal for hosting models used in real-time applications such as document classification, extraction, and summarization for large documents (averaging 8-10 pages). This setup can process about 20-30 million documents annually at an on-demand cost of $30,000, offering a more cost-effective solution compared to alternatives and reducing dependency on third parties and their API quotas. We also compare performance with a less expensive T4 option ($5000 annually) and a consumer-grade GPU like the RTX 3090 (typically not suitable for most businesses).

While the low cost and high volume of self-hosting are appealing, the main motivation is accuracy. Fine-tuned models consistently outperform GPT-4 models in specialized tasks such as business-specific entity extraction. Further details on the training process and accuracy enhancements will be explored in a subsequent article.

Introduction

In this article, we dive into the practicalities of deploying fine-tuned large language models (LLMs) in production environments. You'll take away three key insights:

  • Understanding the vLLM framework and its advantages.
  • Comparing vLLM's performance as a function of input/output tokens, batch size, GPU machines, and more.
  • Practical advice on selecting GPU configurations for business needs.

Before we start, let's address the question:

Why do we need to fine-tune instead of using other methods like Prompt Engineering [1], RAG [15]?

Each approach has its strengths and is suitable for different scenarios, and sometimes they are used together in a complementary manner. Prompt engineering involves designing prompts to maximize the model's efficiency, while retrieval-augmented generation (RAG) combines LLMs with external knowledge sources to incorporate up-to-date information.

Fine-tuning, on the other hand, is crucial when you need to teach a model "new skills" and capture the nuances of specific use cases and domains, such as healthcare, insurance, finance etc. For example, in the insurance industry, fine-tuning can help an LLM accurately identify business-specific claim numbers and claimant names in documents, achieving a level of accuracy necessary for business operations.

In the following sections, the article will focus on fine-tuning and specifically, on deploying a fine-tuned model in production using vLLM.

vLLM

vLLM [4] (Virtually Large Language Model) is an open-source inference optimization framework for LLMs, developed at UC Berkeley. This framework was introduced in June 2023 and has become a popular framework rivalling the likes of TensorRT-LLM [5] by Nvidia.

Developers of vLLM are also the main authors of PagedAttention [4], a hardware-efficient attention algorithm which mirrors the idea of paging in operating systems. This has significantly contributed to vLLM's current popularity and its status as one of the most efficient frameworks for LLM inference. During inference, a decoder LLM uses the previous context (seen tokens) to generate a new token and continues repeating this step autoregressively until it reaches maximum output length, or it outputs an end-of-sentence (EOS) token or stop sequence.

How does vLLM work?

  • KV Caching:
    • When predicting a token, its likelihood is determined by applying softmax on attention scores.  These scores are obtained by scaling the dot product between the query vector and key vectors of all previously seen tokens, followed by multiplication with their value vector. Caching these key and value vectors avoids the need to recalculate attention scores and probabilities for each token, thereby speeding up the inference process. Traditional storage methods can lead to inefficient memory usage. To combat this, the vLLM model adopts a Paged Attention mechanism, drawing inspiration from the Virtual Paging system in Operating Systems. This approach segments the KV Cache into blocks, each storing keys and values for a specific number of tokens. By eliminating the need for contiguous memory, this strategy prevents memory fragmentation and enhances the efficiency of the KV cache.
  • Continuous Batching [6]
    • This is another popular technique that is used by almost all inference frameworks. It helps enhance the efficiency and throughput of LLM inference by dynamically managing incoming requests through batching. The server handles incoming requests, which arrive asynchronously and in varying sizes, by grouping them efficiently for next token prediction. This grouping, or batching, can occur in two ways: either by assembling requests in the order they arrive until a batch is complete, or by setting a time limit to wait for additional requests before forming a batch.
  • Speculative Decoding [7]
    • Speculative decoding aims to optimize model's generation speed by predicting multiple potential paths of the predictions using a smaller model in parallel. In essence, to boost the performance of a larger model, a smaller counterpart operates alongside it to forecast multiple potential outcomes. These predictions are then assessed through different methods to identify the most accurate sequence of tokens. Once determined, this sequence allows the larger model to bypass generating these tokens itself, effectively speeding up its text generation process.

Why vLLM?

  • vLLM is easy to install with few additional dependencies.
  • vLLM includes an OpenAI-style server implementation that can serve as a replacement for OpenAI models.
  • vLLM supports various quantization methods: GTPQ [8], AWQ [9], Bitsandbytes [10].
  • vLLM also supports RoPE [11] scaling (linear interpolation [12] and dynamic [13]) to extend the context length of models. This can be useful for times when you want to run the model inference on longer texts.

Experiment Setup

Metrics

  • Throughput (Output tokens generated per second)
  • Latency (Time taken per request)

Model

  • We use a Mistral 7B Instruct v2 model as it performs very well for its size and is manageable with ease on a single GPU.
  • We compare this model with the baseline that is the Huggingface Mistral model, which can be treated as a minimalistic model for inference.

Data

  • Internal dataset of around 200 diverse samples whose input tokens range from 1000 to over 30000. (This dataset is used for analysis sections: 1, 2, 3, 4 and 5. A different/bigger dataset is used for analysis 6.
  • This dataset contains documents used for entity extraction. Expected output token length usually varies from 100 to 200 (predicted output length can sometimes go over 200 as per observations).
  • Most of the documents have less than 20 pages and less than 20,000 input tokens.

Analysis 1: vLLM vs Hugging Face models

This comparison studies the throughput of vLLM inference and the performance of the native Hugging Face (HF) model. The results indicate that vLLM improves generation speed by approximately 25 times, even with KV caching enabled on the HF model.

Key Observations:

  • Across all vLLM and HF configurations, there is a general trend of a slight decrease in generation speed as the number of input tokens increases. This decrease is more pronounced for vLLM compared to HF chat models.
  • The quantized version of vLLM ("vLLM+awq") consistently shows higher generation speeds compared to the unquantized version.
  • Surprisingly, the quantized version of HF models ("hf_chat+awq") has a much lower generation speed compared to its unquantized counterpart.
  • The difference in generation speed between quantized and unquantized versions is much more significant in the HF chat models than in vLLM models.

Note that these results are obtained from offline inference. Therefore, there are no asynchronous requests to the inference server.

Analysis 2: vLLM's Throughput vs Input Context Length

We study the throughput (in tokens/second) of the quantized vLLM model ("vLLM+awq") across varying input token sizes.

Configuration:

  • Inference method: vLLM + AWQ
  • Batch Size: 1
  • Offline inference
  • GPU machine: 80GB A100
  • Output tokens range: 100-200

Key Observations

  • As the number of input tokens increases, there's a noticeable decline in the model's throughput.
  • A significant drop in throughput occurs around 8k input tokens. The exact reason for this drop is unclear.
  • Interestingly, the increase in processing time is not linear, suggesting some efficiency in handling larger inputs.

Analysis 3: vLLM's Throughput vs Output Length

Configuration:

  • Inference method: vLLM + AWQ
  • Batch Size: 1
  • Offline inference
  • GPU machine: 80GB A100
  • Input tokens range: 4096 to 16384

Key Observations

  • We notice a gradual increase in throughput as the number of output tokens increases, which suggests an efficiency gain as the model generates more tokens, potentially due to amortizing fixed overheads over a larger number of tokens.
  • The performance also depends on the input token range: smaller input token counts yield higher throughput than larger ones.

 

Analysis 4: vLLM's Throughput vs Batch Size

This experiment analyzes vLLM's performance at different batch sizes.

Configuration:

  • Inference method: vLLM + AWQ
  • Max Input Tokens: 16384
  • Max Out Tokens: 400
  • Offline inference
  • GPU machine: 80GB A100

 

Key Observations

  • The total processing time increases non-linearly as a function of batch size, pointing to a complex relationship between batch size and processing efficiency.
  • Average generation speed increases up to a batch size of 8 or 16. We're not sure why it plateaus after.
  • Shorter inputs (1024 tokens), larger batch sizes (e.g., 32) significantly improve efficiency. However, as input length increases, the efficiency gains from larger batch sizes become less pronounced.
  • Identifying the optimal batch size involves balancing throughput maximization against computational costs, especially as input lengths vary. There's a delicate balance between processing speed and computational load, with efficiency gains plateauing or diminishing beyond certain batch sizes.
  • Out-of-Memory Errors occur when batch sizes exceed 64

Analysis - 5: vLLM's Performance on Different GPU Machines

This study compares vLLM's performance on different GPU machines. It's important to note that the evaluation was conducted under specific conditions: both the maximum input prompt and the maximum output lengths were set to 8192 and 400 tokens, respectively. The version of the model tested is an AWQ-quantized variant, as the non-quantized version does not fit on lower-end GPUs. The vLLM model demonstrates remarkable performance across all tested machines, with the Nvidia A100 GPU delivering the superior performance of the group. For this study, we did not include the H100 GPUs, as they are not utilized for inference in our production workflow, despite being our choice for training all fine-tuned models.

 

Configuration used:

  • Inference method: vLLM + AWQ
  • Max Input Tokens: 16384
  • Max Out Tokens: 400
  • Batch Size: 1
  • Offline inference

  GPU Machine
GPU VRAM
Annual Cost (on demand
Throughput (out tokens/sec)
1
A100
80GB
~ $30,000 USD
83.00
2
T4
16GB
~ $10,000 USD
21.96
3
RTX 3090
24GB
~ $5,000 USD
72.14

 

Key Observations:

  • FlashAttention-2 [8] backend is not supported for Volta and Turing GPUs.
  • The V100 GPUs lack AWQ support, precluding the possibility of running quantized vLLM inference on them. However, the T4 GPUs do not face this issue.
  • Interestingly, the RTX 3090 GPU achieves performance levels comparable to the A100, which is impressive. On the other hand, the T4 GPU's performance is markedly lower, which is expected due to the absence of flash attention support.
  • The Nvidia A100 GPU stands out as the leading choice for running the vLLM model based on our tests, with the RTX 3090 and T4 following in performance. It's important to note that the H100, which will likely outperform the others, was not included in our tests. In choosing a GPU, factors such as the specific task demands, including performance, power efficiency, and budget, as well as the operational setting's infrastructure support, such as cooling systems and power supply, should be considered.
  • Although the A100 provides unmatched performance, its higher cost may not be justifiable for all organizations. The RTX 3090 or T4 may represent more cost-effective alternatives that still meet specific needs for a balance between performance and efficiency. However a consumer-grade 3090 may not be an option for most businesses.

Analysis 6: vLLM's Memory Usage, Throughput on Concurrent Requests

Until now, the analysis has focused on vLLM's offline inference. This section explores vLLM's ability to handle online inference by assessing its throughput with different numbers of concurrent requests. It is important to distinguish concurrent requests from batched requests: unlike batch processing, concurrency involves the server handling multiple simultaneous requests, each with a batch size of one, which is a scenario more reflective of real-time production environments.

 

Configuration:

  • Inference method: vLLM + AWQ
  • Online inference
  • Max Input Tokens: 16384
  • Max Out Tokens: 400
  • Number of parallel requests: 1, 2, 4, 8, 16, 32, 64
  • Total number of requests: 256
  • GPU machine: 80GB A100
  • The vLLM framework exhibits excellent scalability, demonstrating efficient utilization of GPU resources with increased parallel requests.
  • Compared to the A100, the T4 GPU has limited scalability, handling only up to 4 parallel requests before server errors occur, whereas the A100 can manage up to 32 concurrent requests.
  • Throughput on the T4 GPU starts at approximately 10 tokens/sec for a single request and rises to 12 tokens/sec with four parallel requests. In contrast, the A100's throughput jumps from 55 tokens/sec for a single request to 130 tokens/sec for 32 requests.

 

Conclusion

The introduction of the vLLM framework represents a significant advancement in optimizing LLM deployments for efficiency and scalability. The framework enhances memory usage and computational efficiency, particularly through its PagedAttention feature.  While NVIDIA Triton is a strong competitor to vLLM, we chose vLLM due to a more favorable experience during the initial testing phase

A100 GPUs, offering high scalability and throughput, are ideal for demanding applications but come with a high cost—up to $30,000 annually—limiting their use to high-volume, high-ROI projects. Conversely, T4 GPUs present a more budget-friendly option at $5,000 to $10,000, suitable for businesses with stricter inference budget constraints. The RTX3090’s performance, closely mirroring that of the A100, suggests the untapped potential of consumer-grade hardware.

The article shifts focus from batch sizes to managing concurrent requests, aligning more closely with real-world production scenarios. We are hoping it helps the user find a balance between computational efficiency and response speed in LLM applications and assist them in selecting the optimal GPU configuration for their needs by considering key factors such as input size, output size, incoming volume of requests, and budget constraints.

The article discusses the vLLM framework's benefits and its transformative role in generative AI, touching on its applications in diverse fields such as customer service and predictive analysis. However, it also notes a gap in research on GPU memory usage for vLLM, marking this as an area for future exploration.

About the Author

Rohith Mukku is an AI Researcher at Roots Automation, where he is developing a universal document understanding model with a focus on optimizing inference for large language and vision models. Prior to joining Roots, he earned his master's in computer science from New York University, with a focus on advancing Behavioral Cloning in robotics and evaluating the effects of red-teaming on large language models (LLMs). Rohith completed his undergraduate degree in computer science at IIT Kanpur and previously worked as a software engineer at Samsung R&D Institute Delhi, focusing on Tizen kernel and Visual Display applications.

 

References

[1] White, Jules, et al. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv:2302.11382, arXiv, 21 Feb. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2302.11382.

[2] Mangrulkar, Sourab, et al. “PEFT: State-of-the-Art Parameter-Efficient Fine-Tuning Methods.” PEFT, 2022, https://github.com/huggingface/peft.

[3] Hu, Edward J., et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685, arXiv, 16 Oct. 2021. arXiv.org, https://doi.org/10.48550/arXiv.2106.09685.

[4] Kwon, Woosuk, et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180, arXiv, 12 Sept. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2309.06180.

[5] Overview — Tensorrt_llm Documentation. https://nvidia.github.io/TensorRT-LLM/overview.html. Accessed 9 July 2024.

[6] “Achieve 23x LLM Inference Throughput & Reduce P50 Latency.” Anyscale, https://www.anyscale.com/blog/continuous-batching-llm-inference. Accessed 9 July 2024.

[7] Leviathan, Yaniv, et al. Fast Inference from Transformers via Speculative Decoding. arXiv:2211.17192, arXiv, 18 May 2023. arXiv.org, https://doi.org/10.48550/arXiv.2211.17192.

[8] Frantar, Elias, et al. GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers. arXiv:2210.17323, arXiv, 22 Mar. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2210.17323.

[9] Lin, Ji, et al. AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978, arXiv, 23 Apr. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2306.00978.

[10] Bitsandbytes. https://huggingface.co/docs/bitsandbytes/main/en/index. Accessed 9 July 2024.

[11] Su, Jianlin, et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864, arXiv, 8 Nov. 2023. arXiv.org, https://doi.org/10.48550/arXiv.2104.09864.

[12] Chen, Shouyuan, et al. Extending Context Window of Large Language Models via Positional Interpolation. arXiv:2306.15595, arXiv, 28 June 2023. arXiv.org, https://doi.org/10.48550/arXiv.2306.15595.

[13]LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation, 2023b. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_%20rope_allows_llama_models_to_have/

[14] Dao, Tri, et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135, arXiv, 23 June 2022. arXiv.org, https://doi.org/10.48550/arXiv.2205.14135.

[15] Lewis, P., et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv, 2020,  arXiv.org, https://doi.org/10.48550/arXiv.2005.11401