News and Updates
Understanding LLM's Inference Speed
Curiosity
Jun 2, 2024


When a language model generates text, it processes one token at a time. The model calculates probabilities for each token in the vocabulary and selects the next token based on these probabilities. This process happens sequentially, without parallelism.
Understanding Language Model Inference Mechanics
When a language model generates tokens, it does so sequentially, considering one token at a time. Language models, specifically decoder-only text transformer models, act as a function that takes a token as input and produces a range of probabilities for all tokens in the vocabulary. The vocabulary typically consists of 50-250K tokens, each comprising a few letters. The model then samples from this set of tokens based on the probabilities, generates the next token, and repeats the process. This sequential generation approach means that there is no parallelism involved in creating a text sequence.
During token processing, the language model performs two main operations: matrix-vector multiplication and attention computation. The model can access not only the current token's state but also internal states from all preceding tokens in the sequence. These states are stored in a structure known as the "KV-cache" (key-value cache), which contains key and value vectors for each previous position in the text. In attention computation, a query vector for the current token is used to calculate dot products with all key vectors from previous positions, resulting in a value vector through weighted summation of all value vectors from previous positions.
Both matrix-vector multiplication and attention computation involve minimal floating-point operations for each element read from the matrix or KV-cache. Modern CPUs and GPUs excel in ALU operations (multiplies, adds) compared to memory input reading rates. This disparity is evident in the flops to byte ratios of various hardware configurations, showcasing the efficiency of these operations.
Modern CPUs and GPUs have a significantly higher capacity for ALU operations (multiplications, additions) in comparison to their ability to read inputs from memory. This is evident in the following examples:
AMD Ryzen 7950X: With 67 GB/s memory bandwidth and 2735 GFLOPS, it boasts a 40:1 flops to byte ratio.
NVidia RTX 4090: Featuring 1008 GB/s memory bandwidth and 83 TFLOPS, it achieves an 82:1 flops to byte ratio.
NVidia H100: Despite its seemingly more modest 20:1 flops to byte ratio, the tensor cores can deliver approximately 494 TFLOPS without sparsity, resulting in a 147:1 flops to byte ratio for matrix multiplication scenarios.
The efficiency decreases further for smaller floating-point numbers like FP16 or FP8. For instance, the H100 tensor cores can theoretically reach a throughput of 1979 TFLOPS for dense FP8 matrices, leading to a flops to byte ratio of 590:1. In all these configurations, regardless of the use of tensor cores or the floating-point format employed, ALU operations remain abundant.
Therefore, any problem requiring only two operations per element is constrained by bandwidth limitations. By considering factors such as the model configuration, KV-cache size, and available bandwidth, we can estimate the minimum time required to execute the inference process effectively.
Analyzing the theoretical bounds for Llama3
When examining a model like the smaller Llama3 model, it contains approximately 8 billion parameters. The breakdown of these parameters is as follows:
525M parameters for the embedding matrix, which is not used in matrix-vector multiplication.
6979M parameters for the transformers layer, hidden states through a feed-forward network.
1050M parameters for converting the hidden state into token probabilities, which involves a matrix multiplication.
In total, there are approximately 8.03B "active" parameters used in matrix multiplications. If the model uses FP16 for the matrix elements, around 27 GB of data needs to be read for each token. Additionally, due to cache limitations, the process cannot run faster than memory bandwidth.
Regarding attention computation, the model needs to read the KV-cache up to the current token. The amount of data read depends on the number of tokens processed, including system prompts, user prompts, previous model outputs, and potential multiple user prompts for extended chat sessions.
Based on these calculations, it is possible to determine the minimum time required for inference. For example, on an NVidia RTX 4090 with a memory bandwidth of 1008 GB/s, reading 27 GB takes approximately 27 ms. Therefore, we can expect around 27 ms per token for tokens with low position numbers, with minimal impact from the KV-cache. If 8-bit weights are used, reading 13.5 GB takes about 13.3 ms. These estimates represent the theoretical minimum time per token.
Conclusion
In situations like this, having a solid theoretical speed for the model is crucial. It helps select the target hardware, validate the implementation quality and anticipate the consequences of hardware and architectural modifications when the amount of computation and memory access is predetermined.
When a language model generates text, it processes one token at a time. The model calculates probabilities for each token in the vocabulary and selects the next token based on these probabilities. This process happens sequentially, without parallelism.
Understanding Language Model Inference Mechanics
When a language model generates tokens, it does so sequentially, considering one token at a time. Language models, specifically decoder-only text transformer models, act as a function that takes a token as input and produces a range of probabilities for all tokens in the vocabulary. The vocabulary typically consists of 50-250K tokens, each comprising a few letters. The model then samples from this set of tokens based on the probabilities, generates the next token, and repeats the process. This sequential generation approach means that there is no parallelism involved in creating a text sequence.
During token processing, the language model performs two main operations: matrix-vector multiplication and attention computation. The model can access not only the current token's state but also internal states from all preceding tokens in the sequence. These states are stored in a structure known as the "KV-cache" (key-value cache), which contains key and value vectors for each previous position in the text. In attention computation, a query vector for the current token is used to calculate dot products with all key vectors from previous positions, resulting in a value vector through weighted summation of all value vectors from previous positions.
Both matrix-vector multiplication and attention computation involve minimal floating-point operations for each element read from the matrix or KV-cache. Modern CPUs and GPUs excel in ALU operations (multiplies, adds) compared to memory input reading rates. This disparity is evident in the flops to byte ratios of various hardware configurations, showcasing the efficiency of these operations.
Modern CPUs and GPUs have a significantly higher capacity for ALU operations (multiplications, additions) in comparison to their ability to read inputs from memory. This is evident in the following examples:
AMD Ryzen 7950X: With 67 GB/s memory bandwidth and 2735 GFLOPS, it boasts a 40:1 flops to byte ratio.
NVidia RTX 4090: Featuring 1008 GB/s memory bandwidth and 83 TFLOPS, it achieves an 82:1 flops to byte ratio.
NVidia H100: Despite its seemingly more modest 20:1 flops to byte ratio, the tensor cores can deliver approximately 494 TFLOPS without sparsity, resulting in a 147:1 flops to byte ratio for matrix multiplication scenarios.
The efficiency decreases further for smaller floating-point numbers like FP16 or FP8. For instance, the H100 tensor cores can theoretically reach a throughput of 1979 TFLOPS for dense FP8 matrices, leading to a flops to byte ratio of 590:1. In all these configurations, regardless of the use of tensor cores or the floating-point format employed, ALU operations remain abundant.
Therefore, any problem requiring only two operations per element is constrained by bandwidth limitations. By considering factors such as the model configuration, KV-cache size, and available bandwidth, we can estimate the minimum time required to execute the inference process effectively.
Analyzing the theoretical bounds for Llama3
When examining a model like the smaller Llama3 model, it contains approximately 8 billion parameters. The breakdown of these parameters is as follows:
525M parameters for the embedding matrix, which is not used in matrix-vector multiplication.
6979M parameters for the transformers layer, hidden states through a feed-forward network.
1050M parameters for converting the hidden state into token probabilities, which involves a matrix multiplication.
In total, there are approximately 8.03B "active" parameters used in matrix multiplications. If the model uses FP16 for the matrix elements, around 27 GB of data needs to be read for each token. Additionally, due to cache limitations, the process cannot run faster than memory bandwidth.
Regarding attention computation, the model needs to read the KV-cache up to the current token. The amount of data read depends on the number of tokens processed, including system prompts, user prompts, previous model outputs, and potential multiple user prompts for extended chat sessions.
Based on these calculations, it is possible to determine the minimum time required for inference. For example, on an NVidia RTX 4090 with a memory bandwidth of 1008 GB/s, reading 27 GB takes approximately 27 ms. Therefore, we can expect around 27 ms per token for tokens with low position numbers, with minimal impact from the KV-cache. If 8-bit weights are used, reading 13.5 GB takes about 13.3 ms. These estimates represent the theoretical minimum time per token.
Conclusion
In situations like this, having a solid theoretical speed for the model is crucial. It helps select the target hardware, validate the implementation quality and anticipate the consequences of hardware and architectural modifications when the amount of computation and memory access is predetermined.
When a language model generates text, it processes one token at a time. The model calculates probabilities for each token in the vocabulary and selects the next token based on these probabilities. This process happens sequentially, without parallelism.
Understanding Language Model Inference Mechanics
When a language model generates tokens, it does so sequentially, considering one token at a time. Language models, specifically decoder-only text transformer models, act as a function that takes a token as input and produces a range of probabilities for all tokens in the vocabulary. The vocabulary typically consists of 50-250K tokens, each comprising a few letters. The model then samples from this set of tokens based on the probabilities, generates the next token, and repeats the process. This sequential generation approach means that there is no parallelism involved in creating a text sequence.
During token processing, the language model performs two main operations: matrix-vector multiplication and attention computation. The model can access not only the current token's state but also internal states from all preceding tokens in the sequence. These states are stored in a structure known as the "KV-cache" (key-value cache), which contains key and value vectors for each previous position in the text. In attention computation, a query vector for the current token is used to calculate dot products with all key vectors from previous positions, resulting in a value vector through weighted summation of all value vectors from previous positions.
Both matrix-vector multiplication and attention computation involve minimal floating-point operations for each element read from the matrix or KV-cache. Modern CPUs and GPUs excel in ALU operations (multiplies, adds) compared to memory input reading rates. This disparity is evident in the flops to byte ratios of various hardware configurations, showcasing the efficiency of these operations.
Modern CPUs and GPUs have a significantly higher capacity for ALU operations (multiplications, additions) in comparison to their ability to read inputs from memory. This is evident in the following examples:
AMD Ryzen 7950X: With 67 GB/s memory bandwidth and 2735 GFLOPS, it boasts a 40:1 flops to byte ratio.
NVidia RTX 4090: Featuring 1008 GB/s memory bandwidth and 83 TFLOPS, it achieves an 82:1 flops to byte ratio.
NVidia H100: Despite its seemingly more modest 20:1 flops to byte ratio, the tensor cores can deliver approximately 494 TFLOPS without sparsity, resulting in a 147:1 flops to byte ratio for matrix multiplication scenarios.
The efficiency decreases further for smaller floating-point numbers like FP16 or FP8. For instance, the H100 tensor cores can theoretically reach a throughput of 1979 TFLOPS for dense FP8 matrices, leading to a flops to byte ratio of 590:1. In all these configurations, regardless of the use of tensor cores or the floating-point format employed, ALU operations remain abundant.
Therefore, any problem requiring only two operations per element is constrained by bandwidth limitations. By considering factors such as the model configuration, KV-cache size, and available bandwidth, we can estimate the minimum time required to execute the inference process effectively.
Analyzing the theoretical bounds for Llama3
When examining a model like the smaller Llama3 model, it contains approximately 8 billion parameters. The breakdown of these parameters is as follows:
525M parameters for the embedding matrix, which is not used in matrix-vector multiplication.
6979M parameters for the transformers layer, hidden states through a feed-forward network.
1050M parameters for converting the hidden state into token probabilities, which involves a matrix multiplication.
In total, there are approximately 8.03B "active" parameters used in matrix multiplications. If the model uses FP16 for the matrix elements, around 27 GB of data needs to be read for each token. Additionally, due to cache limitations, the process cannot run faster than memory bandwidth.
Regarding attention computation, the model needs to read the KV-cache up to the current token. The amount of data read depends on the number of tokens processed, including system prompts, user prompts, previous model outputs, and potential multiple user prompts for extended chat sessions.
Based on these calculations, it is possible to determine the minimum time required for inference. For example, on an NVidia RTX 4090 with a memory bandwidth of 1008 GB/s, reading 27 GB takes approximately 27 ms. Therefore, we can expect around 27 ms per token for tokens with low position numbers, with minimal impact from the KV-cache. If 8-bit weights are used, reading 13.5 GB takes about 13.3 ms. These estimates represent the theoretical minimum time per token.
Conclusion
In situations like this, having a solid theoretical speed for the model is crucial. It helps select the target hardware, validate the implementation quality and anticipate the consequences of hardware and architectural modifications when the amount of computation and memory access is predetermined.
When a language model generates text, it processes one token at a time. The model calculates probabilities for each token in the vocabulary and selects the next token based on these probabilities. This process happens sequentially, without parallelism.
Understanding Language Model Inference Mechanics
When a language model generates tokens, it does so sequentially, considering one token at a time. Language models, specifically decoder-only text transformer models, act as a function that takes a token as input and produces a range of probabilities for all tokens in the vocabulary. The vocabulary typically consists of 50-250K tokens, each comprising a few letters. The model then samples from this set of tokens based on the probabilities, generates the next token, and repeats the process. This sequential generation approach means that there is no parallelism involved in creating a text sequence.
During token processing, the language model performs two main operations: matrix-vector multiplication and attention computation. The model can access not only the current token's state but also internal states from all preceding tokens in the sequence. These states are stored in a structure known as the "KV-cache" (key-value cache), which contains key and value vectors for each previous position in the text. In attention computation, a query vector for the current token is used to calculate dot products with all key vectors from previous positions, resulting in a value vector through weighted summation of all value vectors from previous positions.
Both matrix-vector multiplication and attention computation involve minimal floating-point operations for each element read from the matrix or KV-cache. Modern CPUs and GPUs excel in ALU operations (multiplies, adds) compared to memory input reading rates. This disparity is evident in the flops to byte ratios of various hardware configurations, showcasing the efficiency of these operations.
Modern CPUs and GPUs have a significantly higher capacity for ALU operations (multiplications, additions) in comparison to their ability to read inputs from memory. This is evident in the following examples:
AMD Ryzen 7950X: With 67 GB/s memory bandwidth and 2735 GFLOPS, it boasts a 40:1 flops to byte ratio.
NVidia RTX 4090: Featuring 1008 GB/s memory bandwidth and 83 TFLOPS, it achieves an 82:1 flops to byte ratio.
NVidia H100: Despite its seemingly more modest 20:1 flops to byte ratio, the tensor cores can deliver approximately 494 TFLOPS without sparsity, resulting in a 147:1 flops to byte ratio for matrix multiplication scenarios.
The efficiency decreases further for smaller floating-point numbers like FP16 or FP8. For instance, the H100 tensor cores can theoretically reach a throughput of 1979 TFLOPS for dense FP8 matrices, leading to a flops to byte ratio of 590:1. In all these configurations, regardless of the use of tensor cores or the floating-point format employed, ALU operations remain abundant.
Therefore, any problem requiring only two operations per element is constrained by bandwidth limitations. By considering factors such as the model configuration, KV-cache size, and available bandwidth, we can estimate the minimum time required to execute the inference process effectively.
Analyzing the theoretical bounds for Llama3
When examining a model like the smaller Llama3 model, it contains approximately 8 billion parameters. The breakdown of these parameters is as follows:
525M parameters for the embedding matrix, which is not used in matrix-vector multiplication.
6979M parameters for the transformers layer, hidden states through a feed-forward network.
1050M parameters for converting the hidden state into token probabilities, which involves a matrix multiplication.
In total, there are approximately 8.03B "active" parameters used in matrix multiplications. If the model uses FP16 for the matrix elements, around 27 GB of data needs to be read for each token. Additionally, due to cache limitations, the process cannot run faster than memory bandwidth.
Regarding attention computation, the model needs to read the KV-cache up to the current token. The amount of data read depends on the number of tokens processed, including system prompts, user prompts, previous model outputs, and potential multiple user prompts for extended chat sessions.
Based on these calculations, it is possible to determine the minimum time required for inference. For example, on an NVidia RTX 4090 with a memory bandwidth of 1008 GB/s, reading 27 GB takes approximately 27 ms. Therefore, we can expect around 27 ms per token for tokens with low position numbers, with minimal impact from the KV-cache. If 8-bit weights are used, reading 13.5 GB takes about 13.3 ms. These estimates represent the theoretical minimum time per token.
Conclusion
In situations like this, having a solid theoretical speed for the model is crucial. It helps select the target hardware, validate the implementation quality and anticipate the consequences of hardware and architectural modifications when the amount of computation and memory access is predetermined.

Built for Enterprise. Designed Around You.
Talk with our team to learn how Curiosity can help your organization..

Build for Enterprise. Designed Around You.
Talk with our team to create a Curiosity Workspace tailored to your organization.

Build for Enterprise. Designed Around You.
Talk with our team to create a Curiosity Workspace tailored to your organization.
© 2025 Curiosity GmbH - All rights reserved
© 2025 Curiosity GmbH - All rights reserved
© 2025 Curiosity GmbH - All rights reserved