KV Cache and Inference Scheduling: Energy Modeling for High-QPS Services

Wenwen Liu

doi:10.70393/6a69656173.333930

OPEN ACCESS |Research Article ||4 February 2026

KV Cache and Inference Scheduling: Energy Modeling for High-QPS Services

Wenwen Liu

Bytedance

liuwenwen.jessica@bytedance.com

Bytedance, CN, liuwenwen.jessica@bytedance.com.

* Corresponding Author¹: Wenwen Liu, E-Mail: liuwenwen.jessica@bytedance.com

Publication

Accepted 2026 January 30 ; Published 2026 February 4

Journal of Industrial Engineering and Applied Science, 2026, 4(1), 3005-6071.

Abstract

High-QPS (Queries Per Second) services, such as large language model (LLM) inference and real-time recommendation systems, are increasingly pervasive in AI-driven applications, but their energy consumption has become a critical challenge—accounting for up to 40% of data center operational costs. Existing optimization efforts primarily focus on latency reduction and throughput improvement, overlooking the intricate interplay between KV cache management (a core component of transformer-based model inference) and inference scheduling in energy efficiency. Traditional energy modeling methods (e.g., linear regression, hardware-centric power meters) fail to capture dynamic dependencies between cache behavior, scheduling policies, and workload volatility, leading to inaccurate energy predictions and suboptimal resource allocation. To address these gaps, this study proposes a hybrid energy modeling framework tailored for high-QPS services, integrating KV cache characteristics and inference scheduling dynamics. First, we construct a multi-dimensional energy factor system encompassing four core dimensions: KV Cache Configuration (e.g., cache size, eviction policy, hit ratio), Inference Scheduling Strategy (e.g., batching size, task prioritization, resource partitioning), System Environment (e.g., CPU/GPU utilization, memory bandwidth, power capping), and Workload Traits (e.g., QPS volatility, request complexity, sequence length distribution). Second, we design a two-stage modeling approach: a data-driven component (Gradient Boosting Tree, GBT) to capture non-linear relationships between factor interactions and energy consumption, and an analytical component (Queueing Theory-based latency-energy tradeoff model) to ensure QPS and latency constraints are satisfied. Third, we validate the framework using a real-world dataset from an LLM inference service (2022–2024) with QPS ranging from 5k to 30k, comparing it against three baseline methods. Experimental results show that the proposed framework outperforms traditional models: it achieves an energy prediction accuracy of 92.7% (vs. 78.3% for linear regression and 83.5% for hardware-centric modeling), reduces energy consumption by 18.9%–25.3% while maintaining target QPS and latency SLAs (Service Level Agreements), and identifies key optimization levers (e.g., adaptive KV cache resizing based on QPS fluctuations reduces energy by 14.2%). This study provides a practical tool for system administrators to balance performance and energy efficiency, supporting the development of sustainable high-QPS AI services.

Keywords

KV Cache , Inference Scheduling , Energy Modeling , High-QPS Services , AI System Optimization , Energy Efficiency .

Metadata

DOI:

10.70393/6a69656173.333930

ARK:

ark:/40704/JIEAS.v4n1a05

Pages: 34-41

References: 11

Disciplines: Computer Science

Subjects: Artificial Intelligence

Cite This Article

APA Style

Liu, W. (2026). Kv cache and inference scheduling: energy modeling for high-qps services. Journal of Industrial Engineering and Applied Science, 4(1), 34-41. https://doi.org/10.70393/6a69656173.333930

Acknowledgments

Not Applicable.

FUNDING

Not Applicable.

INSTITUTIONAL REVIEW BOARD STATEMENT

Not Applicable.

DATA AVAILABILITY STATEMENT

Not Applicable.

INFORMED CONSENT STATEMENT

Not Applicable.

CONFLICT OF INTEREST

Not Applicable.

AUTHOR CONTRIBUTIONS

Not application.

References

1.

Chen, Y., Zhang, S., & Li, J. (2023). Dynamic batching for throughput optimization in LLM inference. IEEE Transactions on Parallel and Distributed Systems, 34(7), 2015–2028.

2.

Lee, H., Kim, S., & Park, J. (2022). Energy-aware cache eviction for edge AI inference. In Proceedings of the 2022 ACM SIGPLAN International Symposium on Memory Management (pp. 123–136). ACM.

3.

Li, M., Wang, H., & Zhang, L. (2024). Energy modeling for large language model training: A data-driven approach. Journal of Parallel and Distributed Computing, 201, 56–70.

4.

Liu, C., Yu, T., & Chen, W. (2022). Multi-tenant scheduling for GPU inference in cloud environments. IEEE Cloud Computing, 9(4), 89–97.

5.

NVIDIA Corporation. (2023). NVIDIA system management interface (NVML) documentation. https://docs.nvidia.com/deploy/nvml-api/index.html

6.

OpenAI. (2024). GPT-4 API documentation. https://platform.openai.com/docs/models/gpt-4

7.

Prometheus. (2024). Prometheus monitoring system documentation. https://prometheus.io/docs/

8.

Wang, C., Zhao, Y., & Li, S. (2024). Lossless KV cache compression for LLM inference. In Proceedings of the 2024 IEEE International Conference on Artificial Intelligence and Engineering Applications (pp. 345–352). IEEE.

9.

Zhang, H., Liu, Y., & Wang, Z. (2023). Dynamic KV cache sizing for low-latency LLM inference. ACM Transactions on Intelligent Systems and Technology, 14(3), Article 1.

10.

PyTorch. (2023). PyTorch 2.1 documentation. https://pytorch.org/docs/stable/index.html

11.

XGBoost Developers. (2023). XGBoost 2.0 documentation. https://xgboost.readthedocs.io/en/stable/

PUBLISHER'S NOTE

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Copyright © 2025 The Author(s). Published by Southern United Academy of Sciences.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.