
JIEAS OPEN ACCESS
Journal of Industrial Engineering and Applied Science
ISSN:3005-608X (print) | ISSN:3005-6071 (online) | Publication Frequency: Bimonthly
KV Cache and Inference Scheduling: Energy Modeling for High-QPS Services
* Corresponding Author1: Wenwen Liu, E-Mail: liuwenwen.jessica@bytedance.com
Publication
Accepted 2026 January 30 ; Published 2026 February 4
Journal of Industrial Engineering and Applied Science, 2026, 4(1), 3005-6071.
Abstract
High-QPS (Queries Per Second) services, such as large language model (LLM) inference and real-time recommendation systems, are increasingly pervasive in AI-driven applications, but their energy consumption has become a critical challenge—accounting for up to 40% of data center operational costs. Existing optimization efforts primarily focus on latency reduction and throughput improvement, overlooking the intricate interplay between KV cache management (a core component of transformer-based model inference) and inference scheduling in energy efficiency. Traditional energy modeling methods (e.g., linear regression, hardware-centric power meters) fail to capture dynamic dependencies between cache behavior, scheduling policies, and workload volatility, leading to inaccurate energy predictions and suboptimal resource allocation. To address these gaps, this study proposes a hybrid energy modeling framework tailored for high-QPS services, integrating KV cache characteristics and inference scheduling dynamics. First, we construct a multi-dimensional energy factor system encompassing four core dimensions: KV Cache Configuration (e.g., cache size, eviction policy, hit ratio), Inference Scheduling Strategy (e.g., batching size, task prioritization, resource partitioning), System Environment (e.g., CPU/GPU utilization, memory bandwidth, power capping), and Workload Traits (e.g., QPS volatility, request complexity, sequence length distribution). Second, we design a two-stage modeling approach: a data-driven component (Gradient Boosting Tree, GBT) to capture non-linear relationships between factor interactions and energy consumption, and an analytical component (Queueing Theory-based latency-energy tradeoff model) to ensure QPS and latency constraints are satisfied. Third, we validate the framework using a real-world dataset from an LLM inference service (2022–2024) with QPS ranging from 5k to 30k, comparing it against three baseline methods. Experimental results show that the proposed framework outperforms traditional models: it achieves an energy prediction accuracy of 92.7% (vs. 78.3% for linear regression and 83.5% for hardware-centric modeling), reduces energy consumption by 18.9%–25.3% while maintaining target QPS and latency SLAs (Service Level Agreements), and identifies key optimization levers (e.g., adaptive KV cache resizing based on QPS fluctuations reduces energy by 14.2%). This study provides a practical tool for system administrators to balance performance and energy efficiency, supporting the development of sustainable high-QPS AI services.
Keywords
KV Cache , Inference Scheduling , Energy Modeling , High-QPS Services , AI System Optimization , Energy Efficiency .
Metadata
Pages: 34-41
References: 11
Disciplines: Computer Science
Subjects: Artificial Intelligence
Cite This Article
APA Style
Liu, W. (2026). Kv cache and inference scheduling: energy modeling for high-qps services. Journal of Industrial Engineering and Applied Science, 4(1), 34-41. https://doi.org/10.70393/6a69656173.333930
Acknowledgments
Not Applicable.
FUNDING
Not Applicable.
INSTITUTIONAL REVIEW BOARD STATEMENT
Not Applicable.
DATA AVAILABILITY STATEMENT
Not Applicable.
INFORMED CONSENT STATEMENT
Not Applicable.
CONFLICT OF INTEREST
Not Applicable.
AUTHOR CONTRIBUTIONS
Not application.
References
PUBLISHER'S NOTE
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Copyright © 2025 The Author(s). Published by Southern United Academy of Sciences.This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Persistent Identifiers





Abstracting and Indexing




Quality Assurance


Archiving Services
t



