KAYTUS Enhances KSManage for Intelligent Management of Liquid-Cooled AI Data Centers

26 Jun 2025
SINGAPORE

KAYTUS, a leading provider of end-to-end AI and liquid cooling solutions, has announced the release of the enhanced KSManage V2.3, its advanced device management platform for AI data centers. The latest version introduces expanded monitoring and control capabilities tailored for GB200 and B200 systems, including integrated liquid cooling detection features. Leveraging intelligent automation, KSManage V2.3 enables AI data centers to operate with greater precision, efficiency, and sustainability, delivering comprehensive refined management across IT infrastructure and maximizing overall performance.

As Generative AI technology accelerates, AI data centers have emerged as critical infrastructure for enabling innovations in artificial intelligence and big data. Next-generation devices such as NVIDIA’s B200 and GB200 are being rapidly adopted to meet growing AI compute demands. However, their advanced architectures differ substantially from traditional systems, driving the need for more sophisticated management solutions. For instance, the GB200 integrates two B200 Blackwell GPUs with an Arm-based Grace CPU, creating a high-performance configuration that poses new management challenges. From hardware status monitoring to software scheduling, more precise and intelligent control mechanisms are essential to maintain operational efficiency. Moreover, the elevated computing power of these devices leads to higher energy consumption, increasing the risk of performance bottlenecks, or even system outages in the event of failures. As a result, energy efficiency and real-time system monitoring have become mission-critical for ensuring the stability and sustainability of AI data center operations.

KSManage Provides Intelligent, Refined Management for AI Data Centers

KSManage builds on a wealth of experience in traditional device management and supports more than 5,000 device models. Its comprehensive management framework spans IT, network, security, and other infrastructure components. The platform enables real-time monitoring of critical server components, including CPU, memory, and storage drives. Leveraging intelligent algorithms, KSManage can predict potential faults, issue early warnings, and support preventive maintenance, helping ensure servers operate at peak performance and reducing the risk of unplanned downtime.

The upgraded KSManage delivers comprehensive monitoring of key performance indicators for GB200 and B200 devices, including GPU performance, CPU utilization, and memory bandwidth. Through 3D real-time modeling, it dynamically visualizes resource distribution and intelligently adjusts allocation based on workload demands. The platform also features automated network topology management, enabling real-time optimization of NVLink connectivity, and contributing to a 90% boost in operational efficiency. During large model training, KSManage automatically allocates more computing resources to relevant tasks, optimizing the distribution of CPU, GPU, and other components. This ensures higher device utilization, improved computational efficiency, and significantly faster training times.

Specific for intelligent fault detection, the upgraded KSManage introduces a three-tier monitoring framework spanning the component, machine, and cluster levels. At the component level, it leverages the PLDM protocol to enable precise monitoring of critical metrics such as GPU memory status. When computational errors are detected in B200 GPUs, KSManage rapidly analyzes error logs to distinguish between hardware faults and software conflicts, achieving over 92% accuracy in fault localization and taking timely corrective actions. At the machine level, KSManage integrates both BMC out-of-band logs and OS in-band logs to support fast and reliable hardware diagnostics. At the cluster level, federated management technology enables cross-domain alarm correlation and analysis, and triggers self-healing engines capable of responding to risks within seconds. The system also synchronizes with a high-precision liquid leak monitoring solution to enhance equipment safety. Collectively, these capabilities significantly reduce Mean Time to Repair (MTTR) and improve Mean Time Between Failures (MTBF), ensuring higher stability and resilience across AI data center operations.

Intelligent Management of Green, Liquid-Cooled AI Data Centers

As power density in AI data centers continues to increase, cooling has become a critical factor influencing both device performance and operational lifespan. To address this challenge, liquid cooling technology—recognized for its high thermal efficiency—has been widely adopted across next-generation AI infrastructure.

The upgraded KSManage introduces a new liquid cooling detection feature that enhances both the efficiency and safety of liquid cooling operations in AI data centers. The system provides real-time monitoring of key parameters such as coolant flow rate, temperature, and pressure, ensuring stable and optimal performance of the liquid cooling infrastructure. By analyzing data from chip power consumption and cooling circuit pressure, KSManage employs a multi-objective optimization algorithm to dynamically adjust flow rates and calculate the optimal coolant distribution under varying workloads. Powered by AI-driven precision control, the platform achieves a 50% improvement in flow utilization and delivers up to 10% additional energy savings in the liquid cooling system.

In addition, KSManage enhances operational reliability by providing real-time anomaly detection in the liquid cooling system. When issues such as abnormal flow rates, pressure fluctuations, temperature control failures, or condensation are detected, the system triggers instant alerts and delivers detailed fault diagnostics, enabling maintenance teams to quickly identify and resolve problems. In the event of a critical coolant leak, KSManage coordinates with the Coolant Distribution Unit (CDU) to deliver a millisecond-level response. Upon detection, the system immediately shuts off coolant flow and initiates an automatic power-down of the CDU, ensuring maximum protection of devices and infrastructure.

For high-power devices such as the GB200 and B200, KSManage delivers fine-grained energy consumption management at the GPU level. It dynamically adjusts the Thermal Design Power (TDP) thresholds of H100/B200 GPUs, while integrating intelligent temperature regulation technologies—such as variable-frequency fluorine pumps—within the liquid cooling system. These optimizations help reduce Power Usage Effectiveness (PUE) to below 1.3. Additionally, the platform’s power-environment interaction module leverages AI algorithms to predict potential cooling system failures. Through synergistic optimization of computing power and energy consumption, KSManage reduces the power usage per cabinet by 20%, effectively lowering device failure rates and improving overall energy efficiency.

KSManage has been successfully deployed across a wide range of industries globally, including internet, finance, and telecommunications. With its intelligent, refined, and sustainable management capabilities, it has become an essential tool for overseeing device operations in AI data centers. In one notable case, an AI data center in Central Asia achieved more than a fourfold increase in operational efficiency by leveraging KSManage’s intelligent diagnostic features. Device fault handling time was also reduced by 80%. Monitoring and control of the liquid cooling system, and firmware optimization collectively contributed to a 20% reduction in energy consumption. Additionally, the hardware service lifespan was extended by one to two years.

KSManage continues to play a critical role in ensuring the efficient, stable, and sustainable operation of AI data center infrastructure.

About KAYTUS

KAYTUS is a leading provider of end-to-end AI and liquid cooling solutions, delivering a diverse range of innovative, open, and eco-friendly products for cloud, AI, edge computing, and other emerging applications. With a customer-centric approach, KAYTUS is agile and responsive to user needs through its adaptable business model. Discover more at KAYTUS.com and follow us on LinkedIn and X

 

© Business Wire, Inc.

Disclaimer :
هذا البيان الصحافي ليس وثيقة من إعداد وكالة فرانس برس. لن تتحمل وكالة فرانس برس أية مسؤولية تتعلق بمضمونه. ألرجاء التواصل مع الأشخاص/المؤسسات المذكورين في متن البيان الصحافي في حال كانت لديكم أية أسئلة عنه.