AN ENGINEERING APPROACH TO ARTIFICIAL INTELLIGENCE–BASED ANOMALY DETECTION IN CLOUD COMPUTING ENVIRONMENTS
Abstract
The reliability and security of large-scale cloud computing infrastructures depend critically on the timely identification of operational anomalies. While Artificial Intelligence (AI) and Machine Learning (ML) techniques have demonstrated significant algorithmic promise in detecting irregularities, the translation of these algorithms into robust, production-grade systems remains a substantial engineering challenge. This research reformulates AI-based anomaly detection not merely as a classification task, but as a complex systems engineering problem constrained by scalability, latency, and resource heterogeneity. By adopting a rigorous engineering perspective, a modular, scalable architecture is proposed that integrates heterogeneous data streams—system logs, distributed traces, and metric time-series—into a cohesive detection framework. The study investigates the architectural decomposition required to support high-throughput inference, the trade-offs between model complexity and operational latency, and the validation criteria necessary for deployment in dynamic environments. Findings indicate that architectural choices, specifically the decoupling of state management from inference logic and the implementation of adaptive sampling, influence system reliability more significantly than marginal gains in algorithmic precision. This article contributes a formalized engineering methodology for designing, validating, and sustaining AI-driven anomaly detection systems in cloud environments, bridging the gap between theoretical ML efficacy and practical operational resilience.