Informasi Lainnya
Abstraksi
Domain Generation Algorithms automatically produce large numbers of pseudo-random<br />
domain names for command-and-control (C2) communication, thereby posing a substan-<br />
tial challenge to network security mechanisms. While machine learning–based detectors<br />
have demonstrated high accuracy for DGAs represented in the training data, only 26%<br />
of prior studies explicitly evaluate cross-family generalization, and model performance fre-<br />
quently deteriorates when confronted with previously unseen DGA families. To this end,<br />
we develop a Random Forest classifier employing a split-ensemble training scheme, com-<br />
prising 24 stratified sub-ensembles aggregated via majority voting. The model relies on 12<br />
domain-level features, encompassing entropy-based, structural, linguistic, and sequential<br />
characteristics.<br />
The experimental evaluation is conducted on 120 DGA families, of which 65 families<br />
are strictly held out from the training phase to emulate a realistic zero-day scenario. The<br />
proposed split-ensemble model attains a Matthews Correlation Coefficient (MCC) of 0.965<br />
(95% CI: 0.963–0.967) on these zero-day families, with all 65 held-out families surpassing<br />
the generalization threshold of MCC = 0.70. Entropy-related features account for 60.2%<br />
of the aggregated feature importance and exhibit stable relative rankings across evaluation<br />
settings (Spearman’s ρ = 1.0). The split-ensemble training strategy mitigates performance<br />
degradation under distribution shift by 73% relative to a single-model baseline (ΔMCC =<br />
0.095, p ¡ 0.001).<br />
The resulting false positive rate of 0.17% lies within an acceptable range for operational<br />
deployment. Overall, the findings indicate that diversity in the training process rather than<br />
increased architectural complexity is the primary factor governing generalization to zero-<br />
day DGA families. Consequently, a lightweight, interpretable ensemble model can achieve<br />
competitive zero-day detection performance, offering a resource-efficient and operationally<br />
transparent alternative to deep learning–based methods, and is thus well suited for inte-<br />
gration into Security Operations Center workflows.<br />
Keyword: Domain Generation Algorithm, zero-day detection, machine learning, Ran-<br />
dom Forest ensemble, cybersecurity, malware detection
Koleksi & Sirkulasi
Tersedia 1 dari total 1 Koleksi