Publication: Preprocessing Impact Analysis for Machine Learning-Based Network Intrusion Detection
No Thumbnail Available
Date
2023
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Sakarya University
Abstract
Machine learning (ML) has been frequently studied to build intelligent systems in many problem domains. For example, one of the application areas of ML in cybersecurity is to develop intelligent intrusion detection systems (IDSs) for malicious network activity detection. However, intelligent IDS development is challenging due to many available methods in the current literature, including different types of classification algorithms and preprocessing techniques. Therefore, revealing the best-fitting methods for intrusion detection would help practitioners develop efficient detection systems. For this purpose, this study has conducted extensive experiments using the support vector machines (SVM) classifier and feature selection (FS) technique, several data normalisation techniques, and a classifier optimisation algorithm to analyse the impact of preprocessing techniques on classification. These methods were tested on three open network intrusion datasets, NSL-KDD, UNSW-NB15, and CICIDS2017. Finally, the results were analysed to investigate each method’s impact on model performance and extract insights for building intelligent IDS. The optimised model achieved an accuracy of 81.51% with two features, 85.27% with 32 features, and 99.43% with 16 features for the NSLKDD, UNSW-NB15, and CICIDS2107 testing datasets, respectively. Furthermore, the results exhibited that data preprocessing has improved classification performance, and the log-scaling normalisation technique outperformed the z-score and min-max. Additionally, the results suggested that SVM-based FS improved classification performance and significantly reduced model complexity. In addition, the conclusion was drawn that classifier optimisation could enhance the performance of the classifier-dependent FS technique, such as SVM FS. However, it was observed that an inadequate feature set in the classifier optimisation process could result in worse performance, therefore, this problem must be addressed during the optimisation process for accurate optimisation. In conclusion, this study provided insights into data preprocessing in ML applications and showed the significance of data preprocessing for building accurate and efficient IDSs. © 2025 Elsevier B.V., All rights reserved.
