Ensemble of feature selection models for malware datasets

CÜREBAL, FARUK

dc.contributor.advisor	DAG, HASAN	en_US
dc.contributor.author	CÜREBAL, FARUK
dc.date.accessioned	2023-08-02T10:42:43Z
dc.date.available	2023-08-02T10:42:43Z
dc.date.issued	2022-09
dc.identifier.uri	https://hdl.handle.net/20.500.12469/4453
dc.description.abstract	While the development of technology has made our lives easier, our dependence on it has also increased. Cybercriminals develop various types of malware to exploit this dependence. Thus, malware classification is essential for security researchers and incident response teams to take action against them and accelerate mitigation. In this study, we selected seven feature selection methods considering their popularity, effectiveness, and complexity: LOFO Importance (Leave One Feature Out) , FRUFS (Feature Relevance based Unsupervised Feature Selection), AGRM (A General Framework for Auto-Weighted Feature Selection with Global Redundancy Minimization), MI (Mutual Information), Chi-square test, mRMR (Minimum Redundancy and Maximum Relevance), BoostARoota. We performed all the experiments in this study using XGBoost (Extreme Gradient Boosting), RF (Random Forest), and HGB (Histogram-Based Gradient Boosting) machine learning classifiers and accuracy, F1-score, and AUC-score (Area under the ROC Curve) evaluation metrics. We measured the parameter sensitivities of these feature selection methods having adjustable parameters on two high-dimensional datasets: the Microsoft Malware Prediction dataset and the API Call Sequences dataset. These feature selection methods and parameters are FRUFS (model-c, random-state), BoostARoota (clf, iters), and LOFO (model). Only the ‘model’ parameter of the LOFO algorithm significantly affects the accuracy and F1-score evaluation metric results among the adjustable parameters. We then compared these seven feature selection algorithms using two high-dimensional malware datasets: the Microsoft Malware Prediction dataset and the API Import dataset. Overall results show that AGRM obtained better metric results than other feature selection methods. Behind AGRM, FRUFS, LOFO, MI, and mRMR achieved the best results in different metrics. Compared to MI and mRMR, LOFO is much less used in the malware domain, while FRUFS has not been used before. Since AGRM performs better and FRUFS and LOFO are newer than other algorithms, we decided to continue our work with these three feature selection methods. Finally, we combined three selected feature selection methods, LOFO Importance, FRUFS, and AGRM, to find the most important features and work with fewer features by reducing the multidimensionality. We trained three feature subsets from these feature selection methods with three models, XGBoost, RF, and HGB classifiers, using a stacking ensemble on the Microsoft Malware Prediction dataset and the API Import dataset. From the nine prediction probabilities we obtained, we eliminated the prediction probabilities containing the same information by setting a threshold in the correlation matrix. We gave the final prediction probabilities we obtained to the SVM (Support Vector Machine) meta classifier. Our model obtained an average of 1.2% better classification accuracy than the selected three feature selection methods on one of the well know malware datasets (Microsoft Malware Prediction dataset). For the API Import dataset, our model obtained an average 8% better classification accuracy than LOFO and FRUFS feature selection algorithms, and AGRM could not be used in that comparison due to insufficient RAM. Therefore, our proposed model was trained with fewer features and got better results.	en_US
dc.language.iso	eng	en_US
dc.publisher	Kadir Has Üniversitesi	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Feature Selection	en_US
dc.subject	Ensemble	en_US
dc.subject	FRUFS	en_US
dc.subject	AGRM	en_US
dc.subject	LOFO	en_US
dc.subject	Malware Classification	en_US
dc.title	Ensemble of feature selection models for malware datasets	en_US
dc.type	masterThesis	en_US
dc.department	Enstitüler, Lisansüstü Eğitim Enstitüsü, İşletme Ana Bilim Dalı	en_US
dc.relation.publicationcategory	Tez	en_US
dc.identifier.yoktezid	766000	en_US

Files in this item

Name:: Faruk_Cürebal.pdf
Size:: 895.0Kb
Format:: PDF
Description:: Ensemble of Feature Selection ...

View/Open

This item appears in the following Collection(s)

Tez Koleksiyonu [1348]
Thesis Collection

Show simple item record