Our public malware dataset generated by Cuckoo Sandbox based on Windows OS API calls analysis for cybersecurity researchers for malware analysis in CSV file format for machine learning applications.
If you find those results useful please cite them :
author = "Catak, FÖ. and Yazi, AF.",
title = "A Benchmark API Call Dataset for Windows PE Malware Classification",
year = "2019",
url = "https://arxiv.org/abs/1905.01999",
note = "[arXiv:1905.01999 ]"
This study seeks to obtain data which will help to address machine learning-based malware research gaps. The specific objective of this study is to build a benchmark dataset for Windows operating system API calls of various malware. This is the first study to undertake metamorphic malware to build sequential API calls. I hope that this research will contribute to a deeper understanding of how metamorphic malware change their behaviour (i.e. API calls) by adding meaningless opcodes with their own dissembler/assembler parts.
Malware Types and System Overall
In our research, we have converted the families produced by each of the software into 8 main malware families: Trojan, Backdoor, Downloader, Worms, Spyware Adware, Dropper, Virus. Table 1 shows the number of malware belonging to malware families in our data set. As you can see in the table, the number of samples of other malware families except AdWare is quite close to each other. There is such a difference because we do not find too much of malware from the adware malware family.
The figure shows the general flow of the generation of the malware data set. As shown in the figure, we have obtained the MD5 hash values of the malware we collect from Github. We searched these hash values using the VirusTotal API, and we have obtained the families of these malicious software from the reports of 67 different antivirus software in VirusTotal. We have observed that the malicious software families found in the reports of these 67 different antivirus software in VirusTotal are different.