CN110704840A

CN110704840A - Convolutional neural network CNN-based malicious software detection method

Info

Publication number: CN110704840A
Application number: CN201910854560.6A
Authority: CN
Inventors: 芦天亮; 杜彦辉; 李国友; 傅依娴; 吴警; 张翼翔; 暴雨轩
Original assignee: CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Current assignee: CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2020-01-17

Abstract

The invention provides a convolutional neural network CNN-based malicious software detection method, which comprises the following steps: step 1: collecting and analyzing a training set, and generating a report file in a json format through a Cuckoo sandbox; step 2: vectorizing the report in the json format to obtain a feature vector; and step 3: the feature vectors processed in the step 2 are used as input and transmitted into an untrained CNN for training and learning to obtain a trained CNN; and 4, step 4: and (3) after the software to be tested is processed in the same way as the software to be tested in the step (1) and the step (2), obtaining a feature vector of the software to be tested, putting the feature vector into the CNN trained in the step (3), and judging whether the software to be tested is malicious software or normal software through CNN model detection. Compared with other machine learning algorithms and antivirus software, the method provided by the application has the advantage that better technical effects can be obtained in the aspects of detection rate and accuracy.

Description

Convolutional neural network CNN-based malicious software detection method

Technical Field

The application relates to the field of network security, in particular to a Convolutional Neural Network (CNN) -based malicious software detection method.

Background

The explosive growth of malicious software and the serious threat to the user machine and the network environment gradually become the main contradictions in the network space security field.

The malware detection method is currently mainly divided into two phases: a feature extraction stage and a detection stage. In the feature extraction stage, the extracted features mainly include static features and dynamic features, the corresponding extraction means are respectively a static feature extraction technology and a dynamic feature extraction technology, and the corresponding detection methods are respectively a static detection method and a dynamic detection method.

The static detection method is mainly represented by a static signature. The static signature-based malware detection approach exploits the static nature of programs to distinguish benign from malware. Malware detection based on static signatures requires the examination of malware and the creation of a different signature for each newly discovered malware. The signature may be based on bytecode, binary assembly instructions, imported Dynamic Link Libraries (DLLs), or function and system calls. Schultz W et al combine features of malware resource segments, DLLs, program string constants, bytecode information with data mining algorithms for static analysis.

In the field of dynamic method analysis, benign and malware are distinguished by malware behavior, which involves two key problems, behavior characterization and software behavior analysis techniques. In the aspect of behavior characterization research, software behaviors are generally described mainly by using an API call sequence, a system feature count or software codes. In the aspect of research of software behavior analysis technology, the content of behavior feature description is mainly extracted from the viewpoint of dynamic analysis. A sandbox named TTAnalyze is implemented in foreign Bayer U and the like to run a sample and analyze sample execution flow information; willem obtains a malicious code behavior analysis report by using Cwsandbox designed in a distributed system laboratory of the university of Mannheim, Germany, and provides a learning method based on a Space Vector (VSM) model.

In China in the early days, HanXiao et al identify the malicious behavior type of a program based on user-layer function call combined with SVM machine learning algorithm and the like. XiaoLinGui et al extracts traffic information of a program in dynamic operation to identify malicious behavior of the program. Bye et al propose a fuzzy inference method, which calculates the probability of program maliciousness based on bayes theorem. In recent years, researchers are continuously trying to apply deep learning algorithms to malware detection directions, which is a relatively research-oriented development trend. Saxe et al used a feed forward neural network to classify the static analysis results, however, they did not consider the dynamic analysis results in the study. In the case of binary obfuscation, static analysis may not provide a satisfactory classification output. Huang et al, on the other hand, use up to four layers of feedforward neural networks, but focus on evaluating the multi-task learning concepts.

At present, the traditional malware analysis and detection technology cannot effectively protect the security of a computer system and an internet, mainly because of the following reasons:

(1) malware authors use various obfuscation techniques to generate new variants of the same malware to circumvent signature-based identification. Traditional protection technologies take antivirus software as an example, and currently, the antivirus software mainly uses a signature-based detection method, so that the defect is that the antivirus software can only detect known malware and cannot detect the malware modified by using code obfuscation and polymorphic technology.

(2) With the rapid increase of the amount of malicious software, the conventional feature code-based static scanning technology and software behavior-based malicious software detection technology are prone to false alarm and false negative, and new requirements in the field of information security cannot be met gradually.

(3) According to the malicious software detection based on the traditional machine learning algorithm, most of machine learning structures are shallow structures, and high-level abstract model training cannot be performed on malicious software, so that the detection effect needs to be improved.

Disclosure of Invention

Aiming at the defects existing in the traditional malicious software detection, the invention provides a novel malicious software detection method based on a Convolutional Neural Network (CNN).

A convolutional neural network CNN-based malware detection method, the method comprising the steps of:

step 1: collecting a training set and analyzing, wherein the training set is composed of normal software samples and malicious software samples, and the analyzing comprises the step of generating report files in a json format by the training set through a Cuckoo sandbox;

step 2: carrying out dynamic API sequence extraction on the report in the json format, and carrying out vectorization processing on the extracted software features to obtain feature vectors;

and step 3: constructing a Convolutional Neural Network (CNN) model, transmitting the feature vector processed in the step (2) as an input into an untrained CNN model for training and learning, and training the CNN to an optimal state by adjusting parameters to obtain a training CNN;

and 4, step 4: and (3) after the software to be tested is processed in the same way as the software to be tested in the steps 1 and 2, obtaining a feature vector of the software to be tested, putting the feature vector into the convolutional neural network model CNN trained in the step 3, and judging that the software to be tested is malicious software or normal software through the detection of the CNN model.

The step 2 specifically comprises: and writing a Python script to extract API behaviors from the json report, generating a txt document, and converting the txt document into a feature vector through One-Hot encoding.

The method further includes that the convolutional neural network CNN model in step 3 includes an input layer, a convolutional layer, a pooling layer, and an output layer, where the convolutional layer converts text information into picture features, and specifically includes the following operations:

let d^(l)And d^(l-1)Respectively output and input of the l-th layer, i.e. d⁽⁰⁾As a matrix of the original image, d^(L)Is the output of the last layer L; let N_I ^(l)And N_O ^(l)The number of input and output feature matrices of the l-th layer respectively; the input of the l layer is the output of the (l-1) layer, i.e. N_I ^(l)＝N_O ^(l-1)(ii) a Denote the jth output feature matrix of the ith layer as d_j ^(l)Then the ith input feature matrix of the l-th layer is also the ith output feature matrix of the (l-1) -th layer, denoted as d_j ^(l-1)(ii) a The output calculation formula of the convolutional layer is as follows:

wherein i is more than or equal to 0 and less than or equal to N_I ^(l-1)，0≤j≤N_O ^(l-1)Mapping f is a nonlinear activation function, namely a sigmoid function;

a weight matrix connected from the ith input feature matrix to the jth output matrix in the convolution kernel;

is the offset term of the jth output feature matrix of the ith layer.

According to the method, the application of the convolutional neural network in the aspects of image and natural language processing is used for reference, the detection of malicious software by using the convolutional neural network is realized, the One-Hot coding is used for replacing a commonly used Word2Vec Skip-gram model, and a better software detection effect is achieved.

Drawings

FIG. 1 is a CNN-based malware detection flow;

FIG. 2 is a Cuckoo Sandbox structure;

FIG. 3 is an overview of a convolutional neural network architecture;

FIG. 4 is a convolution and pooling layer;

FIG. 5 Accuracy comparisons under different filters

Detailed Description

In order to solve the problems in the prior art, the application provides a method for detecting and analyzing malicious software based on a Convolutional Neural Network (CNN), and particularly, the application realizes the detection of the malicious software by using the convolutional neural network by taking reference to the application of the convolutional neural network in the aspects of image and natural language processing, and uses One-Hot coding to replace a commonly used Word2Vec Skip-gram model, so that a better software detection effect is achieved.

The methods for detecting and analyzing malware proposed by the present application are described in detail below.

1 System framework

The present application detects and analyzes malware by using Convolutional Neural Networks (CNN). Fig. 1 is a CNN-based malware detection flow.

As shown in FIG. 1, at step 1 after start, training sets are collected and analyzed, and this part of the work corresponds to the "collect reports" phase of FIG. 1. The training set is composed of normal PE (Executable) file samples and malicious PE file samples. The normal PE file sample refers to non-toxic and relatively safe software; the malicious PE file sample refers to malware that is harmful to the computer. And submitting the collected PE file samples, namely the training PE file set, to a Guest (client) for analysis through a Host (Host) in the Cuckoo sandbox, transmitting supervision records returned by the Guest, and generating a report file in a json format through an analysis component by the Host. The selection and collection of the normal PE file samples and the malicious PE file samples in the training set is to enable the deep learning CNN model to perform supervised learning in the following step 3, and fully learn the characteristics of the normal samples and the characteristics of the malicious samples, so that the PE samples to be tested can be effectively distinguished normally and maliciously.

And step 2, performing dynamic API sequence extraction on the collected json reports, wherein the part of work corresponds to the data preprocessing stage in the figure 1. And writing a Python script to extract API behaviors and exporting the API behaviors to a txt document, and performing feature vectorization on an API sequence through a word vector model to obtain the result that function information of each line in the txt document is expressed by 0 and 1.

And 3, constructing a Convolutional Neural Network (CNN) model, transmitting the feature vector preprocessed in the step 2 as input into the untrained convolutional neural network model (CNN) (namely the initialized CNN) for training and learning, and training the model to an optimal state through parameter adjustment to obtain the trained CNN. After the learning rate coefficient, the convolution kernel step length, the discarding rate, the activation function and the bias term of the model can be adjusted, experimental comparison is performed by setting a plurality of groups of filter heights, and finally, an optimal parameter value is selected to represent that the convolution neural network is trained to the optimal state through the plurality of groups of experimental comparison, as shown in the subsequent technical effect part.

The CNN model mainly comprises an input layer, a convolution layer, a pooling layer and an output layer. Wherein the input layer receives the incoming feature vectors; the convolution layer mainly converts a result processed by the word vector model into a two-dimensional matrix similar to an image; the pooling layer is to extract more remarkable features from the two-dimensional matrix and finally input the features into a softmax function for classification. This part of the work corresponds to the "convolutional neural network training" phase of fig. 1.

And 4, in the stage of malicious software detection, after the PE files to be tested, namely the test set, are subjected to the same processing as the step 1 of collecting reports and the step 2 of data preprocessing, putting the obtained feature vectors of the PE files to be tested into the convolutional neural network model CNN trained in the step 3, and judging the PE files to be tested to be malicious software or normal software through CNN model detection. This portion of the work corresponds to the "malware detection" phase of fig. 1.

The CNN-based malware detection process of the present application is briefly described above, and steps 1 to 4 of the steps shown in fig. 1 are described in detail below.

2 Collection report

Collecting reports is the process of collecting training sets and analyzing. The purpose of this step is to make the deep learning CNN model perform supervised learning in the following step 3, and fully learn the features of the normal sample and the features of the malicious sample, so as to finally perform effective normal and malicious differentiation on the PE sample to be tested.

The collected training set is composed of normal PE (Executable) file samples and malicious PE file samples. The normal PE file sample refers to non-toxic and relatively safe software; the malicious PE file sample refers to malware that is harmful to the computer. And submitting the collected PE file samples, namely the training PE file set, to a Guest (client) for analysis through a Host (Host) in the Cuckoo sandbox, transmitting supervision records returned by the Guest, and generating a report file in a json format through an analysis component by the Host.

A Sandbox Cuckoo Sandbox environment needs to be set up before the sample reports are collected. FIG. 2 is the structure of Cuckoo Sandbox. As shown in fig. 2, Cuckoo Sandbox mainly includes two major parts, namely Host Machine and guest Machine, which communicate with each other by establishing a virtual network. The Host Machine comprises Cuckoo Sandbox software, Virtual Machine software Virtual Box and various analysis components, and is mainly responsible for starting analysis, behavior monitoring, report file generation and the like of a sample. The Guest Machine is primarily responsible for executing malware and reporting the analysis results to the Host Machine.

The work flow of Cuckoo Sandbox is as follows:

firstly, Cuckoo Sandbox executes a main program script and simultaneously starts a Guest Machine virtual Machine.

And sending the PE file sample to be analyzed into a Guest Machine for execution by uploading the script and the monitoring script.

The PE file sample to be analyzed is executed in the Guest Machine, and meanwhile, various information of the sample is recorded by the monitoring script.

And after the sample is executed, the monitoring script sends the record to the Host Machine where the virtual Machine software is located through the virtual network and the sharing function of the virtual Machine software. And recording the result and generating a report file in a json format through the analysis component. The report file contains behavior information of the malicious software, so far, the behavior of the malicious software is successfully extracted through the behavior extraction engine.

The virtual Machine software uses the snapshot to restore the Guest Machine virtual Machine to an initial state.

3 data preprocessing

Because the convolutional neural network model does not accept the original text as input, it can only process numerical data, and the purpose of this step is to convert the extracted description of software behavior into data that the convolutional neural network model can process.

The data preprocessing is to extract dynamic API sequences of collected json reports, write Python scripts to extract API behaviors, and perform a feature vectorization process on the API sequences through a word vector model so as to convert the API sequences into texts capable of being processed by a convolutional neural network model.

The specific operation of extracting the API behavior is as follows: the sequence of API function calls extracted by Python is represented by a txt document, with each line of the document representing an API call. Each row is divided into two parts separated by a space. Json, corresponding to the category field in the original report, and the second part is the called API name.

For example, shown below is a json analysis report snip:

referring to the json analysis report fragment above, the main extracted information is the "category" and "api" of the json fragment, and other parameter information is omitted. Since json files are primarily composed of dictionaries, lists in the data structure. On the extraction process, the positions of the "category" and the "api" in the json segment are found first, and then specific function information is extracted. The following is the extraction code fragment.

Referring to the json analysis report fragment, the content corresponding to the "category" field is "system", and the content corresponding to the "api" field is "ldrgetproceduredaddress", and the combination of the above contents, "systemldrgetproceduredaddress", constitutes one line of the txt document. And the whole json report records a complex behavior process, and finally all function call sequences are assembled into a txt format. The truncated txt fragment is as follows:

.......

system LdrGetProcedureAddress

process NtAllocateVirtualMemory

exception SetUnhandledExceptionFilter

registry RegOpenKeyExW

registry RegQueryValueExW

registry RegCloseKey

resource FindResourceExW

resource FindResourceExA

resource FindResourceExW

.......

in the following, a text-to-numerical conversion process is required, and text vectorization refers to a process of converting text into a numerical tensor. One-Hot encoding is One of the methods of text vectorization. One-Hot encoding uses an N-bit status register to encode the N states and only One bit is valid. It associates each word with a unique integer index i, and then converts this integer index i into a binary vector of length N (N being the dictionary size, corresponding to the above-mentioned N-bit status register), which is characterized by the fact that only the ith element is 1 and the remaining elements are 0.

Assume that there are the following three API call sequences in our sample:

API1API2API7

API3API2API5

API4API2API6

(1) firstly, segmenting the three API call sequences into words, acquiring a dictionary, and then carrying out index numbering on each word:

API1：1；API2：2；API3：3；API4：4；API5：5；API6：6；API7：7

(2) then, a vector of each feature word is obtained as follows:

API1->(1,0,0,0,0,0,0)

API2->(0,1,0,0,0,0,0)

API3->(0,0,1,0,0,0,0)

API4->(0,0,0,1,0,0,0)

API5->(0,0,0,0,1,0,0)

API6->(0,0,0,0,0,1,0)

API7->(0,0,0,0,0,0,1)

(3) finally, obtaining the feature vectors of three API calling sequences:

API1API2API7->(1,1,0,0,0,0,1)

API3API2API5->(0,1,1,0,1,0,0)

API4API2API6->(0,1,0,1,0,1,0)

the One-Hot coding is used, the value of the discrete features can be expanded to the Euclidean space, and in the classification process, the calculation of the distance between the features or the calculation of the similarity are generally carried out in the Euclidean space. In the present invention, an API function call sequence is input and converted into a mathematical vector using one-hot encoding. For discrete features, a one-hot coding is used for a distance-based model, and the condition of sparse features can be well processed.

4 convolutional neural network training

The convolutional neural network is the model used in fig. 1 to construct the "convolutional neural network training". The convolutional neural network follows the neural network of the common multi-layer perceptron structure, which is a feedforward network. The convolutional layer uses a convolution filter to extract features from the data samples. In the field of image processing, convolution filters are used primarily to identify features from images. Similar to images, in text processing (e.g., sentence classification, search, recommendation, etc.), we can use convolution filters for information extraction and high-level feature detection on short texts. Because the log containing the malicious executable program instructions consists of word sequences in the predefined dictionary, when the modeling method is selected, the method is obviously similar to a text document, and the detection of malicious software by using the convolutional neural network is realized by using the application of the convolutional neural network in the aspects of image and natural language processing.

Fig. 3 shows an overview of the convolutional neural network architecture. As shown in fig. 3, the convolutional neural network CNN model mainly includes an input layer, a convolutional layer, a pooling layer, and an output layer. Wherein the input layer receives the incoming feature vectors; the convolution layer mainly converts a result processed by the word vector model into a two-dimensional matrix similar to an image; the pooling layer extracts more remarkable characteristics from the two-dimensional matrix, enters an output layer, and finally inputs the output result of the output layer into a softmax function for classification to judge whether the output layer is the malicious software.

The implementation of the convolutional layer and the pooling layer is described in detail below.

4.1 convolutional layer dynamic feature extraction

The convolutional layer is used as an important layer for extracting the characteristics of the whole network structure and mainly comprises local sensing, weight sharing and multiple convolution kernel characteristics, the former two have the function of reducing the dimension, and the latter provides specific operation for re-extracting the characteristics with different granularities. And the same convolution kernel extracts the same characteristic aiming at different sub-matrixes of the whole image by adopting a local sensing and weight sharing mode. The method has the defect of insufficient feature extraction, and for the defect, the convolutional neural network introduces the concept of multiple convolution kernels, and the convolution kernels with different weights can extract different features for input. For example, by performing image processing on one image using 100 convolution checks, 100 feature matrices can be acquired, and 100 features can be learned. The following describes the main calculation process of the convolutional layer, which aims to convert text information into picture features, and in accordance with the present application, converts the result of the word vector model processing into a two-dimensional matrix similar to a picture.

Let d^(l)And d^(l-1)Respectively output and input of the l-th layer, i.e. d⁽⁰⁾As a matrix of the original image, d^(L)Of the last layer LAnd (6) outputting. Let N_I ^(l)And N_O ^(l)The number of input and output feature matrices of the ith layer, respectively. The input of the l layer is the output of the (l-1) layer, i.e. N_I ^(l)＝N_O ^(l-1). Denote the jth output feature matrix of the ith layer as d_j ^(l)Then the ith input feature matrix of the l-th layer is also the ith output feature matrix of the (l-1) -th layer, denoted as d_j ^(l-1). The output calculation formula of the convolutional layer is as follows:

wherein i is more than or equal to 0 and less than or equal to N_I ^(l-1)，0≤j≤N_O ^(l-1)The mapping f is a nonlinear activation function, i.e. sigmoid function.The weight matrix is connected from the ith input feature matrix to the jth output matrix in the convolution kernel.

Is the offset term of the jth output feature matrix of the ith layer.

Fig. 4 shows the structure of the convolutional and pooling layers. With good feature extraction, the convolutional neural network architecture will help us to distinguish patterns of use between benign and malware families, and indeed the convolutional filter helps to find higher-order local features that do not change for small changes in the data.

As shown in fig. 4, the input of the leftmost box is an API call function sequence, the next step is to generate a word vector matrix, re-extract features through a convolution filter and a pooling layer, and finally send the most significant features into a softmax classifier.

The rows of the input matrix represent discrete API call functions, and the filters slide across the entire row of the matrix, similar to the application in natural language processing. We choose the filter width to be 128, which represents the dimension of the API call function vector. The input sample matrixes are respectively convolved by a plurality of convolution kernels, the lengths of the convolution kernels can be randomly selected, the processing mode is similar to an N-Gram algorithm, for example, the height of a filter with the length of 3 is actually used for extracting features of the dynamic behaviors of 3 adjacent APIs.

4.2 Secondary extraction of pool layer characteristics

In the field of image recognition, sometimes the images are too large, we need to reduce the number of training parameters, and the only purpose of the pooling layer is to reduce the spatial size of the images. Pooling is done on each depth dimension independently, so the depth of the image remains constant.

The pooling layer takes the result of the local feature extracted by the convolutional layer as input, and further extracts the most significant feature. Mainly comprises the following steps: the Max firing function and the Average firing function. Calculating the maximum value in the image pooling window as a sampling value to represent the region characteristic; and calculating a weighted average value in the image pooling window as a sampling value to represent the region characteristic. By summary statistical calculation, the problem of overfitting of the model is solved while the dimension of the feature matrix is reduced, and meanwhile, the characteristic of deformation invariance of the feature matrix is ensured due to the introduction of the pooling layer.

In general, Max-posing takes the maximum value for a small area, and assuming the window size of posing is 2x2, the result of Max-posing takes the largest 2x2 matrix in the middle of the matrix on the left side of the arrow, as shown on the right side of the arrow:

average-posing averages a small area, assuming that posing window size is 2x2, as shown in the left side of the arrow, averaging the top left corner value is 7/4, and averaging the top right corner value is 5/4, and sequentially processing to obtain a 2x2 matrix on the right side of the arrow:

in the present invention, the Max-posing approach is used on the convolution results, which reduces the output dimension while preserving the important global information captured by the filter, thus taking the maximum value in the column vectors, so that each column vector can be converted to a value of 1x 1

5 malware detection

In the 'malware detection stage', the trained model in the step 3 is verified. Firstly, downloading a batch of malicious PE files and normal PE files, performing data preprocessing on the malicious PE files and then directly putting the malicious PE files and the normal PE files into a trained model, and judging each test sample by the detection model according to early learning experience to obtain a conclusion whether the test sample is malicious or not.

6. Technical effects

The technical effect of the technical solution of the present application is described below by a specific example.

6.1 data set and sandbox Environment configuration

The method selects the malicious samples on the Windows platform with the largest use amount as experimental objects, and mainly downloads Windows malicious PE file sets from two public websites, namely https:// viral-analysis.com and www.malware-traffic-analysis.net, wherein 2400 samples are used; in addition, a normal sample set is downloaded from 360 official application malls, and a total of 1000 samples of 16 types of software (such as a browser, a file editor, office software, a media player and the like) are downloaded according to the use ratio, so that 3400 samples are finally counted.

Programming operating environment (computer a) of the convolutional neural network-based malware detection technique: deep learning framework tensoflow, win10x64, CPU core, 2.4GHz, memory bank DDR 428008 GB, SSD solid state disk 512G.

The Cuckoo sandbox environment (computer B) is configured as in table 1:

TABLE 1 sandbox configuration Environment

6.2.1 comparison of CNN results under different filters

Experiment one, firstly adjusting the main parameter values of CNN:

table 2 CNN parameter settings

Parameter(s)	Value of	Description of the invention
			μ	0.001	Learning rate
Stride	1	Convolution kernel step size
			Dropout	0.1	Discard rate
Activation	ReLU	Activating a function
			Bias	Constant	Bias term

4 more filter heights (3, 4, 5, 6, respectively) are set. A comparison of the CNN classification result Accuracy using One-Hot encoding with the CNN classification result Accuracy using the Skip-gram model is shown in FIG. 5.

By selecting different filter heights, it can be seen that the CNN model after One-Hot encoding has better effect in Accuracy than the CNN model after Skip-gram. And when the height of the filter is 6, the detection result of the CNN model after One-Hot coding is the highest, and the CNN state is the best.

6.2.2 comparison with other conventional machine learning algorithms

Experiment two, we performed three cross-validation experiments to estimate the results on new data. In each experiment, we randomly partitioned the dataset into three equal sized partitions, trained on two partitions, and tested the remaining partitions, this process was repeated three times, leaving a different partition to test each time. We calculate the average of the three tests and finally obtain a reliable metric to measure the performance of the proposed convolutional neural network model over the entire data set. In addition, MLP, NaiveBayes, SVM and CNN (One-Hot) are selected for comparison, and the performance of the model is detected. The performance of the convolutional neural network model was quantitatively evaluated using three indices: accuracy (precision), Recall (Recall) and F1-score, experimental results are shown in table 3.

TABLE 3 different machine learning algorithm test results

Algorithm	Accuracy	Recall	F1-score
				MLP	0.92	0.91	0.91
NaiveBayes	0.84	0.75	0.76
				SVM	0.90	0.89	0.88
CNN	0.94	0.92	0.93

As can be seen from the results in Table 3, the overall results of the CNN model in Accuracy, Recall and F1-score are all higher than those of other common machine learning algorithms; CNN is therefore more advantageous than several other machine-learned algorithms.

6.2.3 comparison with common antivirus software

Experiment three, we download another 100 PE malicious samples that are not duplicated with the experiment. Submitting on a virus Total, and calculating the detection rate R of all submitted samples under the antivirus software Clam AV, TotalDefence, ZoneAlarm and Malware bytes, wherein R is n/t, n is the count of the antivirus software detected as the malicious sample, and t is the Total number of all the antivirus software on the virus Total. CNN (One-Hot) was used for detection and comparison with the antivirus software Clam AV, TotalDefense, ZoneAlarm, Malware bytes. The results are shown in Table 4.

TABLE 4 comparison of detection rates for unknown malware

Antivirus software	R
		Clam AV	64％
TotalDefense	52％
		ZoneAlarm	82％
Malwarebytes	43％
		CNN	91％

As can be seen from the table, the detection rate of the CNN model to unknown malware reaches more than 90%, and compared with other common antivirus software, the method has a better detection rate.

Claims

1. A convolutional neural network CNN-based malware detection method, the method comprising the steps of:

and step 3: constructing a Convolutional Neural Network (CNN) model, transmitting the feature vectors processed in the step (2) as input into an untrained CNN model for training and learning, training the CNN to an optimal state by adjusting parameters, and finally obtaining a trained CNN model;

and 4, step 4: and (3) after the software to be tested is processed in the same way as the software to be tested in the steps 1 and 2, obtaining a feature vector of the software to be tested, putting the feature vector into the convolutional neural network model CNN trained in the step 3, and finally judging that the software to be tested is malicious software or normal software through the detection of the CNN model.

2. The malware detection method of claim 1, wherein the step 2 specifically comprises: and writing a Python script to extract API behaviors from the json report, generating a txt format document, and converting the txt format document into a feature vector through One-Hot coding.

3. The malware detection method as claimed in claim 1-2, wherein the convolutional neural network CNN model in step 3 includes an input layer, a convolutional layer, a pooling layer, and an output layer, wherein the convolutional layer converts text information into picture features, and specifically includes the following operations:

is the offset term of the jth output feature matrix of the ith layer.