CN111382783A

CN111382783A - Malicious software identification method and device and storage medium

Info

Publication number: CN111382783A
Application number: CN202010134497.1A
Authority: CN
Inventors: 张九经; 李树栋; 吴晓波; 韩伟红; 方滨兴; 田志宏; 殷丽华; 顾钊铨; 仇晶; 王乐; 李默涵; 唐可可
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2020-07-07

Abstract

The invention relates to the technical field of software security, and discloses a malicious software identification method, a malicious software identification device and a storage medium, wherein the malicious software identification method comprises the following steps: extracting sample software execution sequence characteristics; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature; training a GCforest model by using the API characteristics, the PID characteristics and the RET characteristics; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner; and identifying the malicious software by using the trained GCforest model. The malicious software identification method, the malicious software identification device and the storage medium can improve the identification accuracy of malicious software.

Description

Malicious software identification method and device and storage medium

Technical Field

The present invention relates to the field of software security technologies, and in particular, to a method, an apparatus, and a storage medium for identifying malicious software.

Background

With the popularization and development of networks, people have entered the information-based era. However, along with the development of network attack technology, especially the security problem of malicious codes represented by computer viruses, computer worms, trojan horses, and the like, on networks and information systems has become a significant problem concerning national security, military security, and social security, and software security research has become an important issue of current computer research. The malware identification is a method for judging the security of computer software, and is a key part of software security research.

In the prior art, a deep neural network algorithm is mainly used to complete a malware identification task, and through analysis of a malicious sample, a malware file can be converted into an image data set and a text sequence data set through a conversion algorithm, so that the deep learning model which is excellent in image and text tasks is applied, for example, CNN (Convolutional neural network), GRU (Gated Recurrent Unit), and the like. The traditional forest-based Machine learning algorithm achieves a good effect in a data classification task, wherein XGBoost (extreme Gradient Boosting) and LightGBM (Light Gradient Boosting Machine) are currently used in many problems in the field of network security, such as DDoS (Distributed Denial of service) attack detection, malicious intrusion detection and click fraud detection.

However, the existing recognition method based on deep learning has the defect of low recognition accuracy; the existing recognition method based on the traditional forest machine learning directly takes the average value of the class probability vectors of the last layer as output, and has the defects of low accuracy rate and the like.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method and the device for identifying the malicious software and the storage medium are provided, the malicious software is identified by adopting an improved GCforest model, and the identification accuracy is improved.

In order to solve the technical problem, in a first aspect, the present invention provides a malware identification method, including:

extracting sample software execution sequence characteristics; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature;

training a GCforest model by using the API characteristics, the PID characteristics and the RET characteristics; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner;

and identifying the malicious software by using the trained GCforest model.

Preferably, the extracting sample software executes sequence features, specifically:

grabbing api _ name, call _ pid and ret _ value in an xml file of the sample software;

and extracting the API characteristics, the PID characteristics and the RET characteristics of the sample software according to the API _ name, the call _ PID and the RET _ value use rule matching and frequency statistics.

Specifically, the extracting the API feature, the PID feature, and the RET feature of the sample software according to the API _ name, the call _ PID, and the RET _ value usage rule matching and frequency statistics specifically includes:

when the API _ name of the sample software contains a first character string, determining that the value of the API feature of the sample software is 1, otherwise, determining that the value is 0; the first character string is any character string in api _ name of the malicious software;

when a second character string is contained in the call _ PID of the sample software, determining the value of the PID characteristic of the sample software as the frequency of occurrence of the second character string; wherein the second character string is any character string in call _ pid of the malicious software;

when a third string is contained in the RET value of the sample software, determining the value of the RET characteristic of the sample software as the frequency of occurrence of the third string; wherein the third character string is any character string in ret _ value of the malware.

Preferably, the training of the GCForest model by using the API features, the PID features, and the RET features specifically includes:

s21: merging and standardizing the extracted results of the API characteristic, the PID characteristic and the RET characteristic into a first characteristic vector, dividing the first characteristic vector into a training set and a cross validation set, sending the training set into the GCForest model, and training a base learner and a final decision learner of a first forest layer of the GCForest model;

s22: connecting the first forest layer with the final decision learner to obtain a first GCforest model, predicting the cross validation set by using the first GCforest model, comparing and validating a prediction result with a preset label, and calculating a first accuracy;

s23: connecting a class probability vector output by a previous forest layer with a first feature vector of the training set to obtain a new feature vector as input of a next forest layer, training the next forest layer by using the new feature vector, connecting the new feature vector with the final decision learner to obtain a new GCForest model, predicting the cross validation set by using the new GCForest model, comparing and validating a prediction result with the preset label, and calculating the current accuracy;

s24: if the current accuracy is greater than the accuracy of the previous forest layer, updating the current highest accuracy and the forest layer corresponding to the current highest accuracy, and repeating the step S23;

s25: and when the accuracy is not increased any more, stopping training, and connecting the forest layer with the highest accuracy with the final decision learner to obtain the trained GCforest model.

Preferably, the base learner of any forest layer of the GCForest model is at least one of the following algorithms: random forest, extreme random tree, extreme gradient lift, lightweight gradient lift, category lift, and logistic regression.

In order to solve the same technical problem, in a second aspect, the present invention provides a malware identification apparatus, including: the system comprises a feature extraction module, a model training module and a software identification module; wherein the content of the first and second substances,

the characteristic extraction module is used for extracting the characteristics of the execution sequence of the sample software; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature;

the model training module is used for training a GCforest model by utilizing the API characteristics, the PID characteristics and the RET characteristics; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner;

the software identification module is used for identifying malicious software by using the trained GCforest model.

Preferably, the feature extraction module is configured to extract features of a sample software execution sequence, specifically:

the extracting the API feature, the PID feature, and the RET feature of the sample software according to the API _ name, the call _ PID, and the RET _ value usage rule matching and frequency statistics specifically includes:

Preferably, the model training module is configured to train the GCForest model by using the API features, the PID features, and the RET features, specifically:

a: merging and standardizing the extracted results of the API characteristic, the PID characteristic and the RET characteristic into a first characteristic vector, dividing the first characteristic vector into a training set and a cross validation set, sending the training set into the GCForest model, and training a base learner and a final decision learner of a first forest layer of the GCForest model;

b: connecting the first forest layer with the final decision learner to obtain a first GCforest model, predicting the cross validation set by using the first GCforest model, comparing and validating a prediction result with a preset label, and calculating a first accuracy;

c: connecting a class probability vector output by a previous forest layer with a first feature vector of the training set to obtain a new feature vector as input of a next forest layer, training the next forest layer by using the new feature vector, connecting the new feature vector with the final decision learner to obtain a new GCForest model, predicting the cross validation set by using the new GCForest model, comparing and validating a prediction result with the preset label, and calculating the current accuracy;

d: if the current accuracy is greater than the accuracy of the previous forest layer, updating the current highest accuracy and the forest layer corresponding to the current highest accuracy, and repeating the step c;

e: and when the accuracy is not increased any more, stopping training, and connecting the forest layer with the highest accuracy with the final decision learner to obtain the trained GCforest model.

In order to solve the same technical problem, in a third aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program for implementing the malware identification method of the first aspect described above.

Compared with the prior art, the malicious software identification method, the malicious software identification device and the malicious software identification storage medium have the advantages that: extracting execution sequence characteristics of sample software, training a GCforest model, wherein the GCforest model comprises a cascade forest module, a final prediction result of the GCforest model is output by a final decision learner, and identifying malicious software by using the trained GCforest model; the final prediction result of the GCforest model is output by the final decision learner, and compared with the existing GCforest model, the GCforest model has higher identification accuracy in malicious software identification; the improved GCforest model is adopted to identify the malicious software, and compared with a deep neural network method, the method has the advantages of being high in training speed, less in parameters to be adjusted, more robust, and capable of adaptively adjusting the complexity of the model according to a data set, so that a relatively light model can be obtained without pruning.

Drawings

In order to more clearly illustrate the technical features of the embodiments of the present invention, the drawings needed to be used in the embodiments of the present invention will be briefly described below, and it is apparent that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on the drawings without inventive labor.

FIG. 1 is a schematic structural diagram of a GCforest model in the prior art;

FIG. 2 is a schematic structural diagram of a GCforest model according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a malware identification method according to a first embodiment of the present invention;

fig. 4 is a schematic flowchart of a specific process of feature extraction in the malware identification method according to the first embodiment of the present invention;

fig. 5 is a flowchart illustrating a specific process of step S2 in the malware identification method according to the first embodiment of the present invention;

fig. 6 is a schematic structural diagram of a malware identification device according to a second embodiment of the present invention.

Detailed Description

In order to clearly understand the technical features, objects and effects of the present invention, the following detailed description of the embodiments of the present invention is provided with reference to the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention. Other embodiments, which can be derived by those skilled in the art from the embodiments of the present invention without inventive step, shall fall within the scope of the present invention.

It should be noted that the multi-granular cascaded forest GCForest (multi-granular cascaded forest) is an algorithm for learning a cascaded structure constructed by integrating decision trees, and mainly includes a cascaded forest module and a multi-granular scanning module, and in this document, the multi-granular scanning module is not used, but the cascaded forest module is directly used.

Fig. 1 is a schematic structural diagram of a GCForest model in the prior art.

As shown in fig. 1, in the prior art, a cascade forest module uses forests as basic units, which is a multi-layer cascade structure, each layer is composed of base learners such as random forests and completely random forests, for each base learner, the input is a class probability vector or original data input generated by the previous layer, the output is an output combination of each base learner, then K-fold verification is performed on each layer, and when the accuracy of cross-validation is not increased any more, the cascade process is stopped immediately.

Fig. 2 is a schematic structural diagram of the GCForest model according to the embodiment of the present invention.

As shown in fig. 2, in the embodiment of the present invention, the cascaded forest module uses forests as basic units, which are a multi-layer cascaded structure, for each base learner, the input is a class probability vector generated by a previous layer or raw data input, the output is an output combination of each base learner, then, verification is performed on each layer, when the accuracy of cross verification is no longer improved, the cascading process is stopped immediately, the base learner of each layer may be composed of at least one of random forest (random forest), extreme random tree (extreme random Trees), extreme Gradient boost xgboost (extreme Gradient boost), light Gradient boost (light Gradient boost), category boost (category boost), logistic regression (logistic regression), and at the same time, a final forest decision-making learner is added to the last layer of the cascaded forest module, and the class probability vector of the last layer is used as an input layer, and outputting the final predicted value by a final decision learner, wherein the final decision learner is also a base learner.

Fig. 3 is a flowchart illustrating a malware identification method according to an embodiment of the present invention.

As shown in fig. 3, the malware identification method includes the following steps:

s1: extracting sample software execution sequence characteristics; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature;

s2: training a GCforest model by using the API characteristics, the PID characteristics and the RET characteristics; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner;

s3: and identifying the malicious software by using the trained GCforest model.

In step S1, the specific process is as follows:

Fig. 4 is a schematic diagram illustrating a specific flow of feature extraction in the malware identification method according to the first embodiment of the present invention.

As shown in fig. 4, the feature extraction process specifically includes:

s11: when the API _ name of the sample software contains a first character string, determining that the value of the API feature of the sample software is 1, otherwise, determining that the value is 0; the first character string is any character string in api _ name of the malicious software;

it can be understood that the dynamic behavior of the software is mainly realized by calling a system API, the API property is the most important point considered, and the API _ name is composed of a plurality of character strings, and the character strings do not repeatedly appear in a single sample, so a one-hot manner is used to change the character string of the API _ name of the malware into a feature, and for the API feature, when the API _ name of the sample software contains a first character string, the value of the API feature of the sample software is determined to be 1, otherwise, the value is 0; the first character string is any character string in the api _ name of the malicious software.

S12: when a second character string is contained in the call _ PID of the sample software, determining the value of the PID characteristic of the sample software as the frequency of occurrence of the second character string; wherein the second character string is any character string in call _ pid of the malicious software;

it can be understood that the PID feature represents the type and other information of the software execution process, and the call _ PID is composed of a plurality of character strings which appear many times in a single sample, the character string of the call _ PID of the malicious software is changed into one feature, and for the PID feature, when the second character string is included in the call _ PID of the sample software, the value of the PID feature of the sample software is determined as the frequency of appearance of the second character string; the second character string is any character string in call _ pid of the malicious software.

S13: when a third string is contained in the RET value of the sample software, determining the value of the RET characteristic of the sample software as the frequency of occurrence of the third string; wherein the third character string is any character string in ret _ value of the malware.

It is understood that the RET property shows the execution result of the system call, and the RET _ value is composed of a plurality of character strings which appear multiple times in a single sample, the character string of the RET _ value of the malware is changed into one feature, and for the RET feature, when the RET _ value of the sample software contains a third character string, the value of the RET feature of the sample software is determined as the frequency of appearance of the third character string; and the third character string is any character string in ret _ value of the malicious software.

In the first embodiment of the present invention, the Beautiful Soup library in Python is used to capture the api _ name, call _ pid, and ret _ value in the xml file exported from the sandbox, but the present invention is not limited thereto.

Through the processing of the above steps, a matrix whose features are expressed as only numbers can be extracted.

Fig. 5 is a schematic flowchart illustrating a specific flow of step S2 in the malware identification method according to the first embodiment of the present invention.

As shown in fig. 5, step S2 specifically includes:

in the first embodiment of the present invention, data is labeled as malicious and benign, and the malicious software includes but is not limited to: the software has the behaviors of privately acquiring privacy information such as a user terminal, a position and the like, has the behaviors of changing system settings and installing sub-malicious software, and has the behaviors of falsely linking and endangering property safety of users. The number and proportion of malicious software to benign software, and the proportion of the training set to the test set, the invention is not limited.

it should be understood that, each time the step S23 is performed, the class probability vector output by the last forest layer of the original model is connected to the first feature vector based on the original model to obtain a new feature vector, and the new forest layer is trained by using the new feature vector. And after the training is finished, taking the forest layer which is just trained as the next layer of the last forest layer of the original model, and then connecting the final decision learner to form a new GCforest model.

wherein, it is accurateRate of change

Y is the number of samples in the cross-validation set, Y_Pred ⁽ⁱ⁾Prediction class for the ith sample in the Cross-validation set, y_True ⁽ⁱ⁾The true class of the ith sample in the cross-validation set. The function I (x, y) is an indicative function, namely when the values of x and y are the same, the function value is 1; otherwise, the function value is 0. Using p simultaneously_BestRecord the current highest cross-validation accuracy and Index_BestAnd recording the forest layer corresponding to the current highest accuracy.

It should be understood that Index in step S24_BestAn update of the value of (a) means an increase in the number of layers of the GCForest model. And, when the operation of step S23 is repeated in step S24, p has already been set_BestUpdating the value of the cross validation set to be the prediction accuracy of a new GCforest model added with the forest layer obtained by the training to the cross validation set, and updating Index_BestFor example, after the second forest layer is trained in step S23, and when the accuracy of the current model (which includes the first forest layer, the second forest layer and the final decision learner) is higher than that of the model with only the first forest layer and the final decision learner in step S24, the value of p is updated to the index value of the forest layer obtained in the training, and p is updated_BestAnd Index_BestA value of (d); otherwise, step S25 is completed. At this time Index_BestThe value of (d) still points to the first forest layer, p_BestThe value of (d) is still the first accuracy calculated in step S22.

In the process of training the GCForest model, a base learner of any forest layer of the GCForest model at least comprises one of the following algorithms: random forest, extreme random tree, extreme gradient lift, lightweight gradient lift, category lift, and logistic regression.

After the step S2, an improved GCForest model can be obtained, the final prediction result of the model is output by the final decision learner, and the base learner of any one forest layer is formed by at least one of the following algorithms: random forest, extreme random tree, extreme gradient lift, lightweight gradient lift, category lift, and logistic regression.

After the improved GCforest model is obtained, feature extraction is carried out on software to be recognized, the extracted features are input into the improved GCforest model for recognition, and therefore the software can be recognized and classified into malicious software or benign software.

The malicious software identification method provided by the embodiment of the invention comprises the steps of extracting sample software execution sequence characteristics, training a GCforest model, wherein the GCforest model comprises a cascade forest module, outputting a final prediction result of the GCforest model by a final decision learner, and identifying malicious software by using the trained GCforest model; the final prediction result of the GCforest model is output by the final decision learner, and compared with the existing GCforest model, the GCforest model has higher identification accuracy in malicious software identification; the improved GCforest model is adopted to identify the malicious software, compared with a deep neural network method, the method has the advantages of higher training speed, fewer parameters to be adjusted, relatively simple difficulty in parameter adjustment, robustness and capability of adaptively adjusting the model according to a data set, so that a relatively light model can be obtained without pruning.

It should be understood that all or part of the processes of the malware identification method described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a processor, so as to implement the steps of the malware identification method described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, in accordance with legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunications signals.

As shown in fig. 6, the malware recognition apparatus includes: a feature extraction module 61, a model training module 62 and a software identification module 63; wherein the content of the first and second substances,

the feature extraction module 61 is configured to extract features of a sample software execution sequence; wherein the sample software execution sequence features include an API feature, a PID feature, and a RET feature;

the model training module 62 is configured to train a GCForest model using the API features, the PID features, and the RET features; the GCforest model comprises a cascade forest module, and a final prediction result of the GCforest model is output by a final decision learner;

the software identification module 63 is configured to identify malware by using the trained GCForest model.

Preferably, the feature extraction module 61 is configured to extract features of the sample software execution sequence, specifically:

Further, the extracting the API feature, the PID feature, and the RET feature of the sample software according to the API _ name, the call _ PID, and the RET _ value usage rule matching and frequency statistics specifically includes:

Through the above steps, a matrix whose features appear to be only numbers can be extracted.

Preferably, the model training module 62 is configured to train a GCForest model, specifically:

the model training module 62 is configured to train a GCForest model using the API features, the PID features, and the RET features, specifically:

wherein the accuracy rate

Y is the number of samples in the cross-validation set, Y_Pred ⁽ⁱ⁾Prediction class for the ith sample in the Cross-validation set, y_True ⁽ⁱ⁾The true class of the ith sample in the cross-validation set. The function I (x, y) is an indicative function, namely when the values of x and y are the same, the function value is 1; otherwise, the function value is 0. At the same time, use p_BestRecord the current highest cross-validation accuracy and Index_BestAnd recording the forest layer corresponding to the current highest accuracy.

It is understood that Index in step d_BestAn update of the value of (a) means an increase in the number of layers of the GCForest model. And, when the operation of step c is repeated in step d, p is already set_BestUpdating the value of the cross validation set to be the prediction accuracy of a new GCforest model added with the forest layer obtained by the training to the cross validation set, and updating Index_BestFor example, after the second forest layer is trained in step c, and when the accuracy of the current model (the model comprises the first forest layer, the second forest layer and the final decision learner) is higher than that of the model only comprising the first forest layer and the final decision learner in step d, the value of p is updated to the index value of the forest layer obtained by the training, and p is updated_BestAnd Index_BestA value of (d); otherwise, finishing the step e. At this time Index_BestThe value of (d) still points to the first forest layer, p_BestThe value of (c) is still the first accuracy calculated in step b.

In the process of training GCForest, a base learner of any forest layer of the GCForest model is at least composed of one of the following algorithms: random forest, extreme random tree, extreme gradient lift, lightweight gradient lift, category lift, and logistic regression.

After the improved GCforest model is obtained, feature extraction is carried out on software to be recognized, features are input into the improved GCforest model for recognition, and then the software can be recognized and classified, and the software is classified into malicious software or benign software.

Compared with the conventional GCforest model, the device for identifying the malicious software has higher identification accuracy in the identification of the malicious software; the improved GCforest model is adopted to identify the malicious software, compared with a deep neural network method, the method has the advantages of higher training speed, fewer parameters to be adjusted, relatively simple difficulty in parameter adjustment, robustness and capability of adaptively adjusting the model according to a data set, so that a relatively light model can be obtained without pruning.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and it should be noted that, for those skilled in the art, several equivalent obvious modifications and/or equivalent substitutions can be made without departing from the technical principle of the present invention, and these obvious modifications and/or equivalent substitutions should also be regarded as the scope of the present invention.

Claims

1. A malware identification method, comprising:

and identifying the malicious software by using the trained GCforest model.

2. The malware identification method according to claim 1, wherein the extracting of the sample software execution sequence features specifically is:

3. The malware identification method of claim 2, wherein the extracting of the API feature, the PID feature, and the RET feature of the sample software according to the API _ name, the call _ PID, and the RET _ value using rule matching and frequency statistics specifically comprises:

4. The malware identification method according to claim 1, wherein the training of the GCForest model using the API features, the PID features, and the RET features specifically comprises:

5. The malware identification method of claim 4, wherein the base learner of any forest layer of the GCforest model is comprised of at least one of the following algorithms: random forest, extreme random tree, extreme gradient lift, lightweight gradient lift, category lift, and logistic regression.

6. A malware identification device, comprising: the system comprises a feature extraction module, a model training module and a software identification module; wherein the content of the first and second substances,

7. The malware identification device of claim 6, wherein the feature extraction module is configured to extract sample software execution sequence features, specifically:

8. The malware identification device of claim 7, wherein the API feature, the PID feature, and the RET feature of the sample software are extracted according to the API _ name, the call _ PID, and the RET _ value using rule matching and frequency statistics, specifically:

9. The malware identification device of claim 6, wherein the model training module is configured to train a GCForest model using the API features, the PID features, and the RET features, specifically:

10. A computer-readable storage medium, characterized in that a computer program implementing the malware identification method according to any one of claims 1 to 5 is stored thereon.