CN114969732A

CN114969732A - Malicious code detection method and device, computer equipment and storage medium

Info

Publication number: CN114969732A
Application number: CN202210461822.4A
Authority: CN
Inventors: 袁俊杰; 王波; 吴潇; 潘彭丹; 裴军; 段云
Original assignee: Guoke Huadun Beijing Technology Co ltd
Current assignee: Guoke Huadun Beijing Technology Co ltd
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-08-30
Anticipated expiration: 2042-04-28
Also published as: CN114969732B

Abstract

The application relates to a malicious code detection method, a malicious code detection device, a malicious code detection computer device, a malicious code detection storage medium and a malicious code detection computer program product. The method comprises the following steps: acquiring a code to be detected, and extracting a characteristic data set from the code to be detected; each feature data set comprises a plurality of feature data subsets; calculating a hash sequence corresponding to each feature data in the feature data subsets and a weight corresponding to each feature data aiming at each feature data subset contained in the feature data set; determining a characteristic value sequence corresponding to each characteristic data according to the hash sequence and the weight of each characteristic data, and performing bitwise fusion processing on the characteristic value sequence corresponding to each characteristic data to obtain fusion data corresponding to the characteristic data subset; and determining a target data set according to the fusion data corresponding to each characteristic data subset, and inputting the target data set to the trained malicious code detection model to obtain a code detection result. Therefore, the detection efficiency of unknown malicious codes is improved.

Description

Malicious code detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of network information security technologies, and in particular, to a method and an apparatus for detecting malicious codes, a computer device, and a storage medium.

Background

With the development of network information security technology, a malicious code detection technology appears, which can detect codes in a computer and determine whether the detected codes are malicious codes, thereby protecting the network information security.

In the traditional malicious code detection technology, static features and dynamic features of a code to be detected are extracted, and then the features are input into a trained classifier to obtain a detection result.

However, in the current detection technology of malicious codes, the extracted large-dimension feature data set is not processed, or the dimension of the processed feature data set is still relatively large, so that the operation time of the trained classifier is relatively long, and the detection efficiency is reduced.

Disclosure of Invention

In view of the above, it is necessary to provide a malicious code detection method, apparatus, computer device, computer readable storage medium, and computer program product capable of improving detection efficiency.

In a first aspect, the present application provides a method for detecting malicious code. The method comprises the following steps:

acquiring a code to be detected, and extracting a characteristic data set from the code to be detected; each of the feature data sets comprises a plurality of feature data subsets;

calculating a hash sequence corresponding to each feature data in the feature data subsets and a weight corresponding to each feature data for each feature data subset contained in the feature data set;

determining a characteristic value sequence corresponding to each characteristic data according to the Hash sequence and the weight of each characteristic data, and performing bit-wise fusion processing on the characteristic value sequences corresponding to the characteristic data to obtain fusion data corresponding to the characteristic data subsets;

and determining a target data set according to the fusion data corresponding to each feature data subset, and inputting the target data set to the trained malicious code detection model to obtain a code detection result.

In one embodiment, the determining, according to the hash sequence and the weight of each piece of feature data, a feature value sequence corresponding to each piece of feature data includes:

for each hash value contained in the hash sequence, determining the sign of the weight according to the hash value, and determining a characteristic value according to the sign of the weight and the weight;

and obtaining a characteristic value sequence corresponding to the characteristic data according to the characteristic value corresponding to each hash value, wherein the arrangement sequence of each characteristic value in the characteristic value sequence is the same as the arrangement sequence of the hash value corresponding to each characteristic value.

In one embodiment, the performing bit-wise fusion processing based on the feature value sequence corresponding to each feature data to obtain fusion data corresponding to the feature data subset includes:

performing bit-wise linear superposition on the characteristic value sequence corresponding to each characteristic data aiming at the characteristic data contained in one characteristic data subset to obtain a fusion characteristic value sequence corresponding to the characteristic data subset;

and aiming at the fusion characteristic value sequence corresponding to the characteristic data subset, obtaining fusion data corresponding to the characteristic data subset according to a mapping strategy corresponding to the fusion characteristic value sequence.

In one embodiment, before calculating, for each of the feature data subsets included in the feature data set, a hash sequence corresponding to each feature data in the feature data subset and a weight corresponding to each feature data, the method further includes:

aiming at each feature data contained in each feature data subset, calculating an information gain rate corresponding to each feature data;

and aiming at one characteristic data subset, screening the characteristic data of which the information gain rate meets a first preset screening strategy in the characteristic data subset to obtain the screened characteristic data subset.

In one embodiment, the extracting a feature data set from the code to be detected includes:

respectively extracting a feature data set corresponding to each feature type from the code to be detected according to the extraction strategy corresponding to each feature type;

wherein the feature categories comprise file features, behavior features, and flow features; the file features comprise image texture features; the traffic characteristics include protocol characteristics, flow characteristics, and encryption suite characteristics.

In one embodiment, the determining a target data set according to fusion data corresponding to each feature data subset, and inputting the target data set to a trained malicious code detection model to obtain a code detection result includes:

constructing a fusion data set based on fusion data corresponding to each characteristic data subset;

sampling the fused data sets by adopting a guide aggregation algorithm to obtain a first preset number of initial data sets;

for fused data in each initial data set, randomly returning a second preset number of the fused data extracted from the initial data set, and deleting the fused data with weights not meeting a second preset screening strategy to obtain a corresponding first preset number of target data sets;

and respectively inputting each target data set into a decision tree in the trained malicious code detection model to obtain a code detection result.

In one embodiment, the method further comprises:

acquiring malicious codes, and extracting a feature data set from the malicious codes; each of the feature data sets comprises a plurality of feature data subsets;

determining a sample data set according to the fusion data corresponding to each feature data subset, and inputting the sample data set to a malicious code detection model to be trained to obtain a code prediction result;

and adjusting parameters of the malicious code detection model to be trained according to the code prediction result so as to enable the adjusted malicious code detection model to reach a preset precision condition, and obtaining the trained malicious code detection model.

In a second aspect, the application further provides a malicious code detection apparatus. The device comprises:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a code to be detected and extracting a characteristic data set from the code to be detected; each of the feature data sets comprises a plurality of feature data subsets;

a first calculation module, configured to calculate, for each feature data subset included in the feature data set, a hash sequence corresponding to each feature data in the feature data subset and a weight corresponding to each feature data;

the first determining module is used for determining a characteristic value sequence corresponding to each characteristic data according to the hash sequence and the weight of each characteristic data, and performing bitwise fusion processing on the characteristic value sequence corresponding to each characteristic data to obtain fusion data corresponding to the characteristic data subset;

and the second determining module is used for determining a target data set according to the fusion data corresponding to each characteristic data subset, and inputting the target data set to the trained malicious code detection model to obtain a code detection result.

In one embodiment, the first determining module is specifically configured to:

performing bit-wise linear superposition on characteristic value sequences corresponding to the characteristic data aiming at the characteristic data contained in one characteristic data subset to obtain a fusion characteristic value sequence corresponding to the characteristic data subset;

In one embodiment, the apparatus further comprises:

a second calculating module, configured to calculate, for each feature data included in each feature data subset, an information gain rate corresponding to each feature data;

and the screening module is used for screening the characteristic data of which the information gain rate meets a first preset screening strategy in the characteristic data subset aiming at one characteristic data subset to obtain the screened characteristic data subset.

In one embodiment, the first obtaining module is specifically configured to:

In one embodiment, the second determining module is specifically configured to:

constructing a fusion data set based on fusion data corresponding to each feature data subset;

In one embodiment, the apparatus further comprises:

the second acquisition module is used for acquiring malicious codes and extracting a feature data set from the malicious codes; each of the feature data sets comprises a plurality of feature data subsets;

a third calculating module, configured to calculate, for each subset of the feature data included in the feature data set, a hash sequence corresponding to each feature data in the subset of the feature data and a weight corresponding to each feature data;

a third determining module, configured to determine, according to the hash sequence and the weight of each feature data, a feature value sequence corresponding to each feature data, and perform bitwise fusion processing based on the feature value sequence corresponding to each feature data to obtain fusion data corresponding to the feature data subset;

the fourth determining module is used for determining a sample data set according to the fusion data corresponding to each feature data subset, and inputting the sample data set to a malicious code detection model to be trained to obtain a code prediction result;

and the adjusting module is used for adjusting the parameters of the malicious code detection model to be trained according to the code prediction result so as to enable the adjusted malicious code detection model to reach the preset precision condition and obtain the trained malicious code detection model.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the first aspect when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps recited in the first aspect.

In a fifth aspect, the present application further provides a computer program product. The present application also provides a computer program product. The computer program product comprising a computer program that when executed by a processor performs the steps recited in the first aspect. The malicious code detection method, the malicious code detection device, the computer equipment, the storage medium and the computer program product are provided. Extracting a characteristic data set from a code to be detected by acquiring the code to be detected; each feature data set comprises a plurality of feature data subsets; calculating a hash sequence corresponding to each feature data in the feature data subsets and a weight corresponding to each feature data aiming at each feature data subset contained in the feature data set; determining a characteristic value sequence corresponding to each characteristic data according to the hash sequence and the weight of each characteristic data, and performing bitwise fusion processing on the characteristic value sequence corresponding to each characteristic data to obtain fusion data corresponding to the characteristic data subset; and determining a target data set according to the fusion data corresponding to each characteristic data subset, and inputting the target data set to the trained malicious code detection model to obtain a code detection result. According to the technical scheme, the characteristic value sequences corresponding to the characteristic data are generated by calculating the Hash sequences and the weights of the characteristic data, then the characteristic value sequences are fused, and the target data set is determined according to the fused data, so that the dimension reduction is effectively performed on the input data of the malicious code detection model, the operation time of the malicious code detection model is shortened, and the detection efficiency is improved.

Drawings

FIG. 1 is a flowchart illustrating a method for malicious code detection according to an embodiment;

FIG. 2 is a flow chart illustrating a method for determining a sequence of eigenvalues in one embodiment;

FIG. 3 is a flowchart illustrating a method for determining fused data according to another embodiment;

FIG. 4 is a flowchart illustrating a method for determining a code detection result according to an embodiment;

FIG. 5 is a flowchart illustrating a method for training a malicious code detection model according to an embodiment;

FIG. 6 is a block diagram of an apparatus for malicious code detection in one embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

In an embodiment, as shown in fig. 1, a method for detecting malicious codes is provided, and this embodiment is illustrated by applying the method to a terminal, it is to be understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

and 102, acquiring a code to be detected, and extracting a characteristic data set from the code to be detected.

Wherein each feature data set comprises a plurality of feature data subsets, each feature data subset comprising a plurality of feature data.

In the embodiment of the application, the terminal acquires the code to be detected and extracts the characteristic data set of the code to be detected. The number of feature data sets may be one or plural. In one example, the features of the code to be detected may include a file fingerprint, a behavior fingerprint, and a traffic fingerprint, and accordingly, the feature data set includes a feature data set corresponding to the file fingerprint, a feature data set corresponding to the behavior fingerprint, and a feature data set corresponding to the traffic fingerprint. The file fingerprint may specifically include a function call and Application Programming Interface (API) call feature, a file directory structure feature, a script information feature, a string information feature, an application signature feature, an application system access right, an application system component feature, and an image texture feature, and the feature data subset includes a feature data subset corresponding to the function call and application programming interface call feature, a feature data subset corresponding to the file directory structure feature, a feature data subset corresponding to the script information feature, a feature data subset corresponding to the string information feature, a feature data subset corresponding to the application signature feature, a feature data subset corresponding to the application system access right, a feature data subset corresponding to the application system component feature, and a feature data subset corresponding to the image texture feature. The behavior fingerprint may specifically include a system call characteristic, a command execution characteristic, a critical path characteristic, a data access characteristic, and a network access behavior characteristic, and correspondingly, the characteristic data subset includes a characteristic data subset corresponding to the system call characteristic, a characteristic data subset corresponding to the command execution characteristic, a characteristic data subset corresponding to the critical path characteristic, a characteristic data subset corresponding to the data access characteristic, and a characteristic data subset corresponding to the network access behavior characteristic. The traffic fingerprint may specifically include a protocol feature, a stream feature, and an encryption suite feature, and accordingly, the feature data subset includes a feature data subset corresponding to the protocol feature, a feature data subset corresponding to the stream feature, and a feature data subset corresponding to the encryption suite feature. The feature data subset corresponding to the stream feature comprises a plurality of feature data, and the plurality of feature data can be the packet number of the data packet, the packet length of the data packet, the interval of the data packet and the size of the data packet.

And 104, calculating a hash sequence corresponding to each feature data in the feature data subsets and a weight corresponding to each feature data aiming at each feature data subset contained in the feature data set.

In the embodiment of the application, for each feature data subset contained in the feature data set, the terminal maps each feature data into a corresponding hash sequence by using a text hash algorithm in the hash algorithm. Optionally, the hash sequence is a preset fixed-length sequence. And the terminal searches the called instructions of the characteristics corresponding to each characteristic data in the codes to be detected, determines the number of the called instructions, and takes the number of the called instructions as the characteristic frequency. And aiming at each feature data in the feature data subset, the terminal calculates the weight corresponding to the feature data according to the feature frequency corresponding to the feature data, the dimension of the feature data and the dimension of the feature data with the highest dimension in the feature data subset corresponding to the feature data. Specifically, the terminal brings the feature frequency, the dimension of the feature data, and the dimension of the feature data with the highest dimension into the following formula (1), and calculates the weight corresponding to the feature data.

Wherein W is the weight corresponding to the characteristic data, k is a preset parameter, t is the characteristic frequency, D _max The dimension of the feature data with the highest dimension in the feature data subset corresponding to the feature data, and D is the dimension of the feature data.

And step 106, determining a characteristic value sequence corresponding to each characteristic data according to the hash sequence and the weight of each characteristic data, and performing bit-wise fusion processing on the characteristic value sequence corresponding to each characteristic data to obtain fusion data corresponding to the characteristic data subset.

In the embodiment of the application, for the feature data in each feature data subset, the terminal determines the feature value sequence corresponding to each feature data according to the hash sequence and the weight of each feature data. And aiming at a characteristic data subset, the terminal performs bitwise fusion based on the characteristic value sequence to obtain fusion data corresponding to the characteristic data subset.

And step 108, determining a target data set according to the fusion data corresponding to each characteristic data subset, and inputting the target data set to the trained malicious code detection model to obtain a code detection result.

Wherein the target data set comprises a plurality of target data subsets. The trained malicious code detection model may be a trained random forest model. The trained random forest model includes a plurality of trained decision trees.

In the embodiment of the present application, one extraction is taken as an example, and the situation of multiple extractions is similar, and is not described again. And aiming at each characteristic data subset, the terminal randomly has the replaced extracted fusion data according to a preset sampling strategy to obtain a target data subset. It can be understood that, for each feature data subset, the terminal obtains a plurality of target data subsets through multiple times of extraction. And the terminal respectively inputs the target data subsets into decision trees in the trained malicious code detection model, and each decision tree outputs a code detection result. And the trained malicious code detection model counts all code detection results according to a preset result output strategy to obtain a code detection result with the maximum output number, and takes the code detection result with the maximum output number as a final code detection result. In the method for detecting the malicious codes, the terminal extracts the characteristics of the codes to be detected, performs fusion processing on the extracted characteristic data to determine a target data set, and inputs the target data set to a trained malicious code detection model to obtain a code detection result. It can be understood that the dimension reduction is realized by performing fusion processing on a plurality of feature data, and correspondingly, the dimension reduction is also realized on the basis of the target data set obtained by performing fusion processing on a plurality of feature data. Therefore, the input of the trained malicious code detection model is a low-dimensional target data set, so that the operation time of the trained malicious code detection model is shortened, and the detection efficiency is improved.

In one embodiment, as shown in fig. 2, determining the characteristic value sequence corresponding to each characteristic data according to the hash sequence and the weight of each characteristic data includes:

step 202, for each hash value contained in the hash sequence, determining a sign of the weight according to the hash value, and determining a feature value according to the sign of the weight and the weight.

In the embodiment of the application, for each hash value contained in the hash sequence, the terminal determines the symbol of the weight corresponding to each hash value according to the preset corresponding relationship between the hash value and the weight symbol, and determines the characteristic value corresponding to each hash value according to the symbol of the weight and the specific numerical value of the weight. Specifically, for each hash value included in the hash sequence, the terminal obtains the hash value and determines whether the hash value is 0 or 1. Under the condition that the hash value is determined to be 0, the terminal determines that the sign of the weight corresponding to the hash value is a negative sign, and determines that the characteristic value corresponding to the hash value is a negative weight according to the negative sign of the weight sign and the specific numerical value of the weight; when the hash value is determined to be 1, the terminal determines that the sign of the weight corresponding to the hash value is a positive sign, and determines that the characteristic value corresponding to the hash value is a positive weight according to the positive sign of the weight sign and the specific numerical value of the weight.

And 204, obtaining a characteristic value sequence corresponding to the characteristic data according to the characteristic value corresponding to each hash value.

In the embodiment of the application, the terminal sorts the characteristic values corresponding to the hash values according to the characteristic values corresponding to the hash values and the arrangement sequence of the hash values to obtain the characteristic value sequence corresponding to the characteristic data. And the arrangement sequence of each characteristic value in the characteristic value sequence is the same as the arrangement sequence of the hash value corresponding to each characteristic value.

The hash sequence corresponding to one feature data is taken as an example, and the situations of other hash sequences are similar, and are not described again. If a hash sequence is 100110, the weight of the hash sequence is W ₁ . Aiming at the hash sequence {100110}, the terminal acquires each hash value in the hash sequence according to the sequence, acquires and determines a first hash value as 1, and further determines a weight symbol corresponding to the first hash valueIs a positive sign. For the first hash value, the terminal has a weight W according to the hash sequence {100110} ₁ The weight symbol is a positive sign, and the characteristic value corresponding to the first hash value is determined to be W ₁ . Similarly, the terminal determines that the second hash value 0 corresponds to the characteristic value of-W ₁ And so on, obtaining that the characteristic value sequence corresponding to the Hash sequence {100110} is { W } ₁ ，-W ₁ ，-W ₁ ，W ₁ ，W ₁ ，-W ₁ }. It is to be understood that, in this embodiment, a 6-bit hash sequence is taken as an example, but the specific number of bits of the hash sequence is not limited.

In this embodiment, the terminal determines the characteristic value sequence corresponding to the characteristic data according to the hash sequence and the weight of the characteristic data, so that the hash information and the weight information are fused in the characteristic value sequence, and the dimensionality of subsequent data participating in operation is reduced.

In one embodiment, as shown in fig. 3, performing bitwise fusion processing based on the feature value sequence corresponding to each feature data to obtain fusion data corresponding to the feature data subset includes:

step 302, performing bit-wise linear superposition on the characteristic value sequences corresponding to the characteristic data aiming at the characteristic data contained in one characteristic data subset to obtain fusion characteristic value sequences corresponding to the characteristic data subset.

In the embodiment of the application, for feature data included in one feature data subset, a terminal performs bitwise addition and summation on a feature value sequence corresponding to each feature data, and the sum of feature values on each bit is used as a fusion feature value of the fusion feature value sequence. Wherein the fused feature value sequence comprises a plurality of fused feature values. And the terminal sorts the characteristic values and the corresponding fusion characteristic values according to the characteristic values corresponding to the characteristic value sequence and the arrangement sequence of the characteristic values to obtain a fusion characteristic value sequence corresponding to the characteristic data subset. Specifically, the terminal determines the characteristic value of the ii th bit in each characteristic value sequence respectively, i is a positive integer and the range of i is [1, the number of bits of the characteristic value sequence]. And then, the terminal sums the determined characteristic values of the ii bit to obtain a fusion characteristic value of the ii bit. If a featureThere are two corresponding eigenvalue sequences in the data subset, one eigenvalue sequence being { W } ₁ ，-W ₁ ，-W ₁ ，W ₁ ，W ₁ ，-W ₁ Another sequence of eigenvalues is { W } ₂ ，W ₂ ，-W ₂ ，-W ₂ ，-W ₂ ，-W ₂ Then the terminal determines the 1 st bit of the eigenvalue as W in the two eigenvalue sequences respectively ₁ 、W ₂ Calculating W ₁ And W ₂ To obtain the fused eigenvalue bit (W) of the 1 st bit ₁ +W ₂ ). Similarly, the terminal obtains a fusion characteristic value sequence { W) corresponding to the characteristic data subset through calculation ₁ +W ₂ ，-W ₁ +W ₂ ，-W ₁ -W ₂ ，W ₁ -W ₂ ，W ₁ -W ₂ ，-W ₁ -W ₂ }. It is to be understood that the above-mentioned feature value sequence is only an example, and the number of bits of the feature value sequence and the feature values in the feature value sequence are not limited.

And 304, aiming at the fusion characteristic value sequence corresponding to the characteristic data subset, obtaining fusion data corresponding to the characteristic data subset according to the mapping strategy corresponding to the fusion characteristic value sequence.

In the embodiment of the application, aiming at the fusion characteristic value sequence corresponding to the characteristic data subset, the terminal obtains fusion data corresponding to the characteristic data subset according to the mapping strategy corresponding to the fusion characteristic value sequence. The fused feature value sequence corresponding to one feature data subset is taken as an example, and the situations of other feature data subsets are similar, and are not described again. And aiming at the fusion characteristic value sequence, the terminal determines the fusion characteristic value of the kth bit, wherein k is a positive integer and the range of k is [1, the number of bits of the fusion characteristic value sequence ]. And then, the terminal carries out mapping processing on the determined fusion characteristic value of the k bit according to a mapping strategy corresponding to the fusion characteristic value sequence to obtain a corresponding k bit mapping value of the fusion data. And the terminal sorts the mapping values corresponding to the fusion characteristic values according to the fusion characteristic values corresponding to the fusion characteristic value sequence and the arrangement sequence of the fusion characteristic values to obtain fusion data corresponding to the characteristic data subset. Wherein the fused data comprises a plurality of mapping values. Specifically, for a fusion characteristic value sequence corresponding to the characteristic data subset, the terminal obtains and determines whether each fusion characteristic value in the fusion characteristic value sequence is a positive number or a negative number. Under the condition that the fusion characteristic value is a positive number, the terminal maps the fusion characteristic value into a mapping value 1; in the case where the fusion eigenvalue is a negative number, the terminal maps the fusion eigenvalue to a mapping value of 0. If the fused feature value sequence corresponding to one feature data subset is {13, 108, -22, -5, -32, 55 }. And for the fusion characteristic value sequence {13, 108, -22, -5, -32, 55}, the terminal determines that the fusion characteristic value of the 1 st bit is 13, maps 13 to 1 according to the mapping strategy corresponding to the fusion characteristic value sequence, and takes 1 as the 1 st bit mapping value corresponding to the fusion data. Similarly, the terminal performs mapping processing on the fusion characteristic value sequence to obtain fusion data {1, 1, 0, 0, 0, 1} corresponding to the characteristic data subset. It will be appreciated that the above-described fused feature value sequence is exemplary only, and the number of bits in the fused feature value sequence and the fused feature values in the fused feature value sequence are not limited.

In this embodiment, the terminal performs bit-wise fusion processing on the characteristic value sequence to obtain fusion data corresponding to the characteristic data subset. Therefore, the technical scheme can reduce the dimension of the high-dimension feature data subset to obtain the lower-dimension fusion data, and then detect the lower-dimension fusion data, so that the detection efficiency is improved.

In one embodiment, before calculating, for each feature data subset included in the feature data set, a hash sequence corresponding to each feature data in the feature data subset and a weight corresponding to each feature data, the method further includes:

calculating an information gain rate corresponding to each feature data aiming at each feature data contained in each feature data subset; and aiming at one characteristic data subset, screening the characteristic data of which the information gain rate meets a first preset screening strategy in the characteristic data subset to obtain the screened characteristic data subset.

In the embodiment of the application, the terminal calculates the information gain rate corresponding to each feature data aiming at each feature data contained in each feature data subset. Optionally, any algorithm capable of calculating the information gain rate corresponding to each feature data may be applied to the embodiment of the present application, and the embodiment of the present application is not limited. And aiming at one characteristic data subset, the terminal screens the characteristic data of which the information gain rate meets a first preset screening strategy according to the first preset screening strategy to obtain the screened characteristic data subset. Specifically, for a feature data subset, the terminal acquires and determines the number of feature data in the feature data subset. And aiming at the number of the feature data in the feature data subset, the terminal determines whether the number of the feature data meets a preset number. Aiming at one characteristic data subset, under the condition that the number of the characteristic data is less than the preset number, the terminal takes the characteristic data subset as the screened characteristic data subset; and when the number of the feature data is larger than or equal to the preset number, the terminal performs descending sorting on each feature data in the feature data subset according to the information gain rate corresponding to the feature data to obtain a feature data number sequence, then screens the feature data with the preset number, and forms the screened feature data subset based on the feature data with the preset number. Alternatively, the preset number may be 3000.

In this embodiment, the terminal obtains the screened feature data subset by calculating the information gain rate corresponding to the feature data and according to the preset screening condition. According to the technical scheme, the feature data in the high-dimensional feature subset are screened, the feature data with small influence on the code detection effect are deleted, the feature data with large influence on the code detection effect are reserved, the dimension of the feature subset is reduced while the accuracy of code detection is guaranteed, and the efficiency of code detection is improved.

In one embodiment, extracting a corresponding feature data set from a code to be detected includes:

and respectively extracting the feature data sets corresponding to the feature categories from the codes to be detected according to the extraction strategies corresponding to the feature types.

The characteristic category comprises file characteristics, behavior characteristics and flow characteristics; the file features comprise image texture features; the traffic characteristics include protocol characteristics, flow characteristics, and encryption suite characteristics.

In the embodiment of the application, the terminal respectively extracts the feature data sets corresponding to the feature categories from the codes to be detected according to the extraction strategies corresponding to the feature types. Optionally, any method capable of extracting feature data from the code to be detected may be applied to the embodiment of the present application, and the embodiment of the present application is not limited. Specifically, under the condition that the code to be detected is not operated, the terminal performs feature extraction on the file storing the code to be detected to obtain a feature data set corresponding to the fingerprint of the file. Optionally, the file fingerprint includes a function call feature, an Application Programming Interface (API) call feature, a file directory structure feature, a script information feature, a string information feature, an application signature feature, an application system access right, an application system component feature, and an image texture feature. The terminal adopts a sandbox to simulate the running environment of the code to be detected, records the operation behavior of the code to be detected in the running process, and performs feature extraction on the operation behavior to obtain a feature data set corresponding to the behavior fingerprint. Optionally, the behavior fingerprint includes a system call feature, a command execution feature, a critical path feature, a data access feature, and a network access behavior feature. The terminal carries out reverse analysis on the communication protocol of the network traffic of the code to be detected, and carries out feature extraction on the traffic in the reverse analysis process to obtain a feature data set corresponding to the network traffic fingerprint. Optionally, the network traffic fingerprint includes protocol features, flow features, and encryption suite features.

In the embodiment, the terminal respectively extracts the feature data sets corresponding to the feature categories from the codes to be detected according to the extraction strategies corresponding to the feature types, so that the extracted features can be ensured to be comprehensive and sufficient, and further, the categories of the features in the target data set obtained subsequently based on the feature vector set are relatively more, and the accuracy of malicious code detection is ensured.

In an embodiment, as shown in fig. 4, determining a target data set according to fusion data corresponding to each feature data subset, and inputting the target data set to a trained malicious code detection model to obtain a code detection result, including:

step 402, forming a fusion data set based on the fusion data corresponding to each feature data subset. .

In the embodiment of the application, the terminal forms the fusion data set based on the fusion data corresponding to each feature data subset.

And 404, sampling the fused data sets by adopting a guide aggregation algorithm to obtain a first preset number of initial data sets.

In the embodiment of the application, for the fused data sets, the terminal samples by using a Bootstrap aggregation algorithm (bagging) to obtain a first preset number of initial data sets.

And 406, for the fusion data in each initial data set, randomly returning to extract a second preset number of fusion data from the initial data set, and deleting the fusion data with the weight not meeting a second preset screening strategy to obtain a corresponding first preset number of target data sets. In the embodiment of the application, for the fusion data in each initial data set, the terminal randomly has the second preset number of times of extraction from the initial data set, so as to obtain the second preset number of fusion data. And for the second preset number of fusion data corresponding to each initial data set, the terminal determines whether the weight corresponding to each fusion data set is greater than or equal to a preset threshold value. Under the condition that the weight corresponding to the fusion data is greater than or equal to a preset threshold value, the terminal determines that the weight corresponding to the fusion data meets a second preset screening strategy, and the fusion data is not processed; and under the condition that the weight corresponding to the fusion data is smaller than the preset threshold value, the terminal determines that the weight corresponding to the fusion data does not meet a second preset screening strategy, and deletes the fusion data. Through the processing, the terminal obtains a first preset number of target data sets corresponding to the initial data set.

And 408, inputting each target data set into a decision tree in the trained malicious code detection model respectively to obtain a code detection result.

In the embodiment of the application, the terminal inputs each target data subset into a decision tree in the trained malicious code detection model respectively, and each decision tree outputs a code detection result. And the trained malicious code detection model counts all code detection results according to a preset result output strategy to obtain a code detection result with the maximum output number, and takes the code detection result with the maximum output number as a final code detection result. Specifically, taking a trained malicious code detection model composed of three decision trees as an example, the terminal respectively inputs three target data sets into the corresponding decision trees, and each decision tree outputs a code detection result. If the code detection result output by the first decision tree is a malicious code, the code detection result output by the second decision tree is a malicious code, the code detection result output by the third decision tree is a security code, and the trained malicious code model counts the code detection results output by the three decision trees to obtain the code detection result with the largest output number which is the malicious code. And outputting a code detection result as a malicious code by the trained malicious code model according to a preset result output strategy, wherein the code detection result obtained by the terminal is the malicious code.

In the embodiment, the terminal determines the target data set according to the guide aggregation algorithm and the second preset screening strategy, inputs the target data set to the trained malicious code detection model, and obtains a code detection result.

In one embodiment, as shown in fig. 5, the method for detecting malicious code further includes:

step 502, obtaining the malicious code, and extracting a feature data set from the malicious code.

In the embodiment of the application, the terminal extracts the feature data set from the malicious code. Optionally, any method capable of extracting feature data from malicious codes may be applied to the embodiment of the present application, and the embodiment of the present application is not limited. Specifically, under the condition that the malicious code is not operated, the terminal performs feature extraction on the file storing the malicious code to obtain a feature data set corresponding to the file fingerprint. Optionally, the file fingerprint includes a function call feature, an Application Programming Interface (API) call feature, a file directory structure feature, a script information feature, a string information feature, an application signature feature, an application system access right, an application system component feature, and an image texture feature. The terminal adopts the sandbox to simulate the running environment of the malicious code, records the operation behavior of the malicious code in the running process, and performs feature extraction on the operation behavior to obtain a feature data set corresponding to the behavior fingerprint. Optionally, the behavior fingerprint includes a system call feature, a command execution feature, a critical path feature, a data access feature, and a network access behavior feature. The terminal carries out reverse analysis on the communication protocol of the network traffic of the malicious codes and carries out feature extraction on the traffic in the reverse analysis process to obtain a feature data set corresponding to the network traffic fingerprint. Optionally, the network traffic fingerprint includes protocol features, flow features, and encryption suite features.

Step 504, for each feature data subset included in the feature data set, a hash sequence corresponding to each feature data in the feature data subset and a weight corresponding to each feature data are calculated.

In this embodiment, the processing procedure of the terminal in step 504 is similar to the processing procedure in step 104, and is not described again.

Step 506, determining a characteristic value sequence corresponding to each characteristic data according to the hash sequence and the weight of each characteristic data, and performing bit-wise fusion processing based on the characteristic value sequence corresponding to each characteristic data to obtain fusion data corresponding to the characteristic data subset.

In this embodiment, the processing procedure of step 506 by the terminal is similar to the processing procedure of step 106, and is not described again.

And step 508, determining a sample data set according to the fusion data corresponding to each characteristic data subset, and inputting the sample data set to a malicious code detection model to be trained to obtain a code prediction result.

Wherein the sample data set comprises a plurality of sample data subsets. The malicious code detection model to be trained may be a random forest model to be trained. The random forest model to be trained comprises a plurality of decision trees to be trained.

In the embodiment of the application, the terminal forms the fusion data set based on the fusion data corresponding to each feature data subset. And aiming at the fused data sets, the terminal adopts a guide aggregation algorithm to sample so as to obtain a first preset number of initial data sets. And aiming at the fusion data in each initial data set, the terminal randomly withdraws the fusion data from the initial data set for a second preset number of times to obtain the fusion data with the second preset number. And for the second preset number of fusion data corresponding to each initial data set, the terminal determines whether the weight corresponding to each fusion data set is greater than or equal to a preset threshold value. Under the condition that the weight corresponding to the fusion data is greater than or equal to a preset threshold value, the terminal determines that the weight corresponding to the fusion data meets a second preset screening strategy, and the fusion data is not processed; and under the condition that the weight corresponding to the fusion data is smaller than the preset threshold value, the terminal determines that the weight corresponding to the fusion data does not meet a second preset screening strategy, and deletes the fusion data. Through the processing, the terminal obtains a first preset number of sample data sets corresponding to the initial data set. And respectively inputting each sample data subset into a decision tree in a malicious code detection model to be trained by the terminal, wherein each decision tree outputs a code detection result.

And 510, adjusting parameters of the malicious code detection model to be trained according to the code prediction result so that the adjusted malicious code detection model reaches a preset precision condition, and obtaining the malicious code detection model to be trained.

In the embodiment of the application, for each decision tree in a malicious code detection model to be trained, a terminal obtains a code detection result corresponding to each decision tree, and determines whether the code detection result is the same as a real result. Under the condition that the code detection result is the same as the real result, the terminal does not adjust the parameters of the decision tree corresponding to the code detection result to obtain a trained decision tree; and under the condition that the code detection result is different from the real result, the terminal adjusts the parameters of the decision tree corresponding to the code detection result until the adjusted decision tree meets the preset precision condition, so as to obtain the trained decision tree. And under the condition that each decision tree reaches a preset precision condition, the terminal obtains a malicious code detection model to be trained.

In this embodiment, the terminal extracts a multi-class feature data set from known malicious codes, performs screening and dimension reduction on the feature data set to obtain fusion data, performs random sampling and screening on the fusion data to obtain a sample data set, trains a malicious code detection model by using the sample data set, and finally obtains a malicious code model to be trained, which meets preset precision conditions. That is to say, this technical scheme can train out the malicious code model that satisfies preset accuracy requirement.

It should be understood that, although the steps in the flowcharts related to the embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the present application further provides a malicious code detection apparatus for implementing the above-mentioned malicious code detection method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so that specific limitations in the embodiment of the device for detecting one or more malicious codes provided below can be referred to the limitations of the method for detecting the malicious codes in the foregoing, and details are not described herein again.

In one embodiment, as shown in fig. 6, there is provided a malicious code detection apparatus, including:

a first obtaining module 602, configured to obtain a malicious code, and extract a feature data set from the malicious code; each feature data set comprises a plurality of feature data subsets;

a first calculating module 604, configured to calculate, for each feature data subset included in the feature data set, a hash sequence corresponding to each feature data in the feature data subset and a weight corresponding to each feature data;

a first determining module 606, configured to determine a feature value sequence corresponding to each feature data according to the hash sequence and weight of each feature data, and perform bitwise fusion processing based on the feature value sequence corresponding to each feature data to obtain fusion data corresponding to a feature data subset;

the second determining module 608 is configured to determine a sample data set according to the fusion data corresponding to each feature data subset, and input the sample data set to the malicious code detection model to be trained to obtain a code detection result.

In one embodiment, the first determining module 606 is specifically configured to:

for each hash value contained in the hash sequence, determining a sign of the weight according to the hash value, and determining a characteristic value according to the sign of the weight and the weight;

performing bit-wise linear superposition on characteristic value sequences corresponding to all characteristic data aiming at the characteristic data contained in one characteristic data subset to obtain a fusion characteristic value sequence corresponding to the characteristic data subset;

and aiming at the fusion characteristic value sequence corresponding to the characteristic data subset, obtaining fusion data corresponding to the characteristic data subset according to the mapping strategy corresponding to the fusion characteristic value sequence.

In one embodiment, the apparatus further comprises:

the second calculation module is used for calculating the information gain rate corresponding to each characteristic data aiming at each characteristic data contained in each characteristic data subset;

In one embodiment, the first obtaining module 602 is specifically configured to:

respectively extracting a feature data set corresponding to each feature type from the malicious codes according to the extraction strategy corresponding to each feature type;

In one embodiment, the second determining module 608 is specifically configured to:

sampling by adopting a guide aggregation algorithm aiming at the fused data sets to obtain a first preset number of initial data sets;

for the fusion data in each initial data set, randomly returning to extract a second preset number of fusion data from the initial data set, and deleting the fusion data with the weight not meeting a second preset screening strategy to obtain a corresponding first preset number of sample data sets;

and respectively inputting each sample data set into a decision tree in a malicious code detection model to be trained to obtain a code detection result.

In one embodiment, the apparatus further comprises:

the second acquisition module is used for acquiring the malicious codes and extracting a feature data set from the malicious codes; each feature data set comprises a plurality of feature data subsets;

the third calculation module is used for calculating a hash sequence corresponding to each feature data in the feature data subsets and a weight corresponding to each feature data aiming at each feature data subset contained in the feature data set;

the third determining module is used for determining a characteristic value sequence corresponding to each characteristic data according to the hash sequence and the weight of each characteristic data, and performing bitwise fusion processing on the characteristic value sequence corresponding to each characteristic data to obtain fusion data corresponding to the characteristic data subset;

the fourth determining module is used for determining a sample data set according to the fusion data corresponding to each characteristic data subset, and inputting the sample data set to the malicious code detection model to be trained to obtain a code prediction result;

and the adjusting module is used for adjusting parameters of the malicious code detection model to be trained according to the code prediction result so as to enable the adjusted malicious code detection model to reach the preset precision condition and obtain the malicious code detection model to be trained.

The modules in the malicious code detection device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer device comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of malicious code detection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application should be subject to the appended claims.

Claims

1. A method for detecting malicious code, the method comprising:

determining a characteristic value sequence corresponding to each characteristic data according to the hash sequence and the weight of each characteristic data, and performing bit-wise fusion processing on the characteristic value sequence corresponding to each characteristic data to obtain fusion data corresponding to the characteristic data subset;

2. The method according to claim 1, wherein the determining a feature value sequence corresponding to each feature data according to the hash sequence and the weight of each feature data comprises:

3. The method according to claim 1, wherein performing bitwise fusion processing based on the feature value sequence corresponding to each feature data to obtain fused data corresponding to the feature data subset comprises:

4. The method according to claim 1, wherein before calculating, for each of the feature data subsets included in the feature data set, a hash sequence corresponding to each feature data in the feature data subset and a weight corresponding to each feature data, the method further comprises:

5. The method according to claim 1, wherein the extracting the feature data set from the code to be detected comprises:

6. The method of claim 1, wherein determining a target data set according to the fusion data corresponding to each of the feature data subsets, and inputting the target data set to a trained malicious code detection model to obtain a code detection result comprises:

7. The method of claim 1, further comprising:

8. An apparatus for detecting malicious code, the apparatus comprising:

and the second determining module is used for determining a target data set according to the fusion data corresponding to each feature data subset, and inputting the target data set to the trained malicious code detection model to obtain a code detection result.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.