CN115859290B

CN115859290B - Malicious code detection method based on static characteristics and storage medium

Info

Publication number: CN115859290B
Application number: CN202310049009.0A
Authority: CN
Inventors: 王平; 荣星; 严锦立; 吴流丽; 毛建辉; 汪文晓; 严亚伟; 王耀; 贾雄; 刘筱明; 谷广宇; 王秋实; 尹韧达; 杜丽; 宋健
Original assignee: UNIT 61660 OF PLA
Current assignee: UNIT 61660 OF PLA
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-05-16
Anticipated expiration: 2043-02-01
Also published as: CN115859290A

Abstract

The invention discloses a malicious code detection method and a storage medium based on static characteristics, comprising the steps of obtaining a code to be detected, and extracting a static characteristic array of the code to be detected; performing first separable convolution and second separable convolution on the static feature array to obtain a first size array, wherein the first separable convolution and the second separable convolution respectively comprise the steps of performing multichannel spatial information processing on an input array by using a depth-by-depth convolution layer; the convolution processing step length of the depth-by-depth convolution layer in the first separable convolution is smaller than that of the second separable convolution; the number of cycles of the alternate occurrence of the first separable convolution and the second separable convolution is greater than or equal to 2; the code species is determined using the fully connected layer and the Softmax layer. The method has the advantages that the static features are used for carrying out omnibearing recognition and original feature extraction, the space information and the channel information of the static feature array are separated, the space correlation among the features is better captured during training, and the malicious code detection is facilitated.

Description

Malicious code detection method based on static characteristics and storage medium

Technical Field

The invention relates to the technical field of computer program security. And more particularly, to a malicious code detection method and storage medium based on static features.

Background

With the rapid development of computer network technology, network security has become a complex, realistic, serious non-traditional security problem.

Malicious code attack is one of the main factors affecting network security, and the aim of destroying target system resources or acquiring target system resource information is achieved by injecting computer viruses or Trojan horse programs into an attack target by utilizing various deception means. In practical terms, the major network security events exploded in recent years are mostly attack components with malicious codes as cores and thus cause substantial harm. Adverse effects caused by malicious code attacks are gradually expanding, and a huge potential safety hazard is formed for the country and the society. Research into countering technologies for malicious code attacks has become an urgent need to maintain network security.

In the prior art, a malicious code detection method based on machine learning can detect known and unknown malicious codes, but most of the malicious code detection methods have the problems of excessive related algorithm parameters and incomplete code information extraction, and influence the code detection accuracy.

Disclosure of Invention

The present invention has been made in view of the above-mentioned needs of the prior art, and an object of the present invention is to quickly and accurately identify malicious codes.

In order to solve the problems, the invention is realized by adopting the following technical scheme:

a malicious code detection method based on static characteristics comprises the following steps:

acquiring a code to be detected, and extracting a static feature array of the code to be detected;

performing first separable convolution and second separable convolution on the static feature array to obtain a first size array, wherein the first separable convolution and the second separable convolution respectively comprise the steps of performing multichannel spatial information processing on an input array by using a depth-wise convolution layer; carrying out channel information processing on the processing result of the progressive convolutional layer according to the progressive convolutional layer; the convolution processing step length of the depth-by-depth convolution layer in the first separable convolution is smaller than that of the second separable convolution; wherein the number of cycles of the first separable convolution and the second separable convolution alternately occurs is greater than or equal to 2;

and processing the first size array by using the full connection layer and the Softmax layer to determine the code type.

Optionally, after extracting the static feature array of the code to be detected, the method further includes: and normalizing the feature values in the static feature array.

Optionally, before processing the first size array with the full connection layer and the Softmax layer, performing maximum pooling processing on the first size array by a pooling layer, and extracting a maximum value in each pooling area.

Optionally, the depth-by-depth convolution layer comprises a depth-by-depth convolution kernel, and a BN layer and a ReLU layer which are sequentially connected after the depth-by-depth convolution kernel; the output of the ReLU layer is used as the output result of the depth-by-depth convolution layer;

the point-by-point convolution layer comprises a point-by-point convolution kernel, and a BN layer and a ReLU layer which are sequentially connected after the point-by-point convolution kernel; the output of the ReLU layer is used as the output result of the point-by-point convolution layer.

Optionally, the acquiring the code to be detected, extracting the static feature array of the code to be detected, includes:

decompiling the code to be detected through assembly language to obtain a decompiled file; extracting a static feature array of the code to be detected according to the disassembled file;

the numerical values in the static feature array comprise punctuation frequency, register character frequency, operation code frequency, function call information and keyword frequency.

Optionally, the method further comprises:

constructing a separable convolution model according to the first separable convolution layer, the second separable convolution layer, the full connection layer and the Softmax layer; wherein the first separable convolution layer performs the first separable convolution process and the second separable convolution layer performs the second separable convolution process; the separable convolution model is trained using a static feature array with code type tags.

Optionally, the method further comprises: parameters of the separable convolution model are optimized using a random gradient descent algorithm.

Optionally, before the first separable convolution and the second separable convolution are alternately performed on the static feature array to obtain a first size array, a convolution kernel with a size of 5*5 is used to preprocess the static feature array with a step length of 2.

Optionally, the method further comprises: the accuracy of the separable convolution model is tested using test set data.

A computer readable storage medium having stored thereon a computer program, said computer readable storage medium having stored thereon a static feature based malicious code detection program which, when executed by a processor, implements the steps of any one of said static feature based malicious code detection methods.

Compared with the prior art, the invention provides a file static feature array extraction method aiming at computer readable codes, and the method supports the comprehensive identification and original feature extraction of the static features of the file by counting the occurrence frequency of various characters in the disassembled file. The separable convolution network designed by the embodiment of the invention has the advantages of few parameters, high training speed and the like, and can separate the space information and the channel information of the static feature array by carrying out separable convolution on the static feature array, so that the space correlation among the features can be better captured during training, and the detection of malicious codes is more facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present description, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a flow chart of a method for detecting malicious code based on static features provided in an embodiment of the present invention;

FIG. 2 is a block diagram of a method for detecting malicious code based on static features according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a separable convolutional network of a method for detecting malicious code based on static features according to an embodiment of the present invention;

FIG. 4 is a network schematic diagram of a first separable convolutional layer of a malicious code detection method based on static features according to an embodiment of the present invention;

fig. 5 is a network schematic diagram of a second separable convolution layer of a malicious code detection method based on static features according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

For the purpose of facilitating an understanding of the embodiments of the present invention, reference will now be made to the following description of specific embodiments, taken in conjunction with the accompanying drawings, which are not intended to limit the scope of the invention.

Example 1

The embodiment provides a malicious code detection method based on static characteristics, which is shown by referring to fig. 1-2, and comprises the following steps:

s1: and acquiring a code to be detected, and extracting a static feature array of the code to be detected.

In the embodiment of the invention, the method specifically comprises the following steps: obtaining a code to be detected, decompiling the code to be detected through assembly language to obtain a decompiled file; extracting a static feature array of the code to be detected according to the disassembled file; the numerical values in the static feature array comprise punctuation frequency, register character frequency, operation code frequency, function call information and keyword frequency.

Specifically:

the punctuation frequency comprises: the number of occurrences of punctuation marks ' -, ' + ', ' [ ', ' ] ' @ ', '.

The register character frequency number includes: the number of times the register symbols 'edx', 'esi', 'es', 'fs', 'ds', 'ss', 'gs', 'cs', 'ah', 'al', 'ax', 'bh', 'bl', 'bx', 'ch', 'cl', 'cx', 'dh', 'dl', 'dx', 'eax', 'ebp', etc. appear in the disassembly file.

The operation code frequency number includes: the operation code ' add ', ' al ', ' bt ', ' call ', ' cdq ', ' cld ', ' cli ', ' cmc ', ' cmp ', ' const ', ' cwd ', ' daa ', ' db ', ' dec ', ' dw ', ' enddp ', ' ends ', ' faddp ', ' fchs ', ' fdiv ', ' fdvp ', ' fdivp ', ' fdivr ', ' fild ', ' fistp ', ' fld ', ' fstcw ', ' fstcwil ', ' fstwimul ', ' fstpp ', ' fmul ', ' in ', ' inc ', ' minc ', ' mins ', ' jb ', ' je ', ' jg ', ' jge ', ' jl ', ' and ' jl ', ' 48, and ' v ' are present in the disassembled file; ' jz ', ' lea ', ' loope ', ' mov ', ' movzx ', ' mul ', ' near ', ' neg ', ' not ', ' or ', ' out ', ' outer ', ' pop ', ' popf ', ' proc ', ' push ', ' pushf ', ' rcl ', ' rcr ', ' rdtsc ', ' rep ', ' ret ', ' retn ', ' rol ', ' ror ', ' times of ' sal ', ' sar ', ' sbb ', ' scas ', ' setb ', ' setle ', ' setnle ', ' setnz ', ' setz ', ' shl ', ' shot ', ' shr ', ' sidt ', ' stc ', ' std ', ' sti ', ' sts ', ' sub ', ' test ', ' wait ', ' xchg ', ' xor ', etc.

The function call information includes: firstly, generating a function call graph of the file, and then counting the times of each character in the function call graph of the file.

The keyword frequency includes: the characters 'Virtual', 'Offset', 'loc', 'Import', 'Imports', 'var', 'Forwarder', 'UINT', 'LONG', 'BOOL', 'WORD', 'BYTES', 'large', 'short', 'dd', 'db', 'dw', 'XREF', 'PTR' appear in the disassembled file, 'DATA', 'FUNCTION', 'Extrn', 'byte', 'WORD', 'DWORD', 'char', 'DWORD', 'stdcall', 'arg', 'loblet', 'asc', 'align', 'WinMain', 'unk', 'cookie', 'off', 'nullsub', 'DllEntryPoint', 'System32', 'DLL', 'trunk', 'BASS', 'HMENU', 'DLL', 'LPWSTR', 'void', 'hreshout', 'HDC', 'LRESULT', 'HANDLE', 'HWND', 'LPSTR', 'INT', 'HLOCAL', 'farprroc', 'ATOM', 'HMODULE', 'WPARAM', 'HGLOBAL', 'entry', 'rva', 'colmap', 'exe' DATA: the term "text" is used herein to describe the number of times of "case", "instrument", "microsoft", "polies", "proc", "scrollwindow", "search", "track", "visual", "___ security_cookie", "assume", "callvisual-jar", "exportedenty", "hardware", "hkey_current", "hkey_local_map", "sp-analytically fast", "unab iotato".

After the binary file is disassembled, the code features can be better extracted, and the disassembled file can better embody the code functions compared with the binary file, so that the extraction of the static features of the code is more facilitated.

By extracting the static characteristics of the codes to be detected, an identifiable malicious code characteristic array can be constructed, and the identification of a convolution network is facilitated.

And the data in the static feature array is limited in a certain range through normalization, so that the dimension influence among indexes is eliminated.

In the embodiment of the invention, 16384-dimensional features are obtained by combining the five types of features, and all feature values are normalized due to the difference of the numerical ranges of the features, so that the deviation of classification results caused by the difference of data spans or orders of magnitude is avoided. Data preprocessing is performed for the input of the convolutional network.

In this embodiment, according to the normalized feature value, the feature of each file is represented as a static feature array with a size of 128×128, so as to complete extraction of the static feature array of the code to be detected.

Furthermore, the static feature array of the codes of the known category is recorded in a feature library to obtain a training feature library for the Softmax layer training.

S2: performing first separable convolution and second separable convolution on the static feature array to obtain a first size array, wherein the first separable convolution and the second separable convolution respectively comprise the steps of performing multichannel spatial information processing on an input array by using a depth-wise convolution layer; carrying out channel information processing on the processing result of the progressive convolutional layer according to the progressive convolutional layer; the convolution processing step length of the depth-by-depth convolution layer in the first separable convolution is smaller than that of the second separable convolution; wherein the number of cycles of the first separable convolution and the second separable convolution alternately occurs is greater than or equal to 2.

The first separable convolution representation inputs the input array to a first separable convolution layer for a convolution process, and the second separable convolution representation inputs the input array to a second separable convolution layer for a convolution process.

In the embodiment of the invention, the alternate occurrence means that the first separable convolution and the second separable convolution alternately process the input array, wherein the first separable convolution processes the static feature array, the processing result is processed by the second separable convolution, and the second separable convolution processing result is processed by the first separable convolution and is sequentially arranged. The period refers to the number of times that the first separable convolutional layer and the second separable convolutional layer alternate, one first separable convolutional layer and one second separable convolutional layer adjacently connected as one period.

In the embodiment of the invention, the first separable convolution layer and the second separable convolution layer both comprise a depth-wise convolution layer and a point-wise convolution layer, the depth-wise convolution layer comprises a plurality of groups of input channels and output channels which are in one-to-one correspondence, the output channels in the depth-wise convolution layer are simultaneously used as the input channels of the point-wise convolution layer, each output channel of the depth-wise convolution layer corresponds to a convolution kernel of h x w, the point-wise convolution layer comprises a plurality of groups of output channels, each output channel in the point-wise convolution layer corresponds to a convolution kernel of 1*1, when convolution calculation is performed, the convolution kernel of h x w in the depth-wise convolution layer and the data characteristics input in one input channel are subjected to convolution calculation and output by the corresponding output channels, and then the data characteristics output by each output channel in the depth-wise convolution layer are fused through the convolution kernel of the size 1*1.

The embodiment of the invention provides a process for alternately generating a cycle number of a first separable convolution and a second separable convolution to be more than 2, which specifically comprises the following steps: performing convolution calculation on the static feature array by each channel of the depth-by-depth convolution layer in the first separable convolution to obtain a first data feature, and outputting the first data feature from a corresponding output channel to enable the space information of the code to be detected to be separated for the first time; the input channel of the point-by-point convolution layer in the first separable convolution receives the calculation result of the depth-by-depth convolution layer, and the convolution kernel of 1*1 is utilized to fuse the first data characteristic for the first time, so as to obtain a first data packet.

And performing convolution calculation on the first data packet by each channel of the progressive convolution layer in the second separable convolution to obtain a second data characteristic, outputting the second data characteristic from a corresponding output channel, performing second separation on spatial information of a code to be detected, receiving a calculation result of the progressive convolution layer by an input channel of the progressive convolution layer in the second separable convolution, and performing second fusion on the second data characteristic by using a convolution kernel of 1*1 to obtain a second data packet.

And performing convolution calculation on the second data packet by each channel of the progressive convolution layer in the first separable convolution to obtain a third data characteristic, outputting the third data characteristic from a corresponding output channel, so that the space information of the code to be detected is separated for the third time, receiving the calculation result of the progressive convolution layer by the input channel of the progressive convolution layer in the second separable convolution, and performing third fusion on the third data characteristic by using a convolution kernel of 1*1 to obtain the first size array.

In the embodiment of the present invention, each output channel of the progressive convolutional layer in the first separable convolution corresponds to a convolution kernel of 3*3 with a step size of 1, and each output channel of the progressive convolutional layer in the second separable convolution corresponds to a convolution kernel of 3*3 with a step size of 2.

According to the embodiment of the invention, the parameters are reduced by separating the channel information and the space information of the feature array, so that the processing of invalid information can be avoided, and the redundant information in the convolution processing result is greatly reduced.

The first separable convolution layers and the second separable convolution layers are alternately arranged, the convolution processing step length of the depth-by-depth convolution layers in the first separable convolution is smaller than that of the depth-by-depth convolution layers in the second separable convolution, the space size of an output array can be reduced, deep extraction can be performed, code feature information is extracted more comprehensively, the situation that features are not fused enough after continuous convolution extraction is avoided, and the code feature information is directly discarded is avoided.

Preferably, four first separable convolutions and three second separable convolutions are alternately arranged with a number of cycles greater than 3.

With the arrangement, the phenomenon of overfitting caused by excessive parameters is avoided; the parameters are too few, so that an accurate classification result cannot be obtained, useful information of static features can be reserved as much as possible, and the code detection effect is better.

S3: and processing the first size array by using the full connection layer and the Softmax layer to determine the code type.

The Softmax layer is used for carrying out data classification or regression on the data characteristics after the characteristic fusion and outputting classification or regression results.

Preferably, before the processing of the first size array by using the full connection layer and the Softmax layer, the maximum value in each pooling area is extracted by performing maximum pooling processing on the first size array by a pooling layer.

The maximum pooling is to divide an input image into a plurality of rectangular areas and extract the maximum value of each rectangular area, and in the embodiment of the invention, the maximum value of each pooling area in the first size array is extracted by using one-dimensional maximum pooling, and all the maximum values are integrated to obtain a pooling data packet.

Preferably, the progressive depth convolution layer comprises a progressive depth convolution kernel, and a BN layer and a ReLU layer which are sequentially connected after the progressive depth convolution kernel; the output of the ReLU layer is used as the output result of the depth-by-depth convolution layer;

The BN layer performs batch normalization operation, so that the problem that the data distribution of the middle layer is changed in the training process of the separable convolution model can be solved, gradient disappearance or explosion is prevented, and the training speed is increased.

The ReLU layer is an activation function, and can remove negative values in convolution results and keep positive values unchanged.

Preferably, the method further comprises:

constructing a separable convolution model according to the first separable convolution layer, the second separable convolution layer, the full connection layer and the Softmax layer; wherein the first separable convolution layer performs the first separable convolution process and the second separable convolution layer performs the second separable convolution process; the separable convolution model is trained using a static feature array with code type tags. The network structure of the separable convolution model is shown with reference to fig. 3, the network structure of the first separable convolution layer is shown with reference to fig. 4, and the network structure of the second separable convolution layer is shown with reference to fig. 5.

Preferably, the method further comprises: parameters of the separable convolution model are optimized using a random gradient descent algorithm.

Preferably, before the first separable convolution and the second separable convolution are alternately processed on the static feature array to obtain a first size array, a convolution kernel with a size of 5*5 is used to preprocess the static feature array with a step length of 2.

At this time, the separable convolution model further includes a preprocessing layer, where the preprocessing layer is configured to perform a plurality of convolution kernels with a size of 5*5 and a step of 2 on the input array.

Preferably, the method further comprises: the accuracy of the separable convolution model is tested using test set data.

When the separable convolution model is trained, the training feature array set with the label is divided into a training set, a verification set and a test set according to the proportion of 7:3, network parameters with the highest accuracy rate on the verification set are always stored in the training process, and the parameters are adopted to detect the data of the test set.

And optimizing network parameters by adopting a random gradient descent algorithm, setting the initial learning rate to be 0.001, setting the momentum to be 0.9, and setting the learning rate attenuation rate to be 0.000001. The model input adopts a batch processing mode, and 32 static feature arrays can be set and input in each batch. After all the static feature arrays are trained once, the static feature array sequence is scrambled.

Further preferably, the malicious code detection method based on feature code matching can be combined with the malicious code detection method based on static features provided by the embodiment of the invention, so that the malicious code detection precision is improved, the static feature codes are obtained by extracting unique fixed data fragments and position information of the fragments from a code program of malicious software through format analysis and code analysis, the data fragments refer to a section of special codes or character string information, and the static feature codes are abstracted into static feature codes to serve as features of the malicious software. And matching the code to be detected with the known static feature codes stored in the virus feature library, so as to determine whether malicious codes exist in the code to be detected. Only using the malicious code detection method based on feature code matching can accurately search and kill once the static feature code of a certain malicious code is determined, but unknown malicious codes cannot be effectively searched and killed when no predefined feature code exists, namely, the unknown malicious software cannot be identified due to certain hysteresis; in addition, the characteristics of the malware are easily changed by encryption and confusion, resulting in the need to frequently update the extracted characteristics of the malware. After being combined with the separable convolution model designed by the embodiment of the invention, the code which is missed to be detected is detected secondarily, so that quick and accurate malicious code detection is realized.

The computer readable code may take the form of software, data, or images and other computer readable data.

Compared with the prior art, the embodiment of the invention provides a file static feature array extraction method aiming at computer readable codes, and the method supports the comprehensive identification and original feature extraction of the static features of the file by counting the occurrence frequency of various characters in disassembled files. The separable convolution network designed by the embodiment of the invention has the advantages of few parameters, high training speed and the like, and can separate the space information and the channel information of the static feature array by carrying out separable convolution on the static feature array, so that the space correlation among the features can be better captured during training, and the detection of malicious codes is more facilitated. The method provided by the embodiment of the invention can be used as a complementary means for detecting the traditional malicious code, and is combined with the malicious code detection method based on feature code matching to carry out secondary detection on the code which is missed to be detected, so that quick and accurate malicious code detection is realized.

A computer readable storage medium having stored thereon a computer program, the computer readable storage medium having stored thereon a static feature based malicious code detection program which, when executed by a processor, implements the steps of the static feature based malicious code detection method.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method for detecting malicious code based on static features, comprising:

processing the first size array by using a full connection layer and a Softmax layer to determine code types;

the obtaining the code to be detected, extracting the static feature array of the code to be detected, includes:

2. The method for detecting malicious code based on static features according to claim 1, further comprising, after extracting the static feature array of the code to be detected: and normalizing the feature values in the static feature array.

3. The method for detecting malicious code based on static features according to claim 1, wherein before the first size array is processed by using a full connection layer and a Softmax layer, the first size array is subjected to maximum pooling processing by a pooling layer, and a maximum value in each pooling area is extracted.

4. The method for detecting malicious code based on static features according to claim 1, wherein the depth-wise convolution layer comprises a depth-wise convolution kernel, and a BN layer and a ReLU layer sequentially connected after the depth-wise convolution kernel; the output of the ReLU layer is used as the output result of the depth-by-depth convolution layer;

5. The method for detecting malicious code based on static features according to claim 1, wherein the method further comprises:

6. The method for detecting malicious code based on static features according to claim 5, wherein the method further comprises: parameters of the separable convolution model are optimized using a random gradient descent algorithm.

7. The method for detecting malicious code based on static features according to claim 1, wherein before the static feature array is alternately processed by a first separable convolution and a second separable convolution to obtain a first size array, the static feature array is preprocessed with a step length of 2 by using a convolution kernel with a size of 5*5.

8. The method for detecting malicious code based on static features according to claim 5, wherein the method further comprises: the accuracy of the separable convolution model is tested using test set data.

9. A computer readable storage medium having stored thereon a computer program having stored thereon a static feature based malicious code detection program which, when executed by a processor, implements the steps of a static feature based malicious code detection method of any one of claims 1 to 8.