CN116028936B - Malicious code detection method, medium and device based on neural network - Google Patents

Malicious code detection method, medium and device based on neural network Download PDF

Info

Publication number
CN116028936B
CN116028936B CN202310160086.3A CN202310160086A CN116028936B CN 116028936 B CN116028936 B CN 116028936B CN 202310160086 A CN202310160086 A CN 202310160086A CN 116028936 B CN116028936 B CN 116028936B
Authority
CN
China
Prior art keywords
vector
sequence
detected
operation code
malicious
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310160086.3A
Other languages
Chinese (zh)
Other versions
CN116028936A (en
Inventor
李峰
孙晓鹏
李仲举
刘鹏
郭举
石广军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Yuntian Safety Technology Co ltd
Original Assignee
Shandong Yuntian Safety Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Yuntian Safety Technology Co ltd filed Critical Shandong Yuntian Safety Technology Co ltd
Priority to CN202310160086.3A priority Critical patent/CN116028936B/en
Publication of CN116028936A publication Critical patent/CN116028936A/en
Application granted granted Critical
Publication of CN116028936B publication Critical patent/CN116028936B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The present invention relates to the field of network security, and in particular, to a method, medium, and apparatus for detecting malicious codes based on a neural network. The method comprises the steps of obtaining a detection sequence of the PE file to be detected. The operation code sequence is sliced to generate a plurality of operation code subsequences. And carrying out feature transformation processing on each operation code sub-sequence to generate a feature sub-vector to be detected corresponding to each operation code sub-sequence. And preprocessing each feature sub-vector to be detected, and determining at least one initial target sub-vector. And respectively inputting each initial target sub-vector into a malicious detection neural network model to generate a detection result corresponding to each initial target sub-vector. If any detection result is malicious, determining that the code corresponding to the PE file to be detected is malicious. The method can preliminarily determine the feature sub-vector to be detected which possibly contains malicious codes through preprocessing. The number of feature sub-vectors to be detected in the input neural network can be reduced, and further the consumption of computing resources can be reduced.

Description

Malicious code detection method, medium and device based on neural network
Technical Field
The present invention relates to the field of network security, and in particular, to a method, medium, and apparatus for detecting malicious codes based on a neural network.
Background
In the process of antivirus research and countermeasure of malicious codes, the continuous development of malicious code technology also promotes the continuous development and progress of malicious program detection technology, and novel malicious code writing technology also leads to the appearance of novel detection technology.
In the prior art, the malicious codes can be identified by the characteristic recognition based on a machine learning mode, so that the malicious codes can be detected more quickly and with high precision. Because a large number of similar operation features exist in the malicious code, the existing detection mode based on machine learning generally extracts an opcode sequence (operation code) in the malicious code as a feature to be detected, and inputs the opcode sequence (operation code) into a neural network for identification detection.
And today's antivirus is an antagonistic process. In order to combat the above detection mode, after the malicious code is written, a large amount of normal codes or useless codes which do not play any role are added to weaken the malicious features reflected by the malicious code. Therefore, the existing detection mode cannot accurately detect the malicious codes, the detection accuracy is reduced, and a large amount of computing resources are consumed.
Disclosure of Invention
Aiming at the technical problems, the invention adopts the following technical scheme:
according to one aspect of the present invention, there is provided a malicious code detection method based on a neural network, the method including the steps of:
and obtaining a detection sequence of the PE file to be detected. The detection sequence includes an operation code sequence.
The operation code sequence is sliced to generate a plurality of operation code subsequences. The length of the operator sequence is less than or equal to the length of the operator sequence.
And carrying out feature transformation processing on each operation code sub-sequence to generate a feature sub-vector to be detected corresponding to each operation code sub-sequence.
And preprocessing each feature sub-vector to be detected, and determining at least one initial target sub-vector. The initial target sub-vector is at least one of a plurality of feature sub-vectors to be tested.
And respectively inputting each initial target sub-vector into a malicious detection neural network model to generate a detection result corresponding to each initial target sub-vector. The malicious detection neural network model comprises a gating convolutional network and a fully-connected neural network. The gated convolutional network is used to extract the data features of the initial target sub-vector. The fully connected neural network is used for generating a corresponding classification detection result according to the data characteristics of the initial target sub-vector.
If any detection result is malicious, determining that the code corresponding to the PE file to be detected is malicious.
The pretreatment comprises the following steps:
and calculating the similarity between the feature sub-vector to be detected and the feature vector to be detected. The feature vector to be detected is the feature vector corresponding to the operation code sequence.
And determining the feature sub-vector to be detected, of which the similarity is smaller than a first similarity threshold value, as an initial target sub-vector.
According to a second aspect of the present invention, there is provided a non-transitory computer-readable storage medium storing a computer program which, when executed by a processor, implements a neural network-based malicious code detection method as described above.
According to a third aspect of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing a neural network-based malicious code detection method as described above when executing the computer program.
The invention has at least the following beneficial effects:
in the invention, the feature vector to be detected is segmented into a plurality of feature sub-vectors to be detected. And inputting each feature sub-vector to be detected into the trained neural network model to respectively carry out identification detection. In the invention, a large amount of normal codes or malicious codes after useless codes are added, and the length of the operation code sequence which belongs to the normal after disassembly is far longer than the length of the operation code sequence which belongs to the malicious codes. Therefore, the feature sub-vector to be detected with smaller length can be obtained by cutting the feature vector to be detected. Therefore, in the feature sub-vector to be detected of the part containing the malicious code in the segmentation process, the malicious feature of the malicious code is more prominent because the proportion of doped normal codes or useless codes is reduced. And furthermore, the existing neural network model can more easily extract the malicious features of the malicious codes so as to improve the detection accuracy.
In addition, most of the feature vectors to be tested are features of normal codes, so that the similarity between the feature sub-vector to be tested containing malicious codes and the feature sub-vector to be tested is relatively smaller than the similarity between the feature sub-vector to be tested containing only normal codes when similarity calculation is performed. Therefore, the feature sub-vector to be detected, which possibly contains malicious codes, in the plurality of feature sub-vectors to be detected, namely the initial target sub-vector can be preliminarily determined through preprocessing. And inputting the initial target sub-vector into a neural network for detection. Because the number of feature sub-vectors to be detected in the input neural network can be reduced, the consumption of computing resources can be further reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a malicious code detection method based on a neural network according to an embodiment of the present invention.
Fig. 2 is a flowchart of a malicious code detection method based on a neural network according to another embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
As a possible embodiment of the present invention, as shown in fig. 1, there is provided a malicious code detection method based on a neural network, the method including the steps of:
s100, acquiring a detection sequence of a PE file (Portable Executable, portable executable file) to be detected. The detection sequence includes an operation code sequence. Further, the detection sequence further comprises an API (ApplicationProgramming Interface ) calling sequence of the PE file to be detected.
Typically, malicious programs exist in PE files to steal information or destroy devices. Commonly used EXE, DLL, OCX, SYS, COM are PE files.
After disassembling the PE file to be tested, the operation code sequence of the PE file to be tested can be obtained. The corresponding API call sequence can be obtained by obtaining an API call list generated when the PE file to be tested runs.
In the subsequent S200 and S300, the operation code sequence and the API call sequence may be processed respectively to remove relevant information generated by the useless code. After denoising, the feature sub-vector to be detected corresponding to the operation code and the feature sub-vector to be detected corresponding to the API can be respectively used as the input in S400 for detection. The feature to be detected corresponding to the operation code and the feature to be detected corresponding to the API may be spliced into a feature sub-vector to be detected, and detected as input in S400.
The feature information contained in the spliced feature sub-vector to be detected is more comprehensive, and the accuracy of the final detection result of the malicious detection neural network model can be correspondingly improved.
S200, segmenting an operation code sequence by using a sliding window to generate a plurality of operation code sequences; wherein, the sliding step length L of the sliding window meets the following conditions:
L=S/n。
s is the total length of the sliding window. n is a sliding coefficient, and n is a positive integer. Preferably n=2.
This step is illustrated by the following example:
if the number of the operation codes corresponding to a malicious PE file is 10000, only the operation codes in 5000-6000 are operation codes corresponding to malicious codes. Correspondingly, the total length of the sliding window can be freely set according to the sequence length of the operation code of the malicious code which is usually encountered in the actual use scene. Such as 1000.
Thus, the corresponding sliding step l=1000/2=500. According to the setting of the sliding window, the malicious PE file in this example is divided into 19 segments, each segment having a length of 1000, where the 10 th and 11 th segments contain the operation codes corresponding to the malicious codes. And the malicious operation code in the 10 th segment has a duty ratio of 0.5, and the malicious operation code in the 11 th segment has a duty ratio of 1.
Therefore, the feature vector to be detected is segmented through the sliding window. In the feature sub-vector to be detected of the part containing the malicious code in the segmentation process, the malicious feature of the malicious code is more prominent because the proportion of doped normal codes or useless codes is reduced.
Further, S satisfies the following condition:
s=w/k; wherein W is the total length of the operation code sequence of the feature vector to be detected, k is a window coefficient, and k is a positive integer greater than or equal to 2. Preferably, k=2, 4,8.
In general, at least one length of useless code is added to weaken the characteristics of malicious code, and in addition, in consideration of the fact that malicious code is easy to find during the running process, the running time of the code needs to be reduced as much as possible, that is, the useless code needs to be controlled not to be added too much. From this characteristic, a more precise length of S can be approximately determined.
And S300, carrying out feature transformation processing on each operation sub-sequence to generate a feature sub-vector to be detected corresponding to each operation sub-sequence.
Through a TF-IDF (Term Frequency-inverse document Frequency) algorithm, an operation code sub-sequence or an API call sequence of a PE file to be detected can be converted into a corresponding TF-IDF value sequence, and then, through a laminated noise reduction self-encoder, dimension reduction processing is carried out, so that a feature sub-vector to be detected with preset dimensions after dimension reduction of each operation code sub-sequence is obtained.
The feature sub-vector to be detected in the step can be a feature vector converted by coding the operation sub-sequence; or the combination characteristic is formed by splicing the operation code subsequence and the corresponding API call sequence, and then the characteristic vector converted after encoding is carried out; the combined characteristic can be formed by splicing the operation code subsequence, the corresponding API call sequence and the PE file header field, and then the characteristic vector converted after encoding is carried out.
The combined features formed by splicing the operation code sequence, the API call sequence and the PE file header fields can contain more comprehensive and rich feature information, and the accuracy of the final detection result of the malicious detection neural network model can be correspondingly improved.
S400, inputting each feature sub-vector to be detected into a malicious detection neural network model respectively, and generating a detection result corresponding to each feature sub-vector to be detected. The malicious detection neural network model comprises a gating convolutional network and a fully-connected neural network. The gating convolution network is used for extracting the data characteristics of the feature sub-vector to be detected. The fully-connected neural network is used for generating a corresponding classification detection result according to the data characteristics of the feature sub-vector to be detected.
The malicious detection neural network model in the step is a neural network model which is trained in advance and has the capability of identifying malicious codes and normal codes, and can be generated by taking the existing model as a basic model and training through a training sample.
According to the generated detection result corresponding to each feature sub-vector to be detected, whether the corresponding PE file to be detected is malicious code or not can be determined, and the specific determination method is as follows:
s500, if any detection result is malicious, determining that codes corresponding to PE files to be detected are malicious codes.
Further, S501, if all the detection results are normal, determining the code corresponding to the PE file to be detected as a normal code.
According to the invention, the feature vector to be detected is segmented into a plurality of feature sub-vectors to be detected by setting the sliding window. And inputting each feature sub-vector to be detected into the trained neural network model to respectively carry out identification detection. In the invention, a large amount of normal codes or malicious codes after useless codes are added, and the length of the operation code sequence which belongs to the normal after disassembly is far longer than the length of the operation code sequence which belongs to the malicious codes. Therefore, the feature sub-vector to be measured with smaller length can be obtained by setting the sliding window. Therefore, in the feature sub-vector to be detected of the part containing the malicious code in the segmentation process, the malicious feature of the malicious code is more prominent because the proportion of doped normal codes or useless codes is reduced. And furthermore, the existing neural network model can more easily extract the malicious features of the malicious codes so as to improve the detection accuracy.
As another possible embodiment of the present invention, the total length of the sliding window is multiple, and the method for obtaining the total length of the sliding windows is as follows:
s210, disassembling each malicious PE file in the malicious code library to generate an operation code sequence corresponding to each malicious PE file.
A large amount of malicious codes are collected in the existing malicious code library, so that the operation code sequences corresponding to the large amount of malicious codes can be obtained by disassembling the large amount of malicious codes. Because the used technical and tactical techniques have higher consistency when the malicious codes of the same type attack, the codes have similarity to a certain extent. From this characteristic, the total length of the sliding window can be determined more accurately.
S220, carrying out K-Means clustering processing on the operation code sequences corresponding to the malicious PE files to generate G cluster groups.
S230, taking the average sequence length of all the operation code sequences in each cluster group as the total length A corresponding to the sliding window 1 ,A 2 ,…,A i ,…,A G . Wherein A is i The average sequence length corresponding to the ith cluster group. i=1, 2, …, G. G is the total number of clusters.
Preferably, S240, the mode or median of all the operation code sequence lengths in each cluster group is used as the total length B corresponding to the sliding window 1 ,B 2 ,…,B i ,…,B G . Wherein B is i Mode or median of the opcode sequence length for the i-th cluster group. i=1, 2, …, G. G is the total number of clusters.
In this embodiment, the existing malicious codes can be clustered into multiple groups according to the code length by the K-Means clustering algorithm. And taking the average number, mode or median of the malicious code length in each group as the corresponding total length of the sliding window. In this embodiment, the length that the malicious code is more likely to have is determined according to the prior experience of the length of the existing malicious code, so that the total length of a plurality of sliding windows is determined more accurately. In the embodiment, the total length of a plurality of sliding windows can be more accurately determined through the existing malicious code library, so that malicious codes can be more accurately segmented through the sliding windows in the embodiment.
As another possible embodiment of the present invention, slicing the operation code sequence using a sliding window includes:
s201, the using lengths are respectively A 1 ,A 2 ,…,A i ,…,A G G-time slicing of the opcode sequence.
When the lengths of a plurality of sliding windows are determined, the operation code sequence of the PE file to be detected needs to be segmented by using each window length. And performing feature transformation processing on each segment obtained by segmentation, and performing one-to-one identification detection through a malicious detection neural network model.
As another possible embodiment of the present invention, obtaining an operation code sequence of a PE file to be tested includes:
s110, unshelling is carried out on the PE file to be tested, and binary data corresponding to the PE file to be tested is generated.
S120, performing disassembly processing on binary data corresponding to the PE file to be tested, and generating an operation code sequence of the PE file to be tested.
When processing the PE files, many PE files are shelled (executable program resources are compressed and encrypted, and the compressed program can be directly operated), so that the real instructions in the PE files cannot be directly seen, and therefore, the PE files need to be shelled before feature extraction, and codes obtained when the PE files are actually executed are obtained. Therefore, the PEID software is used for respectively carrying out shell checking treatment on the single PE files, marking is carried out, then the shelling tool of each type of shell is determined according to the marking, and then the corresponding shelling tool is used for shelling, so that binary data of each PE file is obtained.
The binary data of the single PE file is then extracted, including in particular the corresponding operation code, API (application program interface) call sequence and PE file header fields.
And disassembling binary data of the PE file by IDAPro software to obtain an assembly instruction after the PE file is disassembled, namely an operation code sequence of the PE file to be tested.
As another possible embodiment of the present invention, the feature transformation process includes:
s310, performing 3-gram grouping processing on the operation codon sequence to generate a plurality of operation codon group sequences;
s320, calculating TF-IDF of each operation code subgroup sequence in the whole operation code sequence;
s330, coding TF-IDF of each operation code sub-group sequence in the whole operation code sequence to generate a feature sub-vector to be detected corresponding to the operation code sub-sequence.
In this example, the operator codes of the operator sequences are sequentially formed into corresponding sequences, i.e., operator group sequences, in the form of 3-grams, and then the TF-IDF of the operator group sequences in the entire operator sequence is calculated.
Then, the laminated noise reduction self-encoder can be used for carrying out dimension reduction processing to obtain feature vectors with preset dimensions after dimension reduction. The preset dimension may be 25 dimensions. In S330, the combined feature formed by splicing the operation subsequence, the corresponding API call sequence and the header field of the PE file may be encoded to generate a corresponding feature sub-vector to be tested.
As another possible embodiment of the present invention, as shown in fig. 2, there is further provided a method for detecting malicious code based on a neural network, where a preprocessing step is added between S300 and S400, and the specific contents are as follows:
s100, acquiring a detection sequence of the PE file to be detected. The detection sequence includes an operation code sequence.
And S200, cutting the operation code sequence to generate a plurality of operation code sequences. The length of the operator sequence is less than or equal to the length of the operator sequence.
And S300, carrying out feature transformation processing on each operation sub-sequence to generate a feature sub-vector to be detected corresponding to each operation sub-sequence.
S301, preprocessing each feature sub-vector to be detected, and determining at least one initial target sub-vector. The initial target sub-vector is at least one of a plurality of feature sub-vectors to be tested.
S401, inputting each initial target sub-vector into a malicious detection neural network model respectively, and generating a detection result corresponding to each initial target sub-vector. The malicious detection neural network model comprises a gating convolutional network and a fully-connected neural network. The gated convolutional network is used to extract the data features of the initial target sub-vector. The fully connected neural network is used for generating a corresponding classification detection result according to the data characteristics of the initial target sub-vector.
S500, if any detection result is malicious, determining that codes corresponding to PE files to be detected are malicious codes.
The pretreatment comprises the following steps:
and S311, calculating the similarity between the feature sub-vector to be detected and the feature vector to be detected. The feature vector to be detected is the feature vector corresponding to the operation code sequence.
The obtaining mode of the feature vector to be detected can be the same as that of the feature sub-vector to be detected, and the feature vector of 25 dimensions can be generated by performing dimension reduction processing through the laminated noise reduction self-encoder.
Thus, the similarity between any two vectors can be calculated by a cosine similarity algorithm.
S312, determining the feature sub-vector to be detected with the similarity smaller than the first similarity threshold value as an initial target sub-vector.
The first similarity threshold in this embodiment may be set according to the actual usage scenario, for example, 30% -80%. Alternatively, the 3 rd or 4 th minimum similarity value among the plurality of similarity values may be used as the first similarity threshold value. For example, the similarity is 20%, 22%, 25%, 25.6%, 40%, 41%, 42%, 44%, 45%, respectively. Then 25% or 25.6% may be used as the first similarity threshold.
Specifically, the first similarity threshold value X may be determined as follows.
Sequencing the obtained similarity values to obtain a similarity sequence C 1 ,C 2 ,…,C d ,…,C w . Wherein C is 1 <C 2 <…<C d …<C w 。C d The similarity values at the d-th bit are ordered for size. d=1, 2, …, w. w is the total number of feature sub-vectors to be measured.
Obtaining an interval value sequence D corresponding to the similarity sequence 1 ,D 2 ,…,D d ,…,D w-1 . Wherein D is d =C d+1 -C d ;D d Is the d interval value.
Generating a first similarity threshold X according to the similarity sequence and the interval value sequence, wherein X meets the following conditions:
X=(Y 1 +Y 2 ) 2; wherein Y is 1 And Y 2 MAX (D) 1 ,D 2 ,…,D d ,…,D w-1 ) Corresponding two similarity values.
According to the algorithm of the first similarity threshold in the embodiment, a plurality of smaller values with smaller intervals can be more accurately selected from a plurality of similarity values. For example, in the similarity calculation, 8 feature sub-vectors to be measured are in the range of 50% -60% and only two feature sub-vectors are in the range of 20% -30%. The first similarity threshold obtained by the algorithm in this embodiment can accurately select two similarity values in the range of 20% -30%.
Most of the feature vectors to be tested are features of normal codes, so that the similarity between the feature sub-vector to be tested containing malicious codes and the feature sub-vector to be tested is relatively smaller than the similarity between the feature sub-vector to be tested containing only normal codes when similarity calculation is performed. Therefore, the feature sub-vector to be detected, which possibly contains malicious codes, in the plurality of feature sub-vectors to be detected, namely the initial target sub-vector can be preliminarily determined through preprocessing. And inputting the initial target sub-vector into a neural network for detection. Because the number of feature sub-vectors to be detected in the input neural network can be reduced, the consumption of computing resources can be further reduced.
As another possible embodiment of the present invention, S312, determining the feature sub-vector to be measured with the similarity smaller than the first similarity threshold as the initial target sub-vector includes:
and S3121, taking the feature sub-vector to be detected with the similarity smaller than the first similarity threshold value as a first split vector.
And S3122, splicing the first splicing vector and two feature sub-vectors to be detected adjacent to the first splicing vector into an initial target sub-vector.
Since it is often not possible to split exactly all malicious code into one segment during the sliding window splitting process. More cases are that the boundary of the sliding window is just cut in the middle of the malicious code, so that part of the malicious code also exists in the adjacent segments, in the embodiment, the first split vector and two feature sub-vectors to be detected adjacent to the first split vector are split into the initial target sub-vector, the malicious code can be more comprehensively contained in the initial target sub-vector, the feature integrity of the malicious code in the initial target sub-vector is further improved, and therefore the accuracy of a subsequent detection result can be improved.
As another possible embodiment of the present invention, before slicing the operation code sequence, the method further includes:
and S111, if the length of the operation code sequence is smaller than the comparison length, inputting the feature vector to be detected generated by the operation code sequence into a malicious detection neural network model to generate a detection result corresponding to the operation code sequence. Length of contrast A 1 ,A 2 ,…,A i ,…,A G Is the minimum value of (a).
When the length of the operation code sequence is found to be smaller than the minimum length of the existing malicious code, the code is indicated to be not added with useless codes, so that the operation code sequence is not required to be segmented, and the operation code sequence is directly detected, so that redundant operations can be reduced, and the computing resource is saved.
Embodiments of the present invention also provide a non-transitory computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing one of the methods embodiments, the at least one instruction or the at least one program being loaded and executed by the processor to implement the methods provided by the embodiments described above.
Embodiments of the present invention also provide an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.
Embodiments of the present invention also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to the various exemplary embodiments of the invention described in the present specification when the program product is run on the electronic device.
While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (10)

1. A method for detecting malicious code based on a neural network, the method comprising the steps of:
acquiring a detection sequence of a PE file to be detected; the detection sequence includes an operation code sequence;
splitting the operation code sequence to generate a plurality of operation code sequences; the length of the operator sequence is less than or equal to the length of the operator sequence;
performing feature transformation processing on each operation code sub-sequence to generate a feature sub-vector to be detected corresponding to each operation code sub-sequence;
preprocessing each feature sub-vector to be detected, and determining at least one initial target sub-vector; the initial target sub-vector is at least one of a plurality of feature sub-vectors to be detected;
inputting each initial target sub-vector into a malicious detection neural network model respectively, and generating a detection result corresponding to each initial target sub-vector; the malicious detection neural network model comprises a gating convolutional network and a fully-connected neural network; the gating convolution network is used for extracting the data characteristics of the initial target sub-vector; the fully-connected neural network is used for generating a corresponding classification detection result according to the data characteristics of the initial target sub-vector;
if any detection result is malicious, determining that the code corresponding to the PE file to be detected is malicious;
the pretreatment comprises the following steps:
calculating the similarity between the feature sub-vector to be detected and the feature vector to be detected; the feature vector to be detected is the feature vector corresponding to the operation code sequence;
and determining the feature sub-vector to be detected, of which the similarity is smaller than a first similarity threshold value, as an initial target sub-vector.
2. The method of claim 1, wherein determining the feature sub-vector to be tested having a similarity less than a first similarity threshold as the initial target sub-vector comprises:
the feature sub-vector to be detected, the similarity of which is smaller than a first similarity threshold value, is used as a first split vector;
and splicing the first splicing vector and two feature sub-vectors to be detected adjacent to the first splicing vector into the initial target sub-vector.
3. The method of claim 1, wherein slicing the operation code sequence to generate a plurality of operation code subsequences comprises:
splitting the operation code sequence by using a sliding window to generate a plurality of operation code sequences; wherein, the sliding step length L of the sliding window meets the following conditions:
L=S/n;
s is the total length of the sliding window; n is a sliding coefficient, and n is a positive integer.
4. The method of claim 3, wherein the total length of the sliding window is a plurality of sliding windows, and the method for obtaining the total length of the sliding windows is as follows:
performing disassembly operation on each malicious PE file in a malicious code library, and generating an operation code sequence corresponding to each malicious PE file;
K-Means clustering processing is carried out on the operation code sequences corresponding to the malicious PE files, and G cluster groups are generated;
taking the average sequence length of all the operation code sequences in each cluster group as the total length A corresponding to the sliding window 1 ,A 2 ,…,A i ,…,A G The method comprises the steps of carrying out a first treatment on the surface of the Wherein A is i The average sequence length corresponding to the ith cluster group; i=1, 2, …, G; g is the total number of clusters.
5. The method of claim 4, wherein the slicing the operation code sequence using a sliding window comprises:
the using lengths are respectively A 1 ,A 2 ,…,A i ,…,A G And performing G-time slicing on the operation code sequence.
6. The method of claim 4, wherein prior to slicing the opcode sequence, the method further comprises:
if the length of the operation code sequence is smaller than the comparison length, inputting the feature vector to be detected generated by the operation code sequence into the malicious detection neural network model to generate a detection result corresponding to the operation code sequence; the contrast length is A 1 ,A 2 ,…,A i ,…,A G Is the minimum value of (a);
if any detection result is malicious, determining that the code corresponding to the PE file to be detected is malicious.
7. The method of claim 1, wherein obtaining the operation code sequence of the PE file under test comprises:
unshelling treatment is carried out on the PE file to be detected, and binary data corresponding to the PE file to be detected is generated;
and performing disassembly processing on binary data corresponding to the PE file to be tested, and generating an operation code sequence of the PE file to be tested.
8. The method of claim 1, wherein the feature transformation process comprises:
3-gram grouping processing is carried out on the operation codon sequence to generate a plurality of operation codon group sequences;
calculating TF-IDF of each operation code subgroup sequence in the whole operation code sequence;
and coding the TF-IDF of each operation code sub-group sequence in the whole operation code sequence to generate a feature sub-vector to be detected corresponding to the operation code sub-sequences.
9. A non-transitory computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a neural network-based malicious code detection method according to any one of claims 1 to 8.
10. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements a neural network-based malicious code detection method as claimed in any one of claims 1 to 8 when the computer program is executed by the processor.
CN202310160086.3A 2023-02-24 2023-02-24 Malicious code detection method, medium and device based on neural network Active CN116028936B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310160086.3A CN116028936B (en) 2023-02-24 2023-02-24 Malicious code detection method, medium and device based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310160086.3A CN116028936B (en) 2023-02-24 2023-02-24 Malicious code detection method, medium and device based on neural network

Publications (2)

Publication Number Publication Date
CN116028936A CN116028936A (en) 2023-04-28
CN116028936B true CN116028936B (en) 2023-05-30

Family

ID=86077765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310160086.3A Active CN116028936B (en) 2023-02-24 2023-02-24 Malicious code detection method, medium and device based on neural network

Country Status (1)

Country Link
CN (1) CN116028936B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975865B (en) * 2023-08-11 2024-05-28 北京天融信网络安全技术有限公司 Malicious Office document detection method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9361458B1 (en) * 2014-10-08 2016-06-07 Trend Micro Incorporated Locality-sensitive hash-based detection of malicious codes
CN107908963A (en) * 2018-01-08 2018-04-13 北京工业大学 A kind of automatic detection malicious code core feature method
CN109886021A (en) * 2019-02-19 2019-06-14 北京工业大学 A kind of malicious code detecting method based on API overall situation term vector and layered circulation neural network
CN111079143A (en) * 2019-11-25 2020-04-28 北京理工大学 Trojan horse detection method based on multi-dimensional feature map
CN111737694A (en) * 2020-05-19 2020-10-02 华南理工大学 Behavior tree-based malicious software homology analysis method
CN112559978A (en) * 2020-12-18 2021-03-26 北京邮电大学 Multithreading program plagiarism detection method based on dynamic birthmarks and related equipment
CN113239354A (en) * 2021-04-30 2021-08-10 武汉科技大学 Malicious code detection method and system based on recurrent neural network
CN114519187A (en) * 2022-02-17 2022-05-20 北京工业大学 Multi-dimensional hybrid feature-based Android malicious application detection method and system
CN115630358A (en) * 2022-07-20 2023-01-20 哈尔滨工业大学(深圳) Malicious software classification method and device, computer equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9361458B1 (en) * 2014-10-08 2016-06-07 Trend Micro Incorporated Locality-sensitive hash-based detection of malicious codes
CN107908963A (en) * 2018-01-08 2018-04-13 北京工业大学 A kind of automatic detection malicious code core feature method
CN109886021A (en) * 2019-02-19 2019-06-14 北京工业大学 A kind of malicious code detecting method based on API overall situation term vector and layered circulation neural network
CN111079143A (en) * 2019-11-25 2020-04-28 北京理工大学 Trojan horse detection method based on multi-dimensional feature map
CN111737694A (en) * 2020-05-19 2020-10-02 华南理工大学 Behavior tree-based malicious software homology analysis method
CN112559978A (en) * 2020-12-18 2021-03-26 北京邮电大学 Multithreading program plagiarism detection method based on dynamic birthmarks and related equipment
CN113239354A (en) * 2021-04-30 2021-08-10 武汉科技大学 Malicious code detection method and system based on recurrent neural network
CN114519187A (en) * 2022-02-17 2022-05-20 北京工业大学 Multi-dimensional hybrid feature-based Android malicious application detection method and system
CN115630358A (en) * 2022-07-20 2023-01-20 哈尔滨工业大学(深圳) Malicious software classification method and device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
C. Solomonides.Intelligent network application programming interface server architecture.《Proceedings 2000 IEEE Intelligent Network Workshop》.2002,第29-36页. *
苏晗舶.基于N-gram特征提取的恶意代码聚类分析方法研究.《中国优秀硕士学位论文全文数据库 信息科技辑》.2020,(第8期),第I138-16页. *

Also Published As

Publication number Publication date
CN116028936A (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN109063055B (en) Method and device for searching homologous binary files
CN109815487B (en) Text quality inspection method, electronic device, computer equipment and storage medium
CN110175851B (en) Cheating behavior detection method and device
CN116028936B (en) Malicious code detection method, medium and device based on neural network
CN111930610B (en) Software homology detection method, device, equipment and storage medium
CN116089951B (en) Malicious code detection method, readable storage medium and electronic equipment
CN115221516B (en) Malicious application program identification method and device, storage medium and electronic equipment
CN113988061A (en) Sensitive word detection method, device and equipment based on deep learning and storage medium
CN111222137A (en) Program classification model training method, program classification method and device
CN112070506A (en) Risk user identification method, device, server and storage medium
CN112464248A (en) Processor exploit threat detection method and device
CN108090117B (en) A kind of image search method and device, electronic equipment
CN112733140A (en) Detection method and system for model tilt attack
CN111259396A (en) Computer virus detection method based on deep learning convolutional neural network and compression method of deep learning neural network
CN110968702B (en) Method and device for extracting rational relation
CN109670304B (en) Malicious code family attribute identification method and device and electronic equipment
CN113762294B (en) Feature vector dimension compression method, device, equipment and medium
CN113111350A (en) Malicious PDF file detection method and device and electronic equipment
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium
JP6427480B2 (en) IMAGE SEARCH DEVICE, METHOD, AND PROGRAM
CN110059180B (en) Article author identity recognition and evaluation model training method and device and storage medium
CN111108516B (en) Evaluating input data using a deep learning algorithm
CN106650443B (en) Malicious code family identification method based on incremental DBSCAN algorithm
CN117113352B (en) Method, system, equipment and medium for detecting malicious executable file of DCS upper computer
CN115718696B (en) Source code cryptography misuse detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant