CN115577362A

CN115577362A - Vulnerability detection method based on cross-modal characteristic enhancement of source code and assembly code

Info

Publication number: CN115577362A
Application number: CN202211105496.XA
Authority: CN
Inventors: 苏小红; 陶文鑫; 魏宏巍; 郑伟宁; 万佳元; 王甜甜; 张彦航
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-01-06

Abstract

The invention discloses a vulnerability detection method based on cross-modal feature enhancement of a source code and an assembly code, which extracts grammatical semantic features related to control dependence and data dependence from the source code, extracts grammatical semantic features related to memory operation from the assembly code, and then inputs the assembly code aligned with a high-level language program source code and sentences thereof into a cross-modal feature enhancement and fusion bimodal representation learning model for software vulnerability detection. The method can represent and learn two program modes of a high-level language source code and an assembly code, semantic features related to the vulnerability are extracted in the source code mode and the assembly code mode respectively by utilizing the statement alignment relation between the source code and the assembly code, semantic relevance between the source code mode and the assembly code mode is learned by using different deep learning networks and cross attention mechanisms, and feature level fusion is performed by fully utilizing feature complementarity of the programs of the two modes, so that the accuracy of software vulnerability detection is improved.

Description

Vulnerability detection method based on cross-modal characteristic enhancement of source code and assembly code

Technical Field

The invention belongs to the field of software vulnerability detection, relates to a method for automatically detecting software vulnerabilities, and particularly relates to a software vulnerability detection method based on cross-modal feature enhancement of source codes and assembly codes.

Background

Software Vulnerability (also called Vulnerability) refers to a defect that software may be utilized by a lawless person in the life cycle of the software, and once the software is utilized, serious hazards such as privacy leakage, illegal authorization, lasso and the like are caused. The software vulnerability detection technology is an important method and means for reducing software security vulnerabilities and software security risks. Program representation learning is the basis and key for software vulnerability detection. The process of learning a representation of the syntactic semantic features of a code from an intermediate representation of the program using deep learning techniques is referred to as representation learning of the code.

At present, software vulnerability detection based on deep learning is performed on a program representation learning and software vulnerability detection either at a source code level written by a high-level language or at an assembly code level.

In the high-level language program level, a language model is mostly adopted in the vulnerability detection method based on deep learning, codes are regarded as natural language texts to be subjected to word embedding, then deep neural networks (such as LSTM and GRU) are used for expressing and learning the codes, vulnerability characteristics are automatically extracted from the codes, and the vulnerability characteristics are sent to a classifier for vulnerability detection. In the assembly code level, the RNN model is mostly used for representation learning of the function level assembly code, vulnerability characteristics are extracted from the assembly code, and then hidden vector representation output by the model is sent to a classifier for vulnerability detection. Assembly code obtained by high-level language source code can be divided into forward and reverse. The reverse method is to use a decompilation tool or a computer simulator to capture assembly instructions according to an execution sequence for an executable program, but the compiler optimizes the program when compiling the program and deletes codes which are not covered by execution, so that the assembly codes obtained reversely can only obtain Call Context Trees (CCT) of the program, and control flow information is lost, and the method is suitable for vulnerability detection scenes in which source codes cannot be obtained. The forward method is to compile the program into complete program assembly code by a compiler at the time of program compilation. The existing method for detecting the vulnerability of the assembly code generated in the forward direction does not consider the statement level alignment of the assembly code and the source code, so that the software vulnerability detection at function level can only be realized when the software vulnerability is detected, the detection granularity is thick, the source code position where the vulnerability occurs in the source code cannot be detected and positioned for carrying out decision-level fusion, the feature-level fusion is difficult to realize, and the vulnerability detection at function-level granularity can only be realized. For example, although the article "a virtual Detection System Based on Fusion of Assembly Code and Source Code" uses three modes of high-level language Source Code, assembly Code and Source Code mixed Assembly Code, the following disadvantages exist: firstly, only function-level vulnerability detection is realized, and the detection granularity is coarse. And secondly, the three modes use the same program to represent a learning model, and deep learning models adaptive to the three modes are not designed according to the characteristics of the different modes so as to extract respective complementary and unique vulnerability characteristics. Thirdly, a Voting layer (Voting layer) is used for carrying out decision-level fusion on the code vulnerability detection results of the three modes, and the fusion mode is too simple.

At present, no method for performing feature level fusion on high-level language codes and assembly codes and detecting bugs at code segment granularity is searched. The difficulty in feature level fusion of code in both high level languages and assembly languages is that it is difficult to perform statement level feature alignment and bimodal feature enhancement of assembly code and high level language code.

Disclosure of Invention

The invention aims to provide a vulnerability detection method based on cross-modal characteristic enhancement of a source code and an assembly code, which can express and learn two program modes of a high-level language source code and the assembly code, extract semantic characteristics related to vulnerability in the source code mode and the assembly code mode respectively by using sentence alignment relation between the source code and the assembly code, learn semantic relevance between the source code mode and the assembly code mode by using different deep learning networks and cross attention mechanisms, and perform characteristic level fusion by fully using characteristic complementarity of programs of the two modes, thereby improving the accuracy of software vulnerability detection.

The purpose of the invention is realized by the following technical scheme:

a vulnerability detection method based on cross-modal feature enhancement of a source code and an assembly code extracts grammatical semantic features related to control dependence and data dependence from the source code, extracts grammatical semantic features related to memory operation from the assembly code, and then inputs the assembly code aligned with sentences of the source code of a high-level language program into a cross-modal feature enhancement and fusion bimodal representation learning model consisting of a source code representation learning network, an assembly code representation learning network and a cross attention layer for software vulnerability detection, and specifically comprises the following steps:

step 1: converting a high-level language program source code into an assembly code which is aligned with sentences and has source code variable annotations;

and 2, step: generating an Abstract Syntax Tree (AST) and a Program Dependency Graph (PDG) from the high-level language program source code using a static analysis tool;

and 3, step 3: generating slicing code segments of the source code and the assembly code according to the slicing criterion of the source code;

and 4, step 4: generating slicing code segments of the source code and the assembly code according to a slicing criterion of the assembly code;

and 5: merging the source code slicing code segment sets generated in the

steps

3 and 4, marking the slicing code segment containing the bug statements as 'bugs', and marking the rest slicing code segments as 'no bugs', so as to form a training data set of the source code slicing code segment; similarly, combining the assembly code slice code segment sets generated in the step 3 and the step 4, marking the slice code segments containing the bug statements as 'bugs', and marking the rest slice code segments as 'no bugs', so as to form a training data set of the assembly code slice code segments;

step 6: word embedding is carried out on token in a source code slice code segment by using word2vec to obtain initial vector representation of each statement in the source code slice code segment, the initial vector representation is sent to a statement coding network formed by CNN to obtain hidden vector representation of each statement, and then the hidden vector representation of each statement is sent to a program coding network formed by bidirectional GRU (Gate recovery Unit) to obtain hidden vector representation of the source code slice code segment; similarly, word embedding is carried out on tokens in the assembly code slice code segment by using word2vec to obtain initial vector representation of each statement in the assembly code slice code segment, and the initial vector representation is sent to an assembly code representation learning network formed by two layers of bidirectional GRUs to obtain hidden vector representation of the assembly code slice code segment;

and 7: performing feature enhancement on the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment obtained in the step 6 by using a Cross-Attention mechanism (Cross-Attention) to generate more accurate slice code segment vector representation, and performing Attention weighted aggregation and splicing on the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment to obtain Cross-modal feature enhancement and fused hidden vector representation;

and 8: sending the vector representation which is obtained in the step 7 and subjected to Cross-modal Cross attention feature fusion and enhancement into a classifier consisting of a full connection layer (FCN) and Softmax, calculating Cross Entropy Loss (Cross Entropy Loss) according to an output result of the classifier and an actual label of a code segment, and reversely propagating and updating Cross-modal feature enhanced and fused bimodal representation learning model parameters consisting of a source code representation learning network, an assembly code representation learning network and a Cross attention layer until the training of the Cross-modal feature enhanced and fused bimodal representation learning model is finished;

and step 9: and (3) using the trained cross-modal feature enhancement and fusion bimodal representation learning model and the classifier network to carry out vulnerability detection on the code to be tested.

Compared with the prior art, the invention has the following advantages:

(1) The method has the advantages that the memory operation instruction is introduced for the first time as the slicing criterion, the slicing code vulnerability data set generated based on the memory operation slicing technology is more accurate in slicing extracted from the data stream related vulnerability by finding the memory operation instruction related to the source code variable, and the method is more beneficial to a code expression learning model to learn vulnerability related statements.

(2) Cross-modal code element alignment is realized by using code compiling preprocessing information and debugging trace information, an assembly code with source code variable annotations is generated, a vocabulary gap between the assembly code and the source code is reduced, and an auxiliary model is used for rapidly fusing cross-modal semantics.

(3) The vulnerability detection based on the source code only focuses on the data flow direction between variables, but is difficult to capture the data flow information at the memory level, and the assembly can better capture the vulnerability characteristics related to the data flow such as the memory operation. The data of two program modes of a high-level language source code and an assembly code are jointly used, and by means of advantage complementation, vulnerability characteristics of the codes can be captured more comprehensively, and the model detection accuracy is improved.

Drawings

Fig. 1 is a schematic overall flow chart of the vulnerability detection method of the present invention.

FIG. 2 is an example of bimodal code statement alignment.

FIG. 3 is a schematic diagram of generating bimodal code slicing code segments according to source code slicing criteria.

FIG. 4 is a schematic diagram of generating bimodal code slicing code sections according to assembly code slicing criteria.

FIG. 5 is a process of cross-modal feature enhanced vulnerability detection.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.

The invention provides a vulnerability detection method based on cross-modal feature enhancement of a source code and an assembly code, which comprises the steps of firstly converting a high-level program language code to be detected into a statement aligned assembly code to obtain a statement aligned bimodal program code (as shown in figure 2). Program slicing is performed on the high-level language program source code according to the slicing rule of the source code, and a corresponding assembly code slicing code segment (as shown in fig. 3) is found according to the obtained source code slicing code segment. And then, according to the assembly code with the source code variable annotation, finding a source code segment corresponding to the memory operation instruction related to the source code variable, adding the source code segment into a slice code segment data set of the source code, and according to statements in the source code segments, finding out a complete assembly code (as shown in fig. 4) and adding the complete assembly code into a slice code segment data set of the assembly code. Then, according to different sentence characteristics of the two modes, different deep neural networks are respectively designed to perform representation learning on the program slice code segments of the two modes to obtain corresponding hidden vector representations, cross attention networks are used for performing feature enhancement, and attention weighted aggregation is used for splicing bimodal vectors to complete feature fusion. And finally, sending the fused features into a classifier consisting of a full connection layer (FCN) and Softmax to judge whether the test program contains the vulnerability.

As shown in fig. 1, the method comprises the following specific steps:

step 1: converting a high-level language program source code into an assembly code with statement alignment and source code variable annotation, which comprises the following specific steps:

step 1.1: debugging the source code by using a code compiler, generating an assembly code according to a sentence, and aligning and outputting the source code and the assembly code according to the sentence;

step 1.2: generating assembly codes with source code variable annotations by using a code compiler command line compiling mode;

step 1.3: the assembly code annotation is filled into the source code statement corresponding to it generated in step 1.1.

Step 2: a static analysis tool is used to generate Abstract Syntax Trees (AST) and Program Dependency Graphs (PDG) from high-level language program source code.

And step 3: generating slicing code segments of the source code and the assembly code according to a slicing criterion of the source code, wherein the assembly code slicing generation comprises the following specific steps:

step 3.1: extracting vulnerability candidate key points of the source code according to a slicing criterion of the source code, traversing the program dependence graph obtained in the step 2 in a forward and backward mode, obtaining a bidirectional slicing statement set, and generating a slicing code segment in a high-level language form;

step 3.2: and (3) finding an assembly code block corresponding to each statement in the source code slicing code segment according to the statement corresponding relation between the assembly code and the high-level language source code obtained in the step (1), thereby obtaining the slicing code segment in the assembly language form.

And 4, step 4: generating slicing code segments of a source code and an assembly code according to a slicing criterion of the assembly code, and specifically comprising the following steps of:

step 4.1: extracting memory operation instructions related to source code variables according to statement alignment results of the source code and the assembly code generated in the step 1 and source code variable annotations of the assembly code;

step 4.2: finding a source code statement corresponding to the memory operation instruction as a slicing code segment of a source code;

step 4.3: and finding the assembly instruction sequence corresponding to the statement in the slicing code segment generated in the step 4.2 as the slicing code segment of the assembly code.

And 5: merging the homomorphic data sets generated in the step 3 and the step 4 and marking bugs to obtain a training data set of the source code slice code segment and a training data set of the assembly code slice code segment, and the specific steps are as follows:

step 5.1: merging the source code slicing code segment sets generated in the

steps

3 and 4, marking the slicing code segment containing the bug statements as 'bugs', and marking the rest slicing code segments as 'no bugs', so as to form a training data set of the source code slicing code segment;

step 5.2: and (4) merging the assembly code slicing code segment sets generated in the steps (3) and (4), marking the slicing code segment containing the bug statements as 'bugs', and marking the rest slicing code segments as 'no bugs', so as to form a training data set of the assembly code slicing code segment.

And 6: word embedding is carried out on token in a source code slice code segment by using word2vec to obtain initial vector representation of each statement in the source code slice code segment, the initial vector representation is sent to a statement coding network formed by CNN to obtain hidden vector representation of each statement, and then the hidden vector representation of each statement is sent to a program coding network formed by bidirectional GRU (Gate recovery Unit) to obtain hidden vector representation of the source code slice code segment; similarly, word2vec is used for embedding words into tokens in the assembly code slice code segment to obtain an initial vector representation of each statement in the assembly code slice code segment, and the initial vector representation is sent to an assembly code representation learning network formed by two layers of bidirectional GRUs to obtain a hidden vector representation of the assembly code slice code segment, and the method specifically comprises the following steps:

step 6.1: and performing cross-modal joint word embedding training on a multi-modal corpus formed by the source code and the assembly code, and representing token in the source code and assembly code slicing code segments as word embedding.

Step 6.2: token feature extraction stage. The importance of different tokens is resolved by the self-attention network. Embedding a token word corresponding to a given sentence into a sequence T, sending the T into a self-Attention (self-Attention) layer, and generating a hidden vector representation of the token sequence, wherein a specific calculation formula is as follows:

Q＝W ^Q ·T；

K＝W ^K ·T；

V＝W ^V ·T；

wherein Q is the query sequence, (K, V) is the bond-pair sequence, W ^Q 、W ^K 、W ^V For learnable weights, H is the token hidden vector sequence, attention () is the self-Attention method, d _k Is the dimension of Q.

Step 6.3: and (5) sentence characteristic extraction. For the token hidden vector sequence H obtained in step 6.2, capturing local semantic features of different scales in the source code statement by using convolution layers with different convolution kernel sizes, and then splicing the pooled feature vectors through a maximum pooling layer to obtain a vector representation of the source code statement, wherein a specific calculation formula is as follows:

H _S1 ＝CNN _k＝4 (H)；

H _S2 ＝CNN _k＝5 (H)；

H _S3 ＝CNN _k＝6 (H)；

H _P1 ＝MaxPooling(H _S1 )；

H _P2 ＝MaxPooling(H _S2 )；

H _P3 ＝MaxPooling(H _S3 )；

s＝Concat(H _P1 ,H _P2 ,H _P3 )；

where CNN is a one-dimensional convolutional neural network, k is the convolutional kernel size, H _S1 、H _S2 、H _S3 Respectively representing hidden vectors obtained by convolutional neural networks with convolutional kernel sizes of 4, 5 and 6, wherein Maxboosting () is a maximum pooling layer, and H _P1 、H _P2 、H _P3 Is H _S1 、H _S2 、H _S3 And (4) obtaining a result after the maximum pooling layer, wherein Concat () refers to splicing operation on the feature dimension, and s is the final vector representation of the source code statement obtained after splicing.

Step 6.4: and a code segment feature extraction stage. For source code program slice S = [ S ] ₁ ,s ₂ ,...,s _i ,...s _n ]Wherein s is _i For the final vector representation of the ith source code statement obtained in step 6.3, n is the number of initial vector representations containing the source code statement in S, and the program slice representation formed by the final vector set of statements is sent to a program coding network formed by bidirectional GRUs to obtain a hidden representation P of a source code slice code segment _C The concrete formula is as follows:

P _C ＝BiGRU(S)；

wherein BiGRU () is a bidirectional GRU network, P _C Is a source code programSequential slice vector representation, S is the vector representation of the source code statement.

Step 6.5: the assembly code slice code segment obtained in step 6.1 is denoted as a = [ a = ₁ ,a ₂ ,...,a _i ,...a _m ]Wherein a is _i Sending the initial vector representation of the ith assembly statement and m which is the number of the initial vector representations of the assembly statements contained in the A into an assembly code representation learning model formed by two layers of bidirectional GRUs to obtain a vector representation P of an assembly code slice code segment _A The formula is as follows:

P _A ＝BiGRU(BiGRU(A))；

wherein BiGRU () is a bidirectional GRU network, P _A Is a hidden representation of the assembly code program slice, and a is a representation of the assembly slice code segment.

And 7: the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment obtained in the step 6 are subjected to feature enhancement by using a Cross-Attention mechanism (Cross-Attention) to generate more accurate slice code segment vector representation, and then the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment are subjected to Attention weighted aggregation splicing to obtain Cross-modal feature enhancement and fused hidden vector representation, and the specific steps are as follows:

step 7.1: representing P by the hidden vector of the source code slicing code segment obtained in step 6 _C And assembling a hidden vector representation P of a sliced code segment _A And performing feature enhancement by using a Cross-Attention mechanism (Cross-Attention), obtaining an Attention value of one mode relative to the other mode, and obtaining a bimodal vector representation of feature enhancement.

The cross-attention calculation formula for the source code modality relative to the assembly code modality is as follows:

the cross-attention calculation formula for assembly code modalities relative to source code modalities is as follows:

wherein Q _C Query sequence for source code modality, at this time (K) _A ,V _A ) For sequences of assembly code modal key-value pairs, Q _A Query the sequence for assembly code modality, at this point (K) _C ,V _C ) For a sequence of source code modality key-value pairs,

all the weights are learnable weights, and the source code representation after the characteristic enhancement is obtained after the cross attention

Wherein

For the feature enhanced hidden vector representation of the ith source code statement, N is H _C The number of the hidden vector representations of the middle source code statement after the characteristic enhancement is subjected to cross attention to obtain the assembly code representation after the characteristic enhancement

Wherein

The enhanced feature of the jth assembly instruction is represented by a hidden vector, M is H _A The number of feature-enhanced hidden vector representations of the medium assembly code statement.

Step 7.2: after attention weighted aggregation is carried out on the bimodal code segment vectors, splicing is carried out to obtain fused feature vectors, and the method specifically comprises the following steps: respectively sending the bimodal feature enhanced hidden vector representation obtained at the last moment into an Attention weighted aggregation layer for Attention weighted aggregation, wherein the Attention weighted aggregation is based on an Attention weight a _i 、a _j The weighted summation of (a) yields a source code slice attention weighted aggregate vector representation x _C And coding the vector representation x of the slice attention weighted aggregation _A Then spliced x _C 、x _A Then obtaining a feature fused vector x _t Wherein:

the source code modality attention weighted aggregation formula is as follows:

the assembly code modality attention weighted aggregation formula is as follows:

x _t ＝Concat(x _C ,x _A )；

wherein the content of the first and second substances,

is composed of

The vector representation after mapping by the full connection layer and the activation function tanh (),

are respectively as

The transposed vector of (a) is,

are trainable vectors.

And 8: sending the vector representation which is obtained in the step 7 and subjected to Cross-modal Cross attention feature fusion and enhancement into a classifier consisting of a full connection layer (FCN) and Softmax, calculating Cross Entropy Loss (Cross Entropy Loss) according to an output result of the classifier and an actual label of a code segment, and reversely transmitting and updating parameters of the Cross-modal feature enhancement and fusion bimodal representation learning model until the training of the Cross-modal feature enhancement and fusion bimodal representation learning model is finished, wherein the method comprises the following specific steps:

step 8.1: fusing the feature obtained in step 7 into a vector x _t Show to be sent intoAnd the prediction result vector y' is obtained by the connecting layer through a Softmax function.

Step 8.2: calculating cross entropy loss by using label information of the slicing code segment, and specifically comprising the following steps of: using the label value marked in the step 5, when the label value is 0 and indicates no vulnerability, initializing the label value into a vector [1,0], when the label value is 1 and indicates vulnerability, initializing the label value into the vector [0,1], calculating cross entropy loss with a prediction result vector given by the cross-modal feature enhanced and fused bimodal representation learning model, finally adjusting parameters of the cross-modal feature enhanced and fused bimodal representation learning model according to error back propagation until the loss value does not decrease any more, and finishing training, wherein:

the specific calculation formula of the cross entropy loss is as follows:

wherein y' is the model prediction result obtained in step 8.1, and y is the vulnerability label value of the slice code segment after vector initialization.

And step 9: and (3) using the trained cross-modal feature enhancement and fusion bimodal representation learning model and the classifier network to perform vulnerability detection on the code to be tested.

Example (b):

as shown in FIG. 2, which is an example of source code and assembly code bimodal statement alignment, the section of program vulnerability conforms to CWE (Common Weakness Enummation Common Defect Table) 121. The error type is Stack Overflow, which is a defect that C language has no built-in checking mechanism to ensure that the data copied to the buffer can be larger than the size of the buffer, if the data to be copied is larger than the buffer, the range of the buffer will be overflowed, and other storage units will be rewritten when the buffer overflows, resulting in program crash accidents and the like, which cause unpredictable consequences. As can be seen from fig. 2, the program code with aligned source code and assembly code statements is generated by using the method in step 1, wherein the head of each statement is the line number of the corresponding modality, the line number of the source code is a number and then "", such line number is consistent with the line number of the program code of the source code modality, the line number of the assembly code is a pure number, and one source code statement generally corresponds to 0 to 20 assembly codes. In the assembly code, besides the conventional assembly instruction part, the program variable annotation of the source code corresponding to the register is carried out, and after "#", the program variable annotation is divided into "" and "". In the statement of source code vulnerability, 80 characters of a str string are copied to buf by using a strncpy function in the 6 th line of a code segment, and the assembly code modality of the source code vulnerability statement and register comments related to source code variables are shown in the 18 th to 23 th lines of FIG. 2, so that cross-modality statement alignment is realized.

As can be seen from fig. 3, according to the process of finding a corresponding assembly code segment by C language source code slicing code segments, an example source code function parameter is a slicing example of char str under a slicing criterion. And finding out function parameters needing to be tracked through the abstract syntax tree, finding out statements corresponding to parameter dependence through the program dependence graph to obtain source code modal program slices, and combining assembly code lines aligned with each source code statement to obtain the assembly code modal program slices. In the source code slice, it can be seen that the macro definition value of the 3 rd line of the C language code is not present in the slice, and the length of buf cannot be obtained according to the code statement in the slice. And in the assembly instruction line 9 corresponding to the line 4 of the C language code, 40 is assigned to a tmp90 temporary register, so that the real buf length can be found in the assembly code modality, and therefore, the vulnerability which cannot be detected in the source code modality can be found and detected in the assembly code modality.

As can be seen from fig. 4, the memory operation slicing criteria also performs slicing based on the parameter char str, thereby exerting its own advantages. The comment part in the assembly instruction code can be seen, which registers are marked to have cross-mode alignment relation with the variables str, the str corresponds to the register [ -56rbp ], the assembly code related to the variables is respectively the 8 th line and the 14 th line, the two instructions respectively correspond to the 4 th line and the 6 th line of the C language source code, and the statements are formed into the source code mode program slice. And then combining the assembly code lines aligned with each source code statement to obtain the assembly code modal program slice. As can be seen from fig. 4, also in the str-based bimodal program slice, the memory operation slice criterion only cuts to a statement except for the declaration in the C language source code, and the statement is a bug statement, and the occurrence cause of the bug can be seen in the assembly code modality. And the slice extracted by the vulnerability related to the data stream is more accurate by using the memory operation slice criterion.

Claims

1. A vulnerability detection method based on cross-modal feature enhancement of source codes and assembly codes is characterized by comprising the following steps:

step 2: generating an abstract syntax tree and a program dependency graph from the high-level language program source code by using a static analysis tool;

and step 3: generating slicing code segments of the source code and the assembly code according to the slicing criterion of the source code;

and 5: merging the source code slicing code segment sets generated in the steps 3 and 4, marking the slicing code segment containing the bug statements as 'bugs', and marking the rest slicing code segments as 'no bugs', so as to form a training data set of the source code slicing code segment; similarly, combining the assembly code slice code segment sets generated in the step 3 and the step 4, marking the slice code segments containing the bug statements as 'bugs', and marking the rest slice code segments as 'no bugs', so as to form a training data set of the assembly code slice code segments;

and 6: word embedding is carried out on token in a source code slicing code segment by using word2vec to obtain initial vector representation of each statement in the source code slicing code segment, the initial vector representation is sent to a statement coding network formed by CNN to obtain hidden vector representation of each statement, and then the hidden vector representation of each statement is sent to a program coding network formed by bidirectional GRUs to obtain hidden vector representation of the source code slicing code segment; similarly, word embedding is carried out on tokens in the assembly code slice code segment by using word2vec to obtain initial vector representation of each statement in the assembly code slice code segment, and the initial vector representation is sent to an assembly code representation learning network formed by two layers of bidirectional GRUs to obtain hidden vector representation of the assembly code slice code segment;

and 7: performing feature enhancement on the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment obtained in the step 6 by using a cross attention mechanism to generate more accurate vector representation of the slice code segment, and performing attention weighted aggregation splicing on the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment to obtain cross-modal feature enhancement and fused hidden vector representation;

and 8: sending the vector representation subjected to cross-modal cross attention feature fusion and enhancement obtained in the step 7 into a classifier consisting of a full connection layer and Softmax, calculating cross entropy loss according to an output result of the classifier and an actual label of a code segment, and reversely propagating and updating cross-modal feature enhancement and fusion bimodal representation learning model parameters consisting of a source code representation learning network, an assembly code representation learning network and a cross attention layer until the training of the cross-modal feature enhancement and fusion bimodal representation learning model is finished;

and step 9: and (3) carrying out vulnerability detection on the code to be tested by using the trained cross-modal feature enhanced and fused bimodal representation learning model and the classifier network.

2. The vulnerability detection method based on cross-modal feature enhancement of source codes and assembly codes according to claim 1, characterized in that the specific steps of the step 1 are as follows:

step 1.1: debugging the source code by using a program compiler, generating an assembly code according to a sentence, and aligning and outputting the source code and the assembly code according to the sentence;

step 1.2: generating assembly code with source code variable annotations using a code compiler command line compilation mode with specific parameters;

3. The vulnerability detection method based on cross-modal feature enhancement of source code and assembly code according to claim 1, characterized in that in the step 3, assembly code slices are generated as follows:

4. The vulnerability detection method based on cross-modal feature enhancement of source codes and assembly codes according to claim 1, characterized in that the specific steps of the step 4 are as follows:

step 4.2: finding a source code statement corresponding to the memory operation instruction as a slicing code segment of the source code;

5. The vulnerability detection method based on cross-modal feature enhancement of source codes and assembly codes according to claim 1, wherein the specific steps of the step 7 are as follows:

step 7.1: representing P by the hidden vector of the source code slicing code segment obtained in step 6 _C And assembling a hidden vector representation P of a sliced code segment _A Using cross attention machinePerforming feature enhancement to obtain an attention value of one mode aiming at the other mode and obtain a bimodal vector representation of the feature enhancement;

wherein the content of the first and second substances,Q _C query sequence for source code modality, at this time (K) _A ,V _A ) For sequences of assembly code modal key-value pairs, Q _A Query the sequence for assembly code modality, at this point (K) _C ,V _C ) For a sequence of source code modality key-value pairs,

all of which are learnable weights, and the source code representation after the cross attention is obtained after the feature enhancement

Wherein

For the feature enhanced hidden vector representation of the ith source code statement, N is H _C The number of the hidden vector representations after the feature enhancement of the source code statement is subjected to cross attention to obtain the assembly code representation after the feature enhancement

Wherein

The enhanced feature of the jth assembly instruction is represented by a hidden vector, M is H _A The number of the hidden vector representations after the feature enhancement of the middle assembly code statement;

and 7.2: after attention weighted aggregation is carried out on the bimodal code segment vectors, splicing is carried out to obtain fused feature vectors, and the method specifically comprises the following steps: respectively sending the bimodal feature enhanced hidden vector representation obtained at the last moment into an Attention weighted aggregation layer for Attention weighted aggregation, and performing Attention weighted aggregation based on an Attention weight a _i 、a _j The weighted summation of (a) to obtain a source code slice attention weighted aggregate vector representation x _C And coding the vector representation x of the slice attention weighted aggregation _A Then spliced x _C 、x _A Then obtaining a feature fused vector x _t Wherein:

the source code modality attention weighted aggregation formula is as follows:

x _t ＝Concat(x _C ,x _A )；

wherein, the first and the second end of the pipe are connected with each other,

is composed of

are respectively as

The transposed vector of (a) is,

are trainable vectors.