CN115577362A - Vulnerability detection method based on cross-modal characteristic enhancement of source code and assembly code - Google Patents

Vulnerability detection method based on cross-modal characteristic enhancement of source code and assembly code Download PDF

Info

Publication number
CN115577362A
CN115577362A CN202211105496.XA CN202211105496A CN115577362A CN 115577362 A CN115577362 A CN 115577362A CN 202211105496 A CN202211105496 A CN 202211105496A CN 115577362 A CN115577362 A CN 115577362A
Authority
CN
China
Prior art keywords
code
assembly
source code
cross
slicing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211105496.XA
Other languages
Chinese (zh)
Inventor
苏小红
陶文鑫
魏宏巍
郑伟宁
万佳元
王甜甜
张彦航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202211105496.XA priority Critical patent/CN115577362A/en
Publication of CN115577362A publication Critical patent/CN115577362A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a vulnerability detection method based on cross-modal feature enhancement of a source code and an assembly code, which extracts grammatical semantic features related to control dependence and data dependence from the source code, extracts grammatical semantic features related to memory operation from the assembly code, and then inputs the assembly code aligned with a high-level language program source code and sentences thereof into a cross-modal feature enhancement and fusion bimodal representation learning model for software vulnerability detection. The method can represent and learn two program modes of a high-level language source code and an assembly code, semantic features related to the vulnerability are extracted in the source code mode and the assembly code mode respectively by utilizing the statement alignment relation between the source code and the assembly code, semantic relevance between the source code mode and the assembly code mode is learned by using different deep learning networks and cross attention mechanisms, and feature level fusion is performed by fully utilizing feature complementarity of the programs of the two modes, so that the accuracy of software vulnerability detection is improved.

Description

Vulnerability detection method based on cross-modal characteristic enhancement of source code and assembly code
Technical Field
The invention belongs to the field of software vulnerability detection, relates to a method for automatically detecting software vulnerabilities, and particularly relates to a software vulnerability detection method based on cross-modal feature enhancement of source codes and assembly codes.
Background
Software Vulnerability (also called Vulnerability) refers to a defect that software may be utilized by a lawless person in the life cycle of the software, and once the software is utilized, serious hazards such as privacy leakage, illegal authorization, lasso and the like are caused. The software vulnerability detection technology is an important method and means for reducing software security vulnerabilities and software security risks. Program representation learning is the basis and key for software vulnerability detection. The process of learning a representation of the syntactic semantic features of a code from an intermediate representation of the program using deep learning techniques is referred to as representation learning of the code.
At present, software vulnerability detection based on deep learning is performed on a program representation learning and software vulnerability detection either at a source code level written by a high-level language or at an assembly code level.
In the high-level language program level, a language model is mostly adopted in the vulnerability detection method based on deep learning, codes are regarded as natural language texts to be subjected to word embedding, then deep neural networks (such as LSTM and GRU) are used for expressing and learning the codes, vulnerability characteristics are automatically extracted from the codes, and the vulnerability characteristics are sent to a classifier for vulnerability detection. In the assembly code level, the RNN model is mostly used for representation learning of the function level assembly code, vulnerability characteristics are extracted from the assembly code, and then hidden vector representation output by the model is sent to a classifier for vulnerability detection. Assembly code obtained by high-level language source code can be divided into forward and reverse. The reverse method is to use a decompilation tool or a computer simulator to capture assembly instructions according to an execution sequence for an executable program, but the compiler optimizes the program when compiling the program and deletes codes which are not covered by execution, so that the assembly codes obtained reversely can only obtain Call Context Trees (CCT) of the program, and control flow information is lost, and the method is suitable for vulnerability detection scenes in which source codes cannot be obtained. The forward method is to compile the program into complete program assembly code by a compiler at the time of program compilation. The existing method for detecting the vulnerability of the assembly code generated in the forward direction does not consider the statement level alignment of the assembly code and the source code, so that the software vulnerability detection at function level can only be realized when the software vulnerability is detected, the detection granularity is thick, the source code position where the vulnerability occurs in the source code cannot be detected and positioned for carrying out decision-level fusion, the feature-level fusion is difficult to realize, and the vulnerability detection at function-level granularity can only be realized. For example, although the article "a virtual Detection System Based on Fusion of Assembly Code and Source Code" uses three modes of high-level language Source Code, assembly Code and Source Code mixed Assembly Code, the following disadvantages exist: firstly, only function-level vulnerability detection is realized, and the detection granularity is coarse. And secondly, the three modes use the same program to represent a learning model, and deep learning models adaptive to the three modes are not designed according to the characteristics of the different modes so as to extract respective complementary and unique vulnerability characteristics. Thirdly, a Voting layer (Voting layer) is used for carrying out decision-level fusion on the code vulnerability detection results of the three modes, and the fusion mode is too simple.
At present, no method for performing feature level fusion on high-level language codes and assembly codes and detecting bugs at code segment granularity is searched. The difficulty in feature level fusion of code in both high level languages and assembly languages is that it is difficult to perform statement level feature alignment and bimodal feature enhancement of assembly code and high level language code.
Disclosure of Invention
The invention aims to provide a vulnerability detection method based on cross-modal characteristic enhancement of a source code and an assembly code, which can express and learn two program modes of a high-level language source code and the assembly code, extract semantic characteristics related to vulnerability in the source code mode and the assembly code mode respectively by using sentence alignment relation between the source code and the assembly code, learn semantic relevance between the source code mode and the assembly code mode by using different deep learning networks and cross attention mechanisms, and perform characteristic level fusion by fully using characteristic complementarity of programs of the two modes, thereby improving the accuracy of software vulnerability detection.
The purpose of the invention is realized by the following technical scheme:
a vulnerability detection method based on cross-modal feature enhancement of a source code and an assembly code extracts grammatical semantic features related to control dependence and data dependence from the source code, extracts grammatical semantic features related to memory operation from the assembly code, and then inputs the assembly code aligned with sentences of the source code of a high-level language program into a cross-modal feature enhancement and fusion bimodal representation learning model consisting of a source code representation learning network, an assembly code representation learning network and a cross attention layer for software vulnerability detection, and specifically comprises the following steps:
step 1: converting a high-level language program source code into an assembly code which is aligned with sentences and has source code variable annotations;
and 2, step: generating an Abstract Syntax Tree (AST) and a Program Dependency Graph (PDG) from the high-level language program source code using a static analysis tool;
and 3, step 3: generating slicing code segments of the source code and the assembly code according to the slicing criterion of the source code;
and 4, step 4: generating slicing code segments of the source code and the assembly code according to a slicing criterion of the assembly code;
and 5: merging the source code slicing code segment sets generated in the steps 3 and 4, marking the slicing code segment containing the bug statements as 'bugs', and marking the rest slicing code segments as 'no bugs', so as to form a training data set of the source code slicing code segment; similarly, combining the assembly code slice code segment sets generated in the step 3 and the step 4, marking the slice code segments containing the bug statements as 'bugs', and marking the rest slice code segments as 'no bugs', so as to form a training data set of the assembly code slice code segments;
step 6: word embedding is carried out on token in a source code slice code segment by using word2vec to obtain initial vector representation of each statement in the source code slice code segment, the initial vector representation is sent to a statement coding network formed by CNN to obtain hidden vector representation of each statement, and then the hidden vector representation of each statement is sent to a program coding network formed by bidirectional GRU (Gate recovery Unit) to obtain hidden vector representation of the source code slice code segment; similarly, word embedding is carried out on tokens in the assembly code slice code segment by using word2vec to obtain initial vector representation of each statement in the assembly code slice code segment, and the initial vector representation is sent to an assembly code representation learning network formed by two layers of bidirectional GRUs to obtain hidden vector representation of the assembly code slice code segment;
and 7: performing feature enhancement on the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment obtained in the step 6 by using a Cross-Attention mechanism (Cross-Attention) to generate more accurate slice code segment vector representation, and performing Attention weighted aggregation and splicing on the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment to obtain Cross-modal feature enhancement and fused hidden vector representation;
and 8: sending the vector representation which is obtained in the step 7 and subjected to Cross-modal Cross attention feature fusion and enhancement into a classifier consisting of a full connection layer (FCN) and Softmax, calculating Cross Entropy Loss (Cross Entropy Loss) according to an output result of the classifier and an actual label of a code segment, and reversely propagating and updating Cross-modal feature enhanced and fused bimodal representation learning model parameters consisting of a source code representation learning network, an assembly code representation learning network and a Cross attention layer until the training of the Cross-modal feature enhanced and fused bimodal representation learning model is finished;
and step 9: and (3) using the trained cross-modal feature enhancement and fusion bimodal representation learning model and the classifier network to carry out vulnerability detection on the code to be tested.
Compared with the prior art, the invention has the following advantages:
(1) The method has the advantages that the memory operation instruction is introduced for the first time as the slicing criterion, the slicing code vulnerability data set generated based on the memory operation slicing technology is more accurate in slicing extracted from the data stream related vulnerability by finding the memory operation instruction related to the source code variable, and the method is more beneficial to a code expression learning model to learn vulnerability related statements.
(2) Cross-modal code element alignment is realized by using code compiling preprocessing information and debugging trace information, an assembly code with source code variable annotations is generated, a vocabulary gap between the assembly code and the source code is reduced, and an auxiliary model is used for rapidly fusing cross-modal semantics.
(3) The vulnerability detection based on the source code only focuses on the data flow direction between variables, but is difficult to capture the data flow information at the memory level, and the assembly can better capture the vulnerability characteristics related to the data flow such as the memory operation. The data of two program modes of a high-level language source code and an assembly code are jointly used, and by means of advantage complementation, vulnerability characteristics of the codes can be captured more comprehensively, and the model detection accuracy is improved.
Drawings
Fig. 1 is a schematic overall flow chart of the vulnerability detection method of the present invention.
FIG. 2 is an example of bimodal code statement alignment.
FIG. 3 is a schematic diagram of generating bimodal code slicing code segments according to source code slicing criteria.
FIG. 4 is a schematic diagram of generating bimodal code slicing code sections according to assembly code slicing criteria.
FIG. 5 is a process of cross-modal feature enhanced vulnerability detection.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings, but not limited thereto, and any modification or equivalent replacement of the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention shall be covered by the protection scope of the present invention.
The invention provides a vulnerability detection method based on cross-modal feature enhancement of a source code and an assembly code, which comprises the steps of firstly converting a high-level program language code to be detected into a statement aligned assembly code to obtain a statement aligned bimodal program code (as shown in figure 2). Program slicing is performed on the high-level language program source code according to the slicing rule of the source code, and a corresponding assembly code slicing code segment (as shown in fig. 3) is found according to the obtained source code slicing code segment. And then, according to the assembly code with the source code variable annotation, finding a source code segment corresponding to the memory operation instruction related to the source code variable, adding the source code segment into a slice code segment data set of the source code, and according to statements in the source code segments, finding out a complete assembly code (as shown in fig. 4) and adding the complete assembly code into a slice code segment data set of the assembly code. Then, according to different sentence characteristics of the two modes, different deep neural networks are respectively designed to perform representation learning on the program slice code segments of the two modes to obtain corresponding hidden vector representations, cross attention networks are used for performing feature enhancement, and attention weighted aggregation is used for splicing bimodal vectors to complete feature fusion. And finally, sending the fused features into a classifier consisting of a full connection layer (FCN) and Softmax to judge whether the test program contains the vulnerability.
As shown in fig. 1, the method comprises the following specific steps:
step 1: converting a high-level language program source code into an assembly code with statement alignment and source code variable annotation, which comprises the following specific steps:
step 1.1: debugging the source code by using a code compiler, generating an assembly code according to a sentence, and aligning and outputting the source code and the assembly code according to the sentence;
step 1.2: generating assembly codes with source code variable annotations by using a code compiler command line compiling mode;
step 1.3: the assembly code annotation is filled into the source code statement corresponding to it generated in step 1.1.
Step 2: a static analysis tool is used to generate Abstract Syntax Trees (AST) and Program Dependency Graphs (PDG) from high-level language program source code.
And step 3: generating slicing code segments of the source code and the assembly code according to a slicing criterion of the source code, wherein the assembly code slicing generation comprises the following specific steps:
step 3.1: extracting vulnerability candidate key points of the source code according to a slicing criterion of the source code, traversing the program dependence graph obtained in the step 2 in a forward and backward mode, obtaining a bidirectional slicing statement set, and generating a slicing code segment in a high-level language form;
step 3.2: and (3) finding an assembly code block corresponding to each statement in the source code slicing code segment according to the statement corresponding relation between the assembly code and the high-level language source code obtained in the step (1), thereby obtaining the slicing code segment in the assembly language form.
And 4, step 4: generating slicing code segments of a source code and an assembly code according to a slicing criterion of the assembly code, and specifically comprising the following steps of:
step 4.1: extracting memory operation instructions related to source code variables according to statement alignment results of the source code and the assembly code generated in the step 1 and source code variable annotations of the assembly code;
step 4.2: finding a source code statement corresponding to the memory operation instruction as a slicing code segment of a source code;
step 4.3: and finding the assembly instruction sequence corresponding to the statement in the slicing code segment generated in the step 4.2 as the slicing code segment of the assembly code.
And 5: merging the homomorphic data sets generated in the step 3 and the step 4 and marking bugs to obtain a training data set of the source code slice code segment and a training data set of the assembly code slice code segment, and the specific steps are as follows:
step 5.1: merging the source code slicing code segment sets generated in the steps 3 and 4, marking the slicing code segment containing the bug statements as 'bugs', and marking the rest slicing code segments as 'no bugs', so as to form a training data set of the source code slicing code segment;
step 5.2: and (4) merging the assembly code slicing code segment sets generated in the steps (3) and (4), marking the slicing code segment containing the bug statements as 'bugs', and marking the rest slicing code segments as 'no bugs', so as to form a training data set of the assembly code slicing code segment.
And 6: word embedding is carried out on token in a source code slice code segment by using word2vec to obtain initial vector representation of each statement in the source code slice code segment, the initial vector representation is sent to a statement coding network formed by CNN to obtain hidden vector representation of each statement, and then the hidden vector representation of each statement is sent to a program coding network formed by bidirectional GRU (Gate recovery Unit) to obtain hidden vector representation of the source code slice code segment; similarly, word2vec is used for embedding words into tokens in the assembly code slice code segment to obtain an initial vector representation of each statement in the assembly code slice code segment, and the initial vector representation is sent to an assembly code representation learning network formed by two layers of bidirectional GRUs to obtain a hidden vector representation of the assembly code slice code segment, and the method specifically comprises the following steps:
step 6.1: and performing cross-modal joint word embedding training on a multi-modal corpus formed by the source code and the assembly code, and representing token in the source code and assembly code slicing code segments as word embedding.
Step 6.2: token feature extraction stage. The importance of different tokens is resolved by the self-attention network. Embedding a token word corresponding to a given sentence into a sequence T, sending the T into a self-Attention (self-Attention) layer, and generating a hidden vector representation of the token sequence, wherein a specific calculation formula is as follows:
Q=W Q ·T;
K=W K ·T;
V=W V ·T;
Figure BDA0003841645310000091
wherein Q is the query sequence, (K, V) is the bond-pair sequence, W Q 、W K 、W V For learnable weights, H is the token hidden vector sequence, attention () is the self-Attention method, d k Is the dimension of Q.
Step 6.3: and (5) sentence characteristic extraction. For the token hidden vector sequence H obtained in step 6.2, capturing local semantic features of different scales in the source code statement by using convolution layers with different convolution kernel sizes, and then splicing the pooled feature vectors through a maximum pooling layer to obtain a vector representation of the source code statement, wherein a specific calculation formula is as follows:
H S1 =CNN k=4 (H);
H S2 =CNN k=5 (H);
H S3 =CNN k=6 (H);
H P1 =MaxPooling(H S1 );
H P2 =MaxPooling(H S2 );
H P3 =MaxPooling(H S3 );
s=Concat(H P1 ,H P2 ,H P3 );
where CNN is a one-dimensional convolutional neural network, k is the convolutional kernel size, H S1 、H S2 、H S3 Respectively representing hidden vectors obtained by convolutional neural networks with convolutional kernel sizes of 4, 5 and 6, wherein Maxboosting () is a maximum pooling layer, and H P1 、H P2 、H P3 Is H S1 、H S2 、H S3 And (4) obtaining a result after the maximum pooling layer, wherein Concat () refers to splicing operation on the feature dimension, and s is the final vector representation of the source code statement obtained after splicing.
Step 6.4: and a code segment feature extraction stage. For source code program slice S = [ S ] 1 ,s 2 ,...,s i ,...s n ]Wherein s is i For the final vector representation of the ith source code statement obtained in step 6.3, n is the number of initial vector representations containing the source code statement in S, and the program slice representation formed by the final vector set of statements is sent to a program coding network formed by bidirectional GRUs to obtain a hidden representation P of a source code slice code segment C The concrete formula is as follows:
P C =BiGRU(S);
wherein BiGRU () is a bidirectional GRU network, P C Is a source code programSequential slice vector representation, S is the vector representation of the source code statement.
Step 6.5: the assembly code slice code segment obtained in step 6.1 is denoted as a = [ a = 1 ,a 2 ,...,a i ,...a m ]Wherein a is i Sending the initial vector representation of the ith assembly statement and m which is the number of the initial vector representations of the assembly statements contained in the A into an assembly code representation learning model formed by two layers of bidirectional GRUs to obtain a vector representation P of an assembly code slice code segment A The formula is as follows:
P A =BiGRU(BiGRU(A));
wherein BiGRU () is a bidirectional GRU network, P A Is a hidden representation of the assembly code program slice, and a is a representation of the assembly slice code segment.
And 7: the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment obtained in the step 6 are subjected to feature enhancement by using a Cross-Attention mechanism (Cross-Attention) to generate more accurate slice code segment vector representation, and then the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment are subjected to Attention weighted aggregation splicing to obtain Cross-modal feature enhancement and fused hidden vector representation, and the specific steps are as follows:
step 7.1: representing P by the hidden vector of the source code slicing code segment obtained in step 6 C And assembling a hidden vector representation P of a sliced code segment A And performing feature enhancement by using a Cross-Attention mechanism (Cross-Attention), obtaining an Attention value of one mode relative to the other mode, and obtaining a bimodal vector representation of feature enhancement.
The cross-attention calculation formula for the source code modality relative to the assembly code modality is as follows:
Figure BDA0003841645310000111
Figure BDA0003841645310000112
Figure BDA0003841645310000113
Figure BDA0003841645310000121
the cross-attention calculation formula for assembly code modalities relative to source code modalities is as follows:
Figure BDA0003841645310000122
Figure BDA0003841645310000123
Figure BDA0003841645310000124
Figure BDA0003841645310000125
wherein Q C Query sequence for source code modality, at this time (K) A ,V A ) For sequences of assembly code modal key-value pairs, Q A Query the sequence for assembly code modality, at this point (K) C ,V C ) For a sequence of source code modality key-value pairs,
Figure BDA0003841645310000126
all the weights are learnable weights, and the source code representation after the characteristic enhancement is obtained after the cross attention
Figure BDA0003841645310000127
Wherein
Figure BDA0003841645310000128
For the feature enhanced hidden vector representation of the ith source code statement, N is H C The number of the hidden vector representations of the middle source code statement after the characteristic enhancement is subjected to cross attention to obtain the assembly code representation after the characteristic enhancement
Figure BDA0003841645310000129
Wherein
Figure BDA00038416453100001210
The enhanced feature of the jth assembly instruction is represented by a hidden vector, M is H A The number of feature-enhanced hidden vector representations of the medium assembly code statement.
Step 7.2: after attention weighted aggregation is carried out on the bimodal code segment vectors, splicing is carried out to obtain fused feature vectors, and the method specifically comprises the following steps: respectively sending the bimodal feature enhanced hidden vector representation obtained at the last moment into an Attention weighted aggregation layer for Attention weighted aggregation, wherein the Attention weighted aggregation is based on an Attention weight a i 、a j The weighted summation of (a) yields a source code slice attention weighted aggregate vector representation x C And coding the vector representation x of the slice attention weighted aggregation A Then spliced x C 、x A Then obtaining a feature fused vector x t Wherein:
the source code modality attention weighted aggregation formula is as follows:
Figure BDA0003841645310000131
Figure BDA0003841645310000132
Figure BDA0003841645310000133
the assembly code modality attention weighted aggregation formula is as follows:
Figure BDA0003841645310000134
Figure BDA0003841645310000135
Figure BDA0003841645310000136
x t =Concat(x C ,x A );
wherein the content of the first and second substances,
Figure BDA0003841645310000137
is composed of
Figure BDA0003841645310000138
The vector representation after mapping by the full connection layer and the activation function tanh (),
Figure BDA0003841645310000139
are respectively as
Figure BDA00038416453100001310
The transposed vector of (a) is,
Figure BDA00038416453100001311
are trainable vectors.
And 8: sending the vector representation which is obtained in the step 7 and subjected to Cross-modal Cross attention feature fusion and enhancement into a classifier consisting of a full connection layer (FCN) and Softmax, calculating Cross Entropy Loss (Cross Entropy Loss) according to an output result of the classifier and an actual label of a code segment, and reversely transmitting and updating parameters of the Cross-modal feature enhancement and fusion bimodal representation learning model until the training of the Cross-modal feature enhancement and fusion bimodal representation learning model is finished, wherein the method comprises the following specific steps:
step 8.1: fusing the feature obtained in step 7 into a vector x t Show to be sent intoAnd the prediction result vector y' is obtained by the connecting layer through a Softmax function.
Step 8.2: calculating cross entropy loss by using label information of the slicing code segment, and specifically comprising the following steps of: using the label value marked in the step 5, when the label value is 0 and indicates no vulnerability, initializing the label value into a vector [1,0], when the label value is 1 and indicates vulnerability, initializing the label value into the vector [0,1], calculating cross entropy loss with a prediction result vector given by the cross-modal feature enhanced and fused bimodal representation learning model, finally adjusting parameters of the cross-modal feature enhanced and fused bimodal representation learning model according to error back propagation until the loss value does not decrease any more, and finishing training, wherein:
the specific calculation formula of the cross entropy loss is as follows:
Figure BDA0003841645310000141
wherein y' is the model prediction result obtained in step 8.1, and y is the vulnerability label value of the slice code segment after vector initialization.
And step 9: and (3) using the trained cross-modal feature enhancement and fusion bimodal representation learning model and the classifier network to perform vulnerability detection on the code to be tested.
Example (b):
as shown in FIG. 2, which is an example of source code and assembly code bimodal statement alignment, the section of program vulnerability conforms to CWE (Common Weakness Enummation Common Defect Table) 121. The error type is Stack Overflow, which is a defect that C language has no built-in checking mechanism to ensure that the data copied to the buffer can be larger than the size of the buffer, if the data to be copied is larger than the buffer, the range of the buffer will be overflowed, and other storage units will be rewritten when the buffer overflows, resulting in program crash accidents and the like, which cause unpredictable consequences. As can be seen from fig. 2, the program code with aligned source code and assembly code statements is generated by using the method in step 1, wherein the head of each statement is the line number of the corresponding modality, the line number of the source code is a number and then "", such line number is consistent with the line number of the program code of the source code modality, the line number of the assembly code is a pure number, and one source code statement generally corresponds to 0 to 20 assembly codes. In the assembly code, besides the conventional assembly instruction part, the program variable annotation of the source code corresponding to the register is carried out, and after "#", the program variable annotation is divided into "" and "". In the statement of source code vulnerability, 80 characters of a str string are copied to buf by using a strncpy function in the 6 th line of a code segment, and the assembly code modality of the source code vulnerability statement and register comments related to source code variables are shown in the 18 th to 23 th lines of FIG. 2, so that cross-modality statement alignment is realized.
As can be seen from fig. 3, according to the process of finding a corresponding assembly code segment by C language source code slicing code segments, an example source code function parameter is a slicing example of char str under a slicing criterion. And finding out function parameters needing to be tracked through the abstract syntax tree, finding out statements corresponding to parameter dependence through the program dependence graph to obtain source code modal program slices, and combining assembly code lines aligned with each source code statement to obtain the assembly code modal program slices. In the source code slice, it can be seen that the macro definition value of the 3 rd line of the C language code is not present in the slice, and the length of buf cannot be obtained according to the code statement in the slice. And in the assembly instruction line 9 corresponding to the line 4 of the C language code, 40 is assigned to a tmp90 temporary register, so that the real buf length can be found in the assembly code modality, and therefore, the vulnerability which cannot be detected in the source code modality can be found and detected in the assembly code modality.
As can be seen from fig. 4, the memory operation slicing criteria also performs slicing based on the parameter char str, thereby exerting its own advantages. The comment part in the assembly instruction code can be seen, which registers are marked to have cross-mode alignment relation with the variables str, the str corresponds to the register [ -56rbp ], the assembly code related to the variables is respectively the 8 th line and the 14 th line, the two instructions respectively correspond to the 4 th line and the 6 th line of the C language source code, and the statements are formed into the source code mode program slice. And then combining the assembly code lines aligned with each source code statement to obtain the assembly code modal program slice. As can be seen from fig. 4, also in the str-based bimodal program slice, the memory operation slice criterion only cuts to a statement except for the declaration in the C language source code, and the statement is a bug statement, and the occurrence cause of the bug can be seen in the assembly code modality. And the slice extracted by the vulnerability related to the data stream is more accurate by using the memory operation slice criterion.

Claims (5)

1. A vulnerability detection method based on cross-modal feature enhancement of source codes and assembly codes is characterized by comprising the following steps:
step 1: converting a high-level language program source code into an assembly code which is aligned with sentences and has source code variable annotations;
step 2: generating an abstract syntax tree and a program dependency graph from the high-level language program source code by using a static analysis tool;
and step 3: generating slicing code segments of the source code and the assembly code according to the slicing criterion of the source code;
and 4, step 4: generating slicing code segments of the source code and the assembly code according to a slicing criterion of the assembly code;
and 5: merging the source code slicing code segment sets generated in the steps 3 and 4, marking the slicing code segment containing the bug statements as 'bugs', and marking the rest slicing code segments as 'no bugs', so as to form a training data set of the source code slicing code segment; similarly, combining the assembly code slice code segment sets generated in the step 3 and the step 4, marking the slice code segments containing the bug statements as 'bugs', and marking the rest slice code segments as 'no bugs', so as to form a training data set of the assembly code slice code segments;
and 6: word embedding is carried out on token in a source code slicing code segment by using word2vec to obtain initial vector representation of each statement in the source code slicing code segment, the initial vector representation is sent to a statement coding network formed by CNN to obtain hidden vector representation of each statement, and then the hidden vector representation of each statement is sent to a program coding network formed by bidirectional GRUs to obtain hidden vector representation of the source code slicing code segment; similarly, word embedding is carried out on tokens in the assembly code slice code segment by using word2vec to obtain initial vector representation of each statement in the assembly code slice code segment, and the initial vector representation is sent to an assembly code representation learning network formed by two layers of bidirectional GRUs to obtain hidden vector representation of the assembly code slice code segment;
and 7: performing feature enhancement on the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment obtained in the step 6 by using a cross attention mechanism to generate more accurate vector representation of the slice code segment, and performing attention weighted aggregation splicing on the hidden vector representation of the source code slice code segment and the hidden vector representation of the assembly slice code segment to obtain cross-modal feature enhancement and fused hidden vector representation;
and 8: sending the vector representation subjected to cross-modal cross attention feature fusion and enhancement obtained in the step 7 into a classifier consisting of a full connection layer and Softmax, calculating cross entropy loss according to an output result of the classifier and an actual label of a code segment, and reversely propagating and updating cross-modal feature enhancement and fusion bimodal representation learning model parameters consisting of a source code representation learning network, an assembly code representation learning network and a cross attention layer until the training of the cross-modal feature enhancement and fusion bimodal representation learning model is finished;
and step 9: and (3) carrying out vulnerability detection on the code to be tested by using the trained cross-modal feature enhanced and fused bimodal representation learning model and the classifier network.
2. The vulnerability detection method based on cross-modal feature enhancement of source codes and assembly codes according to claim 1, characterized in that the specific steps of the step 1 are as follows:
step 1.1: debugging the source code by using a program compiler, generating an assembly code according to a sentence, and aligning and outputting the source code and the assembly code according to the sentence;
step 1.2: generating assembly code with source code variable annotations using a code compiler command line compilation mode with specific parameters;
step 1.3: the assembly code annotation is filled into the source code statement corresponding to it generated in step 1.1.
3. The vulnerability detection method based on cross-modal feature enhancement of source code and assembly code according to claim 1, characterized in that in the step 3, assembly code slices are generated as follows:
step 3.1: extracting vulnerability candidate key points of the source code according to a slicing criterion of the source code, traversing the program dependence graph obtained in the step 2 in a forward and backward mode, obtaining a bidirectional slicing statement set, and generating a slicing code segment in a high-level language form;
step 3.2: and (3) finding an assembly code block corresponding to each statement in the source code slicing code segment according to the statement corresponding relation between the assembly code and the high-level language source code obtained in the step (1), thereby obtaining the slicing code segment in the assembly language form.
4. The vulnerability detection method based on cross-modal feature enhancement of source codes and assembly codes according to claim 1, characterized in that the specific steps of the step 4 are as follows:
step 4.1: extracting memory operation instructions related to source code variables according to statement alignment results of the source code and the assembly code generated in the step 1 and source code variable annotations of the assembly code;
step 4.2: finding a source code statement corresponding to the memory operation instruction as a slicing code segment of the source code;
step 4.3: and finding the assembly instruction sequence corresponding to the statement in the slicing code segment generated in the step 4.2 as the slicing code segment of the assembly code.
5. The vulnerability detection method based on cross-modal feature enhancement of source codes and assembly codes according to claim 1, wherein the specific steps of the step 7 are as follows:
step 7.1: representing P by the hidden vector of the source code slicing code segment obtained in step 6 C And assembling a hidden vector representation P of a sliced code segment A Using cross attention machinePerforming feature enhancement to obtain an attention value of one mode aiming at the other mode and obtain a bimodal vector representation of the feature enhancement;
the cross-attention calculation formula for the source code modality relative to the assembly code modality is as follows:
Figure FDA0003841645300000041
Figure FDA0003841645300000042
Figure FDA0003841645300000043
Figure FDA0003841645300000044
the cross-attention calculation formula for assembly code modalities relative to source code modalities is as follows:
Figure FDA0003841645300000045
Figure FDA0003841645300000046
Figure FDA0003841645300000047
Figure FDA0003841645300000048
wherein the content of the first and second substances,Q C query sequence for source code modality, at this time (K) A ,V A ) For sequences of assembly code modal key-value pairs, Q A Query the sequence for assembly code modality, at this point (K) C ,V C ) For a sequence of source code modality key-value pairs,
Figure FDA0003841645300000049
all of which are learnable weights, and the source code representation after the cross attention is obtained after the feature enhancement
Figure FDA00038416453000000410
Wherein
Figure FDA00038416453000000411
For the feature enhanced hidden vector representation of the ith source code statement, N is H C The number of the hidden vector representations after the feature enhancement of the source code statement is subjected to cross attention to obtain the assembly code representation after the feature enhancement
Figure FDA00038416453000000412
Wherein
Figure FDA00038416453000000413
The enhanced feature of the jth assembly instruction is represented by a hidden vector, M is H A The number of the hidden vector representations after the feature enhancement of the middle assembly code statement;
and 7.2: after attention weighted aggregation is carried out on the bimodal code segment vectors, splicing is carried out to obtain fused feature vectors, and the method specifically comprises the following steps: respectively sending the bimodal feature enhanced hidden vector representation obtained at the last moment into an Attention weighted aggregation layer for Attention weighted aggregation, and performing Attention weighted aggregation based on an Attention weight a i 、a j The weighted summation of (a) to obtain a source code slice attention weighted aggregate vector representation x C And coding the vector representation x of the slice attention weighted aggregation A Then spliced x C 、x A Then obtaining a feature fused vector x t Wherein:
the source code modality attention weighted aggregation formula is as follows:
Figure FDA0003841645300000051
Figure FDA0003841645300000052
Figure FDA0003841645300000053
the assembly code modality attention weighted aggregation formula is as follows:
Figure FDA0003841645300000054
Figure FDA0003841645300000055
Figure FDA0003841645300000056
x t =Concat(x C ,x A );
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003841645300000057
is composed of
Figure FDA0003841645300000058
The vector representation after mapping by the full connection layer and the activation function tanh (),
Figure FDA0003841645300000059
are respectively as
Figure FDA00038416453000000510
The transposed vector of (a) is,
Figure FDA00038416453000000511
are trainable vectors.
CN202211105496.XA 2022-09-09 2022-09-09 Vulnerability detection method based on cross-modal characteristic enhancement of source code and assembly code Pending CN115577362A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211105496.XA CN115577362A (en) 2022-09-09 2022-09-09 Vulnerability detection method based on cross-modal characteristic enhancement of source code and assembly code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211105496.XA CN115577362A (en) 2022-09-09 2022-09-09 Vulnerability detection method based on cross-modal characteristic enhancement of source code and assembly code

Publications (1)

Publication Number Publication Date
CN115577362A true CN115577362A (en) 2023-01-06

Family

ID=84581955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211105496.XA Pending CN115577362A (en) 2022-09-09 2022-09-09 Vulnerability detection method based on cross-modal characteristic enhancement of source code and assembly code

Country Status (1)

Country Link
CN (1) CN115577362A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115795487A (en) * 2023-02-07 2023-03-14 深圳开源互联网安全技术有限公司 Vulnerability detection method, device, equipment and storage medium
CN116628707A (en) * 2023-07-19 2023-08-22 山东省计算中心(国家超级计算济南中心) Interpretable multitasking-based source code vulnerability detection method
CN116627429A (en) * 2023-07-20 2023-08-22 无锡沐创集成电路设计有限公司 Assembly code generation method and device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115795487A (en) * 2023-02-07 2023-03-14 深圳开源互联网安全技术有限公司 Vulnerability detection method, device, equipment and storage medium
CN116628707A (en) * 2023-07-19 2023-08-22 山东省计算中心(国家超级计算济南中心) Interpretable multitasking-based source code vulnerability detection method
CN116627429A (en) * 2023-07-20 2023-08-22 无锡沐创集成电路设计有限公司 Assembly code generation method and device, electronic equipment and storage medium
CN116627429B (en) * 2023-07-20 2023-10-20 无锡沐创集成电路设计有限公司 Assembly code generation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN115577362A (en) Vulnerability detection method based on cross-modal characteristic enhancement of source code and assembly code
US11972365B2 (en) Question responding apparatus, question responding method and program
Zhou et al. AMR parsing with latent structural information
CN111382574B (en) Semantic parsing system combining syntax under virtual reality and augmented reality scenes
First et al. TacTok: Semantics-aware proof synthesis
CN112183094A (en) Chinese grammar debugging method and system based on multivariate text features
CN113010209A (en) Binary code similarity comparison technology for resisting compiling difference
CN114201406B (en) Code detection method, system, equipment and storage medium based on open source component
CN113255295B (en) Automatic generation method and system for formal protocol from natural language to PPTL
CN115510236A (en) Chapter-level event detection method based on information fusion and data enhancement
CN112732264A (en) Automatic code conversion method between high-level programming languages
Gao et al. Generating natural adversarial examples with universal perturbations for text classification
Liu et al. Language model augmented relevance score
CN117251522A (en) Entity and relationship joint extraction model method based on latent layer relationship enhancement
Gao et al. Chinese causal event extraction using causality‐associated graph neural network
CN116560890A (en) Automatic program repairing method combining lexical and grammatical information
CN114462045B (en) Intelligent contract vulnerability detection method
CN116595537A (en) Vulnerability detection method of generated intelligent contract based on multi-mode features
He et al. Named entity recognition method in network security domain based on BERT-BiLSTM-CRF
Kaili et al. A simple but effective classification model for grammatical error correction
CN111562943B (en) Code clone detection method and device based on event embedded tree and GAT network
CN114239555A (en) Training method of keyword extraction model and related device
CN114298032A (en) Text punctuation detection method, computer device and storage medium
Li et al. Word segmentation and morphological parsing for sanskrit
Yan et al. A survey of human-machine collaboration in fuzzing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination