CN113901474B

CN113901474B - Vulnerability detection method based on function-level code similarity

Info

Publication number: CN113901474B
Application number: CN202111071388.0A
Authority: CN
Inventors: 黄诚; 赵倩崇; 郭勇延
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2022-07-26
Anticipated expiration: 2041-09-13
Also published as: CN113901474A

Abstract

The invention discloses a vulnerability detection method based on function level code similarity, which belongs to the technical field of computer network security and comprises the following steps: firstly, preprocessing an open source vulnerability function and a to-be-detected function source code by using a self-defined grammar abstraction and normalization rule; then, generating a vulnerability function fingerprint and a function fingerprint to be detected by using the vulnerability function body and the added line code and the deleted line code in the corresponding patch file; and finally, realizing vulnerability detection of the function fingerprint based on fuzzy matching based on the Wagner Fischer algorithm and multi-modal accurate matching based on the Aho-Corasick algorithm. The invention avoids generating complex intermediate representation, simultaneously reserves basic grammar structure, ensures the performance of the detection model, and particularly ensures that the detection precision is not influenced by meaningless modification on grammar. The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.

Description

Vulnerability detection method based on function-level code similarity

Technical Field

The invention relates to the field of computer network security, in particular to a vulnerability detection method based on function level code similarity. The invention avoids generating complex intermediate representation, simultaneously reserves basic grammar structure, ensures the performance of the detection model, particularly ensures that the detection precision is not influenced by meaningless modification on grammar, can carry out 1-3 types of clone detection, and simultaneously automatically distinguishes bug codes and patched codes. The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.

Background

Over the past few years, the number of Open-source software ("OSS") programs has increased rapidly. The significant increase in the number of OSS programs naturally leads to an increase in software vulnerabilities due to code cloning, thereby posing a serious threat to the security of software systems. Software vulnerabilities include lack of verification of user input, lack of adequate logging mechanisms, failure to open error handling, failure to properly close database connections, etc. The code cloning is the action of copying and pasting the existing codes of other software, and if the code cloning is correctly utilized, the development efficiency can be greatly improved, and the development period can be shortened. However, in practice, code cloning is often viewed as a poor programming practice because it can increase maintenance costs, reduce code quality, create potential legal conflicts, and even propagate software vulnerabilities. In particular, since OSS programs are widely used as code libraries in software development, code cloning is becoming one of the main causes of software bugs.

Conventional code similarity detection generally converts object code into an intermediate representation, such as a parse tree or a program control graph, and then analyzes the intermediate representation and checks whether it matches some predefined bug rule to determine whether the source program has a bug corresponding to the bug rule. The complex intermediate representation method is helpful for improving the detection accuracy, but also leads to higher calculation cost; and a higher code abstract representation mode can improve the efficiency, but part of vulnerability semantic information can be lost, and vulnerability codes and patched codes cannot be distinguished.

Disclosure of Invention

In view of this, the embodiment of the present application provides a vulnerability detection method based on function-level code similarity, which aims to balance efficiency and accuracy at an acceptable cost and effectively detect common variation ways in code cloning. The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.

The relevant definitions referred to in the present invention are as follows.

Definition 1: antlr4 is an open-source parser generation tool developed by Java, and can generate a corresponding parser from a grammar rule file.

Definition 2: abstract Syntax Tree (AST) refers to a Tree-like representation that describes the Syntax structure of a program code to analyze the source code structure from the Syntax Tree point of view. For example, a conditional statement in the form of an if-else may be represented using two branch nodes in the AST.

Definition 3: Fowler-Noll-Vo hash (abbreviated FNV hash) that can quickly hash large amounts of data and maintain a small collision rate, its high dispersion makes it suitable for hash of very similar strings. Such as URL, hostname, filename, text, IP address, etc.

Definition 4: the difference line code, the patch file is composed of one or more difference blocks, and each difference block is a code line sequence with a special mark. The lines beginning with "+" indicate added code and the lines beginning with "-" indicate deleted code, collectively referred to herein as difference line code.

Definition 5: context-triggered segment hashes are created by setting the boundaries of traditional segment hashes using rolling hashes based on a content-segmented hashing (CTPH algorithm), which can be used to identify ordered homologous sequences between unknown inputs and known files, even if the unknown files are modified versions of known files.

Definition 6: the Wagner-Fischer algorithm (abbreviated WF algorithm) refers to finding a series of least costly editing operations to convert a character a to a character b, with the allowable editing operations including character insertion, character deletion and character replacement. For example, the WF value between the string S1 "Angel" and the string S2 "Angle" is 2.

Definition 7: AC Automaton (Aho-coral Automaton, abbreviated as AC Automaton) is one type of multi-mode matching algorithm used to match substrings in a finite set of "dictionaries" in a string of input characters. It is different from the common character string matching in that matching is performed with all dictionary strings at the same time.

The technical scheme of the invention is as follows: a vulnerability detection method based on function-level code similarity comprises the following steps.

Step one, building a vulnerability function fingerprint database.

(1) And matching and removing the comments in the extracted source codes by adopting a Python regular expression for the C language-oriented codes containing the bugs.

(2) Collecting commit files of all CVE vulnerabilities and corresponding patch files from a CVE project library of Github to establish a vulnerability database, and extracting increase lines and delete lines in all vulnerability functions and the corresponding patch files.

(3) Writing a syntax rule file of a C language by using the Antlr4, generating abstract syntax trees of all functions from a source code file of the C language, converting the abstract syntax trees into a token sequence, and extracting a function body, a vulnerability source, namely a file position, a function name, a form parameter list, a local variable list, a data type list and a function call list from the abstract syntax trees.

(4) Syntax abstraction is performed according to the following steps: replacing the function name with the notation funneame and each parameter variable appearing within the function body with the notation forpra; replacing each local variable appearing within the function body with a symbol LOVAR; replacing all custom data type declarations except those declared in the ISO C standard with CUSTYPE; each function call is replaced with the notation FUNCALL except the C standard library function.

(5) Delete space, tab and linefeed, delete all "{" and "}" and convert all characters to lowercase letters.

(6) The generated loophole function grammar structure after the grammar abstraction and the normalization comprises two parts of a difference structure and a function body structure, and for the latter, a fuzzy hash value based on a CTPH algorithm is generated. Wherein, the segmentation adopts a rolling hash algorithm: suppose there is

Input of individual characters, first of input

A byte is composed of

And (4) showing. Thus, the input is composed of bytes as a whole

And (4) forming. At any position in the input

The state of the rolling hash will only depend on the last of the file

A byte. Thus, the hash value is rolled

Can be expressed as a function of the last few bytes as shown in the following equation:

。

and step two, generating the fingerprint of the target function.

(1) And removing the comments from the C language source code to be detected.

(2) Generating abstract syntax trees of all target functions from source code files of C language, then converting the abstract syntax trees into token sequences, and extracting target function bodies from the abstract syntax trees, wherein the target function sources are file positions, function names, form parameter lists, local variable lists, data type lists and function calls.

(3) And (4) sequentially replacing the function name, the function form parameter, the local variable, the custom data type and the custom function call of the function body of the target function to realize syntax abstraction.

(4) And normalizing the target function after the variable replacement.

(5) After the target function is subjected to syntax abstraction and normalization processing, the syntax structure of the generated target function is reserved, meanwhile, fuzzy hash based on the CTPH algorithm is generated, and the two fuzzy hashes jointly form intermediate representation of the target function.

Wherein, the slicing adopts a rolling hash algorithm: suppose there is

Input of individual characters, first of input

A byte is composed of

And (4) showing. Thus, the input is composed of bytes as a whole

And (4) forming. At any position in the input

Where the state of the rolling hash will depend only on the last of the file

A byte. Thus, the hash value is rolled

。

and step three, detecting the vulnerability based on the function fingerprint.

(1) Calculating the Wagner-Fischer value between the fuzzy hash of the function body in the vulnerability function fingerprint and the fuzzy hash of the target function fingerprint to obtain the score of the similarity degree, and judging whether the two functions have the similarity relation.

Two character strings

And

the Wagner-Fischer value of (A) can be described in the following mathematical language:

. Definition of

Refer to

Middle front

A character and

middle front

The distance between the individual characters.

Is that

Length of (d). Since the first character index of the character string starts from 1, the last edit distance is

Distance when:

。

(2) and taking the grammatical structure of the target function as a main string, taking all difference code lines in the vulnerability function fingerprint as a plurality of mode strings, and constructing an AC automaton to carry out multi-mode accurate matching. The AC algorithm is realized by two steps of preprocessing and matching: pretreatment: a plurality of keywords are constructed into a finite state pattern matching machine. An automaton is constructed that contains all the keys. Matching: a given text on the built-in automaton is traversed to find all matching words. Starting from the first character of the main string and the initial state 0 of the automaton, if the character is successfully matched, the method is transferred to the next state according to the steering function of the automaton; if the transferred state corresponds to an output function, outputting the matched pattern string; if the character matching fails, the transmission is performed in a recursive manner according to the invalidation function of the automaton.

The advantages of the invention are mainly.

And a self-defined grammar abstraction and normalization rule is provided, so that the generation of complex intermediate representation is avoided, a basic grammar structure is kept, the influence brought by code clone modification is eliminated, the performance of a detection model is ensured, and particularly, the detection precision is not influenced by meaningless modification on grammar.

And providing vulnerability function fingerprints based on code differences. In many cases, there is little difference between code containing holes and patched code, and holes can be eliminated by inserting a single if statement. There are also many security holes that are usually very sensitive to constants and statement order, so if we want to detect a clone hole of class 1, 2, 3 code, if we do not distinguish the hole code from the patched code, we will result in a high false negative rate. The method generates the vulnerability fingerprint by using the vulnerability function body and the added line code and the deleted line code (collectively called as difference line code) in the corresponding patch file, can perform 1-3 types of clone detection, and simultaneously distinguishes the vulnerability code and the patched code.

Drawings

FIG. 1 is a system flow diagram.

Fig. 2 is a flowchart of building a vulnerability function fingerprint database in block 1001 of fig. 1.

Figure 3 is a flow diagram of the objective function fingerprint generation process of block 1002 of figure 1.

Fig. 4 is a flowchart illustrating a vulnerability detection process based on a function fingerprint in block 1003 of fig. 1.

Detailed Description

The present invention will be further explained below with reference to the drawings and examples.

FIG. 1 is a flow chart of the overall system of the present invention.

The vulnerability function fingerprint database construction module collects commit files and corresponding patch files of all CVE vulnerabilities from a CVE project database of Github to establish a vulnerability database, generates vulnerability function fingerprints based on CTPH algorithm and code difference, and establishes a vulnerability function fingerprint database.

And the fingerprint generation module of the target function generates a target function fingerprint based on the CTPH algorithm.

The vulnerability detection based on the function fingerprint comprises two steps of matching: fuzzy matching and accurate matching, and the loopholes can be successfully detected after the two-step matching is successful.

Fig. 2 is a flowchart of the vulnerability function fingerprint database construction in fig. 1, which illustrates how to construct the vulnerability function fingerprint database.

The process begins by adopting a Python regular expression to carry out matching and remove the annotations in the extracted vulnerability function source codes. One is annotated with a single row with "//" characters, and the other is annotated with multiple rows, bracketed by "/" and "/".

And secondly, collecting commit files and corresponding patch files of all CVE vulnerabilities from a CVE project library of Github to establish a vulnerability database, and extracting all vulnerability functions and add lines and delete lines in the corresponding patch files.

And thirdly, writing a syntax rule file of the C language by using Antlr4, generating abstract syntax trees of all vulnerability functions from a source code file of the C language, converting the abstract syntax trees into token sequences, and extracting function bodies, vulnerability sources, namely file positions, function names, form parameter lists, local variable lists, data type lists and function call lists.

Fourthly, syntax abstraction is carried out according to the following steps: replacing the function name with the notation funneame and each parameter variable appearing within the function body with the notation forpar; replacing each local variable appearing within the function body with the notation LOVAR; replacing all custom data type declarations except those declared in the ISO C standard with CUSTYPE; each function call is replaced with the notation FUNCALL except the C standard library function.

And fifthly, deleting spaces, tabulation symbols and line feed symbols, deleting all the {'s and the { }', and converting all the characters into lower case letters so as to normalize the vulnerability function body.

And finally, generating a vulnerability function grammar structure after grammar abstraction and normalization processing, wherein the vulnerability function grammar structure comprises a difference structure and a function body structure, and generating a fuzzy hash value based on a CTPH algorithm for the latter, wherein the concrete process is as follows.

Fragmenting: reading a part of content in the loophole function, and calculating by a weak hash algorithm to obtain a hash value.

Files cannot be separated using fixed-length blocks, pseudo-random values are generated from the current context of the input using a rolling hash algorithm that works by maintaining a state based only on the last few bytes of the input, each byte is added to and deleted from the state at the time of processing after a certain number of other bytes have been processed.

Suppose we have

Input of individual characters, first of input

A byte is composed of

And (4) showing. Thus, the input is composed of bytes as a whole

And (4) forming. At any position in the input

Where the state of the rolling hash will depend only on the last of the file

A byte. Thus, the hash value is rolled

. Rolling hash function

Is constructed so that the influence of the items therein can be eliminated. Thus, given

Can be removed by

Is expressed as a function

And adding

Is expressed as a function

Calculate out

As shown in the following equation.

。

。

After the shard is determined, Alder-32 algorithm is used as the weak hash. The final checksum is obtained by computing 2 16-bit checksums a and B and concatenating the bits into a single 32-bit result. In this algorithm, a represents the sum of all bytes plus 1, and B is the sum of all values for each step in a. For Adler-32, A is 1 and B is 0. These sums are stored in an order modulo 65521 (being a prime number, the largest not exceeding 216) bytes called big-endian, where B occupies the 2 most significant bytes.

Hashing each slice: after the vulnerability function is fragmented, a hash value needs to be calculated for each fragment. In the present invention, the FNV can quickly hash large amounts of data and maintain a low collision rate using a hashing algorithm called Fowler-Noll-Voh hash, whose high dispersion makes it suitable for hash strings that are very similar.

Compression mapping: after a hash value is obtained by calculation for each vulnerability function fragment, the result pressure can be shortened selectively, and the method only adopts the lowest 6 bits of the FNV and uses an ASCII character to represent the FNV as the final hash result of the fragment.

And (3) outputting: and connecting the final hash results of each piece together to obtain a fuzzy hash value of the loophole function. The fuzzy hash has the following shape: BS: hash1: hash 2. BS: this is the block size. We can only compare hash values of the same block size. hash1: this is a concatenation of FNV-1a results (mapped to 64 characters) for each block in the file. hash 2: this is the same as hash1, but uses twice the block size. This result is written because a small change can halve or double the block size. If this occurs, at least a portion of the two signatures may be compared. And processing the difference structure, wherein the vulnerability function fingerprint consists of fuzzy hash of the function body structure and the difference structure.

Fig. 3 is a flowchart of the generation of the fingerprint of the objective function in fig. 1, illustrating how the fingerprint of the objective function is generated by processing the objective function.

The process begins by adopting a Python regular expression to perform matching and remove annotations in the source code of the vulnerability to be detected.

And secondly, generating abstract syntax trees of all target functions from the source code file of the C language, converting the abstract syntax trees into token sequences, and extracting a target function body from the token sequences, wherein the target function source is the file position to which the target function belongs, the function name, the form parameter list, the local variable list, the data type list and the function call list.

And thirdly, replacing a function name, a function form parameter, a local variable, a custom data type and a custom function call of a function body of the target function in sequence to realize syntax abstraction.

And fourthly, deleting spaces, tabulations and line feeds in the target function after the variables are replaced, deleting all the {'s and the { } characters, and converting all the characters into lower case letters.

And finally, after the target function is subjected to syntax abstraction and normalization processing, keeping a syntax structure of the generated target function, and simultaneously generating fuzzy hash based on a CTPH algorithm, wherein the concrete process is similar to the structural processing of a vulnerability function body, and the syntax abstraction and the normalization processing form the intermediate representation of the target function.

Fig. 4 is a flowchart of vulnerability detection based on function fingerprints in fig. 1, which illustrates how vulnerability detection is performed according to function fingerprints.

The flow starts from fuzzy matching based on the Wagner-Fischer algorithm, and the steps of the fuzzy matching are divided into five steps.

The first step is to compare the block sizes. We can only compare hash values computed for the same block size, and in a fuzzy hash string we have both block size and double block size hash values. Therefore, we try to match at least the hash values, and if they do not have a common block size, the comparison returns 0.

The second step deletes sequences of three or more equal characters that have little information about the document and bias the matching score.

The third step tests for coincidence of at least 7 characters, which is the default value, but this value can be altered. If the longest common substring is at least equal to the length, the function returns 0. Since we map the 32-bit FNV value into the output of 64 characters, many collisions occur, which is one way to eliminate false positives.

The fourth step we use the Wagner-Fischer algorithm to calculate the Levenshtein distance using the following weights: we denote the edit distance of two strings a, b as

In which

And

respectively correspond to

And

of the length of (c). The edit distance problem is to find a series of least costly edit operations to combine characters

Conversion to characters

The allowed editing operations include character insertion, character deletion, and character replacement.

Two character strings

And

the edit distance of (c) can be described in the following mathematical language:

definition of

Refer to

Middle front

A character and

middle front

The distance between the individual characters of the character,

is that

Since the first character index of the character string starts from 1, the last edit distance is

Distance of time:

。

when the temperature is higher than the set temperature

When corresponding to a character string

Middle front

A character and

middle front

A character at this time

Has a value of 0, representing a string

And

one of them is an empty string, then

Switch to

Only needs to carry out the single character editing operation, so the editing distance between the single character editing operation and the single character editing operation is

I.e. by

The largest one.

When in use

At the time of the operation, the user can select the required operation,

is the minimum of the following three cases:

indicating deletion

，

Indicating insertion

，

Representation replacement

，

Is an indicator function, expressed when

When the current time is 0; when in use

At this time, the value is 1.

The fifth step will scale the edit distance to an output between 0 and 100 according to the original fuzzy hash algorithm. Where 100 represents that the vulnerability function and the objective function are completely consistent, and 0 represents that they are not completely similar.

Therefore, the score of the similarity degree is obtained finally, and the score can be used for judging whether the loophole function and the target function have a similar relation or not.

Setting a similarity threshold for fuzzy matching

If the similarity between the target function and the vulnerability function is lower than the threshold value, the similarity between the function and the vulnerability function is directly judged.

The final exact matching algorithm is a multi-modal matching based on the Aho-Corasick algorithm.

The method takes the grammar structure of the target function as a main string, takes all difference line codes in the vulnerability function fingerprint as a plurality of mode strings, and proves that the vulnerability exists in the target function if all deleted line difference fingerprints can be accurately matched in the grammar structure of the target function and all added line difference fingerprints cannot be accurately matched in the grammar structure of the target function.

The AC algorithm implementation is divided into two steps of preprocessing and matching.

Pretreatment: a plurality of keywords are constructed into a finite state pattern matching machine, an automaton containing all the keywords is constructed, and the automaton mainly has the following three functions.

Turning: the function stores the edges in the Trie of all key constructs in the automaton, which are represented as a two-dimensional array in which the next state of the character for the state of the current character is stored.

And (4) failure: this function stores all edges that the current character connects when there are no edges in the sure, represented as a one-dimensional array, where the next state of the current state is stored.

And (3) outputting: the function stores the index of all words at the end of the current state. It is represented by a one-dimensional array, and the index of all matching words is stored as a bitmap of the current state.

Matching: traversing a given text on the built-in automaton to search all matched words, starting from a first character of the main string and the initial state 0 of the automaton, and if the character is successfully matched, transferring to the next state according to a steering function of the automaton; if the transferred state corresponds to an output function, outputting the matched mode string; if the character matching fails, the transmission is performed in a recursive manner according to the invalidation function of the automaton.

According to the thought and implementation steps of the vulnerability detection method based on the function-level code similarity, five open source projects are selected by the model for vulnerability detection through knowing the operation result of the vulnerability detection prototype system based on the function-level code similarity.

The item sizes range from 13.1MB to 965M, while the number of C functions ranges from 1161 to 435,734.

The code clone detection accuracy was 77.3% and the recall rate was 75.6%.

Fingerprinting was completed only in 28 hours and clones of 1 GB size target were tested.

The model can expand the detection range into a large-scale code warehouse, and solves the problem of balancing the expense and the performance under the scene of high-frequency incremental codes.

The vulnerability function fingerprint detection method is based on function level vulnerability source codes of open sources and combined with fuzzy hash values to try to design vulnerability function fingerprints, and the vulnerability detection method based on function level code similarity is provided.

Firstly, preprocessing an open source vulnerability function and a target function source code by using a self-defined grammar abstraction and normalization rule; then, generating a vulnerability function fingerprint and a target function fingerprint by using the vulnerability function body and the added line code and the deleted line code in the corresponding patch file; the vulnerability detection process based on the function fingerprint comprises two steps of matching: fuzzy matching based on the Wagner-Fischer algorithm and multi-modal precise matching based on the Aho-Corasick algorithm, and finally, through experimental verification and experiments, the code clone detection precision is 77.3%, and the recall rate is 75.6%.

Fingerprint generation is completed in 28 hours only, clones of 1 GB target are detected, and the correctness and the effectiveness of the method are verified.

The invention avoids generating complex intermediate representation, simultaneously reserves basic grammar structure, ensures the performance of the detection model, particularly ensures that the detection precision is not influenced by meaningless modification on grammar, can carry out 1-3 types of clone detection, and simultaneously automatically distinguishes bug codes and patched codes.

The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.

The model can expand the detection range into a large-scale code warehouse, and achieves the balance of cost and performance under the scene of high-frequency incremental codes.

Claims

1. A vulnerability detection method based on function-level code similarity is characterized by comprising the following steps:

A. the vulnerability function fingerprint library construction module is oriented to C language codes containing vulnerabilities, adopts Python regular expressions to carry out matching and remove comments in extracted source codes, collects commit files and corresponding patch files of all CVE vulnerabilities from a CVE project library of Github to establish a vulnerability database, and extracts increase rows and delete rows in all vulnerability functions and corresponding patch files;

B. writing a syntax rule file of a C language by using Antlr4, generating an abstract syntax tree of all vulnerability functions from a source code file of the C language, then converting the abstract syntax tree into a token sequence, and carrying out syntax abstraction and normalization, wherein the concrete steps of syntax abstraction and normalization are as follows:

b-1. syntax abstraction: extracting a function body, namely a vulnerability source, namely a file position, a function name, a form parameter list, a local variable list, a data type list and a function call list from the token sequence; replacing function names with the notation funneame, replacing each parameter variable appearing within a function body with the notation forpar, replacing each local variable appearing within a function body with the notation LOVAR, replacing all custom data type declarations except those declared in the ISO C standard with the notation CUSTYPE, replacing each function call with the notation funclean, except for the C standard library functions;

b-2, normalization: deleting spaces, tabulation symbols and line feed symbols, deleting all the characters { 'and { }', and converting all the characters into lower case letters so as to normalize the vulnerability function body;

C. the generated vulnerability function grammar structure after the grammar abstraction and the normalization comprises a difference structure and a function body structure, and for the latter, a fuzzy hash value based on a CTPH algorithm is generated, and the concrete process is as follows:

c-1, slicing: reading a part of content in a loophole function, calculating by a weak hash algorithm to obtain a hash value, and adoptingGenerating a pseudo-random value from the current context of the input using a rolling hash algorithm, assuming that

Input of individual characters, first of input

A byte is composed of

Represents, therefore, the input is composed of bytes as a whole

Composition, anywhere in the input

The state of the rolling hash will only depend on the last of the file

Byte, hence, rolling hash value

after the fragments are determined, an Alder-32 algorithm is used as weak hash;

c-2, hashing each chip: after the vulnerability function is segmented, calculating a hash value for each segment, and using a hash algorithm named as Fowler-Noll-Vo hash;

c-3. compression mapping: after a hash value is obtained by calculation for each vulnerability function fragment, the result pressure can be shortened by selecting, only the lowest 6 bits of the FNV hash value are adopted, and an ASCII character is used for representing the result as the final hash result of the fragment;

c-4, outputting: and connecting the final hash result of each piece together to obtain a fuzzy hash value of the loophole function, wherein the fuzzy hash has the following shape: BS: hash1: hash2, BS: this is the block size, and only hashes of the same block size can be compared, hash1: this is the concatenation of the final hash result for each block in the file, hash 2: this is the same as hash1, but uses twice the block size;

D. the fingerprint generation module of the target function generates a target function fingerprint: the method comprises the following steps of facing to a C language source code to be detected, removing annotations, generating abstract syntax trees of all target functions from a C language source code file, converting the abstract syntax trees into token sequences, and carrying out syntax abstraction and normalization on the target functions to be detected, wherein the specific steps of syntax abstraction and normalization are as follows:

d-1. syntax abstraction: extracting a target function body from the token sequence, wherein the source of the target function is the position of a file to which the target function belongs, the name of the function, a form parameter list, a local variable list, a data type list and a function call list; sequentially replacing a function body of the target function by function names, function form parameters, local variables, custom data types and custom function calls to realize syntax abstraction;

d-2, normalization: deleting spaces, tabulation symbols and line feed symbols in the target function after variable replacement, deleting all the {'s and the { } symbols, and converting all the symbols into lower case letters;

after the syntax abstraction and normalization processing, the syntax structure of the generated target function is reserved, meanwhile, fuzzy hash based on the CTPH algorithm is generated, and the two fuzzy hashes jointly form the intermediate representation of the target function, wherein the fuzzy hash value based on the CTPH algorithm is generated, and the specific process comprises the following steps:

d-3, slicing: reading a part of content in an objective function, calculating by a weak hash algorithm to obtain a hash value, generating a pseudo-random value according to the current input context by adopting a rolling hash algorithm, and assuming that the pseudo-random value exists

Input of individual characters, first of input

A byte is composed of

Represents, therefore, the input is composed of bytes as a whole

Composition, anywhere in the input

Where the state of the rolling hash will depend only on the last of the file

Byte, and thus, rolling hash value

after the fragments are determined, an Alder-32 algorithm is used as weak hash;

d-4, hashing each chip: after the target function is segmented, calculating a hash value for each segment, and using a hash algorithm named as Fowler-Noll-Vo hash;

d-5, compressing mapping: after a hash value is calculated for each target function fragment, the result pressure can be shortened selectively, only the lowest 6 bits of the FNV hash value are adopted, and an ASCII character is used for representing the result as the final hash result of the fragment;

d-6, outputting: connecting the final hash result of each slice together to obtain a fuzzy hash value of the objective function, wherein the fuzzy hash has the following shape: BS: hash1: hash2, BS: this is the block size, and only hash values of the same block size can be compared, hash1: this is the concatenation of the final hash result for each block in the file, hash 2: this is the same as hash1, but uses twice the block size;

E. the vulnerability detection based on the function fingerprint comprises two steps of matching: fuzzy matching based on a Wagner-Fischer algorithm and multi-modal precise matching based on an Aho-Corasick algorithm, wherein bugs can be successfully detected after the two steps of matching are successful;

fuzzy matching based on the Wagner-Fischer algorithm specifically comprises the following steps:

e-1. comparison Block size: only hash values calculated for the same block size can be compared, in one fuzzy hash string, with both block size and double block size hash values, so an attempt is made to match at least the hash values, and if they do not have a common block size, the comparison returns 0;

deleting sequences of three or more equal characters, which have almost no information about the document and bias the matching score;

e-3. test for coincidence of at least 7 characters, which is the default value, but this value can be changed, returning 0 if the longest common substring is at least equal to the length;

e-4. two character strings

Is expressed as

Wherein

And

respectively correspond to

And

length of (1), edit distanceThe problem is to find a series of least costly editing operations to assemble the characters

Conversion to characters

The allowed editing operation includes character insertion, character deletion and character replacement, two character strings

And

definition of

Refer to

Middle front

A character and

middle front

The distance between the individual characters is such that,

is that

The length of (d);

e-5, according to the original fuzzy hash algorithm, the editing distance is scaled so that the output is between 0 and 100, wherein 100 represents that the vulnerability function and the objective function are completely consistent, and 0 represents that the vulnerability function and the objective function are completely dissimilar, and a similarity threshold value is set for fuzzy matching

If the similarity is lower than the threshold value, directly judging that the target function and the vulnerability function have no similarity;

the method comprises the following specific steps of multimodal matching based on an Aho-Corasick algorithm: taking a syntactic structure of a target function as a main string, taking all difference code lines in a vulnerability function fingerprint as a plurality of mode strings, constructing an AC automaton to carry out multi-mode precise matching, and specifically comprising the following steps:

e-6, pretreatment: constructing a finite state pattern matching machine by a plurality of keywords, and constructing an automaton containing all the keywords;

e-7, matching: traversing a given text on the built-in automaton to search all matched words, starting from the first character of the main string and the initial state 0 of the automaton, and if the character is successfully matched, transferring to the next state according to a turning function of the automaton; if the transferred state corresponds to an output function, outputting the matched pattern string; if the character matching fails, the transmission is performed in a recursive manner according to the invalidation function of the automaton.