CN113901474B - Vulnerability detection method based on function-level code similarity - Google Patents
Vulnerability detection method based on function-level code similarity Download PDFInfo
- Publication number
- CN113901474B CN113901474B CN202111071388.0A CN202111071388A CN113901474B CN 113901474 B CN113901474 B CN 113901474B CN 202111071388 A CN202111071388 A CN 202111071388A CN 113901474 B CN113901474 B CN 113901474B
- Authority
- CN
- China
- Prior art keywords
- function
- hash
- vulnerability
- algorithm
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/033—Test or assess software
Landscapes
- Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
The invention discloses a vulnerability detection method based on function level code similarity, which belongs to the technical field of computer network security and comprises the following steps: firstly, preprocessing an open source vulnerability function and a to-be-detected function source code by using a self-defined grammar abstraction and normalization rule; then, generating a vulnerability function fingerprint and a function fingerprint to be detected by using the vulnerability function body and the added line code and the deleted line code in the corresponding patch file; and finally, realizing vulnerability detection of the function fingerprint based on fuzzy matching based on the Wagner Fischer algorithm and multi-modal accurate matching based on the Aho-Corasick algorithm. The invention avoids generating complex intermediate representation, simultaneously reserves basic grammar structure, ensures the performance of the detection model, and particularly ensures that the detection precision is not influenced by meaningless modification on grammar. The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.
Description
Technical Field
The invention relates to the field of computer network security, in particular to a vulnerability detection method based on function level code similarity. The invention avoids generating complex intermediate representation, simultaneously reserves basic grammar structure, ensures the performance of the detection model, particularly ensures that the detection precision is not influenced by meaningless modification on grammar, can carry out 1-3 types of clone detection, and simultaneously automatically distinguishes bug codes and patched codes. The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.
Background
Over the past few years, the number of Open-source software ("OSS") programs has increased rapidly. The significant increase in the number of OSS programs naturally leads to an increase in software vulnerabilities due to code cloning, thereby posing a serious threat to the security of software systems. Software vulnerabilities include lack of verification of user input, lack of adequate logging mechanisms, failure to open error handling, failure to properly close database connections, etc. The code cloning is the action of copying and pasting the existing codes of other software, and if the code cloning is correctly utilized, the development efficiency can be greatly improved, and the development period can be shortened. However, in practice, code cloning is often viewed as a poor programming practice because it can increase maintenance costs, reduce code quality, create potential legal conflicts, and even propagate software vulnerabilities. In particular, since OSS programs are widely used as code libraries in software development, code cloning is becoming one of the main causes of software bugs.
Conventional code similarity detection generally converts object code into an intermediate representation, such as a parse tree or a program control graph, and then analyzes the intermediate representation and checks whether it matches some predefined bug rule to determine whether the source program has a bug corresponding to the bug rule. The complex intermediate representation method is helpful for improving the detection accuracy, but also leads to higher calculation cost; and a higher code abstract representation mode can improve the efficiency, but part of vulnerability semantic information can be lost, and vulnerability codes and patched codes cannot be distinguished.
Disclosure of Invention
In view of this, the embodiment of the present application provides a vulnerability detection method based on function-level code similarity, which aims to balance efficiency and accuracy at an acceptable cost and effectively detect common variation ways in code cloning. The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.
The relevant definitions referred to in the present invention are as follows.
Definition 1: antlr4 is an open-source parser generation tool developed by Java, and can generate a corresponding parser from a grammar rule file.
Definition 2: abstract Syntax Tree (AST) refers to a Tree-like representation that describes the Syntax structure of a program code to analyze the source code structure from the Syntax Tree point of view. For example, a conditional statement in the form of an if-else may be represented using two branch nodes in the AST.
Definition 3: Fowler-Noll-Vo hash (abbreviated FNV hash) that can quickly hash large amounts of data and maintain a small collision rate, its high dispersion makes it suitable for hash of very similar strings. Such as URL, hostname, filename, text, IP address, etc.
Definition 4: the difference line code, the patch file is composed of one or more difference blocks, and each difference block is a code line sequence with a special mark. The lines beginning with "+" indicate added code and the lines beginning with "-" indicate deleted code, collectively referred to herein as difference line code.
Definition 5: context-triggered segment hashes are created by setting the boundaries of traditional segment hashes using rolling hashes based on a content-segmented hashing (CTPH algorithm), which can be used to identify ordered homologous sequences between unknown inputs and known files, even if the unknown files are modified versions of known files.
Definition 6: the Wagner-Fischer algorithm (abbreviated WF algorithm) refers to finding a series of least costly editing operations to convert a character a to a character b, with the allowable editing operations including character insertion, character deletion and character replacement. For example, the WF value between the string S1 "Angel" and the string S2 "Angle" is 2.
Definition 7: AC Automaton (Aho-coral Automaton, abbreviated as AC Automaton) is one type of multi-mode matching algorithm used to match substrings in a finite set of "dictionaries" in a string of input characters. It is different from the common character string matching in that matching is performed with all dictionary strings at the same time.
The technical scheme of the invention is as follows: a vulnerability detection method based on function-level code similarity comprises the following steps.
Step one, building a vulnerability function fingerprint database.
(1) And matching and removing the comments in the extracted source codes by adopting a Python regular expression for the C language-oriented codes containing the bugs.
(2) Collecting commit files of all CVE vulnerabilities and corresponding patch files from a CVE project library of Github to establish a vulnerability database, and extracting increase lines and delete lines in all vulnerability functions and the corresponding patch files.
(3) Writing a syntax rule file of a C language by using the Antlr4, generating abstract syntax trees of all functions from a source code file of the C language, converting the abstract syntax trees into a token sequence, and extracting a function body, a vulnerability source, namely a file position, a function name, a form parameter list, a local variable list, a data type list and a function call list from the abstract syntax trees.
(4) Syntax abstraction is performed according to the following steps: replacing the function name with the notation funneame and each parameter variable appearing within the function body with the notation forpra; replacing each local variable appearing within the function body with a symbol LOVAR; replacing all custom data type declarations except those declared in the ISO C standard with CUSTYPE; each function call is replaced with the notation FUNCALL except the C standard library function.
(5) Delete space, tab and linefeed, delete all "{" and "}" and convert all characters to lowercase letters.
(6) The generated loophole function grammar structure after the grammar abstraction and the normalization comprises two parts of a difference structure and a function body structure, and for the latter, a fuzzy hash value based on a CTPH algorithm is generated. Wherein, the segmentation adopts a rolling hash algorithm: suppose there isInput of individual characters, first of inputA byte is composed ofAnd (4) showing. Thus, the input is composed of bytes as a wholeAnd (4) forming. At any position in the inputThe state of the rolling hash will only depend on the last of the fileA byte. Thus, the hash value is rolledCan be expressed as a function of the last few bytes as shown in the following equation:。
and step two, generating the fingerprint of the target function.
(1) And removing the comments from the C language source code to be detected.
(2) Generating abstract syntax trees of all target functions from source code files of C language, then converting the abstract syntax trees into token sequences, and extracting target function bodies from the abstract syntax trees, wherein the target function sources are file positions, function names, form parameter lists, local variable lists, data type lists and function calls.
(3) And (4) sequentially replacing the function name, the function form parameter, the local variable, the custom data type and the custom function call of the function body of the target function to realize syntax abstraction.
(4) And normalizing the target function after the variable replacement.
(5) After the target function is subjected to syntax abstraction and normalization processing, the syntax structure of the generated target function is reserved, meanwhile, fuzzy hash based on the CTPH algorithm is generated, and the two fuzzy hashes jointly form intermediate representation of the target function.
Wherein, the slicing adopts a rolling hash algorithm: suppose there isInput of individual characters, first of inputA byte is composed ofAnd (4) showing. Thus, the input is composed of bytes as a wholeAnd (4) forming. At any position in the inputWhere the state of the rolling hash will depend only on the last of the fileA byte. Thus, the hash value is rolledCan be expressed as a function of the last few bytes as shown in the following equation:。
and step three, detecting the vulnerability based on the function fingerprint.
(1) Calculating the Wagner-Fischer value between the fuzzy hash of the function body in the vulnerability function fingerprint and the fuzzy hash of the target function fingerprint to obtain the score of the similarity degree, and judging whether the two functions have the similarity relation.
Two character stringsAndthe Wagner-Fischer value of (A) can be described in the following mathematical language:. Definition ofRefer toMiddle frontA character andmiddle frontThe distance between the individual characters.Is thatLength of (d). Since the first character index of the character string starts from 1, the last edit distance isDistance when:。
(2) and taking the grammatical structure of the target function as a main string, taking all difference code lines in the vulnerability function fingerprint as a plurality of mode strings, and constructing an AC automaton to carry out multi-mode accurate matching. The AC algorithm is realized by two steps of preprocessing and matching: pretreatment: a plurality of keywords are constructed into a finite state pattern matching machine. An automaton is constructed that contains all the keys. Matching: a given text on the built-in automaton is traversed to find all matching words. Starting from the first character of the main string and the initial state 0 of the automaton, if the character is successfully matched, the method is transferred to the next state according to the steering function of the automaton; if the transferred state corresponds to an output function, outputting the matched pattern string; if the character matching fails, the transmission is performed in a recursive manner according to the invalidation function of the automaton.
The advantages of the invention are mainly.
And a self-defined grammar abstraction and normalization rule is provided, so that the generation of complex intermediate representation is avoided, a basic grammar structure is kept, the influence brought by code clone modification is eliminated, the performance of a detection model is ensured, and particularly, the detection precision is not influenced by meaningless modification on grammar.
And providing vulnerability function fingerprints based on code differences. In many cases, there is little difference between code containing holes and patched code, and holes can be eliminated by inserting a single if statement. There are also many security holes that are usually very sensitive to constants and statement order, so if we want to detect a clone hole of class 1, 2, 3 code, if we do not distinguish the hole code from the patched code, we will result in a high false negative rate. The method generates the vulnerability fingerprint by using the vulnerability function body and the added line code and the deleted line code (collectively called as difference line code) in the corresponding patch file, can perform 1-3 types of clone detection, and simultaneously distinguishes the vulnerability code and the patched code.
Drawings
FIG. 1 is a system flow diagram.
Fig. 2 is a flowchart of building a vulnerability function fingerprint database in block 1001 of fig. 1.
Figure 3 is a flow diagram of the objective function fingerprint generation process of block 1002 of figure 1.
Fig. 4 is a flowchart illustrating a vulnerability detection process based on a function fingerprint in block 1003 of fig. 1.
Detailed Description
The present invention will be further explained below with reference to the drawings and examples.
FIG. 1 is a flow chart of the overall system of the present invention.
The vulnerability function fingerprint database construction module collects commit files and corresponding patch files of all CVE vulnerabilities from a CVE project database of Github to establish a vulnerability database, generates vulnerability function fingerprints based on CTPH algorithm and code difference, and establishes a vulnerability function fingerprint database.
And the fingerprint generation module of the target function generates a target function fingerprint based on the CTPH algorithm.
The vulnerability detection based on the function fingerprint comprises two steps of matching: fuzzy matching and accurate matching, and the loopholes can be successfully detected after the two-step matching is successful.
Fig. 2 is a flowchart of the vulnerability function fingerprint database construction in fig. 1, which illustrates how to construct the vulnerability function fingerprint database.
The process begins by adopting a Python regular expression to carry out matching and remove the annotations in the extracted vulnerability function source codes. One is annotated with a single row with "//" characters, and the other is annotated with multiple rows, bracketed by "/" and "/".
And secondly, collecting commit files and corresponding patch files of all CVE vulnerabilities from a CVE project library of Github to establish a vulnerability database, and extracting all vulnerability functions and add lines and delete lines in the corresponding patch files.
And thirdly, writing a syntax rule file of the C language by using Antlr4, generating abstract syntax trees of all vulnerability functions from a source code file of the C language, converting the abstract syntax trees into token sequences, and extracting function bodies, vulnerability sources, namely file positions, function names, form parameter lists, local variable lists, data type lists and function call lists.
Fourthly, syntax abstraction is carried out according to the following steps: replacing the function name with the notation funneame and each parameter variable appearing within the function body with the notation forpar; replacing each local variable appearing within the function body with the notation LOVAR; replacing all custom data type declarations except those declared in the ISO C standard with CUSTYPE; each function call is replaced with the notation FUNCALL except the C standard library function.
And fifthly, deleting spaces, tabulation symbols and line feed symbols, deleting all the {'s and the { }', and converting all the characters into lower case letters so as to normalize the vulnerability function body.
And finally, generating a vulnerability function grammar structure after grammar abstraction and normalization processing, wherein the vulnerability function grammar structure comprises a difference structure and a function body structure, and generating a fuzzy hash value based on a CTPH algorithm for the latter, wherein the concrete process is as follows.
Fragmenting: reading a part of content in the loophole function, and calculating by a weak hash algorithm to obtain a hash value.
Files cannot be separated using fixed-length blocks, pseudo-random values are generated from the current context of the input using a rolling hash algorithm that works by maintaining a state based only on the last few bytes of the input, each byte is added to and deleted from the state at the time of processing after a certain number of other bytes have been processed.
Suppose we haveInput of individual characters, first of inputA byte is composed ofAnd (4) showing. Thus, the input is composed of bytes as a wholeAnd (4) forming. At any position in the inputWhere the state of the rolling hash will depend only on the last of the fileA byte. Thus, the hash value is rolledCan be expressed as a function of the last few bytes as shown in the following equation:. Rolling hash functionIs constructed so that the influence of the items therein can be eliminated. Thus, givenCan be removed byIs expressed as a functionAnd addingIs expressed as a functionCalculate outAs shown in the following equation.。。
After the shard is determined, Alder-32 algorithm is used as the weak hash. The final checksum is obtained by computing 2 16-bit checksums a and B and concatenating the bits into a single 32-bit result. In this algorithm, a represents the sum of all bytes plus 1, and B is the sum of all values for each step in a. For Adler-32, A is 1 and B is 0. These sums are stored in an order modulo 65521 (being a prime number, the largest not exceeding 216) bytes called big-endian, where B occupies the 2 most significant bytes.
Hashing each slice: after the vulnerability function is fragmented, a hash value needs to be calculated for each fragment. In the present invention, the FNV can quickly hash large amounts of data and maintain a low collision rate using a hashing algorithm called Fowler-Noll-Voh hash, whose high dispersion makes it suitable for hash strings that are very similar.
Compression mapping: after a hash value is obtained by calculation for each vulnerability function fragment, the result pressure can be shortened selectively, and the method only adopts the lowest 6 bits of the FNV and uses an ASCII character to represent the FNV as the final hash result of the fragment.
And (3) outputting: and connecting the final hash results of each piece together to obtain a fuzzy hash value of the loophole function. The fuzzy hash has the following shape: BS: hash1: hash 2. BS: this is the block size. We can only compare hash values of the same block size. hash1: this is a concatenation of FNV-1a results (mapped to 64 characters) for each block in the file. hash 2: this is the same as hash1, but uses twice the block size. This result is written because a small change can halve or double the block size. If this occurs, at least a portion of the two signatures may be compared. And processing the difference structure, wherein the vulnerability function fingerprint consists of fuzzy hash of the function body structure and the difference structure.
Fig. 3 is a flowchart of the generation of the fingerprint of the objective function in fig. 1, illustrating how the fingerprint of the objective function is generated by processing the objective function.
The process begins by adopting a Python regular expression to perform matching and remove annotations in the source code of the vulnerability to be detected.
And secondly, generating abstract syntax trees of all target functions from the source code file of the C language, converting the abstract syntax trees into token sequences, and extracting a target function body from the token sequences, wherein the target function source is the file position to which the target function belongs, the function name, the form parameter list, the local variable list, the data type list and the function call list.
And thirdly, replacing a function name, a function form parameter, a local variable, a custom data type and a custom function call of a function body of the target function in sequence to realize syntax abstraction.
And fourthly, deleting spaces, tabulations and line feeds in the target function after the variables are replaced, deleting all the {'s and the { } characters, and converting all the characters into lower case letters.
And finally, after the target function is subjected to syntax abstraction and normalization processing, keeping a syntax structure of the generated target function, and simultaneously generating fuzzy hash based on a CTPH algorithm, wherein the concrete process is similar to the structural processing of a vulnerability function body, and the syntax abstraction and the normalization processing form the intermediate representation of the target function.
Fig. 4 is a flowchart of vulnerability detection based on function fingerprints in fig. 1, which illustrates how vulnerability detection is performed according to function fingerprints.
The flow starts from fuzzy matching based on the Wagner-Fischer algorithm, and the steps of the fuzzy matching are divided into five steps.
The first step is to compare the block sizes. We can only compare hash values computed for the same block size, and in a fuzzy hash string we have both block size and double block size hash values. Therefore, we try to match at least the hash values, and if they do not have a common block size, the comparison returns 0.
The second step deletes sequences of three or more equal characters that have little information about the document and bias the matching score.
The third step tests for coincidence of at least 7 characters, which is the default value, but this value can be altered. If the longest common substring is at least equal to the length, the function returns 0. Since we map the 32-bit FNV value into the output of 64 characters, many collisions occur, which is one way to eliminate false positives.
The fourth step we use the Wagner-Fischer algorithm to calculate the Levenshtein distance using the following weights: we denote the edit distance of two strings a, b asIn whichAndrespectively correspond toAndof the length of (c). The edit distance problem is to find a series of least costly edit operations to combine charactersConversion to charactersThe allowed editing operations include character insertion, character deletion, and character replacement.
Two character stringsAndthe edit distance of (c) can be described in the following mathematical language:
definition ofRefer toMiddle frontA character andmiddle frontThe distance between the individual characters of the character,is thatSince the first character index of the character string starts from 1, the last edit distance isDistance of time:。
when the temperature is higher than the set temperatureWhen corresponding to a character stringMiddle frontA character andmiddle frontA character at this timeHas a value of 0, representing a stringAndone of them is an empty string, thenSwitch toOnly needs to carry out the single character editing operation, so the editing distance between the single character editing operation and the single character editing operation isI.e. byThe largest one.
When in useAt the time of the operation, the user can select the required operation,is the minimum of the following three cases:indicating deletion,Indicating insertion,Representation replacement,Is an indicator function, expressed whenWhen the current time is 0; when in useAt this time, the value is 1.
The fifth step will scale the edit distance to an output between 0 and 100 according to the original fuzzy hash algorithm. Where 100 represents that the vulnerability function and the objective function are completely consistent, and 0 represents that they are not completely similar.
Therefore, the score of the similarity degree is obtained finally, and the score can be used for judging whether the loophole function and the target function have a similar relation or not.
Setting a similarity threshold for fuzzy matchingIf the similarity between the target function and the vulnerability function is lower than the threshold value, the similarity between the function and the vulnerability function is directly judged.
The final exact matching algorithm is a multi-modal matching based on the Aho-Corasick algorithm.
The method takes the grammar structure of the target function as a main string, takes all difference line codes in the vulnerability function fingerprint as a plurality of mode strings, and proves that the vulnerability exists in the target function if all deleted line difference fingerprints can be accurately matched in the grammar structure of the target function and all added line difference fingerprints cannot be accurately matched in the grammar structure of the target function.
The AC algorithm implementation is divided into two steps of preprocessing and matching.
Pretreatment: a plurality of keywords are constructed into a finite state pattern matching machine, an automaton containing all the keywords is constructed, and the automaton mainly has the following three functions.
Turning: the function stores the edges in the Trie of all key constructs in the automaton, which are represented as a two-dimensional array in which the next state of the character for the state of the current character is stored.
And (4) failure: this function stores all edges that the current character connects when there are no edges in the sure, represented as a one-dimensional array, where the next state of the current state is stored.
And (3) outputting: the function stores the index of all words at the end of the current state. It is represented by a one-dimensional array, and the index of all matching words is stored as a bitmap of the current state.
Matching: traversing a given text on the built-in automaton to search all matched words, starting from a first character of the main string and the initial state 0 of the automaton, and if the character is successfully matched, transferring to the next state according to a steering function of the automaton; if the transferred state corresponds to an output function, outputting the matched mode string; if the character matching fails, the transmission is performed in a recursive manner according to the invalidation function of the automaton.
According to the thought and implementation steps of the vulnerability detection method based on the function-level code similarity, five open source projects are selected by the model for vulnerability detection through knowing the operation result of the vulnerability detection prototype system based on the function-level code similarity.
The item sizes range from 13.1MB to 965M, while the number of C functions ranges from 1161 to 435,734.
The code clone detection accuracy was 77.3% and the recall rate was 75.6%.
Fingerprinting was completed only in 28 hours and clones of 1 GB size target were tested.
The model can expand the detection range into a large-scale code warehouse, and solves the problem of balancing the expense and the performance under the scene of high-frequency incremental codes.
The vulnerability function fingerprint detection method is based on function level vulnerability source codes of open sources and combined with fuzzy hash values to try to design vulnerability function fingerprints, and the vulnerability detection method based on function level code similarity is provided.
Firstly, preprocessing an open source vulnerability function and a target function source code by using a self-defined grammar abstraction and normalization rule; then, generating a vulnerability function fingerprint and a target function fingerprint by using the vulnerability function body and the added line code and the deleted line code in the corresponding patch file; the vulnerability detection process based on the function fingerprint comprises two steps of matching: fuzzy matching based on the Wagner-Fischer algorithm and multi-modal precise matching based on the Aho-Corasick algorithm, and finally, through experimental verification and experiments, the code clone detection precision is 77.3%, and the recall rate is 75.6%.
Fingerprint generation is completed in 28 hours only, clones of 1 GB target are detected, and the correctness and the effectiveness of the method are verified.
The invention avoids generating complex intermediate representation, simultaneously reserves basic grammar structure, ensures the performance of the detection model, particularly ensures that the detection precision is not influenced by meaningless modification on grammar, can carry out 1-3 types of clone detection, and simultaneously automatically distinguishes bug codes and patched codes.
The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.
The model can expand the detection range into a large-scale code warehouse, and achieves the balance of cost and performance under the scene of high-frequency incremental codes.
Claims (1)
1. A vulnerability detection method based on function-level code similarity is characterized by comprising the following steps:
A. the vulnerability function fingerprint library construction module is oriented to C language codes containing vulnerabilities, adopts Python regular expressions to carry out matching and remove comments in extracted source codes, collects commit files and corresponding patch files of all CVE vulnerabilities from a CVE project library of Github to establish a vulnerability database, and extracts increase rows and delete rows in all vulnerability functions and corresponding patch files;
B. writing a syntax rule file of a C language by using Antlr4, generating an abstract syntax tree of all vulnerability functions from a source code file of the C language, then converting the abstract syntax tree into a token sequence, and carrying out syntax abstraction and normalization, wherein the concrete steps of syntax abstraction and normalization are as follows:
b-1. syntax abstraction: extracting a function body, namely a vulnerability source, namely a file position, a function name, a form parameter list, a local variable list, a data type list and a function call list from the token sequence; replacing function names with the notation funneame, replacing each parameter variable appearing within a function body with the notation forpar, replacing each local variable appearing within a function body with the notation LOVAR, replacing all custom data type declarations except those declared in the ISO C standard with the notation CUSTYPE, replacing each function call with the notation funclean, except for the C standard library functions;
b-2, normalization: deleting spaces, tabulation symbols and line feed symbols, deleting all the characters { 'and { }', and converting all the characters into lower case letters so as to normalize the vulnerability function body;
C. the generated vulnerability function grammar structure after the grammar abstraction and the normalization comprises a difference structure and a function body structure, and for the latter, a fuzzy hash value based on a CTPH algorithm is generated, and the concrete process is as follows:
c-1, slicing: reading a part of content in a loophole function, calculating by a weak hash algorithm to obtain a hash value, and adoptingGenerating a pseudo-random value from the current context of the input using a rolling hash algorithm, assuming thatInput of individual characters, first of inputA byte is composed ofRepresents, therefore, the input is composed of bytes as a wholeComposition, anywhere in the inputThe state of the rolling hash will only depend on the last of the fileByte, hence, rolling hash valueCan be expressed as a function of the last few bytes as shown in the following equation:after the fragments are determined, an Alder-32 algorithm is used as weak hash;
c-2, hashing each chip: after the vulnerability function is segmented, calculating a hash value for each segment, and using a hash algorithm named as Fowler-Noll-Vo hash;
c-3. compression mapping: after a hash value is obtained by calculation for each vulnerability function fragment, the result pressure can be shortened by selecting, only the lowest 6 bits of the FNV hash value are adopted, and an ASCII character is used for representing the result as the final hash result of the fragment;
c-4, outputting: and connecting the final hash result of each piece together to obtain a fuzzy hash value of the loophole function, wherein the fuzzy hash has the following shape: BS: hash1: hash2, BS: this is the block size, and only hashes of the same block size can be compared, hash1: this is the concatenation of the final hash result for each block in the file, hash 2: this is the same as hash1, but uses twice the block size;
D. the fingerprint generation module of the target function generates a target function fingerprint: the method comprises the following steps of facing to a C language source code to be detected, removing annotations, generating abstract syntax trees of all target functions from a C language source code file, converting the abstract syntax trees into token sequences, and carrying out syntax abstraction and normalization on the target functions to be detected, wherein the specific steps of syntax abstraction and normalization are as follows:
d-1. syntax abstraction: extracting a target function body from the token sequence, wherein the source of the target function is the position of a file to which the target function belongs, the name of the function, a form parameter list, a local variable list, a data type list and a function call list; sequentially replacing a function body of the target function by function names, function form parameters, local variables, custom data types and custom function calls to realize syntax abstraction;
d-2, normalization: deleting spaces, tabulation symbols and line feed symbols in the target function after variable replacement, deleting all the {'s and the { } symbols, and converting all the symbols into lower case letters;
after the syntax abstraction and normalization processing, the syntax structure of the generated target function is reserved, meanwhile, fuzzy hash based on the CTPH algorithm is generated, and the two fuzzy hashes jointly form the intermediate representation of the target function, wherein the fuzzy hash value based on the CTPH algorithm is generated, and the specific process comprises the following steps:
d-3, slicing: reading a part of content in an objective function, calculating by a weak hash algorithm to obtain a hash value, generating a pseudo-random value according to the current input context by adopting a rolling hash algorithm, and assuming that the pseudo-random value existsInput of individual characters, first of inputA byte is composed ofRepresents, therefore, the input is composed of bytes as a wholeComposition, anywhere in the inputWhere the state of the rolling hash will depend only on the last of the fileByte, and thus, rolling hash valueCan be expressed as a function of the last few bytes as shown in the following equation:after the fragments are determined, an Alder-32 algorithm is used as weak hash;
d-4, hashing each chip: after the target function is segmented, calculating a hash value for each segment, and using a hash algorithm named as Fowler-Noll-Vo hash;
d-5, compressing mapping: after a hash value is calculated for each target function fragment, the result pressure can be shortened selectively, only the lowest 6 bits of the FNV hash value are adopted, and an ASCII character is used for representing the result as the final hash result of the fragment;
d-6, outputting: connecting the final hash result of each slice together to obtain a fuzzy hash value of the objective function, wherein the fuzzy hash has the following shape: BS: hash1: hash2, BS: this is the block size, and only hash values of the same block size can be compared, hash1: this is the concatenation of the final hash result for each block in the file, hash 2: this is the same as hash1, but uses twice the block size;
E. the vulnerability detection based on the function fingerprint comprises two steps of matching: fuzzy matching based on a Wagner-Fischer algorithm and multi-modal precise matching based on an Aho-Corasick algorithm, wherein bugs can be successfully detected after the two steps of matching are successful;
fuzzy matching based on the Wagner-Fischer algorithm specifically comprises the following steps:
e-1. comparison Block size: only hash values calculated for the same block size can be compared, in one fuzzy hash string, with both block size and double block size hash values, so an attempt is made to match at least the hash values, and if they do not have a common block size, the comparison returns 0;
deleting sequences of three or more equal characters, which have almost no information about the document and bias the matching score;
e-3. test for coincidence of at least 7 characters, which is the default value, but this value can be changed, returning 0 if the longest common substring is at least equal to the length;
e-4. two character stringsIs expressed asWhereinAndrespectively correspond toAndlength of (1), edit distanceThe problem is to find a series of least costly editing operations to assemble the charactersConversion to charactersThe allowed editing operation includes character insertion, character deletion and character replacement, two character stringsAndthe edit distance of (c) can be described in the following mathematical language:
definition ofRefer toMiddle frontA character andmiddle frontThe distance between the individual characters is such that,is thatThe length of (d);
e-5, according to the original fuzzy hash algorithm, the editing distance is scaled so that the output is between 0 and 100, wherein 100 represents that the vulnerability function and the objective function are completely consistent, and 0 represents that the vulnerability function and the objective function are completely dissimilar, and a similarity threshold value is set for fuzzy matchingIf the similarity is lower than the threshold value, directly judging that the target function and the vulnerability function have no similarity;
the method comprises the following specific steps of multimodal matching based on an Aho-Corasick algorithm: taking a syntactic structure of a target function as a main string, taking all difference code lines in a vulnerability function fingerprint as a plurality of mode strings, constructing an AC automaton to carry out multi-mode precise matching, and specifically comprising the following steps:
e-6, pretreatment: constructing a finite state pattern matching machine by a plurality of keywords, and constructing an automaton containing all the keywords;
e-7, matching: traversing a given text on the built-in automaton to search all matched words, starting from the first character of the main string and the initial state 0 of the automaton, and if the character is successfully matched, transferring to the next state according to a turning function of the automaton; if the transferred state corresponds to an output function, outputting the matched pattern string; if the character matching fails, the transmission is performed in a recursive manner according to the invalidation function of the automaton.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111071388.0A CN113901474B (en) | 2021-09-13 | 2021-09-13 | Vulnerability detection method based on function-level code similarity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111071388.0A CN113901474B (en) | 2021-09-13 | 2021-09-13 | Vulnerability detection method based on function-level code similarity |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113901474A CN113901474A (en) | 2022-01-07 |
CN113901474B true CN113901474B (en) | 2022-07-26 |
Family
ID=79028063
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111071388.0A Active CN113901474B (en) | 2021-09-13 | 2021-09-13 | Vulnerability detection method based on function-level code similarity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113901474B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114356405B (en) * | 2022-03-21 | 2022-05-17 | 思探明信息科技(南京)有限公司 | Matching method and device of open source component function, computer equipment and storage medium |
CN114781008B (en) * | 2022-04-15 | 2022-10-28 | 山东省计算中心(国家超级计算济南中心) | Data identification method and device for security detection of terminal firmware of Internet of things |
CN114491566B (en) * | 2022-04-18 | 2022-07-05 | 中国长江三峡集团有限公司 | Fuzzy test method and device based on code similarity and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107688748A (en) * | 2017-09-05 | 2018-02-13 | 中国人民解放军信息工程大学 | Fragility Code Clones detection method and its device based on leak fingerprint |
CN109635569A (en) * | 2018-12-10 | 2019-04-16 | 国家电网有限公司信息通信分公司 | A kind of leak detection method and device |
US10754958B1 (en) * | 2016-09-19 | 2020-08-25 | Nopsec Inc. | Vulnerability risk mitigation platform apparatuses, methods and systems |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10514909B2 (en) * | 2017-03-29 | 2019-12-24 | Technion Research & Development Foundation Limited | Similarity of binaries |
CN108491228B (en) * | 2018-03-28 | 2020-03-17 | 清华大学 | Binary vulnerability code clone detection method and system |
-
2021
- 2021-09-13 CN CN202111071388.0A patent/CN113901474B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10754958B1 (en) * | 2016-09-19 | 2020-08-25 | Nopsec Inc. | Vulnerability risk mitigation platform apparatuses, methods and systems |
CN107688748A (en) * | 2017-09-05 | 2018-02-13 | 中国人民解放军信息工程大学 | Fragility Code Clones detection method and its device based on leak fingerprint |
CN109635569A (en) * | 2018-12-10 | 2019-04-16 | 国家电网有限公司信息通信分公司 | A kind of leak detection method and device |
Non-Patent Citations (1)
Title |
---|
邱瑶瑶 等.基于语义分析的恶意JavaScript代码检测方法.《四川大学学报(自然科学版)》.2019,第56卷(第2期),第273-277页. * |
Also Published As
Publication number | Publication date |
---|---|
CN113901474A (en) | 2022-01-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113901474B (en) | Vulnerability detection method based on function-level code similarity | |
CN109445834B (en) | Program code similarity rapid comparison method based on abstract syntax tree | |
CN101978348B (en) | Manage the archives about approximate string matching | |
CN109359439B (en) | software detection method, device, equipment and storage medium | |
CN111290784B (en) | Program source code similarity detection method suitable for large-scale samples | |
CN109885479B (en) | Software fuzzy test method and device based on path record truncation | |
US8391614B2 (en) | Determining near duplicate “noisy” data objects | |
KR101627592B1 (en) | Detection of confidential information | |
Breitinger et al. | Approximate matching: definition and terminology | |
CN112651028B (en) | Vulnerability code clone detection method based on context semantics and patch verification | |
CN111310178B (en) | Firmware vulnerability detection method and system in cross-platform scene | |
Liu et al. | Vfdetect: A vulnerable code clone detection system based on vulnerability fingerprint | |
CN113297580B (en) | Code semantic analysis-based electric power information system safety protection method and device | |
CN109858025B (en) | Word segmentation method and system for address standardized corpus | |
US20230418578A1 (en) | Systems and methods for detection of code clones | |
Sheneamer et al. | Code clone detection using coarse and fine-grained hybrid approaches | |
CN116149669B (en) | Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium | |
CN113961768B (en) | Sensitive word detection method and device, computer equipment and storage medium | |
CN113688240B (en) | Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium | |
CN114266046A (en) | Network virus identification method and device, computer equipment and storage medium | |
CN114510717A (en) | ELF file detection method and device and storage medium | |
CN111562943B (en) | Code clone detection method and device based on event embedded tree and GAT network | |
WO2021160822A1 (en) | A method for linking a cve with at least one synthetic cpe | |
Ullah et al. | Efficient features for function matching in multi-architecture binary executables | |
KR102449691B1 (en) | Binary diffing algorithm and system thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |