CN113901474B - Vulnerability detection method based on function-level code similarity - Google Patents

Vulnerability detection method based on function-level code similarity Download PDF

Info

Publication number
CN113901474B
CN113901474B CN202111071388.0A CN202111071388A CN113901474B CN 113901474 B CN113901474 B CN 113901474B CN 202111071388 A CN202111071388 A CN 202111071388A CN 113901474 B CN113901474 B CN 113901474B
Authority
CN
China
Prior art keywords
function
hash
vulnerability
algorithm
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111071388.0A
Other languages
Chinese (zh)
Other versions
CN113901474A (en
Inventor
黄诚
赵倩崇
郭勇延
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202111071388.0A priority Critical patent/CN113901474B/en
Publication of CN113901474A publication Critical patent/CN113901474A/en
Application granted granted Critical
Publication of CN113901474B publication Critical patent/CN113901474B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a vulnerability detection method based on function level code similarity, which belongs to the technical field of computer network security and comprises the following steps: firstly, preprocessing an open source vulnerability function and a to-be-detected function source code by using a self-defined grammar abstraction and normalization rule; then, generating a vulnerability function fingerprint and a function fingerprint to be detected by using the vulnerability function body and the added line code and the deleted line code in the corresponding patch file; and finally, realizing vulnerability detection of the function fingerprint based on fuzzy matching based on the Wagner Fischer algorithm and multi-modal accurate matching based on the Aho-Corasick algorithm. The invention avoids generating complex intermediate representation, simultaneously reserves basic grammar structure, ensures the performance of the detection model, and particularly ensures that the detection precision is not influenced by meaningless modification on grammar. The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.

Description

Vulnerability detection method based on function-level code similarity
Technical Field
The invention relates to the field of computer network security, in particular to a vulnerability detection method based on function level code similarity. The invention avoids generating complex intermediate representation, simultaneously reserves basic grammar structure, ensures the performance of the detection model, particularly ensures that the detection precision is not influenced by meaningless modification on grammar, can carry out 1-3 types of clone detection, and simultaneously automatically distinguishes bug codes and patched codes. The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.
Background
Over the past few years, the number of Open-source software ("OSS") programs has increased rapidly. The significant increase in the number of OSS programs naturally leads to an increase in software vulnerabilities due to code cloning, thereby posing a serious threat to the security of software systems. Software vulnerabilities include lack of verification of user input, lack of adequate logging mechanisms, failure to open error handling, failure to properly close database connections, etc. The code cloning is the action of copying and pasting the existing codes of other software, and if the code cloning is correctly utilized, the development efficiency can be greatly improved, and the development period can be shortened. However, in practice, code cloning is often viewed as a poor programming practice because it can increase maintenance costs, reduce code quality, create potential legal conflicts, and even propagate software vulnerabilities. In particular, since OSS programs are widely used as code libraries in software development, code cloning is becoming one of the main causes of software bugs.
Conventional code similarity detection generally converts object code into an intermediate representation, such as a parse tree or a program control graph, and then analyzes the intermediate representation and checks whether it matches some predefined bug rule to determine whether the source program has a bug corresponding to the bug rule. The complex intermediate representation method is helpful for improving the detection accuracy, but also leads to higher calculation cost; and a higher code abstract representation mode can improve the efficiency, but part of vulnerability semantic information can be lost, and vulnerability codes and patched codes cannot be distinguished.
Disclosure of Invention
In view of this, the embodiment of the present application provides a vulnerability detection method based on function-level code similarity, which aims to balance efficiency and accuracy at an acceptable cost and effectively detect common variation ways in code cloning. The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.
The relevant definitions referred to in the present invention are as follows.
Definition 1: antlr4 is an open-source parser generation tool developed by Java, and can generate a corresponding parser from a grammar rule file.
Definition 2: abstract Syntax Tree (AST) refers to a Tree-like representation that describes the Syntax structure of a program code to analyze the source code structure from the Syntax Tree point of view. For example, a conditional statement in the form of an if-else may be represented using two branch nodes in the AST.
Definition 3: Fowler-Noll-Vo hash (abbreviated FNV hash) that can quickly hash large amounts of data and maintain a small collision rate, its high dispersion makes it suitable for hash of very similar strings. Such as URL, hostname, filename, text, IP address, etc.
Definition 4: the difference line code, the patch file is composed of one or more difference blocks, and each difference block is a code line sequence with a special mark. The lines beginning with "+" indicate added code and the lines beginning with "-" indicate deleted code, collectively referred to herein as difference line code.
Definition 5: context-triggered segment hashes are created by setting the boundaries of traditional segment hashes using rolling hashes based on a content-segmented hashing (CTPH algorithm), which can be used to identify ordered homologous sequences between unknown inputs and known files, even if the unknown files are modified versions of known files.
Definition 6: the Wagner-Fischer algorithm (abbreviated WF algorithm) refers to finding a series of least costly editing operations to convert a character a to a character b, with the allowable editing operations including character insertion, character deletion and character replacement. For example, the WF value between the string S1 "Angel" and the string S2 "Angle" is 2.
Definition 7: AC Automaton (Aho-coral Automaton, abbreviated as AC Automaton) is one type of multi-mode matching algorithm used to match substrings in a finite set of "dictionaries" in a string of input characters. It is different from the common character string matching in that matching is performed with all dictionary strings at the same time.
The technical scheme of the invention is as follows: a vulnerability detection method based on function-level code similarity comprises the following steps.
Step one, building a vulnerability function fingerprint database.
(1) And matching and removing the comments in the extracted source codes by adopting a Python regular expression for the C language-oriented codes containing the bugs.
(2) Collecting commit files of all CVE vulnerabilities and corresponding patch files from a CVE project library of Github to establish a vulnerability database, and extracting increase lines and delete lines in all vulnerability functions and the corresponding patch files.
(3) Writing a syntax rule file of a C language by using the Antlr4, generating abstract syntax trees of all functions from a source code file of the C language, converting the abstract syntax trees into a token sequence, and extracting a function body, a vulnerability source, namely a file position, a function name, a form parameter list, a local variable list, a data type list and a function call list from the abstract syntax trees.
(4) Syntax abstraction is performed according to the following steps: replacing the function name with the notation funneame and each parameter variable appearing within the function body with the notation forpra; replacing each local variable appearing within the function body with a symbol LOVAR; replacing all custom data type declarations except those declared in the ISO C standard with CUSTYPE; each function call is replaced with the notation FUNCALL except the C standard library function.
(5) Delete space, tab and linefeed, delete all "{" and "}" and convert all characters to lowercase letters.
(6) The generated loophole function grammar structure after the grammar abstraction and the normalization comprises two parts of a difference structure and a function body structure, and for the latter, a fuzzy hash value based on a CTPH algorithm is generated. Wherein, the segmentation adopts a rolling hash algorithm: suppose there is
Figure 100002_DEST_PATH_IMAGE001
Input of individual characters, first of input
Figure 100002_DEST_PATH_IMAGE002
A byte is composed of
Figure 100002_DEST_PATH_IMAGE003
And (4) showing. Thus, the input is composed of bytes as a whole
Figure 100002_DEST_PATH_IMAGE004
And (4) forming. At any position in the input
Figure 100002_DEST_PATH_IMAGE005
The state of the rolling hash will only depend on the last of the file
Figure 100002_DEST_PATH_IMAGE006
A byte. Thus, the hash value is rolled
Figure 100002_DEST_PATH_IMAGE007
Can be expressed as a function of the last few bytes as shown in the following equation:
Figure 100002_DEST_PATH_IMAGE008
and step two, generating the fingerprint of the target function.
(1) And removing the comments from the C language source code to be detected.
(2) Generating abstract syntax trees of all target functions from source code files of C language, then converting the abstract syntax trees into token sequences, and extracting target function bodies from the abstract syntax trees, wherein the target function sources are file positions, function names, form parameter lists, local variable lists, data type lists and function calls.
(3) And (4) sequentially replacing the function name, the function form parameter, the local variable, the custom data type and the custom function call of the function body of the target function to realize syntax abstraction.
(4) And normalizing the target function after the variable replacement.
(5) After the target function is subjected to syntax abstraction and normalization processing, the syntax structure of the generated target function is reserved, meanwhile, fuzzy hash based on the CTPH algorithm is generated, and the two fuzzy hashes jointly form intermediate representation of the target function.
Wherein, the slicing adopts a rolling hash algorithm: suppose there is
Figure 656796DEST_PATH_IMAGE001
Input of individual characters, first of input
Figure 760887DEST_PATH_IMAGE002
A byte is composed of
Figure 540624DEST_PATH_IMAGE003
And (4) showing. Thus, the input is composed of bytes as a whole
Figure 935833DEST_PATH_IMAGE004
And (4) forming. At any position in the input
Figure 750205DEST_PATH_IMAGE005
Where the state of the rolling hash will depend only on the last of the file
Figure 510351DEST_PATH_IMAGE006
A byte. Thus, the hash value is rolled
Figure 777384DEST_PATH_IMAGE007
Can be expressed as a function of the last few bytes as shown in the following equation:
Figure 976285DEST_PATH_IMAGE008
and step three, detecting the vulnerability based on the function fingerprint.
(1) Calculating the Wagner-Fischer value between the fuzzy hash of the function body in the vulnerability function fingerprint and the fuzzy hash of the target function fingerprint to obtain the score of the similarity degree, and judging whether the two functions have the similarity relation.
Two character strings
Figure 100002_DEST_PATH_IMAGE009
And
Figure 100002_DEST_PATH_IMAGE010
the Wagner-Fischer value of (A) can be described in the following mathematical language:
Figure 100002_DEST_PATH_IMAGE011
. Definition of
Figure 100002_DEST_PATH_IMAGE012
Refer to
Figure 832114DEST_PATH_IMAGE009
Middle front
Figure 763161DEST_PATH_IMAGE002
A character and
Figure 704441DEST_PATH_IMAGE010
middle front
Figure 100002_DEST_PATH_IMAGE013
The distance between the individual characters.
Figure 100002_DEST_PATH_IMAGE014
Is that
Figure 100002_DEST_PATH_IMAGE015
Length of (d). Since the first character index of the character string starts from 1, the last edit distance is
Figure 100002_DEST_PATH_IMAGE016
Distance when:
Figure 100002_DEST_PATH_IMAGE017
(2) and taking the grammatical structure of the target function as a main string, taking all difference code lines in the vulnerability function fingerprint as a plurality of mode strings, and constructing an AC automaton to carry out multi-mode accurate matching. The AC algorithm is realized by two steps of preprocessing and matching: pretreatment: a plurality of keywords are constructed into a finite state pattern matching machine. An automaton is constructed that contains all the keys. Matching: a given text on the built-in automaton is traversed to find all matching words. Starting from the first character of the main string and the initial state 0 of the automaton, if the character is successfully matched, the method is transferred to the next state according to the steering function of the automaton; if the transferred state corresponds to an output function, outputting the matched pattern string; if the character matching fails, the transmission is performed in a recursive manner according to the invalidation function of the automaton.
The advantages of the invention are mainly.
And a self-defined grammar abstraction and normalization rule is provided, so that the generation of complex intermediate representation is avoided, a basic grammar structure is kept, the influence brought by code clone modification is eliminated, the performance of a detection model is ensured, and particularly, the detection precision is not influenced by meaningless modification on grammar.
And providing vulnerability function fingerprints based on code differences. In many cases, there is little difference between code containing holes and patched code, and holes can be eliminated by inserting a single if statement. There are also many security holes that are usually very sensitive to constants and statement order, so if we want to detect a clone hole of class 1, 2, 3 code, if we do not distinguish the hole code from the patched code, we will result in a high false negative rate. The method generates the vulnerability fingerprint by using the vulnerability function body and the added line code and the deleted line code (collectively called as difference line code) in the corresponding patch file, can perform 1-3 types of clone detection, and simultaneously distinguishes the vulnerability code and the patched code.
Drawings
FIG. 1 is a system flow diagram.
Fig. 2 is a flowchart of building a vulnerability function fingerprint database in block 1001 of fig. 1.
Figure 3 is a flow diagram of the objective function fingerprint generation process of block 1002 of figure 1.
Fig. 4 is a flowchart illustrating a vulnerability detection process based on a function fingerprint in block 1003 of fig. 1.
Detailed Description
The present invention will be further explained below with reference to the drawings and examples.
FIG. 1 is a flow chart of the overall system of the present invention.
The vulnerability function fingerprint database construction module collects commit files and corresponding patch files of all CVE vulnerabilities from a CVE project database of Github to establish a vulnerability database, generates vulnerability function fingerprints based on CTPH algorithm and code difference, and establishes a vulnerability function fingerprint database.
And the fingerprint generation module of the target function generates a target function fingerprint based on the CTPH algorithm.
The vulnerability detection based on the function fingerprint comprises two steps of matching: fuzzy matching and accurate matching, and the loopholes can be successfully detected after the two-step matching is successful.
Fig. 2 is a flowchart of the vulnerability function fingerprint database construction in fig. 1, which illustrates how to construct the vulnerability function fingerprint database.
The process begins by adopting a Python regular expression to carry out matching and remove the annotations in the extracted vulnerability function source codes. One is annotated with a single row with "//" characters, and the other is annotated with multiple rows, bracketed by "/" and "/".
And secondly, collecting commit files and corresponding patch files of all CVE vulnerabilities from a CVE project library of Github to establish a vulnerability database, and extracting all vulnerability functions and add lines and delete lines in the corresponding patch files.
And thirdly, writing a syntax rule file of the C language by using Antlr4, generating abstract syntax trees of all vulnerability functions from a source code file of the C language, converting the abstract syntax trees into token sequences, and extracting function bodies, vulnerability sources, namely file positions, function names, form parameter lists, local variable lists, data type lists and function call lists.
Fourthly, syntax abstraction is carried out according to the following steps: replacing the function name with the notation funneame and each parameter variable appearing within the function body with the notation forpar; replacing each local variable appearing within the function body with the notation LOVAR; replacing all custom data type declarations except those declared in the ISO C standard with CUSTYPE; each function call is replaced with the notation FUNCALL except the C standard library function.
And fifthly, deleting spaces, tabulation symbols and line feed symbols, deleting all the {'s and the { }', and converting all the characters into lower case letters so as to normalize the vulnerability function body.
And finally, generating a vulnerability function grammar structure after grammar abstraction and normalization processing, wherein the vulnerability function grammar structure comprises a difference structure and a function body structure, and generating a fuzzy hash value based on a CTPH algorithm for the latter, wherein the concrete process is as follows.
Fragmenting: reading a part of content in the loophole function, and calculating by a weak hash algorithm to obtain a hash value.
Files cannot be separated using fixed-length blocks, pseudo-random values are generated from the current context of the input using a rolling hash algorithm that works by maintaining a state based only on the last few bytes of the input, each byte is added to and deleted from the state at the time of processing after a certain number of other bytes have been processed.
Suppose we have
Figure 379136DEST_PATH_IMAGE001
Input of individual characters, first of input
Figure 902521DEST_PATH_IMAGE002
A byte is composed of
Figure 519316DEST_PATH_IMAGE003
And (4) showing. Thus, the input is composed of bytes as a whole
Figure 760942DEST_PATH_IMAGE004
And (4) forming. At any position in the input
Figure 36065DEST_PATH_IMAGE005
Where the state of the rolling hash will depend only on the last of the file
Figure 679536DEST_PATH_IMAGE006
A byte. Thus, the hash value is rolled
Figure 952386DEST_PATH_IMAGE007
Can be expressed as a function of the last few bytes as shown in the following equation:
Figure 681307DEST_PATH_IMAGE008
. Rolling hash function
Figure 100002_DEST_PATH_IMAGE018
Is constructed so that the influence of the items therein can be eliminated. Thus, given
Figure 100002_DEST_PATH_IMAGE019
Can be removed by
Figure 100002_DEST_PATH_IMAGE020
Is expressed as a function
Figure DEST_PATH_IMAGE021
And adding
Figure DEST_PATH_IMAGE022
Is expressed as a function
Figure DEST_PATH_IMAGE023
Calculate out
Figure DEST_PATH_IMAGE024
As shown in the following equation.
Figure DEST_PATH_IMAGE025
Figure DEST_PATH_IMAGE026
After the shard is determined, Alder-32 algorithm is used as the weak hash. The final checksum is obtained by computing 2 16-bit checksums a and B and concatenating the bits into a single 32-bit result. In this algorithm, a represents the sum of all bytes plus 1, and B is the sum of all values for each step in a. For Adler-32, A is 1 and B is 0. These sums are stored in an order modulo 65521 (being a prime number, the largest not exceeding 216) bytes called big-endian, where B occupies the 2 most significant bytes.
Hashing each slice: after the vulnerability function is fragmented, a hash value needs to be calculated for each fragment. In the present invention, the FNV can quickly hash large amounts of data and maintain a low collision rate using a hashing algorithm called Fowler-Noll-Voh hash, whose high dispersion makes it suitable for hash strings that are very similar.
Compression mapping: after a hash value is obtained by calculation for each vulnerability function fragment, the result pressure can be shortened selectively, and the method only adopts the lowest 6 bits of the FNV and uses an ASCII character to represent the FNV as the final hash result of the fragment.
And (3) outputting: and connecting the final hash results of each piece together to obtain a fuzzy hash value of the loophole function. The fuzzy hash has the following shape: BS: hash1: hash 2. BS: this is the block size. We can only compare hash values of the same block size. hash1: this is a concatenation of FNV-1a results (mapped to 64 characters) for each block in the file. hash 2: this is the same as hash1, but uses twice the block size. This result is written because a small change can halve or double the block size. If this occurs, at least a portion of the two signatures may be compared. And processing the difference structure, wherein the vulnerability function fingerprint consists of fuzzy hash of the function body structure and the difference structure.
Fig. 3 is a flowchart of the generation of the fingerprint of the objective function in fig. 1, illustrating how the fingerprint of the objective function is generated by processing the objective function.
The process begins by adopting a Python regular expression to perform matching and remove annotations in the source code of the vulnerability to be detected.
And secondly, generating abstract syntax trees of all target functions from the source code file of the C language, converting the abstract syntax trees into token sequences, and extracting a target function body from the token sequences, wherein the target function source is the file position to which the target function belongs, the function name, the form parameter list, the local variable list, the data type list and the function call list.
And thirdly, replacing a function name, a function form parameter, a local variable, a custom data type and a custom function call of a function body of the target function in sequence to realize syntax abstraction.
And fourthly, deleting spaces, tabulations and line feeds in the target function after the variables are replaced, deleting all the {'s and the { } characters, and converting all the characters into lower case letters.
And finally, after the target function is subjected to syntax abstraction and normalization processing, keeping a syntax structure of the generated target function, and simultaneously generating fuzzy hash based on a CTPH algorithm, wherein the concrete process is similar to the structural processing of a vulnerability function body, and the syntax abstraction and the normalization processing form the intermediate representation of the target function.
Fig. 4 is a flowchart of vulnerability detection based on function fingerprints in fig. 1, which illustrates how vulnerability detection is performed according to function fingerprints.
The flow starts from fuzzy matching based on the Wagner-Fischer algorithm, and the steps of the fuzzy matching are divided into five steps.
The first step is to compare the block sizes. We can only compare hash values computed for the same block size, and in a fuzzy hash string we have both block size and double block size hash values. Therefore, we try to match at least the hash values, and if they do not have a common block size, the comparison returns 0.
The second step deletes sequences of three or more equal characters that have little information about the document and bias the matching score.
The third step tests for coincidence of at least 7 characters, which is the default value, but this value can be altered. If the longest common substring is at least equal to the length, the function returns 0. Since we map the 32-bit FNV value into the output of 64 characters, many collisions occur, which is one way to eliminate false positives.
The fourth step we use the Wagner-Fischer algorithm to calculate the Levenshtein distance using the following weights: we denote the edit distance of two strings a, b as
Figure 884755DEST_PATH_IMAGE017
In which
Figure DEST_PATH_IMAGE027
And
Figure DEST_PATH_IMAGE028
respectively correspond to
Figure 569683DEST_PATH_IMAGE009
And
Figure 341330DEST_PATH_IMAGE010
of the length of (c). The edit distance problem is to find a series of least costly edit operations to combine characters
Figure 291969DEST_PATH_IMAGE009
Conversion to characters
Figure 112157DEST_PATH_IMAGE010
The allowed editing operations include character insertion, character deletion, and character replacement.
Two character strings
Figure 464641DEST_PATH_IMAGE009
And
Figure 407189DEST_PATH_IMAGE010
the edit distance of (c) can be described in the following mathematical language:
Figure 845124DEST_PATH_IMAGE011
definition of
Figure DEST_PATH_IMAGE029
Refer to
Figure DEST_PATH_IMAGE030
Middle front
Figure DEST_PATH_IMAGE031
A character and
Figure DEST_PATH_IMAGE032
middle front
Figure DEST_PATH_IMAGE033
The distance between the individual characters of the character,
Figure DEST_PATH_IMAGE034
is that
Figure DEST_PATH_IMAGE035
Since the first character index of the character string starts from 1, the last edit distance is
Figure DEST_PATH_IMAGE036
Distance of time:
Figure DEST_PATH_IMAGE037
when the temperature is higher than the set temperature
Figure DEST_PATH_IMAGE038
When corresponding to a character string
Figure 577326DEST_PATH_IMAGE030
Middle front
Figure 49895DEST_PATH_IMAGE031
A character and
Figure 163345DEST_PATH_IMAGE032
middle front
Figure 26259DEST_PATH_IMAGE033
A character at this time
Figure 250566DEST_PATH_IMAGE034
Has a value of 0, representing a string
Figure 312063DEST_PATH_IMAGE030
And
Figure 596414DEST_PATH_IMAGE032
one of them is an empty string, then
Figure 195892DEST_PATH_IMAGE030
Switch to
Figure 223891DEST_PATH_IMAGE032
Only needs to carry out the single character editing operation, so the editing distance between the single character editing operation and the single character editing operation is
Figure DEST_PATH_IMAGE039
I.e. by
Figure 405473DEST_PATH_IMAGE034
The largest one.
When in use
Figure DEST_PATH_IMAGE040
At the time of the operation, the user can select the required operation,
Figure 532829DEST_PATH_IMAGE037
is the minimum of the following three cases:
Figure DEST_PATH_IMAGE041
indicating deletion
Figure DEST_PATH_IMAGE042
Figure DEST_PATH_IMAGE043
Indicating insertion
Figure DEST_PATH_IMAGE044
Figure DEST_PATH_IMAGE045
Representation replacement
Figure 619603DEST_PATH_IMAGE044
Figure DEST_PATH_IMAGE046
Is an indicator function, expressed when
Figure DEST_PATH_IMAGE047
When the current time is 0; when in use
Figure DEST_PATH_IMAGE048
At this time, the value is 1.
The fifth step will scale the edit distance to an output between 0 and 100 according to the original fuzzy hash algorithm. Where 100 represents that the vulnerability function and the objective function are completely consistent, and 0 represents that they are not completely similar.
Therefore, the score of the similarity degree is obtained finally, and the score can be used for judging whether the loophole function and the target function have a similar relation or not.
Setting a similarity threshold for fuzzy matching
Figure DEST_PATH_IMAGE049
If the similarity between the target function and the vulnerability function is lower than the threshold value, the similarity between the function and the vulnerability function is directly judged.
The final exact matching algorithm is a multi-modal matching based on the Aho-Corasick algorithm.
The method takes the grammar structure of the target function as a main string, takes all difference line codes in the vulnerability function fingerprint as a plurality of mode strings, and proves that the vulnerability exists in the target function if all deleted line difference fingerprints can be accurately matched in the grammar structure of the target function and all added line difference fingerprints cannot be accurately matched in the grammar structure of the target function.
The AC algorithm implementation is divided into two steps of preprocessing and matching.
Pretreatment: a plurality of keywords are constructed into a finite state pattern matching machine, an automaton containing all the keywords is constructed, and the automaton mainly has the following three functions.
Turning: the function stores the edges in the Trie of all key constructs in the automaton, which are represented as a two-dimensional array in which the next state of the character for the state of the current character is stored.
And (4) failure: this function stores all edges that the current character connects when there are no edges in the sure, represented as a one-dimensional array, where the next state of the current state is stored.
And (3) outputting: the function stores the index of all words at the end of the current state. It is represented by a one-dimensional array, and the index of all matching words is stored as a bitmap of the current state.
Matching: traversing a given text on the built-in automaton to search all matched words, starting from a first character of the main string and the initial state 0 of the automaton, and if the character is successfully matched, transferring to the next state according to a steering function of the automaton; if the transferred state corresponds to an output function, outputting the matched mode string; if the character matching fails, the transmission is performed in a recursive manner according to the invalidation function of the automaton.
According to the thought and implementation steps of the vulnerability detection method based on the function-level code similarity, five open source projects are selected by the model for vulnerability detection through knowing the operation result of the vulnerability detection prototype system based on the function-level code similarity.
The item sizes range from 13.1MB to 965M, while the number of C functions ranges from 1161 to 435,734.
The code clone detection accuracy was 77.3% and the recall rate was 75.6%.
Fingerprinting was completed only in 28 hours and clones of 1 GB size target were tested.
The model can expand the detection range into a large-scale code warehouse, and solves the problem of balancing the expense and the performance under the scene of high-frequency incremental codes.
The vulnerability function fingerprint detection method is based on function level vulnerability source codes of open sources and combined with fuzzy hash values to try to design vulnerability function fingerprints, and the vulnerability detection method based on function level code similarity is provided.
Firstly, preprocessing an open source vulnerability function and a target function source code by using a self-defined grammar abstraction and normalization rule; then, generating a vulnerability function fingerprint and a target function fingerprint by using the vulnerability function body and the added line code and the deleted line code in the corresponding patch file; the vulnerability detection process based on the function fingerprint comprises two steps of matching: fuzzy matching based on the Wagner-Fischer algorithm and multi-modal precise matching based on the Aho-Corasick algorithm, and finally, through experimental verification and experiments, the code clone detection precision is 77.3%, and the recall rate is 75.6%.
Fingerprint generation is completed in 28 hours only, clones of 1 GB target are detected, and the correctness and the effectiveness of the method are verified.
The invention avoids generating complex intermediate representation, simultaneously reserves basic grammar structure, ensures the performance of the detection model, particularly ensures that the detection precision is not influenced by meaningless modification on grammar, can carry out 1-3 types of clone detection, and simultaneously automatically distinguishes bug codes and patched codes.
The expandability of vulnerability detection is improved while the low false alarm rate and the low missing report rate are ensured.
The model can expand the detection range into a large-scale code warehouse, and achieves the balance of cost and performance under the scene of high-frequency incremental codes.

Claims (1)

1. A vulnerability detection method based on function-level code similarity is characterized by comprising the following steps:
A. the vulnerability function fingerprint library construction module is oriented to C language codes containing vulnerabilities, adopts Python regular expressions to carry out matching and remove comments in extracted source codes, collects commit files and corresponding patch files of all CVE vulnerabilities from a CVE project library of Github to establish a vulnerability database, and extracts increase rows and delete rows in all vulnerability functions and corresponding patch files;
B. writing a syntax rule file of a C language by using Antlr4, generating an abstract syntax tree of all vulnerability functions from a source code file of the C language, then converting the abstract syntax tree into a token sequence, and carrying out syntax abstraction and normalization, wherein the concrete steps of syntax abstraction and normalization are as follows:
b-1. syntax abstraction: extracting a function body, namely a vulnerability source, namely a file position, a function name, a form parameter list, a local variable list, a data type list and a function call list from the token sequence; replacing function names with the notation funneame, replacing each parameter variable appearing within a function body with the notation forpar, replacing each local variable appearing within a function body with the notation LOVAR, replacing all custom data type declarations except those declared in the ISO C standard with the notation CUSTYPE, replacing each function call with the notation funclean, except for the C standard library functions;
b-2, normalization: deleting spaces, tabulation symbols and line feed symbols, deleting all the characters { 'and { }', and converting all the characters into lower case letters so as to normalize the vulnerability function body;
C. the generated vulnerability function grammar structure after the grammar abstraction and the normalization comprises a difference structure and a function body structure, and for the latter, a fuzzy hash value based on a CTPH algorithm is generated, and the concrete process is as follows:
c-1, slicing: reading a part of content in a loophole function, calculating by a weak hash algorithm to obtain a hash value, and adoptingGenerating a pseudo-random value from the current context of the input using a rolling hash algorithm, assuming that
Figure DEST_PATH_IMAGE001
Input of individual characters, first of input
Figure DEST_PATH_IMAGE002
A byte is composed of
Figure DEST_PATH_IMAGE003
Represents, therefore, the input is composed of bytes as a whole
Figure DEST_PATH_IMAGE004
Composition, anywhere in the input
Figure DEST_PATH_IMAGE005
The state of the rolling hash will only depend on the last of the file
Figure DEST_PATH_IMAGE006
Byte, hence, rolling hash value
Figure DEST_PATH_IMAGE007
Can be expressed as a function of the last few bytes as shown in the following equation:
Figure DEST_PATH_IMAGE008
after the fragments are determined, an Alder-32 algorithm is used as weak hash;
c-2, hashing each chip: after the vulnerability function is segmented, calculating a hash value for each segment, and using a hash algorithm named as Fowler-Noll-Vo hash;
c-3. compression mapping: after a hash value is obtained by calculation for each vulnerability function fragment, the result pressure can be shortened by selecting, only the lowest 6 bits of the FNV hash value are adopted, and an ASCII character is used for representing the result as the final hash result of the fragment;
c-4, outputting: and connecting the final hash result of each piece together to obtain a fuzzy hash value of the loophole function, wherein the fuzzy hash has the following shape: BS: hash1: hash2, BS: this is the block size, and only hashes of the same block size can be compared, hash1: this is the concatenation of the final hash result for each block in the file, hash 2: this is the same as hash1, but uses twice the block size;
D. the fingerprint generation module of the target function generates a target function fingerprint: the method comprises the following steps of facing to a C language source code to be detected, removing annotations, generating abstract syntax trees of all target functions from a C language source code file, converting the abstract syntax trees into token sequences, and carrying out syntax abstraction and normalization on the target functions to be detected, wherein the specific steps of syntax abstraction and normalization are as follows:
d-1. syntax abstraction: extracting a target function body from the token sequence, wherein the source of the target function is the position of a file to which the target function belongs, the name of the function, a form parameter list, a local variable list, a data type list and a function call list; sequentially replacing a function body of the target function by function names, function form parameters, local variables, custom data types and custom function calls to realize syntax abstraction;
d-2, normalization: deleting spaces, tabulation symbols and line feed symbols in the target function after variable replacement, deleting all the {'s and the { } symbols, and converting all the symbols into lower case letters;
after the syntax abstraction and normalization processing, the syntax structure of the generated target function is reserved, meanwhile, fuzzy hash based on the CTPH algorithm is generated, and the two fuzzy hashes jointly form the intermediate representation of the target function, wherein the fuzzy hash value based on the CTPH algorithm is generated, and the specific process comprises the following steps:
d-3, slicing: reading a part of content in an objective function, calculating by a weak hash algorithm to obtain a hash value, generating a pseudo-random value according to the current input context by adopting a rolling hash algorithm, and assuming that the pseudo-random value exists
Figure 85461DEST_PATH_IMAGE001
Input of individual characters, first of input
Figure 2601DEST_PATH_IMAGE002
A byte is composed of
Figure 782338DEST_PATH_IMAGE003
Represents, therefore, the input is composed of bytes as a whole
Figure DEST_PATH_IMAGE009
Composition, anywhere in the input
Figure 115231DEST_PATH_IMAGE005
Where the state of the rolling hash will depend only on the last of the file
Figure 929603DEST_PATH_IMAGE006
Byte, and thus, rolling hash value
Figure 752066DEST_PATH_IMAGE007
Can be expressed as a function of the last few bytes as shown in the following equation:
Figure 19099DEST_PATH_IMAGE008
after the fragments are determined, an Alder-32 algorithm is used as weak hash;
d-4, hashing each chip: after the target function is segmented, calculating a hash value for each segment, and using a hash algorithm named as Fowler-Noll-Vo hash;
d-5, compressing mapping: after a hash value is calculated for each target function fragment, the result pressure can be shortened selectively, only the lowest 6 bits of the FNV hash value are adopted, and an ASCII character is used for representing the result as the final hash result of the fragment;
d-6, outputting: connecting the final hash result of each slice together to obtain a fuzzy hash value of the objective function, wherein the fuzzy hash has the following shape: BS: hash1: hash2, BS: this is the block size, and only hash values of the same block size can be compared, hash1: this is the concatenation of the final hash result for each block in the file, hash 2: this is the same as hash1, but uses twice the block size;
E. the vulnerability detection based on the function fingerprint comprises two steps of matching: fuzzy matching based on a Wagner-Fischer algorithm and multi-modal precise matching based on an Aho-Corasick algorithm, wherein bugs can be successfully detected after the two steps of matching are successful;
fuzzy matching based on the Wagner-Fischer algorithm specifically comprises the following steps:
e-1. comparison Block size: only hash values calculated for the same block size can be compared, in one fuzzy hash string, with both block size and double block size hash values, so an attempt is made to match at least the hash values, and if they do not have a common block size, the comparison returns 0;
deleting sequences of three or more equal characters, which have almost no information about the document and bias the matching score;
e-3. test for coincidence of at least 7 characters, which is the default value, but this value can be changed, returning 0 if the longest common substring is at least equal to the length;
e-4. two character strings
Figure DEST_PATH_IMAGE010
Is expressed as
Figure DEST_PATH_IMAGE011
Wherein
Figure DEST_PATH_IMAGE012
And
Figure DEST_PATH_IMAGE013
respectively correspond to
Figure DEST_PATH_IMAGE014
And
Figure DEST_PATH_IMAGE015
length of (1), edit distanceThe problem is to find a series of least costly editing operations to assemble the characters
Figure 591900DEST_PATH_IMAGE014
Conversion to characters
Figure 198462DEST_PATH_IMAGE015
The allowed editing operation includes character insertion, character deletion and character replacement, two character strings
Figure 191826DEST_PATH_IMAGE014
And
Figure 946155DEST_PATH_IMAGE015
the edit distance of (c) can be described in the following mathematical language:
Figure DEST_PATH_IMAGE016
definition of
Figure DEST_PATH_IMAGE017
Refer to
Figure 620850DEST_PATH_IMAGE014
Middle front
Figure 144236DEST_PATH_IMAGE002
A character and
Figure 574080DEST_PATH_IMAGE015
middle front
Figure DEST_PATH_IMAGE018
The distance between the individual characters is such that,
Figure DEST_PATH_IMAGE019
is that
Figure 753388DEST_PATH_IMAGE010
The length of (d);
e-5, according to the original fuzzy hash algorithm, the editing distance is scaled so that the output is between 0 and 100, wherein 100 represents that the vulnerability function and the objective function are completely consistent, and 0 represents that the vulnerability function and the objective function are completely dissimilar, and a similarity threshold value is set for fuzzy matching
Figure DEST_PATH_IMAGE020
If the similarity is lower than the threshold value, directly judging that the target function and the vulnerability function have no similarity;
the method comprises the following specific steps of multimodal matching based on an Aho-Corasick algorithm: taking a syntactic structure of a target function as a main string, taking all difference code lines in a vulnerability function fingerprint as a plurality of mode strings, constructing an AC automaton to carry out multi-mode precise matching, and specifically comprising the following steps:
e-6, pretreatment: constructing a finite state pattern matching machine by a plurality of keywords, and constructing an automaton containing all the keywords;
e-7, matching: traversing a given text on the built-in automaton to search all matched words, starting from the first character of the main string and the initial state 0 of the automaton, and if the character is successfully matched, transferring to the next state according to a turning function of the automaton; if the transferred state corresponds to an output function, outputting the matched pattern string; if the character matching fails, the transmission is performed in a recursive manner according to the invalidation function of the automaton.
CN202111071388.0A 2021-09-13 2021-09-13 Vulnerability detection method based on function-level code similarity Active CN113901474B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111071388.0A CN113901474B (en) 2021-09-13 2021-09-13 Vulnerability detection method based on function-level code similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111071388.0A CN113901474B (en) 2021-09-13 2021-09-13 Vulnerability detection method based on function-level code similarity

Publications (2)

Publication Number Publication Date
CN113901474A CN113901474A (en) 2022-01-07
CN113901474B true CN113901474B (en) 2022-07-26

Family

ID=79028063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111071388.0A Active CN113901474B (en) 2021-09-13 2021-09-13 Vulnerability detection method based on function-level code similarity

Country Status (1)

Country Link
CN (1) CN113901474B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356405B (en) * 2022-03-21 2022-05-17 思探明信息科技(南京)有限公司 Matching method and device of open source component function, computer equipment and storage medium
CN114781008B (en) * 2022-04-15 2022-10-28 山东省计算中心(国家超级计算济南中心) Data identification method and device for security detection of terminal firmware of Internet of things
CN114491566B (en) * 2022-04-18 2022-07-05 中国长江三峡集团有限公司 Fuzzy test method and device based on code similarity and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688748A (en) * 2017-09-05 2018-02-13 中国人民解放军信息工程大学 Fragility Code Clones detection method and its device based on leak fingerprint
CN109635569A (en) * 2018-12-10 2019-04-16 国家电网有限公司信息通信分公司 A kind of leak detection method and device
US10754958B1 (en) * 2016-09-19 2020-08-25 Nopsec Inc. Vulnerability risk mitigation platform apparatuses, methods and systems

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10514909B2 (en) * 2017-03-29 2019-12-24 Technion Research & Development Foundation Limited Similarity of binaries
CN108491228B (en) * 2018-03-28 2020-03-17 清华大学 Binary vulnerability code clone detection method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10754958B1 (en) * 2016-09-19 2020-08-25 Nopsec Inc. Vulnerability risk mitigation platform apparatuses, methods and systems
CN107688748A (en) * 2017-09-05 2018-02-13 中国人民解放军信息工程大学 Fragility Code Clones detection method and its device based on leak fingerprint
CN109635569A (en) * 2018-12-10 2019-04-16 国家电网有限公司信息通信分公司 A kind of leak detection method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邱瑶瑶 等.基于语义分析的恶意JavaScript代码检测方法.《四川大学学报(自然科学版)》.2019,第56卷(第2期),第273-277页. *

Also Published As

Publication number Publication date
CN113901474A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN113901474B (en) Vulnerability detection method based on function-level code similarity
CN109445834B (en) Program code similarity rapid comparison method based on abstract syntax tree
CN101978348B (en) Manage the archives about approximate string matching
CN109359439B (en) software detection method, device, equipment and storage medium
CN111290784B (en) Program source code similarity detection method suitable for large-scale samples
CN109885479B (en) Software fuzzy test method and device based on path record truncation
US8391614B2 (en) Determining near duplicate “noisy” data objects
KR101627592B1 (en) Detection of confidential information
Breitinger et al. Approximate matching: definition and terminology
CN112651028B (en) Vulnerability code clone detection method based on context semantics and patch verification
CN111310178B (en) Firmware vulnerability detection method and system in cross-platform scene
Liu et al. Vfdetect: A vulnerable code clone detection system based on vulnerability fingerprint
CN113297580B (en) Code semantic analysis-based electric power information system safety protection method and device
CN109858025B (en) Word segmentation method and system for address standardized corpus
US20230418578A1 (en) Systems and methods for detection of code clones
Sheneamer et al. Code clone detection using coarse and fine-grained hybrid approaches
CN116149669B (en) Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium
CN113961768B (en) Sensitive word detection method and device, computer equipment and storage medium
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN114266046A (en) Network virus identification method and device, computer equipment and storage medium
CN114510717A (en) ELF file detection method and device and storage medium
CN111562943B (en) Code clone detection method and device based on event embedded tree and GAT network
WO2021160822A1 (en) A method for linking a cve with at least one synthetic cpe
Ullah et al. Efficient features for function matching in multi-architecture binary executables
KR102449691B1 (en) Binary diffing algorithm and system thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant