CN115904486A

CN115904486A - Code similarity detection method and device

Info

Publication number: CN115904486A
Application number: CN202110932314.5A
Authority: CN
Inventors: 周艳; 施勇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2023-04-04

Abstract

The application discloses a code similarity detection method, which comprises the following steps: acquiring a source code to be detected; compiling a source code to be detected through a LLVM compiler of a low-level virtual machine to obtain a target intermediate representation IR of the source code to be detected; identifying fragments related to the operation codes in the target IR to obtain a target identification token sequence; and performing similarity comparison on the target token sequence and the code library to obtain a similarity detection result of the source code to be detected. According to the method and the device, semantic information of the source code can be fully reserved through the target IR compiled by the LLVM compiler, the influence of invalid codes such as spaces and comments is avoided, and high similarity recognition accuracy can be guaranteed. And extracting information such as operation codes, called functions and the like as characteristic vectors, and under the premise of not needing complex semantic extraction, deeper clone code pairs can be mined by utilizing the optimization capability of the LLVM compiler, so that the detection capability of massive codes is realized.

Description

Code similarity detection method and device

Technical Field

The present application relates to the field of information technologies, and in particular, to a method and an apparatus for detecting code similarity.

Background

In order to improve efficiency, software developers copy and paste code (i.e., code clone) during the software development process to achieve the same functionality, which, while somewhat helpful for code development, is not conducive to maintenance.

The cloning of the code brings great safety hazards while improving the production efficiency. For example, if an error exists in the original program and even after the bug is fixed, the bug is fixed in other redundant codes, which may cause a more serious security risk than the original bug. Therefore, code clone detection is required. The code clone detection can be applied to multiple fields and multiple scenes. For example, in a code copying detection scene, whether a target code is similar to an existing code protected by copyright or not can be detected, whether a target malicious code is similar to a known code or not can be detected in a malicious software analysis scene, so that whether the target code is malicious software or not can be judged, and whether the target code is similar to a known bug code or not can be detected in a known bug matching scene, so that whether a known bug exists in the target code or not can be judged.

In practical application, similar codes need to be quickly discovered from a large number of codes, for example, in a software copyright scene, whether the inside of the whole project or even different projects are plagiarized needs to be determined, and therefore, a code similarity detection method which needs to have the detection capability of a large number of codes is urgently needed.

Disclosure of Invention

In a first aspect, the present application provides a code similarity detection method, including:

acquiring a source code to be detected; compiling the source code to be detected through a LLVM compiler of a low-level virtual machine to obtain a target Intermediate Representation (IR) of the source code to be detected; identifying a segment related to an operation code in the target IR to obtain a target identification token sequence; and comparing the similarity of the target token sequence with a code library to obtain a similarity detection result of the source code to be detected.

The embodiment of the application provides a code similarity detection method, which comprises the following steps: acquiring a source code to be detected; compiling the source code to be detected through a low-level virtual machine LLVM compiler to obtain a target intermediate representation IR of the source code to be detected; identifying a segment related to an operation code in the target IR to obtain a target identification token sequence; and comparing the similarity of the target token sequence with a code library to obtain a similarity detection result of the source code to be detected.

The binary systems generated by different platforms, different CPU architectures, different compiler versions, different compiling options and the like often have great difference and can generate great influence on comparison of similarity, and the IR language compiled by the LLVM compiler can fully shield the back-end difference, so that the problem can be well solved.

The target IR compiled by the LLVM compiler can fully reserve semantic information of a source code, is not influenced by invalid codes such as spaces, comments and the like, and can ensure higher similarity recognition accuracy. And information such as operation codes, called functions and the like is extracted to be used as feature vectors, and deeper clone code pairs can be mined by utilizing the optimization capability of the LLVM compiler on the premise of not needing complex semantic extraction, so that the detection capability of massive codes is realized.

In one possible implementation, before the compiling the source code to be detected by the low-level virtual machine LLVM compiler, the method further includes:

and eliminating invalid code segments in the source code to be detected, wherein the invalid code segments are code segments which have no influence on an operation result in the source code to be detected.

Specifically, a compiler optimization technique may be utilized to mine deep level optimization, such as useless code elimination (i.e., removing invalid code segments), where an invalid code segment may also be referred to as a Dead-code (Dead-code), and an invalid code segment may refer to a statement that a calculation result is never used.

The invalid code segments have no substantial effect in the codes, for example, the segments such as spaces and comments are removed when the similarity comparison of the codes is carried out, so that the accuracy of code similarity detection can be improved.

In one possible implementation, before the compiling the source code to be detected by the low-level virtual machine LLVM compiler, the method further includes: and eliminating loop invariants (loop invariants) positioned in a loop body in the source code to be detected and moving the loop invariants to the outside of the loop body.

Where loop invariants may refer to expressions that do not change with each cycle, the optimizer is computed once outside the loop and used during the loop. The compiler's optimizer can find loop invariants and remove them from the loop body using "code movement".

Taking loop invariant extraction as an example, res =3 is the loop invariant, where the code before optimization is:

int loop(int num)

{int i＝0；int res；

for(i＝0；i<num；i++){res＝3；}return res；}

the optimized code is:

int loop1(int num)

{int i＝0；int res＝3；for(i＝0；i<num；i++){}return res；}。

in one possible implementation, an opcode included in the target IR may be identified to obtain a target token sequence, where the target token sequence includes the opcode. For example, the IR instruction can be parsed in units of functions for the IR file (. Bc) generated after construction, and relevant semantic information can be extracted (for a normal IR instruction, an operation code can be extracted).

In one possible implementation, an instruction containing a pointer operation in the target IR may be identified to obtain a target token sequence, where the target token sequence includes an opcode contained in the instruction of the pointer operation and an object pointed to by the pointer operation. For example, the IR instruction can be parsed in function units for the IR file (. Bc) generated after the build, extracting relevant semantic information (for pointer type operations, the opcode and the type pointed to by the pointer can be extracted).

In one possible implementation, a call instruction in the target IR may be identified to obtain a target token sequence, where the target token sequence includes a function called in the call instruction. For example, the IR instruction can be parsed in units of functions for the IR file (. Bc) generated after construction, and relevant semantic information can be extracted (for the call instruction, the called function name can be extracted).

In one possible implementation, the code library includes a plurality of documents, each of the documents corresponds to a token sequence, each of the token sequences is obtained according to a candidate code, and each of the documents includes a minimum hash signature of the corresponding token sequence;

the similarity comparison of the target token sequence and a code library comprises:

determining a target minimum hash signature of the target token sequence by a minimum hash Minhash algorithm;

and comparing the similarity of the target minimum hash signature and the minimum hash signature included by each document.

The code library can comprise a document consisting of a plurality of token sequences, and can be in a key-value structure, wherein the key is the name of the document, and the value is the minhash signature value of the document.

In one possible implementation, the comparing the similarity between the target hash signature and the hash signature included in each document includes:

and performing similarity comparison on the target minimum hash signature and the minimum hash signature included in each document through a Local Sensitive Hash (LSH) algorithm.

The Minhash algorithm can convert token comparison into signature comparison, and has the advantages of accuracy and high efficiency, and the Minhash LSH algorithm can solve the problem of mass data comparison. According to the embodiment of the application, the source codes are converted into the IR language through the LLVM, semantic information of the source codes is reserved, and meanwhile, the detection capability of massive codes is realized by combining a MinhashLSH algorithm.

In one possible implementation, the similarity detection result includes at least one candidate code, and the similarity between the token sequence corresponding to each candidate code and the target token sequence is higher than a threshold;

the method further comprises the following steps:

presenting target information, wherein the target information is used for indicating difference information between each candidate code in the at least one candidate code and the code to be detected.

In one possible implementation, the source code to be detected includes a compiling instruction of a code, and the compiling instruction is a clang compiling command.

In a second aspect, the present application provides a code similarity detection apparatus, including:

the acquisition module is used for acquiring a source code to be detected;

the compiling module is used for compiling the source code to be detected through a LLVM compiler of a low-level virtual machine to obtain a target intermediate representation IR of the source code to be detected;

the identification module is used for identifying the fragments related to the operation codes in the target IR so as to obtain a target identification token sequence;

and the similarity detection module is used for comparing the similarity of the target token sequence with a code library to obtain a similarity detection result of the source code to be detected.

The target IR compiled by the LLVM compiler can fully reserve semantic information of a source code, is not influenced by invalid codes such as spaces and comments, and can ensure higher similarity recognition accuracy. And extracting information such as operation codes, called functions and the like as characteristic vectors, and under the premise of not needing complex semantic extraction, deeper clone code pairs can be mined by utilizing the optimization capability of the LLVM compiler, so that the detection capability of massive codes is realized.

In one possible implementation, the apparatus further comprises:

and the code optimization module is used for eliminating invalid code fragments in the source code to be detected before the source code to be detected is compiled through the LLVM compiler, wherein the invalid code fragments are code fragments which have no influence on an operation result in the source code to be detected.

In one possible implementation, the apparatus further comprises:

and the code optimization module is used for removing loop invariants (loop invariants) in a loop body in the source code to be detected and moving the loop invariants to the outside of the loop body before the source code to be detected is compiled by the LLVM compiler.

In a possible implementation, the identification module is specifically configured to:

identifying an operation code contained in the target IR to obtain a target token sequence, wherein the target token sequence comprises the operation code; and/or the presence of a gas in the gas,

identifying an instruction containing a pointer operation in the target IR to obtain a target token sequence, wherein the target token sequence comprises an operation code contained in the instruction containing the pointer operation and an object pointed by the pointer operation; and/or the presence of a gas in the gas,

and identifying a calling instruction in the target IR to obtain a target token sequence, wherein the target token sequence comprises a function called in the calling instruction.

the similarity detection module is specifically configured to:

determining a target minimum hash signature of the target token sequence through a minimum hash Minhash algorithm;

and comparing the similarity of the target minimum hash signature with the minimum hash signature included in each document.

In a possible implementation, the similarity detection module is specifically configured to:

and comparing the similarity of the target minimum hash signature and the minimum hash signature included by each document through a Local Sensitive Hash (LSH) algorithm.

the device further comprises:

and the result presenting module is used for presenting target information, and the target information is used for indicating the difference information between each candidate code in the at least one candidate code and the code to be detected.

In a third aspect, an embodiment of the present application provides a code similarity detection apparatus, which may include a memory, a processor, and a bus system, where the memory is used to store a program, and the processor is used to execute the program in the memory to perform the foregoing first aspect and any optional method thereof.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the first aspect and any optional method thereof.

In a fifth aspect, embodiments of the present application provide a computer program, which when run on a computer, causes the computer to perform the first aspect and any optional method thereof.

In a sixth aspect, the present application provides a chip system, which includes a processor, configured to support a code similarity detection apparatus to implement the functions referred to in the above aspects, for example, to transmit or process data referred to in the above methods; or, information. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the code similarity detection apparatus. The chip system may be formed by a chip, or may include a chip and other discrete devices.

The embodiment of the application provides a code similarity detection method, which comprises the following steps: acquiring a source code to be detected; compiling the source code to be detected through a low-level virtual machine LLVM compiler to obtain a target intermediate representation IR of the source code to be detected; identifying fragments related to the operation codes in the target IR to obtain a target identification token sequence; and performing similarity comparison on the target token sequence and a code library to obtain a similarity detection result of the source code to be detected.

The target IR compiled by the LLVM compiler can fully reserve semantic information of a source code, is not influenced by invalid codes such as spaces and comments, and can ensure higher similarity recognition accuracy. And extracting information such as operation codes and called functions as characteristic vectors, and utilizing the optimization capability of the LLVM compiler to mine deeper clone code pairs.

Drawings

Fig. 1a is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 1b is a schematic diagram of an LLVM provided in an embodiment of the present application;

fig. 2 is a schematic diagram of a code similarity detection method provided in an embodiment of the present application;

fig. 3 is a schematic diagram of a code similarity detection method provided in an embodiment of the present application;

fig. 4a is a schematic diagram of a code similarity detection method provided in an embodiment of the present application;

fig. 4b is a schematic storage format of a code library provided in an embodiment of the present application;

fig. 5 is a schematic diagram of a code similarity detection method provided in an embodiment of the present application;

fig. 6a is a schematic diagram of a code similarity detection method provided in an embodiment of the present application;

fig. 6b is a schematic diagram of a detection result provided in the embodiment of the present application;

fig. 6c is a schematic diagram of a detection result provided in an embodiment of the present application;

fig. 6d is a schematic diagram of a detection result provided in the embodiment of the present application;

fig. 6e is a schematic diagram of a detection result provided in an embodiment of the present application;

fig. 6f is a schematic diagram of a detection result provided in the embodiment of the present application;

fig. 6g is a schematic diagram of a detection result provided in the embodiment of the present application;

fig. 7 is a schematic structural diagram of a code similarity detection apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments of the invention only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenes, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

The terms "first," "second," and the like in the description and claims of this application and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the manner in which objects of the same nature are distinguished in the embodiments of the application. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The method in the embodiment of the application is applied to one or more code similarity detection devices, wherein the code similarity detection devices can be terminals or servers, and the processes of acquiring data, processing data and the like are realized through software and/or hardware. Taking a server as an example, please refer to fig. 1a, where fig. 1a is a schematic diagram of a server structure provided in an embodiment of the present application, and steps executed by the server in the embodiment of the present application may be based on the server structure shown in fig. 1 a.

The server 100 may vary greatly in configuration or performance, and may include one or more Central Processing Units (CPUs) 122 (e.g., one or more processors) and memory 132, one or more storage media 130 (e.g., one or more mass storage devices) storing applications 142 or data 144. Memory 132 and storage medium 130 may be, among other things, transient or persistent storage. The program stored in the storage medium 130 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 122 may be configured to communicate with the storage medium 130 to execute a series of instruction operations in the storage medium 130 on the server 100.

The server 100 may also include one or more power supplies 126, one or more wired or wireless network interfaces 150, one or more input-output interfaces 158, and/or one or more operating systems 141, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and so forth.

In this embodiment, the CPU122 may acquire a code in the storage medium 130 to execute the code similarity detection method provided in this embodiment.

The following description will first discuss relevant terms related to embodiments of the present application.

(1) Low Level Virtual Machine (LLVM)

The traditional compiler does not make hierarchical division during design, so that a lot of data are coupled together at the front end and the back end, and therefore, the support of a new programming language or a new target architecture is particularly difficult. The LLVM performs a three-segment design as shown in fig. 1 b. Therefore, only one front end needs to be realized again to support a new programming language, only one back end needs to be realized again to support a new target architecture, and the front end and back end connection hub is the LLVM Intermediate Representation (IR).

LLVM IR is essentially a generic intermediate representation independent of the source programming language and target machine architecture, the core design and greatest advantage of LLVM projects.

(2) Minimum Hash minhash

In data mining, a most basic problem is to compare similarity between two sets, and the similarity between the sets is usually represented by traversing all elements of the two sets and counting the number of the same elements in the two sets, such as calculation of similarity between feature vectors (euclidean distance and cosine distance).

minhash is an algorithm based on Jaccard similarity. According to the Jaccard formula, the similarity S (A, B) = | A ≈ B |/| A | U B |, between the sets A and B, the intersection and the union of the two sets are directly calculated, calculation resources are consumed, and the calculation is not feasible particularly under the massive data scene.

And (3) scrambling the line for the minhash to simulate the similarity of the Jaccard, wherein the probability that the minimum hash values calculated by the two sets after line scrambling are equal to the similarity of the Jaccard of the two sets. For example, the rows of the feature matrix may be scrambled first (i.e., the positions between rows are randomly swapped), which scrambling is random. Then the minimum hash value of a certain column is equal to the row number of the row with the first value of 1 after the scrambling, and the row number starts from 0.

(3) Feature matrix

One column of the feature matrix corresponds to a set, and the sum of all rows is the complete set of all set elements, and if there is that element in the set, the corresponding position in the matrix is 1, otherwise it is 0.

(4) Minimum hash signatures

A plurality of permutation transformations may be performed, with a minimum hash value for a set being computed under each permutation transformation, with the minimum hash value sequences making up the minimum hash signature for the set. Theories prove that the probability that the minimum hash values calculated by the two sets after random line scrambling are equal is equal to the Jaccard similarity of the two sets.

(5) Locality Sensitive Hashing (LSH)

Even if the similarity of any pair is calculated by converting a large document into a signature by using Minhash, the documents to be compared are huge in mass code data, and efficient searching for the most similar document is still impossible. The Local Sensitive Hash (LSH) can be divided into line strips/buckets according to the minimum Hash signature of a given set, then the similarity between set pairs with at least one equal line strip is only calculated, and the comparison between most set pairs which do not meet the similarity can be eliminated by reasonably selecting the size of the line strips, so that the processing process of the minimum Hash is accelerated.

Referring to fig. 2, fig. 2 is a schematic diagram of an embodiment of a code similarity detection method provided in an embodiment of the present application, where the code similarity detection method provided in the embodiment of the present application may be applied to a server or a terminal device, as shown in fig. 2, the code similarity detection method provided in the embodiment of the present application includes:

201. acquiring a source code to be detected;

in a possible implementation, when a user has a need for code similarity detection, a code to be subjected to the abric terrain detection may be uploaded to a server on the cloud side, and then the server may acquire a source code to be detected uploaded by the user from the end side.

The source code to be detected may include a code and a compiling command of the code.

In one possible implementation, the code to be detected may be optimized.

In a possible implementation, an invalid code segment in the source code to be detected may be removed, where the invalid code segment is a code segment in the source code to be detected that has no influence on an operation result, and a loop invariant (loop invariant) in a loop body in the source code to be detected is removed and moved to the outside of the loop body.

Specifically, the compiler optimization technology can be used to mine deep level optimization, such as useless code elimination (i.e., removing invalid code fragments), loop invariant extraction, and the like. The invalid code segment may also be referred to as a Dead-code (Dead-code), and the invalid code segment may refer to a statement that the calculation result is never used. Loop invariants may refer to expressions that do not change with each cycle, and the optimizer is computed only once outside the loop and used during the loop. The compiler's optimizer can find loop invariants and remove them from the loop body using "code movement".

int loop(int num)

{int i＝0；int res；

for(i＝0；i<num；i++){res＝3；}return res；}

the optimized code is as follows:

int loop1(int num)

{int i＝0；int res＝3；for(i＝0；i<num；i++){}return res；}

i.e. res =3 is knocked out and moved outside the circulation.

In addition, referring to fig. 3, in the building process, the Gcc compilation command in the compilation commands of the code may be hijacked and converted into a Clang compilation command. Besides adding a generation command of a Clang front-end compiler (Clang-c-LLVM-emit), compiling options can be filtered through a white list mechanism so as to shield the generated LLVM intermediate language difference. The building module will check whether the compiled option is in the white list, if so, the compiling option is retained, otherwise, the option is abandoned, and the recombination of the Clang building command is completed.

202. And compiling the source code to be detected through a low-level virtual machine LLVM compiler to obtain a target intermediate representation IR of the source code to be detected.

In the embodiment of the present application, the LLVM front-end compiler Clang may recompile the source code (e.g. c or cpp file) to generate the target intermediate representation IR (or referred to as LLVM IR) based on the LLVM framework, where the LLVM IR has more underlying operation process and data flow direction capability and contains more semantic information than the high-level programming language. By building the compilation, a conversion from source code to IR can be achieved.

203. And identifying a segment related to the operation code in the target IR to obtain a target identification token sequence.

In the embodiment of the application, after the target IR is obtained, a segment related to the operation code in the target IR may be identified to obtain a target identification token sequence.

For example, fig. 4a shows the extraction flow of a token sequence.

Illustratively, the LLVM IR instructions may be in the following format: < result > = add < ty > < op1>, < op2>;

where add is an opcode, op1 and op2 are operands, ty is an operand type, e.g., < result > = add i32,% var, where add is an opcode, 4 and% var are operands, and i32 is an operand type.

204. And comparing the similarity of the target token sequence with a code library to obtain a similarity detection result of the source code to be detected.

In the embodiment of the application, after the target token sequence of the code to be detected is obtained, the target minimum hash signature of the target token sequence can be determined through a minimum hash Minhash algorithm.

In this embodiment of the application, the code library may include a plurality of documents as a retrieval fingerprint library, each document corresponds to one token sequence, each token sequence is obtained according to one candidate code, and each document includes a minimum hash signature of the corresponding token sequence.

Specifically, the code library may include a document composed of a plurality of token sequences, and may be a key-value structure, where key is a name of the document, and value is a minhash signature value of the document (for example, refer to fig. 4 b).

In the source code compiling, a front-end compiler clang of llvm is used for generating a bc file, the bc file is analyzed to generate an ir intermediate language operation code sequence, after preprocessing, the result is stored in a database, and the query is similar, so that the source code is converted into the ir intermediate language operation code sequence. It should be emphasized that the opcode sequence in this application may incorporate a layer of dependency relationship, that is, include the called function name corresponding to the call instruction.

Illustratively, the pseudo code for generating the code base may be as follows:

global configuration:

hashNums＝200；

maxShingle＝4294967295

nextPrimeForHash＝4294967311

shingleLen＝6

1. random hash function h (x) = (a x + b)% c

2. hash function parameters a (coeffA), b (coeffB) are generated

3. Generating minhash signature matrix signature

4. Generating documents to Minhash signature docToSig

docToSig＝{}

signature＝[]

get_signature(doc,signature)

docToSig[doc]＝signature

In one possible implementation, the target minimum hash signature may be compared to the minimum hash signature included in each of the documents for similarity. Specifically, the similarity comparison between the target minimum hash signature and the minimum hash signature included in each document may be performed by using a Local Sensitive Hash (LSH) algorithm.

The Minhash algorithm can convert the token comparison into the signature comparison, and has the advantages of accuracy and high efficiency, and the Minhash LSH algorithm can solve the comparison problem of mass data. According to the embodiment of the application, the source codes are converted into the IR language through the LLVM, semantic information of the source codes is reserved, and meanwhile, the detection capability of massive codes is realized by combining a MinhashLSH algorithm.

In the embodiment of the present application, if similarity is found in n documents, all set pairs still need to be traversed to find all similar set pairs, and the complexity is O (n 2). Therefore, the core idea LSH to solve this problem is described next. The basic idea is to gather similar sets together, reduce the search range and avoid comparing dissimilar sets.

For example, there are now 5 sets:

a signature matrix is obtained by the above method, and then the matrix is divided into b line strips (bands), each of which is composed of r rows. For each row bar, there is a hash function that can map a column vector of every r integers in the row bar (each column in the row bar) into a bucket. The same hash function can be used for all row strips, but we use a separate set of bucket numbers for each row strip, so even the same column vector in different row strips will not be hashed into the same bucket. Thus, as long as two sets have two columns falling in the same bucket in a certain row bar, the two sets are considered to have high possible similarity and serve as candidate pairs for subsequent calculation, and the two columns not falling in the same bucket in all the row bars are considered to have not high similarity and are directly ignored.

Among these, the LSH accelerometer may be, for example:

lsh_dicts＝[lsh_dict1...,lsh_dictb]

lsh _ dictionary = { }, which is also a key-value structure, key is the transformation of the signature matrix of the minhash into a b-dimensional hash, e.g., X: [ X1, X2,. Xn ] - > Y: [ Y1, Y2,. Yb ], where n = b r, the signature matrix of the X original minhash, key = crc32 (yi)

The pseudo-code may be as follows:

global configuration

Referring to fig. 5, fig. 5 is a flow of similarity detection, where the similarity detection result may include at least one candidate code, a similarity between a token sequence corresponding to each candidate code and the target token sequence is higher than a threshold, and after the similarity detection result is obtained, the similarity detection result may be delivered to an end side.

For example, a code similarity detection method provided in the embodiment of the present application is described below with reference to a flow, and with reference to fig. 6a, for a code to be detected, a Clang compilation may be used to replace a Gcc compilation command to complete version construction, and specifically, a compilation log may be generated based on source code compilation (command line example: log) and then generating a clone compiling command, for example, replacing Gcc with clone compiling, adding a uniform building option (command line example: sh rebuild.sh) by using a white list mechanism, performing version building by using the reconstructed clone compiling command, generating a BC file under a specified output directory (for example, BC _ out directory under the engineering directory can be defaulted), generating a token sequence of function level (command line example: parseIR/home/zhouy/pro/BC _ out) under the specified output directory (ir _ out directory under the engineering directory by default) in a data preprocessing stage, then performing similarity detection, firstly generating configuration information under the specified output directory (ir _ out directory under the engineering directory by default), and storing configuration information into a database according to parameters required by a Minhash algorithm, such as the number N of hash functions, coefficients required for forming N hash functions, the number of row bars in a striping policy, similarity threshold and other information, ensuring that the similarity comparison of the same files is performed for multiple times, and then ensuring that the result of the same signature is mapped into the same directory record command log/log, and calculating the signature of the copy compiling command line/log file (copy _ out) and storing the same signature in the directory record command line/log. The similarity values may then be queried, taking the example currently of detecting similarity values inside a project (example command lines: python3 queryredu. Py/home/zhouy/project/ir _ out// home/zhouy/project /).

In one possible implementation, after obtaining the similarity result, target information indicating difference information between each candidate code of the at least one candidate code and the code to be detected may be presented based on the similarity result.

The code detection method provided by the application can detect clone function pairs of types of Type1-Type4 (wherein Type-1 means that codes are the same as a source program except for spaces and comments, type-2 means that on the basis of level 1, identifiers such as code fragment variable names, constants and class names are different, but the grammatical structures are the same, type-3 means that on the basis of level 2, execution statements of a program are changed, deleted and added, but the general structure of the program is basically unchanged, type-4 means that two sections of codes can achieve the same function, and the same result can be obtained under the condition of the same input value), and the detection time in 1K ten thousand lines of codes is about 2 minutes.

Referring to FIG. 6b, FIG. 6b is a schematic diagram of the Type2 level detection result, and FIG. 6b shows the pair of cloning functions due to the different code segment constants; referring to FIG. 6c, FIG. 6c is a schematic diagram of the detection results at Type2 level, and FIG. 6c shows the cloning function pairs due to different parameters; referring to FIG. 6d, FIG. 6d is a schematic diagram of the Type3 level detection result, and FIG. 6d shows a pair of clone functions resulting from inserting a line of useless code; referring to FIG. 6e, FIG. 6e is a schematic illustration of the detection result at Type3 level, and FIG. 6e shows that the semantics are consistent, such as the following judgment that the pointer is not empty; referring to FIG. 6f, FIG. 6f is a schematic diagram of the detection result of Type3 level, and FIG. 6f shows the if expression condition or the pair of cloning functions resulting from splitting into two if conditions; referring to FIG. 6g, FIG. 6g is a schematic diagram of the detection result of Type4 level, and FIG. 6g shows the consistent clone function pair after function expansion.

The embodiment of the application provides a code similarity detection method, which comprises the following steps: acquiring a source code to be detected; compiling the source code to be detected through a low-level virtual machine LLVM compiler to obtain a target intermediate representation IR of the source code to be detected; identifying fragments related to the operation codes in the target IR to obtain a target identification token sequence; and comparing the similarity of the target token sequence with a code library to obtain a similarity detection result of the source code to be detected.

The target IR compiled by the LLVM compiler can fully reserve semantic information of a source code, is not influenced by invalid codes such as spaces, comments and the like, and can ensure higher similarity recognition accuracy. And extracting information such as operation codes and called functions as characteristic vectors, and utilizing the optimization capability of the LLVM compiler to mine deeper clone code pairs.

The binary systems generated by different platforms, different CPU architectures, different compiler versions, different compilation options and the like often have great difference and can generate great influence on comparison of similarity, and the IR language compiled by the LLVM compiler can fully shield the difference of the rear end, so that the problem can be well solved.

Referring to fig. 7, fig. 7 is a schematic diagram of a code similarity detection apparatus 700 provided in an embodiment of the present application, and as shown in fig. 7, the code similarity detection apparatus 700 provided in the present application includes:

an obtaining module 701, configured to obtain a source code to be detected;

for a detailed description of the obtaining module 701, reference may be made to the description of step 201, which is not described herein again.

A compiling module 702, configured to compile the source code to be detected through a low-level virtual machine LLVM compiler, so as to obtain a target intermediate representation IR of the source code to be detected;

for a detailed description of the compiling module 702, reference may be made to the description of step 202, which is not described herein again.

The identifying module 703 is configured to identify a segment related to the operation code in the target IR to obtain a target identifier token sequence;

for a detailed description of the identifying module 703, reference may be made to the description of step 203, which is not described herein again.

And the similarity detection module 704 is configured to compare the similarity between the target token sequence and the code library to obtain a similarity detection result of the source code to be detected.

For a detailed description of the similarity detection module 704, reference may be made to the description of step 204, which is not described herein again.

The target IR compiled by the LLVM compiler can fully reserve semantic information of a source code, is not influenced by invalid codes such as spaces, comments and the like, and can ensure higher similarity recognition accuracy. And extracting information such as operation codes, called functions and the like as characteristic vectors, and under the premise of not needing complex semantic extraction, deeper clone code pairs can be mined by utilizing the optimization capability of the LLVM compiler, so that the detection capability of massive codes is realized.

In one possible implementation, the apparatus further comprises:

and the code optimization module is used for eliminating invalid code segments in the source code to be detected before the source code to be detected is compiled through the LLVM compiler of the low-level virtual machine, wherein the invalid code segments are code segments which have no influence on an operation result in the source code to be detected.

In one possible implementation, the apparatus further comprises:

identifying an operation code contained in the target IR to obtain a target token sequence, wherein the target token sequence comprises the operation code; and/or the presence of a gas in the atmosphere,

In one possible implementation, the codebase includes a plurality of documents, each document corresponds to a token sequence, each token sequence is obtained according to a candidate code, and each document includes a minimum hash signature of the corresponding token sequence;

the similarity detection module is specifically configured to:

the device further comprises:

Referring to fig. 8, fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present disclosure, and the terminal device 800 may be embodied as a mobile phone, a tablet, a notebook computer, an intelligent wearable device, and the like, which is not limited herein. The terminal device 800 may be configured to implement the steps related to the end side in the code similarity detection method in the foregoing embodiments. Specifically, the terminal apparatus 800 includes: a receiver 801, a transmitter 802, a processor 803 and a memory 804 (wherein the number of the processors 803 in the terminal device 800 may be one or more, and one processor is taken as an example in fig. 8), wherein the processor 803 may include an application processor 8031 and a communication processor 8032. In some embodiments of the present application, the receiver 801, the transmitter 802, the processor 803, and the memory 804 may be connected by a bus or other means.

The memory 804 may include a read-only memory and a random access memory, and provides instructions and data to the processor 803. A portion of the memory 804 may also include non-volatile random access memory (NVRAM). The memory 804 stores the processor and operating instructions, executable modules or data structures, or a subset or an expanded set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 803 controls the operation of the terminal device. In a specific application, the various components of the terminal device are coupled together by a bus system, wherein the bus system may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The method disclosed in the embodiments of the present application can be applied to the processor 803 or implemented by the processor 803. The processor 803 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 803. The processor 803 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 803 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 804, and the processor 803 reads the information in the memory 804 to complete the steps of the method in combination with the hardware thereof.

Receiver 801 may be used to receive input numeric or character information and generate signal inputs related to the associated settings and function controls of the terminal device. The transmitter 802 may be configured to output numeric or character information via a first interface; the transmitter 802 may also be configured to send instructions to the disk pack through the first interface to modify data in the disk pack; the transmitter 802 may also include a display device such as a display screen.

In this embodiment, in one case, the processor 803 is configured to implement the steps related to the end side in the code similarity detection method in the foregoing embodiments.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a server provided in the embodiment of the present application, specifically, the server 900 is implemented by one or more servers, and the server 900 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 99 (e.g., one or more processors) and a memory 932, and one or more storage media 930 (e.g., one or more mass storage devices) storing an application 942 or data 944. Memory 932 and storage media 930 can be, among other things, transient storage or persistent storage. The program stored on the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 99 may be configured to communicate with the storage medium 930 to execute a series of instruction operations in the storage medium 930 on the server 900.

The server 900 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input-output interfaces 958; or, one or more operating systems 941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

Specifically, the server may execute the code similarity detection method described in the foregoing embodiment.

The embodiment of the present application further provides a computer program product, which when running on a computer, causes the computer to execute the steps performed by the code similarity detection apparatus.

The present invention also provides a computer-readable storage medium, in which a program for signal processing is stored, and when the program runs on a computer, the computer is caused to execute the steps executed by the code similarity detection apparatus.

The code similarity detection device provided by the embodiment of the application can be specifically a chip, and the chip comprises: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, a pin or a circuit, etc. The processing unit can execute the computer execution instructions stored in the storage unit to make the chip in the code similarity detection device execute the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the radio access device, such as a read-only memory (ROM) or another type of static storage device that may store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, referring to fig. 10, fig. 10 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 1000, and the NPU 1000 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 1003, and the controller 1004 controls the arithmetic circuit 1003 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 1003 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuit 1003 is a two-dimensional systolic array. The arithmetic circuit 1003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1003 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1002 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 1001 and performs matrix operation with the matrix B, and a partial result or a final result of the obtained matrix is stored in an accumulator (accumulator) 1008.

The unified memory 1006 is used for storing input data and output data. The weight data is directly passed through a Memory cell Access Controller (DMAC) 1005, and the DMAC is carried into the weight Memory 1002. The input data is also carried into the unified memory 1006 by the DMAC.

The BIU is a Bus Interface Unit 1010 used for interaction between the AXI Bus and the DMAC and an Instruction Fetch memory (IFB) 1009.

A Bus Interface Unit 1010 (Bus Interface Unit, BIU for short) is configured to obtain an instruction from the external memory by the instruction fetch memory 1009, and is further configured to obtain the raw data of the input matrix a or the weight matrix B from the external memory by the memory Unit access controller 1005.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1006, to transfer weight data to the weight memory 1002, or to transfer input data to the input memory 1001.

The vector calculation unit 1007 includes a plurality of operation processing units, and further processes the output of the operation circuit 1003 such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 1007 can store the processed output vector to the unified memory 1006. For example, the vector calculation unit 1007 may calculate a linear function; alternatively, a non-linear function is applied to the output of the arithmetic circuit 1003, such as performing linear interpolation on the feature planes extracted from the convolutional layers, and then accumulating the vectors of values to generate the activation values. In some implementations, the vector calculation unit 1007 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuit 1003, for example, for use in subsequent layers in a neural network.

Instruction fetch memory (instruction fetch buffer) 1009 connected to controller 1004, for storing instructions used by controller 1004;

the unified memory 1006, the input memory 1001, the weight memory 1002, and the instruction fetch memory 1009 are On-Chip memories. The external memory is private to the NPU hardware architecture.

The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be substantially embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that a computer can store or a data storage device, such as a training device, data center, etc., that includes one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Claims

1. A code similarity detection method, characterized in that the method comprises:

acquiring a source code to be detected;

compiling the source code to be detected through a LLVM compiler of a low-level virtual machine to obtain a target Intermediate Representation (IR) of the source code to be detected;

identifying a segment related to an operation code in the target IR to obtain a target identification token sequence;

and comparing the similarity of the target token sequence with a code library to obtain a similarity detection result of the source code to be detected.

2. The method according to claim 1, wherein before the compiling the source code to be detected by the low-level virtual machine LLVM compiler, the method further comprises:

3. The method according to claim 1 or 2, wherein before the compiling the source code to be detected by the low-level virtual machine LLVM compiler, the method further comprises:

and eliminating a loop invariant (loop invariant) positioned in a loop body in the source code to be detected and moving the loop invariant to the outside of the loop body.

4. The method according to any one of claims 1 to 3, wherein the identifying the segment related to the operation code in the target IR to obtain the target identification token sequence comprises:

identifying an instruction containing a pointer operation in the target IR to obtain a target token sequence, wherein the target token sequence comprises an operation code contained in the instruction containing the pointer operation and an object pointed by the pointer operation; and/or the presence of a gas in the atmosphere,

5. The method according to any one of claims 1 to 4, wherein the code library comprises a plurality of documents, each document corresponding to a token sequence, each token sequence being derived from a candidate code, and each document comprising a minimum hash signature of the corresponding token sequence;

6. The method of claim 5, wherein the comparing the similarity of the target minimum hash signature to the minimum hash signature included in each of the documents comprises:

7. The method according to claim 5 or 6, wherein the similarity detection result comprises at least one candidate code, and the similarity between the token sequence corresponding to each candidate code and the target token sequence is higher than a threshold value;

the method further comprises the following steps:

presenting target information indicating difference information between each candidate code of the at least one candidate code and the code to be detected.

8. The method according to any one of claims 1 to 7, wherein the source code to be detected comprises a compiling instruction of a code, and the compiling instruction is a clang compiling command.

9. A code similarity detection apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring a source code to be detected;

the compiling module is used for compiling the source code to be detected through a low-level virtual machine LLVM compiler to obtain a target intermediate representation IR of the source code to be detected;

the identification module is used for identifying fragments related to the operation codes in the target IR so as to obtain a target identification token sequence;

10. The apparatus of claim 9, further comprising:

11. The apparatus of claim 9 or 10, further comprising:

12. The apparatus according to any one of claims 9 to 11, wherein the identification module is specifically configured to:

13. The apparatus according to any one of claims 9 to 12, wherein the code library comprises a plurality of documents, each document corresponding to a token sequence, each token sequence being derived from a candidate code, and each document comprising a minimum hash signature of the corresponding token sequence;

the similarity detection module is specifically configured to:

14. The apparatus of claim 13, wherein the similarity detection module is specifically configured to:

15. The apparatus according to claim 13 or 14, wherein the similarity detection result comprises at least one candidate code, and the similarity between the token sequence corresponding to each candidate code and the target token sequence is higher than a threshold value;

the device further comprises:

16. The apparatus according to any one of claims 9 to 15, wherein the source code to be detected comprises a compiling instruction of a code, and the compiling instruction is a clasping compiling command.

17. A code similarity detection apparatus, characterized in that the apparatus comprises a memory and a processor; the memory stores code, and the processor is configured to retrieve the code and perform the method of any of claims 1 to 8.

18. A computer storage medium, characterized in that the computer storage medium stores one or more instructions that, when executed by one or more computers, cause the one or more computers to implement the method of any one of claims 1 to 8.

19. A computer program product comprising instructions for causing a computer to perform the method of any of claims 1-8 when the computer program product is run on the computer.