CN114880023A - Technical feature oriented source code comparison method, system and program product - Google Patents
Technical feature oriented source code comparison method, system and program product Download PDFInfo
- Publication number
- CN114880023A CN114880023A CN202210808406.7A CN202210808406A CN114880023A CN 114880023 A CN114880023 A CN 114880023A CN 202210808406 A CN202210808406 A CN 202210808406A CN 114880023 A CN114880023 A CN 114880023A
- Authority
- CN
- China
- Prior art keywords
- function
- vector
- code
- built
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/75—Structural analysis for program understanding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/43—Checking; Contextual analysis
- G06F8/436—Semantic checking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
A source code comparison method, a system and a program product for technical characteristics belong to the technical field of natural language processing. The invention comprises the following steps: a semantic coding method based on a function calling structure is used, and code similarity is analyzed from the aspects of the function calling structure, function names, built-in classes and the like; carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity. The invention comprehensively considers the technical characteristic information of function name, calling structure, built-in class and the like, and can better compare codes according to the technical characteristics.
Description
Technical Field
The invention discloses a source code comparison method, a system and a program product for technical characteristics, and belongs to the technical field of natural language processing.
Background
The open source platform provides an environment for sharing and exchanging codes for scientific research personnel, more and more deep learning models and codes are shared on the open source platform, and an ecological environment capable of reusing the codes is created, so that the research personnel need to search related solutions for specific problems. The modern algorithm design idea is that the code is constructed in a modularization mode, a large number of basic function functions are usually contained, and information such as function names, calling structures and built-in classes provides important code technical characteristics. Various solutions to the same problem, such as in text classification, may use different neural network structures: the performance of convolutional neural networks, cyclic neural networks, attention mechanisms, etc. also vary, and therefore, it is necessary to further analyze the performance problems such as code structure and execution efficiency, etc., and compare the newly designed code with the reusable code.
The similarity of codes is an important index in code comparison analysis, and currently mainstream code similarity calculation methods can be divided into three categories: code similarity calculation method based on calling structure, static characteristics and binary codes.
The code similarity calculation method based on the call structure mainly calculates the similarity between codes according to the logical structure of the codes, and the logical structure generally includes an abstract syntax tree, a program call graph and the like, for example: the invention patent with publication number CN110737469A discloses a source code similarity evaluation method based on semantic information in function granularity, which considers the structure and node information of a control flow graph by calculating identifiers corresponding to functions and embedded vectors of the control flow graph, but does not consider semantic information of technical features such as built-in classes of code import.
The code similarity calculation method based on static features is to extract some metric values from a source code to form feature vectors, and then to take the similarity between the code feature vectors as the similarity of the codes, for example: the invention patent with publication number CN111290784A discloses a program source code similarity detection method suitable for large-scale samples, which calculates a locality sensitive hash value for a text feature sequence and a feature weight sequence of each sample to be detected, and uses the value as a sample feature vector.
A code similarity calculation method based on binary system generally includes obtaining an instruction sequence of each function after disassembling a binary system code, vectorizing instruction features, and finally calculating code similarity through feature vectors, for example: chinese patent publication No. CN113554101A discloses a binary code similarity detection method based on deep learning, which uses Structure2Vec to generate graph embedding of a control flow graph of a binary function, and introduces CNN to process sequential Structure information between basic blocks of the control flow graph, thereby better defining the precedence relationship between blocks within the function.
Aiming at the problems, the method comprehensively considers the semantic information of technical characteristics, uses a semantic coding method based on a function calling structure, and analyzes the code similarity from the aspects of the function calling structure, the function name, the built-in class and the like; carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention discloses a source code comparison method facing technical characteristics.
The invention also discloses a system for realizing the comparison method.
The invention also discloses a program product loaded with the method.
The invention also discloses a computer readable storage medium loaded with the method.
The invention discloses an application method utilizing the method.
Summary of the invention:
the invention relates to a source code comparison method facing technical characteristics, the modern algorithm design idea is that a code is constructed in a modularization mode, the code usually comprises a large number of basic function functions, and information such as function names, calling structures, built-in classes and the like provides important code technical characteristics, and the method aims to: and comprehensively considering the semantic information of the technical characteristics, and analyzing the performance problems of code structure, execution efficiency and the like from the aspects of function call structure, function name, built-in class and the like.
Interpretation of terms:
1. the technical characteristics are as follows: keywords that describe the code implementation function and the usage technique; a function call structure; a function name; a class name is built in; loss functions, etc.
2. Function call structural similarity: the object-oriented programming has three characteristics of encapsulation, inheritance and polymorphism, most codes are based on the idea of modular design, namely, the codes are designed by taking functions as units, the modularization of the codes can be realized by utilizing the functions, and if the functions realized by the codes are similar, the function calling structures in the technical characteristics of the codes are similar.
3. Functional similarity: functions with specific functions usually adopt functions with definite functions, entry call parameters and return values, and if the functions realized by the code are similar, the function names in the code technical characteristics are similar.
4. Built-in class similarity: functions are generally encapsulated in built-in classes, and if the functions implemented by the code are similar, the built-in classes introduced in the technical features of the code are also similar.
The detailed technical scheme of the invention is as follows:
a method for comparing source codes facing technical features is characterized by comprising the following steps:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: performing semantic coding on the function calling structure by adopting a self-coder method based on a graph convolution neural network to obtain a function calling structure vector for comparing code structure semantics based on the semantic vector;
using TF-IDF algorithm coding stage for the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector, and comparing the similarity of the function and the similarity of the built-in class based on the calling information vector of the function name and the built-in class name;
and finally, splicing the function call structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector so as to compare the code similarity.
Preferably, according to the present invention, the code file preprocessing stage includes:
and obtaining a DOT file of the function call structure in the code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in the graph representation of the function call structure and the relationship among the nodes. For example, a Pycallgraph tool can generate a graph representation of a function call structure of a Python application program, a grapeviz tool can be used for converting a DOT file into a function call structure picture, and the DOT file is preprocessed to obtain the graph representation of the function call structure:
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structureAnd further obtaining a adjacency matrix of a diagram representation of the function call structureA。
Preferably, according to the present invention, the function call structure semantic encoding stage includes:
the code file is preprocessed to obtain a diagram representation of a function call structureA function name and a built-in class name, a Graph representation of a function call structure using a Graph Convolutional Neural network (GCN) based self-encoder methodPerforming semantic coding, structuring semantic vectorsh s The encoder contains the function attribute and the information that there is a function call edge, and the calculation is as shown in equations (II) to (IV):
in formulas (II), (III), (IV),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 the self-learning parameter is a parameter to be learned and is obtained after training a self-encoding device of the graph, wherein the self-encoding device of the graph comprises an encoder and a decoder, and the calculation of the encoder is shown in formulas (II) to (IV);
Dthe matrix is a degree matrix which is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts sigmoid function as decoder to reconstruct original graph, and the calculation of the decoder is shown as formula (V):
in the formula (V), the first and second groups of the chemical formula (V),is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function.
According to the invention, in order to make the structural semantic vector contain rich node and edge information and make the reconstructed adjacency matrix similar to the original adjacency matrix as much as possible, the cross entropy of the reconstructed adjacency matrix and the adjacency matrix of the original graph is used as a loss function, and the calculation of the loss function is shown in formula (VI):
in the formula (VI), the reaction mixture is,Nrepresenting the number of function call edge sets;elements in the adjacency matrix a representing the original graph;adjacency matrix representing reconstructed original graphOf (1).
According to the present invention, preferably, the call information encoding stage of the function name and the built-in class name includes: in the function similarity part, the calling information vector based on the function name is compared with the function similarity, the algorithm code comprises a large number of functions providing basic functions, and the functions have definite functions, entry calling parameters and return values, so the functions adopted by the codes with similar functionsThe numbers are also similar, and a function calling information vector is obtained according to the TF-IDF calculation module of the function nameh f Is calculated as shown in formula (VII):
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
in the built-in class similarity part, the similarity comparison of built-in classes is carried out on the calling information vector based on the built-in class name, functions are generally packaged in the built-in classes, if the functions realized by the codes are similar, the built-in classes introduced by the codes are also similar, the built-in class calling information vector is obtained according to a TF-IDF calculation module of the built-in classes, and the built-in class vector is obtainedh c Is calculated as shown in equation (VIII):
in the formula (VIII) shown in the specification,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
According to the invention, the function call structure vector, the function vector and the built-in class vector are preferably spliced to form the technical characteristic vectorhIs calculated as shown in equation (IX):
in the formula (IX), the first and second groups of the general formula (IX),hrepresenting technical characteristic vector ^ is vector concatenation character, since length of function calling structure vector, function vector and built-in class vector are all influenced by number of calling function in code and number of leading-in built-in class, function calling is carried outLinearly transforming the structure vector, the function vector and the built-in class vector into a vector with a certain dimensionality;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
in the formula (X), sim α The similarity value of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
A system loaded with the above method, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a calling information coding stage processing module of functions and built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
A program product loaded with the above method, comprising: the computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
A computer-readable storage medium loaded with the above method, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of any of the methods described herein.
The invention discloses an application method using the method, which comprises the following steps: carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity. The method does not need to label a corpus, extracts the correlation and difference information among the codes, and compares the codes according to technical characteristics.
The invention has the technical effects that:
1. compared with the traditional method, the code comparison method does not need to mark a corpus, comprehensively considers the semantic information of technical features such as function names, calling structures, built-in classes and the like, and can effectively extract the correlation and difference information among the codes.
2. Compared with the traditional method, the method adopts the self-encoder method based on the graph convolution neural network to carry out graph semantic encoding in the semantic encoding stage of the function call structure, and can capture the semantic information of the function call structure.
3. Compared with the traditional method, the method provided by the invention has the advantages that the similarity of the function and the similarity of the built-in class are compared based on the calling information vector of the function and the built-in class name, the logic calling information of the code is reserved, and the precision rate of the method is further improved.
Drawings
FIG. 1 is a flow chart of a source code comparison method for technical features of the present invention;
FIG. 2 is a diagram of the code comparison model framework oriented to technical features of the present invention.
Detailed Description
The following detailed description is made with reference to the embodiments and the accompanying drawings, but not limited thereto.
Example 1
As shown in fig. 1 and fig. 2, a method for comparing source codes of technical features includes:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: performing semantic coding on the function calling structure by adopting a self-coder method based on a graph convolution neural network to obtain a function calling structure vector, wherein the function calling structure vector is used for comparing code structure semantics based on a semantic vector, and preferably, the graph semantic coding is specifically realized according to the following documents: william L. Hamilton, Rex Ying, J ure Leskovec. Inductive representation on large graphs [ C ]. Proceedings of the 31st International Conference on Neural Information Processing systems, 2017: 1025 1035;
using TF-IDF algorithm coding stage for the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector, and comparing the similarity of the function and the similarity of the built-in class based on the calling information vector of the function name and the built-in class name;
and finally, splicing the function call structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector so as to compare the code similarity.
The code file preprocessing stage comprises:
and obtaining a DOT file of the function call structure in the code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in the graph representation of the function call structure and the relationship among the nodes. For example, a Pycallgraph tool can generate a graph representation of a function call structure of a Python application program, a grapeviz tool can be used for converting a DOT file into a function call structure picture, and the DOT file is preprocessed to obtain the graph representation of the function call structure:
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structureAnd further obtaining a adjacency matrix of a diagram representation of the function call structureA。
The semantic coding stage of the function call structure comprises the following steps:
the code file is preprocessed to obtain a diagram representation of a function call structureA function name and a built-in class name, a Graph representation of a function call structure using a Graph Convolutional Neural network (GCN) based self-encoder methodPerforming semantic coding, structuring semantic vectorsh s The encoder contains the function attribute and the information that there is a function call edge, and the calculation is as shown in equations (II) to (IV):
in formulas (II), (III), (IV),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 is the parameter to be learned, obtained after the graph self-encoder is trainedThe graph self-encoder comprises an encoder and a decoder, wherein the calculation of the encoder is shown in formulas (II) to (IV);
Dthe matrix is a degree matrix which is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts a sigmoid function as a decoder to reconstruct an original graph, and the calculation of the decoder is shown as a formula (V):
in the formula (V), the first and second groups of the compound,is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function.
Example 2
In order to make the structural semantic vector include rich node and edge information and make the reconstructed adjacency matrix similar to the original adjacency matrix as much as possible, the cross entropy between the reconstructed adjacency matrix and the adjacency matrix of the original graph is used as a loss function, and the loss function is calculated as shown in formula (VI):
in the formula (VI), the reaction mixture is,Nrepresenting the number of function call edge sets;elements in the adjacency matrix a representing the original graph;adjacency matrix representing reconstructed original graphOf (2).
Example 3
A method for comparing technical feature-oriented source codes as described in embodiment 1,
the calling information encoding stage of the function name and the built-in class name comprises the following steps: in the function similarity part, the calling information vector based on the function name is compared with the function similarity, the algorithm code comprises a large number of functions providing basic functions, and the functions have definite functions, entry calling parameters and return values, so the functions adopted by the codes with similar functions are also similar, and the function calling information vector and the function vector are obtained according to the TF-IDF calculation module of the function nameh f Is calculated as shown in formula (VII):
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
in the built-in class similarity part, the similarity comparison of built-in classes is carried out on the calling information vector based on the built-in class name, functions are generally packaged in the built-in classes, if the functions realized by the codes are similar, the built-in classes introduced by the codes are also similar, the built-in class calling information vector is obtained according to a TF-IDF calculation module of the built-in classes, and the built-in class vector is obtainedh c Is calculated as shown in equation (VIII):
in the formula (VIII) shown in the specification,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
Example 4
The method for comparing the source code of the technical characteristics comprises the following steps of calling a structure vector and a function by the functionSplicing the vector and the built-in class vector to be used as a technical characteristic vectorhIs calculated as shown in equation (IX):
in the formula (IX), the reaction mixture is,hexpressing a technical characteristic vector, wherein ^ is a vector concatenation character, and since the lengths of the function calling structure vector, the function vector and the built-in class vector are all influenced by the number of calling functions in the code and the number of leading in the built-in class, the function calling structure vector, the function vector and the built-in class vector are all linearly transformed into a vector with a certain dimension;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
in the formula (X), sim α The similarity value of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
Example 5
A system loaded with the method of embodiments 1-4, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a calling information coding stage processing module of functions and built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
Example 6
A program product loaded with the method of embodiments 1-4, comprising: the computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
Example 7
A computer-readable storage medium loaded with the method according to embodiments 1-4, having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of any of the methods described herein.
Example 8
An application method using the method as described in examples 1-4: carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
Using the methods described in examples 1-5, 8, taking the "syntactic analyzed" code as an example, the retrieved reusable code and the newly designed code are as follows:
firstly, inputting a code file:
the new design code iscode 1 The retrieved reusable code iscode 2 ;
For codecode 1 Representation of function call structure obtained by file preprocessingFunction name and built-in class name:
function name set V = { main, nextInt, for, print, while, if }
Function attribute collectionA V ={[0.32,0.41],[0.85,0.20],[0.17,0.66]…}
Collections with function call edgesE={(main,nextInt),(main,for),(for,print)…}
Aggregation of edge attributesA E ={1,1,1…}
The built-in class name set is { java.io.buffer reader, java.util.hashmap, java.io.ioexception … }
Function call structure semantic coding stage:
function call structure semantic coding is performed using Graph Convolutional Neural network (GCN) based self-encoder methods.
And a calling information encoding stage of the function and the built-in class name:
the TF-IDF calculation module according to the function name can obtain the function call information vectorh f =[0.61,0.73,0.82,0.53…]
The built-in class calling information vector can be obtained by the TF-IDF calculation module according to the built-in class nameh c =[0.86,0.62,0.45,0.72…]
Splicing:
splicing a structural semantic vector, a function vector and a built-in class vector to serve as a technical feature vectorh des =[0.24,0.56,0.46,0.89…,0.61,0.73,0.82,0.53…,0.86,0.62,0.45,0.72…]
Reusable code to be retrievedcode 2 And new design codecode 1 And (3) comparison:
obtaining new design codes according to spliced resultscode 1 Is a technical feature vector ofh des The same can obtain the reusable codecode 2 Is a technical feature vector ofh sea The cosine similarity value of the technical feature vector is taken as the code similarity sim α =cos(h sea ,h des )=0.75。
Claims (10)
1. A method for comparing source codes facing technical features is characterized by comprising the following steps:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: adopting a self-encoder method based on a graph convolution neural network to carry out semantic encoding on the function calling structure to obtain a function calling structure vector;
using TF-IDF algorithm coding stage to the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector;
and finally, splicing the function call structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector so as to compare the code similarity.
2. A method as claimed in claim 1, wherein the code file preprocessing stage comprises:
obtaining a DOT file of a function call structure in code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in a graph representation of the function call structure and relations among the nodes, and preprocessing the DOT file to obtain the graph representation of the function call structure:
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structureAnd further obtaining a adjacency matrix of a diagram representation of the function call structureA。
3. The method of claim 1, wherein the function call structure semantic coding stage comprises:
the code file is preprocessed to obtain a diagram representation of a function call structureA function name and a built-in class name, a Graph representation of a function call structure using a Graph Convolutional Neural network (GCN) based self-encoder methodPerforming semantic coding, structuring semantic vectorsh s The encoder contains the function attribute and the information that there is a function call edge, and the calculation is as shown in equations (II) to (IV):
in the formulae (II) (III) (IV)) In (1),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 the self-learning parameter is a parameter to be learned and is obtained after training a self-encoding device of the graph, wherein the self-encoding device of the graph comprises an encoder and a decoder, and the calculation of the encoder is shown in formulas (II) to (IV);
Dthe matrix is a degree matrix which is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts sigmoid function as decoder to reconstruct original graph, and the calculation of the decoder is shown as formula (V):
4. A method for comparing source codes of technical features according to claim 3, wherein the cross entropy of the adjacency matrix of the reconstructed graph and the adjacency matrix of the original graph is used as a loss function, and the calculation of the loss function is shown in formula (VI):
5. The method of claim 1, wherein the step of encoding the calling information of the function name and the built-in class name comprises: obtaining a function call information vector according to the TF-IDF calculation module of the function name, and obtaining a function vectorh f Is calculated as shown in formula (VII):
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
obtaining a built-in class calling information vector according to a built-in class TF-IDF calculation module, and setting a built-in class vectorh c Is calculated as shown in equation (VIII):
in the formula (VIII), the reaction mixture is,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
6. The method of claim 1, wherein the feature-oriented source code comparison method comprises concatenating a function call structure vector, a function vector, and a built-in class vector as a feature vectorhIs calculated as shown in equation (IX):
in the formula (IX), the reaction mixture is,hrepresenting a technical feature vector, and behavior behavioris a vector concatenation character;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
in the formula (X), sim α The similarity value of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
7. A system loaded with the method of any of claims 1-6, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a calling information coding stage processing module of functions and built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
8. A program product loaded with the method of any one of claims 1-6, comprising: wherein the program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
9. A computer-readable storage medium loaded with a method according to any of claims 1-6, having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of any of the methods recited in the present invention.
10. A method of use using the method of any one of claims 1 to 6, comprising: carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210808406.7A CN114880023B (en) | 2022-07-11 | 2022-07-11 | Technical feature oriented source code comparison method, system and program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210808406.7A CN114880023B (en) | 2022-07-11 | 2022-07-11 | Technical feature oriented source code comparison method, system and program product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114880023A true CN114880023A (en) | 2022-08-09 |
CN114880023B CN114880023B (en) | 2022-09-30 |
Family
ID=82683640
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210808406.7A Active CN114880023B (en) | 2022-07-11 | 2022-07-11 | Technical feature oriented source code comparison method, system and program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114880023B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101976318A (en) * | 2010-11-15 | 2011-02-16 | 北京理工大学 | Detection method of code similarity based on digital fingerprints |
CN110147235A (en) * | 2019-03-29 | 2019-08-20 | 中国科学院信息工程研究所 | Semantic comparison method and device between a kind of source code and binary code |
CN110502361A (en) * | 2019-08-29 | 2019-11-26 | 扬州大学 | Fine granularity defect positioning method towards bug report |
CN111897970A (en) * | 2020-07-27 | 2020-11-06 | 平安科技(深圳)有限公司 | Text comparison method, device and equipment based on knowledge graph and storage medium |
CN112286575A (en) * | 2020-10-20 | 2021-01-29 | 杭州云象网络技术有限公司 | Intelligent contract similarity detection method and system based on graph matching model |
CN112800172A (en) * | 2021-02-07 | 2021-05-14 | 重庆大学 | Code searching method based on two-stage attention mechanism |
-
2022
- 2022-07-11 CN CN202210808406.7A patent/CN114880023B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101976318A (en) * | 2010-11-15 | 2011-02-16 | 北京理工大学 | Detection method of code similarity based on digital fingerprints |
CN110147235A (en) * | 2019-03-29 | 2019-08-20 | 中国科学院信息工程研究所 | Semantic comparison method and device between a kind of source code and binary code |
CN110502361A (en) * | 2019-08-29 | 2019-11-26 | 扬州大学 | Fine granularity defect positioning method towards bug report |
CN111897970A (en) * | 2020-07-27 | 2020-11-06 | 平安科技(深圳)有限公司 | Text comparison method, device and equipment based on knowledge graph and storage medium |
CN112286575A (en) * | 2020-10-20 | 2021-01-29 | 杭州云象网络技术有限公司 | Intelligent contract similarity detection method and system based on graph matching model |
CN112800172A (en) * | 2021-02-07 | 2021-05-14 | 重庆大学 | Code searching method based on two-stage attention mechanism |
Non-Patent Citations (1)
Title |
---|
YUQING SUN 等: "Constraint-based Authorization Management for Mobile Collaboration Services", 《2009 CONGRESS ON SERVICES》 * |
Also Published As
Publication number | Publication date |
---|---|
CN114880023B (en) | 2022-09-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Conneau et al. | Very deep convolutional networks for natural language processing | |
Kushman et al. | Using semantic unification to generate regular expressions from natural language | |
CN110472003B (en) | Social network text emotion fine-grained classification method based on graph convolution network | |
CN111124487B (en) | Code clone detection method and device and electronic equipment | |
CN109871955A (en) | A kind of aviation safety accident causality abstracting method | |
CN113434858B (en) | Malicious software family classification method based on disassembly code structure and semantic features | |
CN112487812A (en) | Nested entity identification method and system based on boundary identification | |
CN112306494A (en) | Code classification and clustering method based on convolution and cyclic neural network | |
CN116661805B (en) | Code representation generation method and device, storage medium and electronic equipment | |
CN115357904B (en) | Multi-class vulnerability detection method based on program slicing and graph neural network | |
CN110555305A (en) | Malicious application tracing method based on deep learning and related device | |
CN113255320A (en) | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism | |
CN113743119B (en) | Chinese named entity recognition module, method and device and electronic equipment | |
CN114881043B (en) | Deep learning model-based legal document semantic similarity evaluation method and system | |
CN115033890A (en) | Comparison learning-based source code vulnerability detection method and system | |
CN112925904A (en) | Lightweight text classification method based on Tucker decomposition | |
CN109446299A (en) | The method and system of searching email content based on event recognition | |
CN108875024B (en) | Text classification method and system, readable storage medium and electronic equipment | |
CN113704473A (en) | Media false news detection method and system based on long text feature extraction optimization | |
CN111831624A (en) | Data table creating method and device, computer equipment and storage medium | |
CN114880023B (en) | Technical feature oriented source code comparison method, system and program product | |
CN110377753B (en) | Relation extraction method and device based on relation trigger word and GRU model | |
CN112270358A (en) | Code annotation generation model robustness improving method based on deep learning | |
CN116701325A (en) | Binary file cache-based XBRL classification standard loading method | |
CN113449517B (en) | Entity relationship extraction method based on BERT gated multi-window attention network model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |