CN114880023B - Technical feature oriented source code comparison method, system and program product - Google Patents
Technical feature oriented source code comparison method, system and program product Download PDFInfo
- Publication number
- CN114880023B CN114880023B CN202210808406.7A CN202210808406A CN114880023B CN 114880023 B CN114880023 B CN 114880023B CN 202210808406 A CN202210808406 A CN 202210808406A CN 114880023 B CN114880023 B CN 114880023B
- Authority
- CN
- China
- Prior art keywords
- function
- vector
- built
- code
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/75—Structural analysis for program understanding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/43—Checking; Contextual analysis
- G06F8/436—Semantic checking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A source code comparison method, a system and a program product for technical characteristics belong to the technical field of natural language processing. The invention comprises the following steps: a semantic coding method based on a function calling structure is used, and code similarity is analyzed from the aspects of the function calling structure, function names, built-in classes and the like; carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity. The invention comprehensively considers the technical characteristic information of function name, calling structure, built-in class and the like, and can better compare codes according to the technical characteristics.
Description
Technical Field
The invention discloses a source code comparison method, a system and a program product for technical characteristics, and belongs to the technical field of natural language processing.
Background
The open source platform provides an environment for sharing and exchanging codes for scientific research personnel, more and more deep learning models and codes are shared on the open source platform, and an ecological environment capable of reusing the codes is created, so that the research personnel need to search related solutions for specific problems. The modern algorithm design idea is that the code is constructed in a modularization mode, a large number of basic function functions are usually contained, and information such as function names, calling structures and built-in classes provides important code technical characteristics. Various solutions to the same problem, such as in text classification, may use different neural network structures: the performance of convolutional neural networks, cyclic neural networks, attention mechanisms, etc. also vary, and therefore, it is necessary to further analyze the performance problems such as code structure and execution efficiency, etc., and compare the newly designed code with the reusable code.
The similarity of codes is an important index in code comparison analysis, and currently mainstream code similarity calculation methods can be divided into three categories: code similarity calculation method based on calling structure, static characteristics and binary codes.
The code similarity calculation method based on the call structure mainly calculates the similarity between codes according to the logical structure of the codes, and the logical structure generally includes an abstract syntax tree, a program call graph and the like, for example: the invention patent with publication number CN110737469A discloses a source code similarity evaluation method based on semantic information in function granularity, which considers the structure and node information of a control flow graph by calculating identifiers corresponding to functions and embedded vectors of the control flow graph, but does not consider semantic information of technical features such as built-in classes of code import.
The code similarity calculation method based on static features is to extract some metric values from a source code to form feature vectors, and then to take the similarity between the code feature vectors as the similarity of the codes, for example: the invention patent with publication number CN111290784A discloses a program source code similarity detection method suitable for large-scale samples, which calculates a locality sensitive hash value for a text feature sequence and a feature weight sequence of each sample to be detected, and uses the value as a sample feature vector.
A code similarity calculation method based on binary system generally includes obtaining an instruction sequence of each function after disassembling a binary system code, vectorizing instruction features, and finally calculating code similarity through feature vectors, for example: chinese patent publication No. CN113554101A discloses a binary code similarity detection method based on deep learning, which uses Structure2Vec to generate graph embedding of a control flow graph of a binary function, and introduces CNN to process sequential Structure information between basic blocks of the control flow graph, thereby better defining the precedence relationship between blocks within the function.
Aiming at the problems, the method comprehensively considers the semantic information of technical characteristics, uses a semantic coding method based on a function calling structure, and analyzes the code similarity from the aspects of the function calling structure, the function name, the built-in class and the like; carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention discloses a source code comparison method facing technical characteristics.
The invention also discloses a system for realizing the comparison method.
The invention also discloses a program product loaded with the method.
The invention also discloses a computer readable storage medium loaded with the method.
The invention discloses an application method utilizing the method.
Summary of the invention:
the invention relates to a source code comparison method facing technical characteristics, the modern algorithm design idea is that a code is constructed in a modularization mode, the code usually comprises a large number of basic function functions, and information such as function names, calling structures, built-in classes and the like provides important code technical characteristics, and the method aims at: and comprehensively considering the semantic information of the technical characteristics, and analyzing the performance problems of code structure, execution efficiency and the like from the aspects of function call structure, function name, built-in class and the like.
Interpretation of terms:
1. the technical characteristics are as follows: keywords that describe the code implementation function and the usage technique; a function call structure; a function name; a class name is built in; loss functions, etc.
2. Function call structural similarity: the object-oriented programming has three characteristics of encapsulation, inheritance and polymorphism, most codes are based on the idea of modular design, namely, the codes are designed by taking functions as units, the modularization of the codes can be realized by utilizing the functions, and if the functions realized by the codes are similar, the function calling structures in the technical characteristics of the codes are similar.
3. Functional similarity: functions with specific functions usually adopt functions with definite functions, entry call parameters and return values, and if the functions realized by the code are similar, the function names in the code technical characteristics are similar.
4. Built-in class similarity: functions are generally encapsulated in built-in classes, and if the functions implemented by the code are similar, the built-in classes introduced in the technical features of the code are also similar.
The detailed technical scheme of the invention is as follows:
a method for comparing source codes facing technical features is characterized by comprising the following steps:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: performing semantic coding on the function calling structure by adopting a self-coder method based on a graph convolution neural network to obtain a function calling structure vector for comparing code structure semantics based on the semantic vector;
a TF-IDF algorithm coding stage is used for the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector, and the function vector and the built-in class vector are used for comparing the similarity of the function and the similarity of the built-in class based on the calling information vector of the function name and the built-in class name;
and finally, splicing the function call structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector so as to compare the code similarity.
Preferably, according to the present invention, the code file preprocessing stage includes:
and obtaining a DOT file of the function call structure in the code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in the graph representation of the function call structure and the relationship among the nodes. For example, a Pycallgraph tool can generate a graph representation of a function call structure of a Python application program, a grapeviz tool can be used for converting a DOT file into a function call structure picture, and the DOT file is preprocessed to obtain the graph representation of the function call structure:
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structureAnd obtaining the adjacency matrix of the graph representation of the function call structureA。
Preferably, according to the present invention, the function call structure semantic encoding stage includes:
the code file is preprocessed to obtain a diagram representation of a function call structureA function name and a built-in class name, a Graph representation of a function call structure using a Graph Convolutional Neural network (GCN) based self-encoder methodPerforming semantic coding, structuring semantic vectorsh s The encoder contains the function attribute and the information that there is a function call edge, and the calculation is as shown in equations (II) to (IV):
in formulas (II), (III), (IV),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 the self-encoding method comprises the following steps of obtaining parameters to be learned after training a self-encoder of the graph, wherein the self-encoder of the graph comprises an encoder and a decoder, and the calculation of the encoder is shown in formulas (II) to (IV);
Dthe method comprises the following steps of (1) obtaining a degree matrix, wherein the degree matrix is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts sigmoid function as decoder to reconstruct original graph, and the calculation of the decoder is shown as formula (V):
in the formula (V), the first and second groups of the compound,is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function.
According to the invention, in order to make the structure semantic vector contain abundant node and edge information and make the reconstructed adjacency matrix similar to the original adjacency matrix as much as possible, the cross entropy of the reconstructed adjacency matrix and the adjacency matrix of the original graph is used as a loss function, and the calculation of the loss function is shown in formula (VI):
in the formula (VI), the reaction mixture is,Nrepresenting the number of function call edge sets;elements in the adjacency matrix a representing the original graph;adjacency matrix representing reconstructed original graphOf (1).
According to the present invention, the call information encoding stage of the function name and the built-in class name includes: in the function similarity part, the calling information vector based on the function name is compared with the function similarity, the algorithm code comprises a large number of functions for providing basic functions, and the functions have definite functions, entry calling parameters and return values, so the functions adopted by the codes with similar functions are also similar, and the function calling information vector and the function vector are obtained according to the TF-IDF calculation module of the function nameh f Is calculated as shown in formula (VII):
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
in the built-in class similarity part, the similarity comparison of built-in classes is carried out on the calling information vector based on the built-in class name, functions are generally packaged in the built-in classes, if the functions realized by the codes are similar, the built-in classes introduced by the codes are also similar, the built-in class calling information vector is obtained according to a TF-IDF calculation module of the built-in classes, and the built-in class vector is obtainedh c Is calculated asRepresented by formula (VIII):
in the formula (VIII), the reaction mixture is,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
According to the invention, the function call structure vector, the function vector and the built-in class vector are preferably spliced to form the technical characteristic vectorhIs calculated as shown in equation (IX):
in the formula (IX), the first and second groups of the general formula (IX),hthe technical characteristic vector is represented, and [ ] is a vector concatenation character, and since the lengths of the function call structure vector, the function vector and the built-in class vector are all influenced by the number of call functions in the code and the number of the imported built-in class, the function call structure vector, the function vector and the built-in class vector are all linearly transformed into a vector with a certain dimension;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
in the formula (X), sim α The similarity value of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
A system loaded with the above method, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a function and call information coding stage processing module with built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
A program product loaded with the above method, comprising: the computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
A computer-readable storage medium loaded with the above method, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of any of the methods described herein.
The invention discloses an application method using the method, which comprises the following steps: carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity. The method does not need to label a corpus, extracts the correlation and difference information among the codes, and compares the codes according to technical characteristics.
The invention has the technical effects that:
1. compared with the traditional method, the code comparison method does not need to mark a corpus, comprehensively considers the semantic information of technical features such as function names, calling structures, built-in classes and the like, and can effectively extract the correlation and difference information among the codes.
2. Compared with the traditional method, the method adopts the self-encoder method based on the graph convolution neural network to carry out graph semantic encoding in the semantic encoding stage of the function call structure, and can capture the semantic information of the function call structure.
3. Compared with the traditional method, the method provided by the invention has the advantages that the similarity of the function and the similarity of the built-in class are compared based on the calling information vector of the function and the built-in class name, the logic calling information of the code is reserved, and the precision rate of the method is further improved.
Drawings
FIG. 1 is a flow chart of a source code comparison method for technical features of the present invention;
FIG. 2 is a diagram of the code comparison model framework oriented to technical features of the present invention.
Detailed Description
The following detailed description is made with reference to the embodiments and the drawings of the specification, but not limited thereto.
Example 1
As shown in fig. 1 and fig. 2, a method for comparing source codes of technical features includes:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: performing semantic coding on the function calling structure by adopting a self-coder method based on a graph convolution neural network to obtain a function calling structure vector, wherein the function calling structure vector is used for comparing code structure semantics based on a semantic vector, and preferably, the graph semantic coding is specifically realized according to the following documents: william L. Hamilton, Rex Ying, J ure Leskovec. Inductive representation on large graphs [ C ]. Proceedings of the 31st International Conference on Neural Information Processing systems, 2017: 1025 1035;
using TF-IDF algorithm coding stage for the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector, and comparing the similarity of the function and the similarity of the built-in class based on the calling information vector of the function name and the built-in class name;
and finally, splicing the function call structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector so as to compare the code similarity.
The code file preprocessing stage comprises:
and obtaining a DOT file of the function call structure in the code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in the graph representation of the function call structure and the relationship among the nodes. For example, a Pycallgraph tool can generate a graph representation of a function call structure of a Python application program, a grapeviz tool can be used for converting a DOT file into a function call structure picture, and the DOT file is preprocessed to obtain the graph representation of the function call structure:
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structuresAnd further obtaining a adjacency matrix of a diagram representation of the function call structureA。
The semantic coding stage of the function call structure comprises the following steps:
the code file is preprocessed to obtain a diagram representation of a function call structureA function name and a built-in class name, a Graph representation of a function call structure using a Graph Convolutional Neural network (GCN) based self-encoder methodPerforming semantic coding, structuring semantic vectorsh s Containing function attributes and information on the presence of function-call edges, the encoder performing the calculations as shown in equations (II) to (IV)The following steps:
in formulas (II), (III), (IV),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 the self-learning parameter is a parameter to be learned and is obtained after training a self-encoding device of the graph, wherein the self-encoding device of the graph comprises an encoder and a decoder, and the calculation of the encoder is shown in formulas (II) to (IV);
Dthe matrix is a degree matrix which is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts a sigmoid function as a decoder to reconstruct an original graph, and the calculation of the decoder is shown as a formula (V):
in the formula (V), the first and second groups of the chemical formula (V),is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function.
Example 2
In order to make the structural semantic vector include rich node and edge information and make the reconstructed adjacency matrix similar to the original adjacency matrix as much as possible, the cross entropy between the reconstructed adjacency matrix and the adjacency matrix of the original graph is used as a loss function, and the loss function is calculated as shown in formula (VI):
in the formula (VI), the reaction mixture is,Nrepresenting the number of function call edge sets;elements in the adjacency matrix a representing the original graph;adjacency matrix representing reconstructed original graphOf (1).
Example 3
A method for comparing technical feature-oriented source codes as described in embodiment 1,
the calling information encoding stage of the function name and the built-in class name comprises the following steps: in the function similarity part, the calling information vector based on the function name is compared with the function similarity, the algorithm code comprises a large number of functions for providing basic functions, and the functions have definite functions, entry calling parameters and return values, so the functions adopted by the codes with similar functions are also similar, and the function calling information vector and the function vector are obtained according to the TF-IDF calculation module of the function nameh f Is calculated as shown in formula (VII):
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
in the built-in class similarity part, the similarity comparison of built-in classes is carried out on the calling information vector based on the built-in class name, functions are generally packaged in the built-in classes, if the functions realized by the codes are similar, the built-in classes introduced by the codes are also similar, the built-in class calling information vector is obtained according to a TF-IDF calculation module of the built-in classes, and the built-in class vector is obtainedh c Is calculated as shown in equation (VIII):
in the formula (VIII), the reaction mixture is,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
Example 4
In the source code comparison method for technical features described in embodiment 1, a function call structure vector, a function vector, and a built-in class vector are spliced to serve as a technical feature vectorhIs calculated as shown in equation (IX):
in the formula (IX), the reaction mixture is,hexpressing a technical characteristic vector, wherein ^ is a vector concatenation character, and since the lengths of the function calling structure vector, the function vector and the built-in class vector are all influenced by the number of calling functions in the code and the number of leading in the built-in class, the function calling structure vector, the function vector and the built-in class vector are all linearly transformed into a vector with a certain dimension;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
in the formula (X), sim α Similarity values of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
Example 5
A system loaded with the method of embodiments 1-4, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a calling information coding stage processing module of functions and built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
Example 6
A program product loaded with the method of embodiments 1-4, comprising: the computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
Example 7
A computer-readable storage medium loaded with the method according to embodiments 1-4, having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of any of the methods described herein.
Example 8
An application method using the method as described in examples 1-4: carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
Using the methods described in examples 1-5, 8, taking the "syntactic analyzed" code as an example, the retrieved reusable code and the newly designed code are as follows:
firstly, inputting a code file:
the new design code iscode 1 The retrieved reusable code iscode 2 ;
For codecode 1 Representation of function call structure obtained by file preprocessingFunction name and built-in class name:
function name set V = { main, nextInt, for, print, while, if }
Function attribute collectionA V ={[0.32,0.41],[0.85,0.20],[0.17,0.66]…}
Collections with function call edgesE={(main,nextInt),(main,for),(for,print)…}
Aggregation of edge attributesA E ={1,1,1…}
The built-in class name set is { java.io.buffer reader, java.util.hashmap, java.io.ioexception … }
Function call structure semantic coding stage:
function call structure semantic coding is performed using Graph Convolutional Neural network (GCN) based self-encoder methods.
And a calling information encoding stage of the function and the built-in class name:
the TF-IDF calculation module according to the function name can obtain the function call information vectorh f =[0.61,0.73,0.82,0.53…]
The built-in class calling information vector can be obtained by the TF-IDF calculation module according to the built-in class nameh c =[0.86,0.62,0.45,0.72…]
Splicing:
splicing a structural semantic vector, a function vector and a built-in class vector to serve as a technical feature vectorh des =[0.24,0.56,0.46,0.89…,0.61,0.73,0.82,0.53…,0.86,0.62,0.45,0.72…]
Reusable code to be retrievedcode 2 And new design codecode 1 And (3) comparison:
obtaining new design codes according to splicing resultscode 1 The technical feature vector ofh des The same can obtain the reusable codecode 2 Is a technical feature vector ofh sea The cosine similarity value of the technical feature vector is taken as the code similarity sim α =cos(h sea ,h des )=0.75。
Claims (7)
1. A method for comparing source codes facing technical features is characterized by comprising the following steps:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: adopting a self-encoder method based on a graph convolution neural network to carry out semantic encoding on the function calling structure to obtain a function calling structure vector;
using TF-IDF algorithm coding stage to the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector;
finally, splicing the function calling structure vector, the function vector and the built-in class vector to serve as an integral technical feature vector so as to compare the similarity of the codes;
the code file preprocessing stage comprises:
obtaining a DOT file of a function call structure in code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in a graph representation of the function call structure and relations among the nodes, and preprocessing the DOT file to obtain the graph representation of the function call structure:
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structureAnd further obtaining a adjacency matrix of a diagram representation of the function call structureA;
The semantic coding stage of the function call structure comprises the following steps:
the code file is preprocessed to obtain a diagram representation of a function call structureFunction name and built-in class name, using graph convolution neural network (GCN) -based self-encoder method to pair functionsGraph representation of number call structurePerforming semantic coding, structuring semantic vectorsh s The encoder contains the function attribute and the information that there is a function call edge, and the calculation is as shown in equations (II) to (IV):
in formulas (II), (III), (IV),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 the self-learning parameter is a parameter to be learned and is obtained after training a self-encoding device of the graph, wherein the self-encoding device of the graph comprises an encoder and a decoder, and the calculation of the encoder is shown in formulas (II) to (IV);
Dthe matrix is a degree matrix which is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts sigmoid function as decoder to reconstruct original graph, and the calculation of the decoder is shown as formula (V):
in the formula (V), the first and second groups of the compound,is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function;
using the cross entropy of the reconstructed adjacency matrix and the adjacency matrix of the original graph as a loss function, the computation of the loss function is shown in formula (VI):
2. The method of claim 1, wherein the step of encoding the calling information of the function name and the built-in class name comprises: obtaining a function call information vector according to the TF-IDF calculation module of the function name, and obtaining a function vectorh f Is calculated as shown in formula (VII):
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
obtaining a built-in class calling information vector according to a built-in class TF-IDF calculation module, and setting a built-in class vectorh c Is calculated as shown in equation (VIII):
in the formula (VIII), the reaction mixture is,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
3. The method of claim 1, wherein the feature-oriented source code comparison method comprises concatenating a function call structure vector, a function vector, and a built-in class vector as a feature vectorhIs calculated as shown in equation (IX):
in the formula (IX), the reaction mixture is,hrepresenting a technical feature vector, and behavior behavioris a vector concatenation character;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
in the formula (X), sim α The similarity value of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
4. A system loaded with the method of any of claims 1-3, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a calling information coding stage processing module of functions and built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
5. A program product loaded with the method of any one of claims 1-3, comprising: wherein the program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
6. A computer-readable storage medium loaded with a method according to any of claims 1-3, having a computer program stored thereon, wherein the computer program, when being executed by a processor, is adapted to carry out the steps of the method.
7. A method of use using the method of any one of claims 1 to 3, comprising: carrying out image semantic coding by adopting a self-coder method based on an image convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210808406.7A CN114880023B (en) | 2022-07-11 | 2022-07-11 | Technical feature oriented source code comparison method, system and program product |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210808406.7A CN114880023B (en) | 2022-07-11 | 2022-07-11 | Technical feature oriented source code comparison method, system and program product |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114880023A CN114880023A (en) | 2022-08-09 |
CN114880023B true CN114880023B (en) | 2022-09-30 |
Family
ID=82683640
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210808406.7A Active CN114880023B (en) | 2022-07-11 | 2022-07-11 | Technical feature oriented source code comparison method, system and program product |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114880023B (en) |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101976318A (en) * | 2010-11-15 | 2011-02-16 | 北京理工大学 | Detection method of code similarity based on digital fingerprints |
CN110147235B (en) * | 2019-03-29 | 2021-01-01 | 中国科学院信息工程研究所 | Semantic comparison method and device between source code and binary code |
CN110502361B (en) * | 2019-08-29 | 2023-05-30 | 扬州大学 | Fine granularity defect positioning method for bug report |
CN111897970B (en) * | 2020-07-27 | 2024-05-10 | 平安科技(深圳)有限公司 | Text comparison method, device, equipment and storage medium based on knowledge graph |
CN112286575A (en) * | 2020-10-20 | 2021-01-29 | 杭州云象网络技术有限公司 | Intelligent contract similarity detection method and system based on graph matching model |
CN112800172B (en) * | 2021-02-07 | 2022-07-12 | 重庆大学 | Code searching method based on two-stage attention mechanism |
-
2022
- 2022-07-11 CN CN202210808406.7A patent/CN114880023B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN114880023A (en) | 2022-08-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Conneau et al. | Very deep convolutional networks for natural language processing | |
Kushman et al. | Using semantic unification to generate regular expressions from natural language | |
CN112487812B (en) | Nested entity identification method and system based on boundary identification | |
CN109871955A (en) | A kind of aviation safety accident causality abstracting method | |
CN109840322A (en) | It is a kind of based on intensified learning cloze test type reading understand analysis model and method | |
CN106599041A (en) | Text processing and retrieval system based on big data platform | |
CN111124487B (en) | Code clone detection method and device and electronic equipment | |
CN101751385B (en) | Multilingual information extraction method adopting hierarchical pipeline filter system structure | |
CN112100401B (en) | Knowledge graph construction method, device, equipment and storage medium for science and technology services | |
CN116661805B (en) | Code representation generation method and device, storage medium and electronic equipment | |
CN110555305A (en) | Malicious application tracing method based on deep learning and related device | |
CN115357904B (en) | Multi-class vulnerability detection method based on program slicing and graph neural network | |
CN113743119B (en) | Chinese named entity recognition module, method and device and electronic equipment | |
CN113255320A (en) | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism | |
CN114881043B (en) | Deep learning model-based legal document semantic similarity evaluation method and system | |
CN114329225A (en) | Search method, device, equipment and storage medium based on search statement | |
CN115033890A (en) | Comparison learning-based source code vulnerability detection method and system | |
CN112925904A (en) | Lightweight text classification method based on Tucker decomposition | |
CN109446299A (en) | The method and system of searching email content based on event recognition | |
CN113704473A (en) | Media false news detection method and system based on long text feature extraction optimization | |
CN111831624A (en) | Data table creating method and device, computer equipment and storage medium | |
CN114880023B (en) | Technical feature oriented source code comparison method, system and program product | |
CN117312559A (en) | Method and system for extracting aspect-level emotion four-tuple based on tree structure information perception | |
CN112270358A (en) | Code annotation generation model robustness improving method based on deep learning | |
CN111931461A (en) | Variational self-encoder for text generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |