CN114880023A - Technical feature oriented source code comparison method, system and program product - Google Patents

Technical feature oriented source code comparison method, system and program product Download PDF

Info

Publication number
CN114880023A
CN114880023A CN202210808406.7A CN202210808406A CN114880023A CN 114880023 A CN114880023 A CN 114880023A CN 202210808406 A CN202210808406 A CN 202210808406A CN 114880023 A CN114880023 A CN 114880023A
Authority
CN
China
Prior art keywords
function
vector
code
built
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210808406.7A
Other languages
Chinese (zh)
Other versions
CN114880023B (en
Inventor
龚斌
宁祥东
孙宇清
万林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202210808406.7A priority Critical patent/CN114880023B/en
Publication of CN114880023A publication Critical patent/CN114880023A/en
Application granted granted Critical
Publication of CN114880023B publication Critical patent/CN114880023B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A source code comparison method, a system and a program product for technical characteristics belong to the technical field of natural language processing. The invention comprises the following steps: a semantic coding method based on a function calling structure is used, and code similarity is analyzed from the aspects of the function calling structure, function names, built-in classes and the like; carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity. The invention comprehensively considers the technical characteristic information of function name, calling structure, built-in class and the like, and can better compare codes according to the technical characteristics.

Description

Technical feature oriented source code comparison method, system and program product
Technical Field
The invention discloses a source code comparison method, a system and a program product for technical characteristics, and belongs to the technical field of natural language processing.
Background
The open source platform provides an environment for sharing and exchanging codes for scientific research personnel, more and more deep learning models and codes are shared on the open source platform, and an ecological environment capable of reusing the codes is created, so that the research personnel need to search related solutions for specific problems. The modern algorithm design idea is that the code is constructed in a modularization mode, a large number of basic function functions are usually contained, and information such as function names, calling structures and built-in classes provides important code technical characteristics. Various solutions to the same problem, such as in text classification, may use different neural network structures: the performance of convolutional neural networks, cyclic neural networks, attention mechanisms, etc. also vary, and therefore, it is necessary to further analyze the performance problems such as code structure and execution efficiency, etc., and compare the newly designed code with the reusable code.
The similarity of codes is an important index in code comparison analysis, and currently mainstream code similarity calculation methods can be divided into three categories: code similarity calculation method based on calling structure, static characteristics and binary codes.
The code similarity calculation method based on the call structure mainly calculates the similarity between codes according to the logical structure of the codes, and the logical structure generally includes an abstract syntax tree, a program call graph and the like, for example: the invention patent with publication number CN110737469A discloses a source code similarity evaluation method based on semantic information in function granularity, which considers the structure and node information of a control flow graph by calculating identifiers corresponding to functions and embedded vectors of the control flow graph, but does not consider semantic information of technical features such as built-in classes of code import.
The code similarity calculation method based on static features is to extract some metric values from a source code to form feature vectors, and then to take the similarity between the code feature vectors as the similarity of the codes, for example: the invention patent with publication number CN111290784A discloses a program source code similarity detection method suitable for large-scale samples, which calculates a locality sensitive hash value for a text feature sequence and a feature weight sequence of each sample to be detected, and uses the value as a sample feature vector.
A code similarity calculation method based on binary system generally includes obtaining an instruction sequence of each function after disassembling a binary system code, vectorizing instruction features, and finally calculating code similarity through feature vectors, for example: chinese patent publication No. CN113554101A discloses a binary code similarity detection method based on deep learning, which uses Structure2Vec to generate graph embedding of a control flow graph of a binary function, and introduces CNN to process sequential Structure information between basic blocks of the control flow graph, thereby better defining the precedence relationship between blocks within the function.
Aiming at the problems, the method comprehensively considers the semantic information of technical characteristics, uses a semantic coding method based on a function calling structure, and analyzes the code similarity from the aspects of the function calling structure, the function name, the built-in class and the like; carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention discloses a source code comparison method facing technical characteristics.
The invention also discloses a system for realizing the comparison method.
The invention also discloses a program product loaded with the method.
The invention also discloses a computer readable storage medium loaded with the method.
The invention discloses an application method utilizing the method.
Summary of the invention:
the invention relates to a source code comparison method facing technical characteristics, the modern algorithm design idea is that a code is constructed in a modularization mode, the code usually comprises a large number of basic function functions, and information such as function names, calling structures, built-in classes and the like provides important code technical characteristics, and the method aims to: and comprehensively considering the semantic information of the technical characteristics, and analyzing the performance problems of code structure, execution efficiency and the like from the aspects of function call structure, function name, built-in class and the like.
Interpretation of terms:
1. the technical characteristics are as follows: keywords that describe the code implementation function and the usage technique; a function call structure; a function name; a class name is built in; loss functions, etc.
2. Function call structural similarity: the object-oriented programming has three characteristics of encapsulation, inheritance and polymorphism, most codes are based on the idea of modular design, namely, the codes are designed by taking functions as units, the modularization of the codes can be realized by utilizing the functions, and if the functions realized by the codes are similar, the function calling structures in the technical characteristics of the codes are similar.
3. Functional similarity: functions with specific functions usually adopt functions with definite functions, entry call parameters and return values, and if the functions realized by the code are similar, the function names in the code technical characteristics are similar.
4. Built-in class similarity: functions are generally encapsulated in built-in classes, and if the functions implemented by the code are similar, the built-in classes introduced in the technical features of the code are also similar.
The detailed technical scheme of the invention is as follows:
a method for comparing source codes facing technical features is characterized by comprising the following steps:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: performing semantic coding on the function calling structure by adopting a self-coder method based on a graph convolution neural network to obtain a function calling structure vector for comparing code structure semantics based on the semantic vector;
using TF-IDF algorithm coding stage for the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector, and comparing the similarity of the function and the similarity of the built-in class based on the calling information vector of the function name and the built-in class name;
and finally, splicing the function call structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector so as to compare the code similarity.
Preferably, according to the present invention, the code file preprocessing stage includes:
and obtaining a DOT file of the function call structure in the code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in the graph representation of the function call structure and the relationship among the nodes. For example, a Pycallgraph tool can generate a graph representation of a function call structure of a Python application program, a grapeviz tool can be used for converting a DOT file into a function call structure picture, and the DOT file is preprocessed to obtain the graph representation of the function call structure
Figure 572047DEST_PATH_IMAGE001
Figure 268871DEST_PATH_IMAGE002
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structure
Figure 540977DEST_PATH_IMAGE001
And further obtaining a adjacency matrix of a diagram representation of the function call structureA
Preferably, according to the present invention, the function call structure semantic encoding stage includes:
the code file is preprocessed to obtain a diagram representation of a function call structure
Figure 727589DEST_PATH_IMAGE001
A function name and a built-in class name, a Graph representation of a function call structure using a Graph Convolutional Neural network (GCN) based self-encoder method
Figure 899244DEST_PATH_IMAGE001
Performing semantic coding, structuring semantic vectorsh s The encoder contains the function attribute and the information that there is a function call edge, and the calculation is as shown in equations (II) to (IV):
Figure 909182DEST_PATH_IMAGE003
Figure 521779DEST_PATH_IMAGE004
Figure 531247DEST_PATH_IMAGE005
in formulas (II), (III), (IV),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;
Figure 941762DEST_PATH_IMAGE006
is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 the self-learning parameter is a parameter to be learned and is obtained after training a self-encoding device of the graph, wherein the self-encoding device of the graph comprises an encoder and a decoder, and the calculation of the encoder is shown in formulas (II) to (IV);
Dthe matrix is a degree matrix which is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts sigmoid function as decoder to reconstruct original graph, and the calculation of the decoder is shown as formula (V):
Figure 851163DEST_PATH_IMAGE007
in the formula (V), the first and second groups of the chemical formula (V),
Figure 222494DEST_PATH_IMAGE008
is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function.
According to the invention, in order to make the structural semantic vector contain rich node and edge information and make the reconstructed adjacency matrix similar to the original adjacency matrix as much as possible, the cross entropy of the reconstructed adjacency matrix and the adjacency matrix of the original graph is used as a loss function, and the calculation of the loss function is shown in formula (VI):
Figure 887262DEST_PATH_IMAGE009
in the formula (VI), the reaction mixture is,Nrepresenting the number of function call edge sets;
Figure 21790DEST_PATH_IMAGE010
elements in the adjacency matrix a representing the original graph;
Figure 107951DEST_PATH_IMAGE011
adjacency matrix representing reconstructed original graph
Figure 824367DEST_PATH_IMAGE012
Of (1).
According to the present invention, preferably, the call information encoding stage of the function name and the built-in class name includes: in the function similarity part, the calling information vector based on the function name is compared with the function similarity, the algorithm code comprises a large number of functions providing basic functions, and the functions have definite functions, entry calling parameters and return values, so the functions adopted by the codes with similar functionsThe numbers are also similar, and a function calling information vector is obtained according to the TF-IDF calculation module of the function nameh f Is calculated as shown in formula (VII):
Figure 281106DEST_PATH_IMAGE013
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
in the built-in class similarity part, the similarity comparison of built-in classes is carried out on the calling information vector based on the built-in class name, functions are generally packaged in the built-in classes, if the functions realized by the codes are similar, the built-in classes introduced by the codes are also similar, the built-in class calling information vector is obtained according to a TF-IDF calculation module of the built-in classes, and the built-in class vector is obtainedh c Is calculated as shown in equation (VIII):
Figure 276001DEST_PATH_IMAGE014
in the formula (VIII) shown in the specification,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
According to the invention, the function call structure vector, the function vector and the built-in class vector are preferably spliced to form the technical characteristic vectorhIs calculated as shown in equation (IX):
Figure 804502DEST_PATH_IMAGE015
in the formula (IX), the first and second groups of the general formula (IX),hrepresenting technical characteristic vector ^ is vector concatenation character, since length of function calling structure vector, function vector and built-in class vector are all influenced by number of calling function in code and number of leading-in built-in class, function calling is carried outLinearly transforming the structure vector, the function vector and the built-in class vector into a vector with a certain dimensionality;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
Figure 878987DEST_PATH_IMAGE016
in the formula (X), sim α The similarity value of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
A system loaded with the above method, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a calling information coding stage processing module of functions and built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
A program product loaded with the above method, comprising: the computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
A computer-readable storage medium loaded with the above method, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of any of the methods described herein.
The invention discloses an application method using the method, which comprises the following steps: carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity. The method does not need to label a corpus, extracts the correlation and difference information among the codes, and compares the codes according to technical characteristics.
The invention has the technical effects that:
1. compared with the traditional method, the code comparison method does not need to mark a corpus, comprehensively considers the semantic information of technical features such as function names, calling structures, built-in classes and the like, and can effectively extract the correlation and difference information among the codes.
2. Compared with the traditional method, the method adopts the self-encoder method based on the graph convolution neural network to carry out graph semantic encoding in the semantic encoding stage of the function call structure, and can capture the semantic information of the function call structure.
3. Compared with the traditional method, the method provided by the invention has the advantages that the similarity of the function and the similarity of the built-in class are compared based on the calling information vector of the function and the built-in class name, the logic calling information of the code is reserved, and the precision rate of the method is further improved.
Drawings
FIG. 1 is a flow chart of a source code comparison method for technical features of the present invention;
FIG. 2 is a diagram of the code comparison model framework oriented to technical features of the present invention.
Detailed Description
The following detailed description is made with reference to the embodiments and the accompanying drawings, but not limited thereto.
Example 1
As shown in fig. 1 and fig. 2, a method for comparing source codes of technical features includes:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: performing semantic coding on the function calling structure by adopting a self-coder method based on a graph convolution neural network to obtain a function calling structure vector, wherein the function calling structure vector is used for comparing code structure semantics based on a semantic vector, and preferably, the graph semantic coding is specifically realized according to the following documents: william L. Hamilton, Rex Ying, J ure Leskovec. Inductive representation on large graphs [ C ]. Proceedings of the 31st International Conference on Neural Information Processing systems, 2017: 1025 1035;
using TF-IDF algorithm coding stage for the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector, and comparing the similarity of the function and the similarity of the built-in class based on the calling information vector of the function name and the built-in class name;
and finally, splicing the function call structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector so as to compare the code similarity.
The code file preprocessing stage comprises:
and obtaining a DOT file of the function call structure in the code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in the graph representation of the function call structure and the relationship among the nodes. For example, a Pycallgraph tool can generate a graph representation of a function call structure of a Python application program, a grapeviz tool can be used for converting a DOT file into a function call structure picture, and the DOT file is preprocessed to obtain the graph representation of the function call structure
Figure 879698DEST_PATH_IMAGE001
Figure 852466DEST_PATH_IMAGE002
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structure
Figure 274570DEST_PATH_IMAGE001
And further obtaining a adjacency matrix of a diagram representation of the function call structureA
The semantic coding stage of the function call structure comprises the following steps:
the code file is preprocessed to obtain a diagram representation of a function call structure
Figure 842211DEST_PATH_IMAGE001
A function name and a built-in class name, a Graph representation of a function call structure using a Graph Convolutional Neural network (GCN) based self-encoder method
Figure 652472DEST_PATH_IMAGE001
Performing semantic coding, structuring semantic vectorsh s The encoder contains the function attribute and the information that there is a function call edge, and the calculation is as shown in equations (II) to (IV):
Figure 740732DEST_PATH_IMAGE003
Figure 401914DEST_PATH_IMAGE017
Figure 517902DEST_PATH_IMAGE018
in formulas (II), (III), (IV),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;
Figure 854555DEST_PATH_IMAGE006
is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 is the parameter to be learned, obtained after the graph self-encoder is trainedThe graph self-encoder comprises an encoder and a decoder, wherein the calculation of the encoder is shown in formulas (II) to (IV);
Dthe matrix is a degree matrix which is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts a sigmoid function as a decoder to reconstruct an original graph, and the calculation of the decoder is shown as a formula (V):
Figure 678548DEST_PATH_IMAGE007
in the formula (V), the first and second groups of the compound,
Figure 719753DEST_PATH_IMAGE008
is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function.
Example 2
In order to make the structural semantic vector include rich node and edge information and make the reconstructed adjacency matrix similar to the original adjacency matrix as much as possible, the cross entropy between the reconstructed adjacency matrix and the adjacency matrix of the original graph is used as a loss function, and the loss function is calculated as shown in formula (VI):
Figure 996407DEST_PATH_IMAGE009
in the formula (VI), the reaction mixture is,Nrepresenting the number of function call edge sets;
Figure 199069DEST_PATH_IMAGE010
elements in the adjacency matrix a representing the original graph;
Figure 891423DEST_PATH_IMAGE011
adjacency matrix representing reconstructed original graph
Figure 284620DEST_PATH_IMAGE012
Of (2).
Example 3
A method for comparing technical feature-oriented source codes as described in embodiment 1,
the calling information encoding stage of the function name and the built-in class name comprises the following steps: in the function similarity part, the calling information vector based on the function name is compared with the function similarity, the algorithm code comprises a large number of functions providing basic functions, and the functions have definite functions, entry calling parameters and return values, so the functions adopted by the codes with similar functions are also similar, and the function calling information vector and the function vector are obtained according to the TF-IDF calculation module of the function nameh f Is calculated as shown in formula (VII):
Figure 245973DEST_PATH_IMAGE013
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
in the built-in class similarity part, the similarity comparison of built-in classes is carried out on the calling information vector based on the built-in class name, functions are generally packaged in the built-in classes, if the functions realized by the codes are similar, the built-in classes introduced by the codes are also similar, the built-in class calling information vector is obtained according to a TF-IDF calculation module of the built-in classes, and the built-in class vector is obtainedh c Is calculated as shown in equation (VIII):
Figure 998465DEST_PATH_IMAGE014
in the formula (VIII) shown in the specification,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
Example 4
The method for comparing the source code of the technical characteristics comprises the following steps of calling a structure vector and a function by the functionSplicing the vector and the built-in class vector to be used as a technical characteristic vectorhIs calculated as shown in equation (IX):
Figure 734734DEST_PATH_IMAGE015
in the formula (IX), the reaction mixture is,hexpressing a technical characteristic vector, wherein ^ is a vector concatenation character, and since the lengths of the function calling structure vector, the function vector and the built-in class vector are all influenced by the number of calling functions in the code and the number of leading in the built-in class, the function calling structure vector, the function vector and the built-in class vector are all linearly transformed into a vector with a certain dimension;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
Figure 168339DEST_PATH_IMAGE016
in the formula (X), sim α The similarity value of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
Example 5
A system loaded with the method of embodiments 1-4, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a calling information coding stage processing module of functions and built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
Example 6
A program product loaded with the method of embodiments 1-4, comprising: the computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
Example 7
A computer-readable storage medium loaded with the method according to embodiments 1-4, having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of any of the methods described herein.
Example 8
An application method using the method as described in examples 1-4: carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
Using the methods described in examples 1-5, 8, taking the "syntactic analyzed" code as an example, the retrieved reusable code and the newly designed code are as follows:
firstly, inputting a code file:
the new design code iscode 1 The retrieved reusable code iscode 2
For codecode 1 Representation of function call structure obtained by file preprocessing
Figure 357268DEST_PATH_IMAGE019
Function name and built-in class name:
wherein the content of the first and second substances,
Figure 30959DEST_PATH_IMAGE020
function name set V = { main, nextInt, for, print, while, if }
Function attribute collectionA V ={[0.32,0.41],[0.85,0.20],[0.17,0.66]…}
Collections with function call edgesE={(main,nextInt),(main,for),(for,print)…}
Aggregation of edge attributesA E ={1,1,1…}
The built-in class name set is { java.io.buffer reader, java.util.hashmap, java.io.ioexception … }
Thereby obtaining a adjacency matrix of the function
Figure 615874DEST_PATH_IMAGE021
Function call structure semantic coding stage:
function call structure semantic coding is performed using Graph Convolutional Neural network (GCN) based self-encoder methods.
Figure 232100DEST_PATH_IMAGE022
=[0.24,0.56,0.46,0.89…]
And a calling information encoding stage of the function and the built-in class name:
the TF-IDF calculation module according to the function name can obtain the function call information vectorh f =[0.61,0.73,0.82,0.53…]
The built-in class calling information vector can be obtained by the TF-IDF calculation module according to the built-in class nameh c =[0.86,0.62,0.45,0.72…]
Splicing:
splicing a structural semantic vector, a function vector and a built-in class vector to serve as a technical feature vectorh des =[0.24,0.56,0.46,0.89…,0.61,0.73,0.82,0.53…,0.86,0.62,0.45,0.72…]
Reusable code to be retrievedcode 2 And new design codecode 1 And (3) comparison:
obtaining new design codes according to spliced resultscode 1 Is a technical feature vector ofh des The same can obtain the reusable codecode 2 Is a technical feature vector ofh sea The cosine similarity value of the technical feature vector is taken as the code similarity sim α =cos(h sea h des )=0.75。

Claims (10)

1. A method for comparing source codes facing technical features is characterized by comprising the following steps:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: adopting a self-encoder method based on a graph convolution neural network to carry out semantic encoding on the function calling structure to obtain a function calling structure vector;
using TF-IDF algorithm coding stage to the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector;
and finally, splicing the function call structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector so as to compare the code similarity.
2. A method as claimed in claim 1, wherein the code file preprocessing stage comprises:
obtaining a DOT file of a function call structure in code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in a graph representation of the function call structure and relations among the nodes, and preprocessing the DOT file to obtain the graph representation of the function call structure
Figure 100366DEST_PATH_IMAGE001
Figure 195711DEST_PATH_IMAGE002
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structure
Figure 87837DEST_PATH_IMAGE001
And further obtaining a adjacency matrix of a diagram representation of the function call structureA
3. The method of claim 1, wherein the function call structure semantic coding stage comprises:
the code file is preprocessed to obtain a diagram representation of a function call structure
Figure 573309DEST_PATH_IMAGE001
A function name and a built-in class name, a Graph representation of a function call structure using a Graph Convolutional Neural network (GCN) based self-encoder method
Figure 394984DEST_PATH_IMAGE001
Performing semantic coding, structuring semantic vectorsh s The encoder contains the function attribute and the information that there is a function call edge, and the calculation is as shown in equations (II) to (IV):
Figure 113978DEST_PATH_IMAGE003
Figure 934429DEST_PATH_IMAGE004
Figure 535394DEST_PATH_IMAGE005
in the formulae (II) (III) (IV)) In (1),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;
Figure 533830DEST_PATH_IMAGE006
is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 the self-learning parameter is a parameter to be learned and is obtained after training a self-encoding device of the graph, wherein the self-encoding device of the graph comprises an encoder and a decoder, and the calculation of the encoder is shown in formulas (II) to (IV);
Dthe matrix is a degree matrix which is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts sigmoid function as decoder to reconstruct original graph, and the calculation of the decoder is shown as formula (V):
Figure 185524DEST_PATH_IMAGE007
in the formula (V), the first and second groups of the compound,
Figure 685032DEST_PATH_IMAGE008
is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function.
4. A method for comparing source codes of technical features according to claim 3, wherein the cross entropy of the adjacency matrix of the reconstructed graph and the adjacency matrix of the original graph is used as a loss function, and the calculation of the loss function is shown in formula (VI):
Figure 15869DEST_PATH_IMAGE009
in the formula (VI), the reaction mixture is,Nrepresenting function call edgesThe number of sets;
Figure 315700DEST_PATH_IMAGE010
elements in the adjacency matrix a representing the original graph;
Figure 878793DEST_PATH_IMAGE011
adjacency matrix representing reconstructed original graph
Figure 660019DEST_PATH_IMAGE012
Of (1).
5. The method of claim 1, wherein the step of encoding the calling information of the function name and the built-in class name comprises: obtaining a function call information vector according to the TF-IDF calculation module of the function name, and obtaining a function vectorh f Is calculated as shown in formula (VII):
Figure 975856DEST_PATH_IMAGE013
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
obtaining a built-in class calling information vector according to a built-in class TF-IDF calculation module, and setting a built-in class vectorh c Is calculated as shown in equation (VIII):
Figure 575815DEST_PATH_IMAGE014
in the formula (VIII), the reaction mixture is,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
6. The method of claim 1, wherein the feature-oriented source code comparison method comprises concatenating a function call structure vector, a function vector, and a built-in class vector as a feature vectorhIs calculated as shown in equation (IX):
Figure 485259DEST_PATH_IMAGE015
in the formula (IX), the reaction mixture is,hrepresenting a technical feature vector, and behavior behavioris a vector concatenation character;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
Figure 457062DEST_PATH_IMAGE016
in the formula (X), sim α The similarity value of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
7. A system loaded with the method of any of claims 1-6, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a calling information coding stage processing module of functions and built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
8. A program product loaded with the method of any one of claims 1-6, comprising: wherein the program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
9. A computer-readable storage medium loaded with a method according to any of claims 1-6, having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of any of the methods recited in the present invention.
10. A method of use using the method of any one of claims 1 to 6, comprising: carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
CN202210808406.7A 2022-07-11 2022-07-11 Technical feature oriented source code comparison method, system and program product Active CN114880023B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210808406.7A CN114880023B (en) 2022-07-11 2022-07-11 Technical feature oriented source code comparison method, system and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210808406.7A CN114880023B (en) 2022-07-11 2022-07-11 Technical feature oriented source code comparison method, system and program product

Publications (2)

Publication Number Publication Date
CN114880023A true CN114880023A (en) 2022-08-09
CN114880023B CN114880023B (en) 2022-09-30

Family

ID=82683640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210808406.7A Active CN114880023B (en) 2022-07-11 2022-07-11 Technical feature oriented source code comparison method, system and program product

Country Status (1)

Country Link
CN (1) CN114880023B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976318A (en) * 2010-11-15 2011-02-16 北京理工大学 Detection method of code similarity based on digital fingerprints
CN110147235A (en) * 2019-03-29 2019-08-20 中国科学院信息工程研究所 Semantic comparison method and device between a kind of source code and binary code
CN110502361A (en) * 2019-08-29 2019-11-26 扬州大学 Fine granularity defect positioning method towards bug report
CN111897970A (en) * 2020-07-27 2020-11-06 平安科技(深圳)有限公司 Text comparison method, device and equipment based on knowledge graph and storage medium
CN112286575A (en) * 2020-10-20 2021-01-29 杭州云象网络技术有限公司 Intelligent contract similarity detection method and system based on graph matching model
CN112800172A (en) * 2021-02-07 2021-05-14 重庆大学 Code searching method based on two-stage attention mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976318A (en) * 2010-11-15 2011-02-16 北京理工大学 Detection method of code similarity based on digital fingerprints
CN110147235A (en) * 2019-03-29 2019-08-20 中国科学院信息工程研究所 Semantic comparison method and device between a kind of source code and binary code
CN110502361A (en) * 2019-08-29 2019-11-26 扬州大学 Fine granularity defect positioning method towards bug report
CN111897970A (en) * 2020-07-27 2020-11-06 平安科技(深圳)有限公司 Text comparison method, device and equipment based on knowledge graph and storage medium
CN112286575A (en) * 2020-10-20 2021-01-29 杭州云象网络技术有限公司 Intelligent contract similarity detection method and system based on graph matching model
CN112800172A (en) * 2021-02-07 2021-05-14 重庆大学 Code searching method based on two-stage attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUQING SUN 等: "Constraint-based Authorization Management for Mobile Collaboration Services", 《2009 CONGRESS ON SERVICES》 *

Also Published As

Publication number Publication date
CN114880023B (en) 2022-09-30

Similar Documents

Publication Publication Date Title
Conneau et al. Very deep convolutional networks for natural language processing
Kushman et al. Using semantic unification to generate regular expressions from natural language
CN110472003B (en) Social network text emotion fine-grained classification method based on graph convolution network
CN111124487B (en) Code clone detection method and device and electronic equipment
CN109871955A (en) A kind of aviation safety accident causality abstracting method
CN113434858B (en) Malicious software family classification method based on disassembly code structure and semantic features
CN112487812A (en) Nested entity identification method and system based on boundary identification
CN112306494A (en) Code classification and clustering method based on convolution and cyclic neural network
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN115357904B (en) Multi-class vulnerability detection method based on program slicing and graph neural network
CN110555305A (en) Malicious application tracing method based on deep learning and related device
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN114881043B (en) Deep learning model-based legal document semantic similarity evaluation method and system
CN115033890A (en) Comparison learning-based source code vulnerability detection method and system
CN112925904A (en) Lightweight text classification method based on Tucker decomposition
CN109446299A (en) The method and system of searching email content based on event recognition
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment
CN113704473A (en) Media false news detection method and system based on long text feature extraction optimization
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN114880023B (en) Technical feature oriented source code comparison method, system and program product
CN110377753B (en) Relation extraction method and device based on relation trigger word and GRU model
CN112270358A (en) Code annotation generation model robustness improving method based on deep learning
CN116701325A (en) Binary file cache-based XBRL classification standard loading method
CN113449517B (en) Entity relationship extraction method based on BERT gated multi-window attention network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant