CN114880023B - Technical feature oriented source code comparison method, system and program product - Google Patents

Technical feature oriented source code comparison method, system and program product Download PDF

Info

Publication number
CN114880023B
CN114880023B CN202210808406.7A CN202210808406A CN114880023B CN 114880023 B CN114880023 B CN 114880023B CN 202210808406 A CN202210808406 A CN 202210808406A CN 114880023 B CN114880023 B CN 114880023B
Authority
CN
China
Prior art keywords
function
vector
built
code
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210808406.7A
Other languages
Chinese (zh)
Other versions
CN114880023A (en
Inventor
龚斌
宁祥东
孙宇清
万林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202210808406.7A priority Critical patent/CN114880023B/en
Publication of CN114880023A publication Critical patent/CN114880023A/en
Application granted granted Critical
Publication of CN114880023B publication Critical patent/CN114880023B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • G06F8/436Semantic checking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A source code comparison method, a system and a program product for technical characteristics belong to the technical field of natural language processing. The invention comprises the following steps: a semantic coding method based on a function calling structure is used, and code similarity is analyzed from the aspects of the function calling structure, function names, built-in classes and the like; carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity. The invention comprehensively considers the technical characteristic information of function name, calling structure, built-in class and the like, and can better compare codes according to the technical characteristics.

Description

Technical feature oriented source code comparison method, system and program product
Technical Field
The invention discloses a source code comparison method, a system and a program product for technical characteristics, and belongs to the technical field of natural language processing.
Background
The open source platform provides an environment for sharing and exchanging codes for scientific research personnel, more and more deep learning models and codes are shared on the open source platform, and an ecological environment capable of reusing the codes is created, so that the research personnel need to search related solutions for specific problems. The modern algorithm design idea is that the code is constructed in a modularization mode, a large number of basic function functions are usually contained, and information such as function names, calling structures and built-in classes provides important code technical characteristics. Various solutions to the same problem, such as in text classification, may use different neural network structures: the performance of convolutional neural networks, cyclic neural networks, attention mechanisms, etc. also vary, and therefore, it is necessary to further analyze the performance problems such as code structure and execution efficiency, etc., and compare the newly designed code with the reusable code.
The similarity of codes is an important index in code comparison analysis, and currently mainstream code similarity calculation methods can be divided into three categories: code similarity calculation method based on calling structure, static characteristics and binary codes.
The code similarity calculation method based on the call structure mainly calculates the similarity between codes according to the logical structure of the codes, and the logical structure generally includes an abstract syntax tree, a program call graph and the like, for example: the invention patent with publication number CN110737469A discloses a source code similarity evaluation method based on semantic information in function granularity, which considers the structure and node information of a control flow graph by calculating identifiers corresponding to functions and embedded vectors of the control flow graph, but does not consider semantic information of technical features such as built-in classes of code import.
The code similarity calculation method based on static features is to extract some metric values from a source code to form feature vectors, and then to take the similarity between the code feature vectors as the similarity of the codes, for example: the invention patent with publication number CN111290784A discloses a program source code similarity detection method suitable for large-scale samples, which calculates a locality sensitive hash value for a text feature sequence and a feature weight sequence of each sample to be detected, and uses the value as a sample feature vector.
A code similarity calculation method based on binary system generally includes obtaining an instruction sequence of each function after disassembling a binary system code, vectorizing instruction features, and finally calculating code similarity through feature vectors, for example: chinese patent publication No. CN113554101A discloses a binary code similarity detection method based on deep learning, which uses Structure2Vec to generate graph embedding of a control flow graph of a binary function, and introduces CNN to process sequential Structure information between basic blocks of the control flow graph, thereby better defining the precedence relationship between blocks within the function.
Aiming at the problems, the method comprehensively considers the semantic information of technical characteristics, uses a semantic coding method based on a function calling structure, and analyzes the code similarity from the aspects of the function calling structure, the function name, the built-in class and the like; carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention discloses a source code comparison method facing technical characteristics.
The invention also discloses a system for realizing the comparison method.
The invention also discloses a program product loaded with the method.
The invention also discloses a computer readable storage medium loaded with the method.
The invention discloses an application method utilizing the method.
Summary of the invention:
the invention relates to a source code comparison method facing technical characteristics, the modern algorithm design idea is that a code is constructed in a modularization mode, the code usually comprises a large number of basic function functions, and information such as function names, calling structures, built-in classes and the like provides important code technical characteristics, and the method aims at: and comprehensively considering the semantic information of the technical characteristics, and analyzing the performance problems of code structure, execution efficiency and the like from the aspects of function call structure, function name, built-in class and the like.
Interpretation of terms:
1. the technical characteristics are as follows: keywords that describe the code implementation function and the usage technique; a function call structure; a function name; a class name is built in; loss functions, etc.
2. Function call structural similarity: the object-oriented programming has three characteristics of encapsulation, inheritance and polymorphism, most codes are based on the idea of modular design, namely, the codes are designed by taking functions as units, the modularization of the codes can be realized by utilizing the functions, and if the functions realized by the codes are similar, the function calling structures in the technical characteristics of the codes are similar.
3. Functional similarity: functions with specific functions usually adopt functions with definite functions, entry call parameters and return values, and if the functions realized by the code are similar, the function names in the code technical characteristics are similar.
4. Built-in class similarity: functions are generally encapsulated in built-in classes, and if the functions implemented by the code are similar, the built-in classes introduced in the technical features of the code are also similar.
The detailed technical scheme of the invention is as follows:
a method for comparing source codes facing technical features is characterized by comprising the following steps:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: performing semantic coding on the function calling structure by adopting a self-coder method based on a graph convolution neural network to obtain a function calling structure vector for comparing code structure semantics based on the semantic vector;
a TF-IDF algorithm coding stage is used for the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector, and the function vector and the built-in class vector are used for comparing the similarity of the function and the similarity of the built-in class based on the calling information vector of the function name and the built-in class name;
and finally, splicing the function call structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector so as to compare the code similarity.
Preferably, according to the present invention, the code file preprocessing stage includes:
and obtaining a DOT file of the function call structure in the code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in the graph representation of the function call structure and the relationship among the nodes. For example, a Pycallgraph tool can generate a graph representation of a function call structure of a Python application program, a grapeviz tool can be used for converting a DOT file into a function call structure picture, and the DOT file is preprocessed to obtain the graph representation of the function call structure
Figure 572047DEST_PATH_IMAGE001
Figure 268871DEST_PATH_IMAGE002
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structure
Figure 540977DEST_PATH_IMAGE001
And obtaining the adjacency matrix of the graph representation of the function call structureA
Preferably, according to the present invention, the function call structure semantic encoding stage includes:
the code file is preprocessed to obtain a diagram representation of a function call structure
Figure 727589DEST_PATH_IMAGE001
A function name and a built-in class name, a Graph representation of a function call structure using a Graph Convolutional Neural network (GCN) based self-encoder method
Figure 899244DEST_PATH_IMAGE001
Performing semantic coding, structuring semantic vectorsh s The encoder contains the function attribute and the information that there is a function call edge, and the calculation is as shown in equations (II) to (IV):
Figure 909182DEST_PATH_IMAGE003
Figure 521779DEST_PATH_IMAGE004
Figure 531247DEST_PATH_IMAGE005
in formulas (II), (III), (IV),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;
Figure 941762DEST_PATH_IMAGE006
is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 the self-encoding method comprises the following steps of obtaining parameters to be learned after training a self-encoder of the graph, wherein the self-encoder of the graph comprises an encoder and a decoder, and the calculation of the encoder is shown in formulas (II) to (IV);
Dthe method comprises the following steps of (1) obtaining a degree matrix, wherein the degree matrix is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts sigmoid function as decoder to reconstruct original graph, and the calculation of the decoder is shown as formula (V):
Figure 851163DEST_PATH_IMAGE007
in the formula (V), the first and second groups of the compound,
Figure 222494DEST_PATH_IMAGE008
is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function.
According to the invention, in order to make the structure semantic vector contain abundant node and edge information and make the reconstructed adjacency matrix similar to the original adjacency matrix as much as possible, the cross entropy of the reconstructed adjacency matrix and the adjacency matrix of the original graph is used as a loss function, and the calculation of the loss function is shown in formula (VI):
Figure 887262DEST_PATH_IMAGE009
in the formula (VI), the reaction mixture is,Nrepresenting the number of function call edge sets;
Figure 21790DEST_PATH_IMAGE010
elements in the adjacency matrix a representing the original graph;
Figure 107951DEST_PATH_IMAGE011
adjacency matrix representing reconstructed original graph
Figure 824367DEST_PATH_IMAGE012
Of (1).
According to the present invention, the call information encoding stage of the function name and the built-in class name includes: in the function similarity part, the calling information vector based on the function name is compared with the function similarity, the algorithm code comprises a large number of functions for providing basic functions, and the functions have definite functions, entry calling parameters and return values, so the functions adopted by the codes with similar functions are also similar, and the function calling information vector and the function vector are obtained according to the TF-IDF calculation module of the function nameh f Is calculated as shown in formula (VII):
Figure 281106DEST_PATH_IMAGE013
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
in the built-in class similarity part, the similarity comparison of built-in classes is carried out on the calling information vector based on the built-in class name, functions are generally packaged in the built-in classes, if the functions realized by the codes are similar, the built-in classes introduced by the codes are also similar, the built-in class calling information vector is obtained according to a TF-IDF calculation module of the built-in classes, and the built-in class vector is obtainedh c Is calculated asRepresented by formula (VIII):
Figure 276001DEST_PATH_IMAGE014
in the formula (VIII), the reaction mixture is,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
According to the invention, the function call structure vector, the function vector and the built-in class vector are preferably spliced to form the technical characteristic vectorhIs calculated as shown in equation (IX):
Figure 804502DEST_PATH_IMAGE015
in the formula (IX), the first and second groups of the general formula (IX),hthe technical characteristic vector is represented, and [ ] is a vector concatenation character, and since the lengths of the function call structure vector, the function vector and the built-in class vector are all influenced by the number of call functions in the code and the number of the imported built-in class, the function call structure vector, the function vector and the built-in class vector are all linearly transformed into a vector with a certain dimension;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
Figure 878987DEST_PATH_IMAGE016
in the formula (X), sim α The similarity value of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
A system loaded with the above method, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a function and call information coding stage processing module with built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
A program product loaded with the above method, comprising: the computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
A computer-readable storage medium loaded with the above method, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of any of the methods described herein.
The invention discloses an application method using the method, which comprises the following steps: carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity. The method does not need to label a corpus, extracts the correlation and difference information among the codes, and compares the codes according to technical characteristics.
The invention has the technical effects that:
1. compared with the traditional method, the code comparison method does not need to mark a corpus, comprehensively considers the semantic information of technical features such as function names, calling structures, built-in classes and the like, and can effectively extract the correlation and difference information among the codes.
2. Compared with the traditional method, the method adopts the self-encoder method based on the graph convolution neural network to carry out graph semantic encoding in the semantic encoding stage of the function call structure, and can capture the semantic information of the function call structure.
3. Compared with the traditional method, the method provided by the invention has the advantages that the similarity of the function and the similarity of the built-in class are compared based on the calling information vector of the function and the built-in class name, the logic calling information of the code is reserved, and the precision rate of the method is further improved.
Drawings
FIG. 1 is a flow chart of a source code comparison method for technical features of the present invention;
FIG. 2 is a diagram of the code comparison model framework oriented to technical features of the present invention.
Detailed Description
The following detailed description is made with reference to the embodiments and the drawings of the specification, but not limited thereto.
Example 1
As shown in fig. 1 and fig. 2, a method for comparing source codes of technical features includes:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: performing semantic coding on the function calling structure by adopting a self-coder method based on a graph convolution neural network to obtain a function calling structure vector, wherein the function calling structure vector is used for comparing code structure semantics based on a semantic vector, and preferably, the graph semantic coding is specifically realized according to the following documents: william L. Hamilton, Rex Ying, J ure Leskovec. Inductive representation on large graphs [ C ]. Proceedings of the 31st International Conference on Neural Information Processing systems, 2017: 1025 1035;
using TF-IDF algorithm coding stage for the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector, and comparing the similarity of the function and the similarity of the built-in class based on the calling information vector of the function name and the built-in class name;
and finally, splicing the function call structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector so as to compare the code similarity.
The code file preprocessing stage comprises:
and obtaining a DOT file of the function call structure in the code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in the graph representation of the function call structure and the relationship among the nodes. For example, a Pycallgraph tool can generate a graph representation of a function call structure of a Python application program, a grapeviz tool can be used for converting a DOT file into a function call structure picture, and the DOT file is preprocessed to obtain the graph representation of the function call structure
Figure 879698DEST_PATH_IMAGE001
Figure 852466DEST_PATH_IMAGE002
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structures
Figure 274570DEST_PATH_IMAGE001
And further obtaining a adjacency matrix of a diagram representation of the function call structureA
The semantic coding stage of the function call structure comprises the following steps:
the code file is preprocessed to obtain a diagram representation of a function call structure
Figure 842211DEST_PATH_IMAGE001
A function name and a built-in class name, a Graph representation of a function call structure using a Graph Convolutional Neural network (GCN) based self-encoder method
Figure 652472DEST_PATH_IMAGE001
Performing semantic coding, structuring semantic vectorsh s Containing function attributes and information on the presence of function-call edges, the encoder performing the calculations as shown in equations (II) to (IV)The following steps:
Figure 740732DEST_PATH_IMAGE003
Figure 401914DEST_PATH_IMAGE017
Figure 517902DEST_PATH_IMAGE018
in formulas (II), (III), (IV),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;
Figure 854555DEST_PATH_IMAGE006
is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 the self-learning parameter is a parameter to be learned and is obtained after training a self-encoding device of the graph, wherein the self-encoding device of the graph comprises an encoder and a decoder, and the calculation of the encoder is shown in formulas (II) to (IV);
Dthe matrix is a degree matrix which is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts a sigmoid function as a decoder to reconstruct an original graph, and the calculation of the decoder is shown as a formula (V):
Figure 678548DEST_PATH_IMAGE007
in the formula (V), the first and second groups of the chemical formula (V),
Figure 719753DEST_PATH_IMAGE008
is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function.
Example 2
In order to make the structural semantic vector include rich node and edge information and make the reconstructed adjacency matrix similar to the original adjacency matrix as much as possible, the cross entropy between the reconstructed adjacency matrix and the adjacency matrix of the original graph is used as a loss function, and the loss function is calculated as shown in formula (VI):
Figure 996407DEST_PATH_IMAGE009
in the formula (VI), the reaction mixture is,Nrepresenting the number of function call edge sets;
Figure 199069DEST_PATH_IMAGE010
elements in the adjacency matrix a representing the original graph;
Figure 891423DEST_PATH_IMAGE011
adjacency matrix representing reconstructed original graph
Figure 284620DEST_PATH_IMAGE012
Of (1).
Example 3
A method for comparing technical feature-oriented source codes as described in embodiment 1,
the calling information encoding stage of the function name and the built-in class name comprises the following steps: in the function similarity part, the calling information vector based on the function name is compared with the function similarity, the algorithm code comprises a large number of functions for providing basic functions, and the functions have definite functions, entry calling parameters and return values, so the functions adopted by the codes with similar functions are also similar, and the function calling information vector and the function vector are obtained according to the TF-IDF calculation module of the function nameh f Is calculated as shown in formula (VII):
Figure 245973DEST_PATH_IMAGE013
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
in the built-in class similarity part, the similarity comparison of built-in classes is carried out on the calling information vector based on the built-in class name, functions are generally packaged in the built-in classes, if the functions realized by the codes are similar, the built-in classes introduced by the codes are also similar, the built-in class calling information vector is obtained according to a TF-IDF calculation module of the built-in classes, and the built-in class vector is obtainedh c Is calculated as shown in equation (VIII):
Figure 998465DEST_PATH_IMAGE014
in the formula (VIII), the reaction mixture is,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
Example 4
In the source code comparison method for technical features described in embodiment 1, a function call structure vector, a function vector, and a built-in class vector are spliced to serve as a technical feature vectorhIs calculated as shown in equation (IX):
Figure 734734DEST_PATH_IMAGE015
in the formula (IX), the reaction mixture is,hexpressing a technical characteristic vector, wherein ^ is a vector concatenation character, and since the lengths of the function calling structure vector, the function vector and the built-in class vector are all influenced by the number of calling functions in the code and the number of leading in the built-in class, the function calling structure vector, the function vector and the built-in class vector are all linearly transformed into a vector with a certain dimension;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
Figure 168339DEST_PATH_IMAGE016
in the formula (X), sim α Similarity values of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
Example 5
A system loaded with the method of embodiments 1-4, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a calling information coding stage processing module of functions and built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
Example 6
A program product loaded with the method of embodiments 1-4, comprising: the computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
Example 7
A computer-readable storage medium loaded with the method according to embodiments 1-4, having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of any of the methods described herein.
Example 8
An application method using the method as described in examples 1-4: carrying out image semantic coding by adopting a self-coder method based on a graph convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
Using the methods described in examples 1-5, 8, taking the "syntactic analyzed" code as an example, the retrieved reusable code and the newly designed code are as follows:
firstly, inputting a code file:
the new design code iscode 1 The retrieved reusable code iscode 2
For codecode 1 Representation of function call structure obtained by file preprocessing
Figure 357268DEST_PATH_IMAGE019
Function name and built-in class name:
wherein the content of the first and second substances,
Figure 30959DEST_PATH_IMAGE020
function name set V = { main, nextInt, for, print, while, if }
Function attribute collectionA V ={[0.32,0.41],[0.85,0.20],[0.17,0.66]…}
Collections with function call edgesE={(main,nextInt),(main,for),(for,print)…}
Aggregation of edge attributesA E ={1,1,1…}
The built-in class name set is { java.io.buffer reader, java.util.hashmap, java.io.ioexception … }
Thereby obtaining a adjacency matrix of the function
Figure 615874DEST_PATH_IMAGE021
Function call structure semantic coding stage:
function call structure semantic coding is performed using Graph Convolutional Neural network (GCN) based self-encoder methods.
Figure 232100DEST_PATH_IMAGE022
=[0.24,0.56,0.46,0.89…]
And a calling information encoding stage of the function and the built-in class name:
the TF-IDF calculation module according to the function name can obtain the function call information vectorh f =[0.61,0.73,0.82,0.53…]
The built-in class calling information vector can be obtained by the TF-IDF calculation module according to the built-in class nameh c =[0.86,0.62,0.45,0.72…]
Splicing:
splicing a structural semantic vector, a function vector and a built-in class vector to serve as a technical feature vectorh des =[0.24,0.56,0.46,0.89…,0.61,0.73,0.82,0.53…,0.86,0.62,0.45,0.72…]
Reusable code to be retrievedcode 2 And new design codecode 1 And (3) comparison:
obtaining new design codes according to splicing resultscode 1 The technical feature vector ofh des The same can obtain the reusable codecode 2 Is a technical feature vector ofh sea The cosine similarity value of the technical feature vector is taken as the code similarity sim α =cos(h sea h des )=0.75。

Claims (7)

1. A method for comparing source codes facing technical features is characterized by comprising the following steps:
a code file preprocessing stage, which is used for outputting a function calling structure, a function name and a built-in class name of a code;
and a semantic coding stage of the function call structure: adopting a self-encoder method based on a graph convolution neural network to carry out semantic encoding on the function calling structure to obtain a function calling structure vector;
using TF-IDF algorithm coding stage to the calling information of the function name and the built-in class name to respectively obtain a function vector and a built-in class vector;
finally, splicing the function calling structure vector, the function vector and the built-in class vector to serve as an integral technical feature vector so as to compare the similarity of the codes;
the code file preprocessing stage comprises:
obtaining a DOT file of a function call structure in code technical characteristics by using a function call structure generation tool, wherein the DOT file is used for describing nodes in a graph representation of the function call structure and relations among the nodes, and preprocessing the DOT file to obtain the graph representation of the function call structure
Figure 624018DEST_PATH_IMAGE001
Figure 788283DEST_PATH_IMAGE002
In formula (I), V represents a set of function names;A V representing a set of function attributes;Eindicating the presence of a set of function call edges;A E representing a set of edge attributes; graph representation from function call structure
Figure 498750DEST_PATH_IMAGE001
And further obtaining a adjacency matrix of a diagram representation of the function call structureA;
The semantic coding stage of the function call structure comprises the following steps:
the code file is preprocessed to obtain a diagram representation of a function call structure
Figure 508294DEST_PATH_IMAGE001
Function name and built-in class name, using graph convolution neural network (GCN) -based self-encoder method to pair functionsGraph representation of number call structure
Figure 869874DEST_PATH_IMAGE001
Performing semantic coding, structuring semantic vectorsh s The encoder contains the function attribute and the information that there is a function call edge, and the calculation is as shown in equations (II) to (IV):
Figure 939461DEST_PATH_IMAGE003
Figure 137224DEST_PATH_IMAGE004
Figure 684880DEST_PATH_IMAGE005
in formulas (II), (III), (IV),h s is a function call structure semantic vector;A V is a collection of function attributes;Aan adjacency matrix that is a representation of the function call structure;
Figure 651699DEST_PATH_IMAGE006
is a symmetric normalized adjacency matrix;
Reluis an activation function;W 0 andW 1 the self-learning parameter is a parameter to be learned and is obtained after training a self-encoding device of the graph, wherein the self-encoding device of the graph comprises an encoder and a decoder, and the calculation of the encoder is shown in formulas (II) to (IV);
Dthe matrix is a degree matrix which is a diagonal matrix, and elements on the diagonal are degrees of each vertex;
the graph self-encoder adopts sigmoid function as decoder to reconstruct original graph, and the calculation of the decoder is shown as formula (V):
Figure 141456DEST_PATH_IMAGE007
in the formula (V), the first and second groups of the compound,
Figure 560936DEST_PATH_IMAGE008
is a adjacency matrix that reconstructs the original graph; σ is a sigmoid function;
using the cross entropy of the reconstructed adjacency matrix and the adjacency matrix of the original graph as a loss function, the computation of the loss function is shown in formula (VI):
Figure 177862DEST_PATH_IMAGE009
in the formula (VI), the reaction mixture is,Nrepresenting the number of function call edge sets;
Figure 999187DEST_PATH_IMAGE010
elements in the adjacency matrix a representing the original graph;
Figure 676156DEST_PATH_IMAGE011
adjacency matrix representing reconstructed original graph
Figure 832200DEST_PATH_IMAGE012
Of (1).
2. The method of claim 1, wherein the step of encoding the calling information of the function name and the built-in class name comprises: obtaining a function call information vector according to the TF-IDF calculation module of the function name, and obtaining a function vectorh f Is calculated as shown in formula (VII):
Figure 721658DEST_PATH_IMAGE013
in the formula (VII), the first and second groups,h f representing a function vector;f(fun i )represents the first in the codeiTFIDF value of the called function;
obtaining a built-in class calling information vector according to a built-in class TF-IDF calculation module, and setting a built-in class vectorh c Is calculated as shown in equation (VIII):
Figure 663070DEST_PATH_IMAGE014
in the formula (VIII), the reaction mixture is,h c representing built-in class vectors;f(cls i )indicating the first introduced in the codeiTFIDF values of the individual built-in classes.
3. The method of claim 1, wherein the feature-oriented source code comparison method comprises concatenating a function call structure vector, a function vector, and a built-in class vector as a feature vectorhIs calculated as shown in equation (IX):
Figure 229049DEST_PATH_IMAGE015
in the formula (IX), the reaction mixture is,hrepresenting a technical feature vector, and behavior behavioris a vector concatenation character;
finally, the cosine similarity value sim of the technical characteristic vector is calculated α As the similarity between the retrieved reusable code and the newly designed code, the calculation of the code similarity is shown in formula (X):
Figure 888701DEST_PATH_IMAGE016
in the formula (X), sim α The similarity value of the retrieved reusable code and the newly designed code;cosis a cosine similarity function;h sea searching out a technical feature vector of the reusable code;h des is the technical feature vector of the newly designed code.
4. A system loaded with the method of any of claims 1-3, comprising:
a code file preprocessing stage processing module; a function call structure semantic coding stage processing module; a calling information coding stage processing module of functions and built-in class names; finally, splicing the structure vector, the function vector and the built-in class vector to form an integral technical characteristic vector; code comparison function based on technical feature vector.
5. A program product loaded with the method of any one of claims 1-3, comprising: wherein the program product is tangibly stored on a non-transitory computer-readable medium and includes machine-executable instructions for performing the above-described method.
6. A computer-readable storage medium loaded with a method according to any of claims 1-3, having a computer program stored thereon, wherein the computer program, when being executed by a processor, is adapted to carry out the steps of the method.
7. A method of use using the method of any one of claims 1 to 3, comprising: carrying out image semantic coding by adopting a self-coder method based on an image convolution neural network, and comparing code structure semantics based on semantic vectors; comparing the similarity of the function with the similarity of the built-in class based on the calling information vectors of the function and the built-in class name; and finally, splicing the structure vector, the function vector and the built-in class vector to serve as an overall technical feature vector, and comparing the code similarity.
CN202210808406.7A 2022-07-11 2022-07-11 Technical feature oriented source code comparison method, system and program product Active CN114880023B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210808406.7A CN114880023B (en) 2022-07-11 2022-07-11 Technical feature oriented source code comparison method, system and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210808406.7A CN114880023B (en) 2022-07-11 2022-07-11 Technical feature oriented source code comparison method, system and program product

Publications (2)

Publication Number Publication Date
CN114880023A CN114880023A (en) 2022-08-09
CN114880023B true CN114880023B (en) 2022-09-30

Family

ID=82683640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210808406.7A Active CN114880023B (en) 2022-07-11 2022-07-11 Technical feature oriented source code comparison method, system and program product

Country Status (1)

Country Link
CN (1) CN114880023B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976318A (en) * 2010-11-15 2011-02-16 北京理工大学 Detection method of code similarity based on digital fingerprints
CN110147235B (en) * 2019-03-29 2021-01-01 中国科学院信息工程研究所 Semantic comparison method and device between source code and binary code
CN110502361B (en) * 2019-08-29 2023-05-30 扬州大学 Fine granularity defect positioning method for bug report
CN111897970B (en) * 2020-07-27 2024-05-10 平安科技(深圳)有限公司 Text comparison method, device, equipment and storage medium based on knowledge graph
CN112286575A (en) * 2020-10-20 2021-01-29 杭州云象网络技术有限公司 Intelligent contract similarity detection method and system based on graph matching model
CN112800172B (en) * 2021-02-07 2022-07-12 重庆大学 Code searching method based on two-stage attention mechanism

Also Published As

Publication number Publication date
CN114880023A (en) 2022-08-09

Similar Documents

Publication Publication Date Title
Conneau et al. Very deep convolutional networks for natural language processing
Kushman et al. Using semantic unification to generate regular expressions from natural language
CN112487812B (en) Nested entity identification method and system based on boundary identification
CN109871955A (en) A kind of aviation safety accident causality abstracting method
CN109840322A (en) It is a kind of based on intensified learning cloze test type reading understand analysis model and method
CN106599041A (en) Text processing and retrieval system based on big data platform
CN111124487B (en) Code clone detection method and device and electronic equipment
CN101751385B (en) Multilingual information extraction method adopting hierarchical pipeline filter system structure
CN112100401B (en) Knowledge graph construction method, device, equipment and storage medium for science and technology services
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN110555305A (en) Malicious application tracing method based on deep learning and related device
CN115357904B (en) Multi-class vulnerability detection method based on program slicing and graph neural network
CN113743119B (en) Chinese named entity recognition module, method and device and electronic equipment
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN114881043B (en) Deep learning model-based legal document semantic similarity evaluation method and system
CN114329225A (en) Search method, device, equipment and storage medium based on search statement
CN115033890A (en) Comparison learning-based source code vulnerability detection method and system
CN112925904A (en) Lightweight text classification method based on Tucker decomposition
CN109446299A (en) The method and system of searching email content based on event recognition
CN113704473A (en) Media false news detection method and system based on long text feature extraction optimization
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN114880023B (en) Technical feature oriented source code comparison method, system and program product
CN117312559A (en) Method and system for extracting aspect-level emotion four-tuple based on tree structure information perception
CN112270358A (en) Code annotation generation model robustness improving method based on deep learning
CN111931461A (en) Variational self-encoder for text generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant