CN112035165B - Code clone detection method and system based on isomorphic network - Google Patents

Code clone detection method and system based on isomorphic network Download PDF

Info

Publication number
CN112035165B
CN112035165B CN202010871698.XA CN202010871698A CN112035165B CN 112035165 B CN112035165 B CN 112035165B CN 202010871698 A CN202010871698 A CN 202010871698A CN 112035165 B CN112035165 B CN 112035165B
Authority
CN
China
Prior art keywords
abstract syntax
program source
encoder
vector
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010871698.XA
Other languages
Chinese (zh)
Other versions
CN112035165A (en
Inventor
姚金龙
谷晶中
左洪强
程杰
张阳光
郑宏亮
高军涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Valley Network Polytron Technologies Inc
Original Assignee
Valley Network Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Valley Network Polytron Technologies Inc filed Critical Valley Network Polytron Technologies Inc
Priority to CN202010871698.XA priority Critical patent/CN112035165B/en
Publication of CN112035165A publication Critical patent/CN112035165A/en
Application granted granted Critical
Publication of CN112035165B publication Critical patent/CN112035165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of code detection, and particularly relates to a code clone detection method and system based on a isomorphic network, which are used for extracting abstract syntax trees of program source code function levels to be compared, wherein all leaf nodes of the abstract syntax trees correspond to sentence text information of the program source codes, and non-leaf nodes correspond to structural information of the program source codes; rearranging the abstract syntax tree to obtain a reconstructed abstract syntax tree forming a isomorphic network model with the recursive self-encoder, wherein each node of the hidden layer of the recursive self-encoder is associated with a non-leaf node of the reconstructed abstract syntax tree; training and learning recursive self-encoder model convergence parameters, and training model parameters for obtaining intermediate vectors through a loss function; extracting text semantic vectors of the grammar tree as input, and acquiring intermediate vectors of the program source codes to be compared through the converged model parameters; and judging the similarity of the program source codes to be compared according to the approximation degree of the intermediate vector. According to the invention, the code clone detection efficiency and accuracy are improved by automatically learning the code hiding characteristics.

Description

Code clone detection method and system based on isomorphic network
Technical Field
The invention belongs to the technical field of code detection, and particularly relates to a code clone detection method and system based on a isomorphic network.
Background
A piece of code is copied, with or without modification, such that two or more code segments are similar to each other, known as code cloning. Code cloning can accelerate software development, formally at present, the general demands of the industry. Code cloning also results in the widespread occurrence of defect duplication. When the original code has defects, the cloned code usually has the same defects, so that the defects are spread in the software system, and the loopholes are introduced into the system, so that potential safety hazards are brought. Thus, code clone detection techniques are widely used to retrieve known vulnerabilities in unknown code. Code clone detection is used as a basic analysis technology and has important significance for maintaining the quality of software. Code clone detection problems are classified into Type1-Type4 for a total of 4 levels. Type1 is that the code segments are identical, and the layout format or annotation of the codes is different except for spaces; type2 is the structure of the code or the grammatical composition of the code is the same, except that the user-defined variables, constant values, types, layouts, or annotations are different; type3 was modified for copy-and-paste code fragments. In addition to the change in type two, the program itself may add, delete, modify, or otherwise modify a portion of the statement; type4 is code function similar, but code syntax features are different.
The traditional detection method utilizes source code characterization of different layers and is mainly divided into various representations based on texts, words, abstract syntax trees, program dependency graphs, metric elements, rounds and the like. The code characterization uses intermediate products which can characterize the syntactic characteristic and structural characteristics of the program in the compiling process of the program to perform source code characterization, and then uses the characterization content of the source code to calculate the similarity degree through Euclidean distance and other modes. The existing method based on the traditional code characterization can well extract the code structure characteristics and the syntax characteristics, and has high recognition rate on the Type1-3 level code clone. This is because the Type1-3 level code clone does not make changes to the interior of a large number of code blocks, largely preserving the code syntax features, and can be effectively identified using syntactically characterized code clone detection techniques. And code clone detection technology based on syntactic characterization has low recognition rate for Type4 level code clones. This is because the Type4 level code clone breaks a lot of code syntax information at construction time, preserving only functional similarity, i.e., code semantic similarity. Therefore, the traditional code clone detection technology without semantic characterization information is difficult to effectively identify the code clone of the Type4 level.
Disclosure of Invention
Therefore, the code clone detection method and system based on the isomorphic network can intuitively and effectively learn the code hiding features through automatic learning, and improve the code clone detection efficiency and accuracy.
According to the design scheme provided by the invention, the code clone detection method based on the isomorphic network comprises the following contents:
extracting abstract grammar tree of program source code function level to be compared, wherein all leaf nodes of abstract grammar tree correspond to sentence text information of program source codes, and non-leaf nodes correspond to structure information of program source codes;
rearranging the abstract syntax tree to obtain a reconstructed abstract syntax tree forming a isomorphic network model with the recursive self-encoder, wherein each node of the hidden layer of the recursive self-encoder is associated with a non-leaf node of the reconstructed abstract syntax tree; extracting text semantic vectors of a program source code reconstruction abstract syntax tree to be compared;
training and learning recursive self-encoder model convergence parameters, and training model parameters for obtaining intermediate vectors through a loss function;
inputting the extracted text semantic vector as a trained recursion self-encoder, and acquiring an intermediate vector of the program source code to be compared through converged model parameters; and judging the similarity of the program source codes to be compared according to the approximation degree of the intermediate vector.
As the code clone detection method based on the isomorphic network, the invention further prepares a reconstructed abstract syntax tree with function sentence sequence from low to high nodes and program structure sequence from non-leaf nodes according to sentence text content filled in from right to left and from bottom to top in the rearrangement process of the abstract syntax tree.
As the code clone detection method based on the isomorphic network, in the rearrangement of the abstract syntax tree, all leaf nodes are firstly removed, then the rest binary tree is traversed by subsequent traversal, all rest binary tree is connected from low to high, the first node is the right subtree of the second node, and after analogy, the left subtree branch of each node is reserved.
As the code clone detection method based on the isomorphic network, the invention further extracts the text semantic vector, trains the word in the program by using the word vector model, converts the word into the vector containing the semantic information and acquires the vector set.
As the code clone detection method based on the isomorphic network, the invention further adopts the Skip-Gram method to carry out word training so as to obtain a vector set with the vector length of n, wherein each vector contains semantic information of a word corresponding to the vector set.
As the code clone detection method based on the isomorphic network of the present invention, further, the model loss function in the recursive self-encoder is expressed as:
Figure BDA0002651285230000021
wherein n is the vector length, x n Representing node inputs, x' n Representing the node output.
As the code clone detection method based on the isomorphic network, the invention further uses a gradient descent algorithm to iteratively update the model parameters of the recursive self-encoder so as to obtain the converged model parameters of the recursive self-encoder.
As the code clone detection method based on the isomorphic network, the invention further sets an approximation degree threshold value for judging the similarity according to the loss function convergence value.
As the code clone detection method based on the isomorphic network, the Euclidean distance of the text semantic vector of the program source code to be compared is further obtained, and if the Euclidean distance is smaller than the approximation degree threshold, the similarity of the program source code to be compared is judged.
Further, the invention also provides a code clone detection system based on a homogeneous network, which comprises: the device comprises an extraction module, a word vector generation module, a training module and a judging module, wherein,
the extraction module is used for extracting abstract syntax trees of program source code function levels to be compared, sentence text information of the program source codes is corresponding to all leaf nodes, and structural information of the program source codes is corresponding to non-leaf nodes;
the word vector generation module is used for rearranging the abstract syntax tree to obtain a reconstructed abstract syntax tree which forms a isomorphic network model with the recursive self-encoder, and each node of the hidden layer of the recursive self-encoder is associated with a non-leaf node of the reconstructed abstract syntax tree; extracting text semantic vectors of a program source code reconstruction abstract syntax tree to be compared;
the training module is used for training and learning the convergence parameters of the recursive self-encoder model and training the model parameters for obtaining the intermediate vector through the loss function;
the judging module is used for inputting the extracted text semantic vector as a trained recursion self-encoder and acquiring an intermediate vector of the program source code to be compared through the converged model parameters; and judging the similarity of the program source codes to be compared according to the approximation degree of the intermediate vector.
The invention has the beneficial effects that:
according to the invention, the isomorphic network for extracting and identifying the program semantics is formed by the abstract syntax tree and the recursion self-encoder, so that the code detection of Type4 cloning can be realized, the code cloning detection efficiency and accuracy are improved, a certain guiding significance is provided for identifying software vulnerabilities, and the method has a good application prospect.
Description of the drawings:
FIG. 1 is a flow chart of a code clone detection method in an embodiment;
FIG. 2 is a schematic diagram of a recursive self-encoder structure in an embodiment;
FIG. 3 is an example program source code in an embodiment;
FIG. 4 is an illustration of an original abstract syntax tree in an embodiment;
FIG. 5 is a non-leaf node reconstruction abstract syntax tree representation in an embodiment;
FIG. 6 is a schematic representation of a reconstruction abstract syntax tree in an embodiment;
fig. 7 is a schematic diagram of a portion of a concealment layer of a recursive self-encoder in an embodiment.
The specific embodiment is as follows:
the present invention will be described in further detail with reference to the drawings and the technical scheme, in order to make the objects, technical schemes and advantages of the present invention more apparent.
As the deep learning method has been successful in the fields of natural language processing, image processing, etc., it can also be applied to the field of program analysis. The biggest advantage of deep learning is that the difficult problem of 'feature engineering' is eliminated, and the data feature can be automatically learned. By referring to the results of a plurality of deep learning in the natural language processing direction, the program codes are regarded as natural language to extract semantic features, and the semantic features are important characterizations for identifying Type4 level code clones. The development of Type4 level code cloning related technology can be well advanced by utilizing the related technology of deep learning in the field of natural language processing. Referring to fig. 1, the embodiment of the invention provides a code clone detection method based on a homogeneous network, which comprises the following steps:
s101, extracting abstract syntax trees of program source code function levels to be compared, wherein all leaf nodes of the abstract syntax trees correspond to sentence text information of program source codes, and non-leaf nodes correspond to structure information of the program source codes;
s102, rearranging the abstract syntax tree to obtain a reconstructed abstract syntax tree forming a isomorphic network model with the recursive self-encoder, wherein each node of the hidden layer of the recursive self-encoder is associated with a non-leaf node of the reconstructed abstract syntax tree; extracting text semantic vectors of a program source code reconstruction abstract syntax tree to be compared;
s103, training and learning the convergence parameters of the recursive self-encoder model, and training the model parameters for obtaining the intermediate vector through a loss function;
s104, inputting the extracted text semantic vector as a trained recursive self-encoder, and acquiring an intermediate vector of the program source code to be compared through the converged model parameters; and judging the similarity of the program source codes to be compared according to the approximation degree of the intermediate vector.
Deep learning-based solutions rely mainly on natural language processing-based related solutions to address code clone detection problems. The related research of natural language processing is focused on solving the problems of word semantic learning, emotion judgment, machine translation and the like. From a natural language perspective, code is also a type of language, with its own vocabulary, syntactic expressions, and grammar. The word semantic learning method related to natural language processing can well help to extract semantic features of codes, and is further used for extracting semantic representation of the codes.
A solution to the problem of code similarity based on deep learning may mainly comprise two main steps: (1) And (3) representing the code semantic information, and learning out the representation capable of expressing the code semantic information by a deep learning method. Granularity may be a word, basic block, or function; the representation may be a matrix or a vector. (2) And measuring the similarity of semantic information, and judging whether the codes of the two sections are similar or dissimilar by measuring the difference between semantic characterization information of the two sections.
In particular, deep learning based solutions to code similarity measurement problems learn hidden features of code using a recursive self-encoder model. The recursive self-encoder is an unsupervised sentence conversion model formed by combining a recurrent neural network and the self-encoder, can convert sentences with indefinite length into sentence vectors, and the structure of the sentence vectors can be shown in fig. 2, (1) the recursive encoding process from an input vector to a hidden layer generates a first node y1 of the hidden layer by passing word vectors x4 and x3 in the sentences through a weight matrix W1 and a bias vector b, and performs the same operation as x4 and x3 on y1 and x2 to obtain y2, and so on. The tanh function in the figure is an activation function, and here, taking the tanh activation function as an example, the tanh activation function can be replaced by a Relu or other functions according to actual situations. (2) In the decoding process from the hidden layer to the reconstruction vector, y1 is decoded to obtain the reconstruction x4 'and x3' through the W2 weight matrix and the bias vector, y2 is output by the same algorithm as x2 'and y1', and so on. A series of reconstructed vectors is finally obtained. (3) Updating the weight matrix, and if the difference between the reconstructed vector and the original input vector is small, considering that the hidden layer learns the characteristics of the input layer well. The difference between the reconstructed vector and the original vector is represented by a loss function, wherein the loss function can be a mean square error loss function or an exponential loss function commonly used in machine learning and can be selected according to actual needs. To keep the loss function as small as possible after determining the loss function, a small value of the loss function means that the reconstructed vector and the input vector are very close. A gradient descent algorithm is used to calculate a direction in which the loss function is reduced, and the weight matrix is iteratively updated along the direction to reduce the value of the loss function. And after the result is converged, model training is completed.
In order to solve the blindness problem of the deep learning model design, the embodiment of the application aims at solving the code clone detection problem, and provides a dependent deep learning model which can learn code characteristics and recursively self-help nodes of a hidden layer in an encoder are not unknown characteristics but node characteristics corresponding to a program structure; the method effectively utilizes the deep learning technology to learn the abstract features of the codes, and utilizes the code features to perform similarity comparison calculation.
As the code clone detection method based on the isomorphic network in the embodiment of the invention, further, in the rearranging process of the abstract syntax tree, sentence text contents are filled in from right to left and from bottom to top, and a reconstructed abstract syntax tree with function sentence sequence from low to high nodes and program structure sequence from non-leaf nodes is generated. By associating each node in the hidden layer of the recursive self-encoder with a node in the program structure, the feature learned by a node in the hidden layer, in particular the feature of which position in the code structure, can be accurately determined. Further, in the rearrangement of the abstract syntax tree, all leaf nodes are removed first, then the rest binary tree is traversed by subsequent traversal, all rest nodes are connected from low to high, the first node is the right subtree of the second node, and after analogy, the left subtree branch of each node is reserved. Further, in the text semantic vector extraction, the word vector model is utilized to train words in the program, the words are converted into vectors containing semantic information, and a vector set is obtained.
When the program to be compared is processed to be input into the recursive self-encoder structure, firstly inputting two sections of program source codes to be compared, and extracting an abstract syntax tree of a program function level through the program source codes. Program source code as shown in fig. 3, which is expressed as a sentence through the replacement general information: "public void print String text < STRING > System out println text". Extracting the abstract syntax tree of the function can result in the original structure shown in fig. 4. The abstract syntax tree is a full binary tree, i.e. all child nodes have two children. All leaf nodes of the grammar tree correspond to sentence text information of the program, and non-leaf nodes of the grammar tree correspond to structure information of the program. The abstract syntax tree is rearranged. All leaf nodes are removed, the rest binary tree is traversed by subsequent traversal, all rest binary tree nodes are connected from low to high, the first node is the right subtree of the second node, and the left subtree branch of each node is reserved by the same, so that the structure shown in fig. 5 is obtained. Then filling the reverse sentence text content in the right-to-left and bottom-to-top direction, and generating a reconstructed abstract syntax tree with low-to-high leaf nodes as function sentence sequences and non-leaf nodes as program structure sequences according to the structure shown in fig. 6. A program contains a plurality of functions, each corresponding to an abstract syntax tree.
As the code clone detection method based on the isomorphic network in the embodiment of the invention, further, word training is carried out by adopting a Skip-Gram method so as to obtain a vector set with a vector length of n, and each vector contains semantic information of a word corresponding to the vector set.
The main purpose of word vector training is to convert words into vectors containing semantic information. The Word in the program is trained by using the Skip-Gram method in the Word2Vec algorithm of Google to obtain a vector set. The number of vectors is the same as the number of words used as the corpus of training, i.e. the program text. The vector length is n. Wherein each vector contains semantic information of the word corresponding thereto for use as input to a subsequent recursive self-encoder network.
As a code clone detection method based on a homogeneous network in the embodiment of the present invention, further, a model loss function in a recursive self-encoder is expressed as:
Figure BDA0002651285230000051
wherein n is the vector length, x n Representing node inputs, x' n Representing the node output. Further, the model parameters of the recursive self-encoder are iteratively updated by using a gradient descent algorithm to obtain a trained and converged recursive self-encoder model. Further, an approximation threshold for determining the similarity is set in accordance with the loss function convergence value. Further, the Euclidean distance of the text semantic vector of the program source code to be compared is obtained, and if the Euclidean distance is smaller than the approximation degree threshold value, the program source code to be compared is judged to be similar.
By the fact that the network structures of the full binary tree and the recursive self-encoder are identical, the structure of deleting the reconstruction part of the original recursive self-encoder structure diagram is obtained as shown in fig. 7, x4 to x1 are leaf nodes of the full binary tree, the structures of fig. 5 and fig. 4 are identical, and vectors corresponding to words of the leaf nodes in fig. 4 are filled in corresponding positions in fig. 5. And (3) passing words corresponding to x4 and x3 with the length of n through a calculation formula of y1 in fig. 1 to obtain y1, wherein a W1 weight matrix and a b1 bias vector are randomly initialized. All hidden layer y nodes are calculated in turn according to this method, which is the encoding process. Each hidden node decodes the reconstructed x node by the W ' weight matrix and the b ' parameter, denoted by x ', which is the decoding process. The hidden node input is two x nodes, and the calculation formula is as follows: y is 1 =f(W[x 1 ,x 2 ]+b), where W is a weight matrix of 2n×n, b is a bias vector of length n, and f is an activation function. The reconstruction node outputs two reconstructed x nodes or one reconstructed x node and one reconstructed y node, and the calculation formula is as follows: [ x ]' 1 ,x′ 2 ]=W′y 1 +b ', where W ' is the weight matrix of n x 2n of the decoder and b ' is the offset vector of length 2 n.
The loss function of the model is:
Figure BDA0002651285230000052
the loss function describes the difference between the model input and output, if lostThe low value of the function may be considered that the hidden layer y extracts the input hidden information well. To reduce the value of the loss function, the values of the W, W ', b' parameters are iteratively updated using a gradient descent algorithm. And after the model converges, training the recursive self-encoder model is completed. The method can take 10 times of the convergence value of the loss function E of the trained model as a threshold value, and calculate the approximation degree of the hidden vector obtained by the two functions through the recursion self-encoder model by using a loss function formula so as to judge whether the two functions are similar. Specifically, text word vectors of two functions are input into a trained recursive self-encoder model; the functions with different lengths are different in the length of the model during calculation, but the parameters W and b of each hidden layer node during calculation are trained, and hidden layer vectors with the same dimension are obtained by adding hidden layer node vectors with different numbers calculated by the parameters W and b. By comparing the Euclidean distances of the two vectors, a determination that the distance is less than the threshold 10E is similar, otherwise dissimilar.
In this embodiment, by using the method of reconstructing the program abstract syntax tree, the reconstructed abstract syntax tree corresponds to the structure of the deep learning-based recursive self-encoder to form a homogeneous network, so that the structural features of the corresponding node positions can be learned more accurately and intuitively. Similar to fig. 6, the structure shown in fig. 7 is obviously also a full binary tree, each leaf node corresponds to a text node of the reconstructed abstract syntax tree, and the structure of the hidden layer corresponds to a non-leaf node in the abstract syntax tree representing the structural features of the program, so that the hidden layer features learned by the recursive self-encoder explicitly correspond to the logical positions of the hidden layer nodes in the abstract syntax tree. The program words are arranged in a reverse order such that the recursion from bottom to top proceeds in the process of computation in the order of the program text. The recursion self-encoder used at this time has a corresponding relation with the input reconstruction abstract syntax tree, and the architecture is consistent, so that the hidden features of the function can be learned more effectively and intuitively.
Further, based on the above method, the embodiment of the present invention further provides a code clone detection system based on a homogeneous network, including: the device comprises an extraction module, a word vector generation module, a training module and a judging module, wherein,
the extraction module is used for extracting abstract syntax trees of program source code function levels to be compared, sentence text information of the program source codes is corresponding to all leaf nodes, and structural information of the program source codes is corresponding to non-leaf nodes;
the word vector generation module is used for rearranging the abstract syntax tree to obtain a reconstructed abstract syntax tree which forms a isomorphic network model with the recursive self-encoder, and each node of the hidden layer of the recursive self-encoder is associated with a non-leaf node of the reconstructed abstract syntax tree; extracting text semantic vectors of a program source code reconstruction abstract syntax tree to be compared;
the training module is used for training and learning the convergence parameters of the recursive self-encoder model and training the model parameters for obtaining the intermediate vector through the loss function;
the judging module is used for inputting the extracted text semantic vector as a trained recursion self-encoder and acquiring an intermediate vector of the program source code to be compared through the converged model parameters; and judging the similarity of the program source codes to be compared according to the approximation degree of the intermediate vector.
The relative steps, numerical expressions and numerical values of the components and steps set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.
Based on the above system, the embodiment of the present invention further provides a server, including: one or more processors; and a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the system described above.
Based on the above system, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, where the program when executed by a processor implements the above system.
The device provided by the embodiment of the present invention has the same implementation principle and technical effects as those of the embodiment of the system, and for the sake of brevity, reference may be made to the corresponding content of the embodiment of the system.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing system embodiments, which are not described herein again.
Any particular values in all examples shown and described herein are to be construed as merely illustrative and not a limitation, and thus other examples of exemplary embodiments may have different values.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, systems and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and systems may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to execute all or part of the steps of the system according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. The code clone detection method based on the isomorphic network is characterized by comprising the following steps:
extracting abstract grammar tree of program source code function level to be compared, wherein all leaf nodes of abstract grammar tree correspond to sentence text information of program source codes, and non-leaf nodes correspond to structure information of program source codes;
rearranging the abstract syntax tree to obtain a reconstructed abstract syntax tree forming an isomorphic network model with the recursive self-encoder, wherein each node of the hidden layer of the recursive self-encoder is associated with a non-leaf node of the reconstructed abstract syntax tree, so that the hidden layer characteristics learned by the recursive self-encoder correspond to the logic positions of the hidden layer nodes in the abstract syntax tree; extracting text semantic vectors of a program source code reconstruction abstract syntax tree to be compared;
training and learning recursive self-encoder model convergence parameters, and training model parameters for obtaining intermediate vectors through a loss function;
inputting the extracted text semantic vector as a trained recursion self-encoder, and acquiring an intermediate vector of the program source code to be compared through converged model parameters; judging the similarity of the program source codes to be compared according to the approximation degree of the intermediate vector;
filling sentence text contents in the rearranging process of the abstract syntax tree according to the right-to-left and bottom-to-top directions, and generating a reconstructed abstract syntax tree with function sentence sequence from low-to-high leaf nodes and program structure sequence from non-leaf nodes;
in the rearrangement of the abstract syntax tree, all leaf nodes are firstly removed, then the rest binary tree is traversed by subsequent traversal, all rest nodes are connected from low to high, the first node is the right subtree of the second node, and after analogy, the left subtree branch of each node is reserved.
2. The isomorphic network-based code clone detection method according to claim 1, wherein words in a program are trained by using a word vector model in extracting text semantic vectors, and words are converted into vectors containing semantic information, so as to obtain a vector set.
3. The isomorphic network-based code clone detection method according to claim 2, wherein word training is performed by using Skip-Gram method to obtain a vector set with a vector length of n, and each vector contains semantic information of a word corresponding to the vector set.
4. The isomorphic network-based code clone detection method according to claim 1, wherein the recursive self-encoder model loss function is expressed as:
Figure FDA0004072209150000011
wherein n is the vector length, x n Representing node inputs, x' n Representing the node output.
5. The isomorphic network-based code clone detection method according to claim 1 or 4, wherein the recursive self-encoder model parameters are iteratively updated with a gradient descent algorithm to obtain the converged recursive self-encoder model parameters.
6. The homogeneous network-based code clone detection method according to claim 1 or 4, wherein an approximation threshold for determining similarity is set according to a loss function convergence value.
7. The method for detecting code clones based on homogeneous network according to claim 6, wherein the euclidean distance of the text semantic vector of the program source codes to be compared is obtained, and if the euclidean distance is smaller than the threshold of approximation degree, the program source codes to be compared are judged to be similar.
8. A homogeneous network-based code clone detection system, characterized by being implemented based on the method of claim 1, comprising: the device comprises an extraction module, a word vector generation module, a training module and a judging module, wherein,
the extraction module is used for extracting abstract syntax trees of program source code function levels to be compared, sentence text information of the program source codes is corresponding to all leaf nodes, and structural information of the program source codes is corresponding to non-leaf nodes;
the word vector generation module is used for rearranging the abstract syntax tree to obtain a reconstructed abstract syntax tree which forms a isomorphic network model with the recursive self-encoder, and each node of the hidden layer of the recursive self-encoder is associated with a non-leaf node of the reconstructed abstract syntax tree; extracting text semantic vectors of a program source code reconstruction abstract syntax tree to be compared;
the training module is used for training and learning the recursive self-encoder model convergence parameter, and training model parameters for obtaining the intermediate vector through the loss function;
the judging module is used for inputting the extracted text semantic vector as a trained recursion self-encoder and acquiring an intermediate vector of the program source code to be compared through the converged model parameters; and judging the similarity of the program source codes to be compared according to the approximation degree of the intermediate vector.
CN202010871698.XA 2020-08-26 2020-08-26 Code clone detection method and system based on isomorphic network Active CN112035165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010871698.XA CN112035165B (en) 2020-08-26 2020-08-26 Code clone detection method and system based on isomorphic network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010871698.XA CN112035165B (en) 2020-08-26 2020-08-26 Code clone detection method and system based on isomorphic network

Publications (2)

Publication Number Publication Date
CN112035165A CN112035165A (en) 2020-12-04
CN112035165B true CN112035165B (en) 2023-06-09

Family

ID=73581580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010871698.XA Active CN112035165B (en) 2020-08-26 2020-08-26 Code clone detection method and system based on isomorphic network

Country Status (1)

Country Link
CN (1) CN112035165B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112698831B (en) * 2020-12-25 2022-08-09 昆明理工大学 Code automatic generation quality evaluation method
CN113296784B (en) * 2021-05-18 2023-11-14 中国人民解放军国防科技大学 Container base mirror image recommendation method and system based on configuration code characterization
CN113535229B (en) * 2021-06-30 2022-12-02 中国人民解放军战略支援部队信息工程大学 Anti-confusion binary code clone detection method based on software gene
CN113656066B (en) * 2021-08-16 2022-08-05 南京航空航天大学 Clone code detection method based on feature alignment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090461A (en) * 2019-11-18 2020-05-01 中山大学 Code annotation generation method based on machine translation model

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140053285A1 (en) * 2012-08-16 2014-02-20 Infosys Limited Methods for detecting plagiarism in software code and devices thereof
US10042740B2 (en) * 2015-12-04 2018-08-07 Microsoft Technology Licensing, Llc Techniques to identify idiomatic code in a code base
US20170212829A1 (en) * 2016-01-21 2017-07-27 American Software Safety Reliability Company Deep Learning Source Code Analyzer and Repairer
US10534604B1 (en) * 2018-03-20 2020-01-14 Architecture Technology Corporation Software refactoring systems and methods
CN110347428A (en) * 2018-04-08 2019-10-18 北京京东尚科信息技术有限公司 A kind of detection method and device of code similarity
CN109445834B (en) * 2018-10-30 2021-04-30 北京计算机技术及应用研究所 Program code similarity rapid comparison method based on abstract syntax tree
CN111124487B (en) * 2018-11-01 2022-01-21 浙江大学 Code clone detection method and device and electronic equipment
CN110990273B (en) * 2019-11-29 2024-04-23 中国银行股份有限公司 Clone code detection method and device
CN111459491B (en) * 2020-03-17 2021-11-05 南京航空航天大学 Code recommendation method based on tree neural network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090461A (en) * 2019-11-18 2020-05-01 中山大学 Code annotation generation method based on machine translation model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
代码克隆检测研究进展;陈秋远;李善平;鄢萌;夏鑫;;软件学报(第04期);102-120 *

Also Published As

Publication number Publication date
CN112035165A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN112035165B (en) Code clone detection method and system based on isomorphic network
Chen et al. Tree-to-tree neural networks for program translation
Brockschmidt et al. Generative code modeling with graphs
Hu et al. Deep code comment generation
Devlin et al. Robustfill: Neural program learning under noisy i/o
CN108804495B (en) Automatic text summarization method based on enhanced semantics
CN112308210B (en) Neural network-based cross-architecture binary function similarity detection method and system
JP5128629B2 (en) Part-of-speech tagging system, part-of-speech tagging model training apparatus and method
CN111090461B (en) Code annotation generation method based on machine translation model
EP4248310A1 (en) Automated merge conflict resolution
CN113076739A (en) Method and system for realizing cross-domain Chinese text error correction
CN114936287A (en) Knowledge injection method for pre-training language model and corresponding interactive system
WO2023010916A1 (en) Software automatic repair method and system, electronic device, and storage medium
CN113190219A (en) Code annotation generation method based on recurrent neural network model
CN112035099B (en) Vectorization representation method and device for nodes in abstract syntax tree
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
EP4272070A1 (en) Multi-lingual code generation with zero-shot inference
US10394960B2 (en) Transliteration decoding using a tree structure
CN115906815A (en) Error correction method and device for modifying one or more types of wrong sentences
CN113065322B (en) Code segment annotation generation method and system and readable storage medium
US10402489B2 (en) Transliteration of text entry across scripts
CN111104520B (en) Personage entity linking method based on personage identity
Jadallah et al. CATE: CAusality Tree Extractor from Natural Language Requirements
CN116738963A (en) Deep learning code plagiarism detection method based on multi-head attention mechanism
CN114757181B (en) Method and device for training and extracting event of end-to-end event extraction model based on prior knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant