CN111124487B - Code clone detection method and device and electronic equipment - Google Patents

Code clone detection method and device and electronic equipment Download PDF

Info

Publication number
CN111124487B
CN111124487B CN201811295180.5A CN201811295180A CN111124487B CN 111124487 B CN111124487 B CN 111124487B CN 201811295180 A CN201811295180 A CN 201811295180A CN 111124487 B CN111124487 B CN 111124487B
Authority
CN
China
Prior art keywords
code
training
versions
source
training data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811295180.5A
Other languages
Chinese (zh)
Other versions
CN111124487A (en
Inventor
傅珉
杨昕立
鄢萌
李元平
章修琳
吴芮
杨小虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201811295180.5A priority Critical patent/CN111124487B/en
Publication of CN111124487A publication Critical patent/CN111124487A/en
Application granted granted Critical
Publication of CN111124487B publication Critical patent/CN111124487B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection

Abstract

The application discloses a code clone detection method, which comprises the following steps: acquiring a source code set consisting of at least two versions of source codes; respectively converting the source codes of the at least two versions into corresponding code feature vectors; and inputting the code characteristic vectors corresponding to the source codes of the at least two versions into an integrated classification model for clone detection to obtain a clone detection result. According to the code clone detection method, the respective feature information of the source code is extracted and converted into the code feature vector, and the clone detection is realized by utilizing the integrated classification model on the basis of the code feature vector, so that the feature loss of the source code is reduced, and the code clone detection realized on the basis is more accurate and more effective.

Description

Code clone detection method and device and electronic equipment
Technical Field
The application relates to the technical field of software cloning, in particular to a code cloning detection method. The application also relates to a code clone detection device and an electronic device.
Background
With the increasing enlargement and complication of the scale of a software system, the software development work is heavier, and in the software development process, software developers often refer to the existing codes to realize similar functions, or directly copy and paste to 'reuse' the existing codes to generate a plurality of semantically and functionally similar codes, namely code cloning. Although the code reusing mode can improve the software development efficiency to some extent, code cloning causes the diffusion of Bug of software, and also causes the difficulty of software maintenance to be increased, and even triggers the risk of license violation; therefore, code clone detection also becomes increasingly important as software is maintained and developed.
Currently, many methods and tools for code clone detection have been proposed, and these methods and tools can be mainly classified into three categories: text-based code clone detection, marker-based code clone detection, and tree-based code clone detection. The tool firstly performs a small amount of preprocessing on source codes, namely formatting and standardized typesetting, and then dynamically clusters potential Clone codes through simple text line comparison. The code clone detection tool based on the mark mainly comprises CCFinder and SourcererCC, and both realize the code clone detection on the basis of the mark. The tree-based code clone detection tool is mainly DECKARD, which performs code clone detection by recognizing AST (abstract syntax tree) like codes.
Although the code clone detection tool provided above is relatively practical, it can detect code clones only by detecting similarity at text, mark and tree (grammar) levels, and for codes with similar semantics but not necessarily similar grammars, it cannot realize code clone detection at semantic level, and has a major drawback.
Disclosure of Invention
The application provides a code clone detection method to solve the defects in the prior art. The application also relates to a code clone detection device and an electronic device.
The application provides a code clone detection method, which comprises the following steps:
acquiring a source code set consisting of at least two versions of source codes;
respectively converting the source codes of the at least two versions into corresponding code feature vectors;
and inputting the code characteristic vectors corresponding to the source codes of the at least two versions into an integrated classification model for clone detection to obtain a clone detection result.
Optionally, the converting the source codes of the at least two versions into corresponding code feature vectors respectively includes:
for at least two versions of source code in the set of source code, performing the following:
extracting code character units in the code character corpus contained in the source code based on a code character corpus;
constructing a semantic vector of the code character unit;
generating a semantic matrix of the source code according to the semantic vector of the code character unit;
and converting the semantic matrix of the source code into a semantic vector as a code characteristic vector corresponding to the source code.
Optionally, the integrated classification model is obtained by training in the following way:
acquiring training source codes of at least two versions in an original training code set;
respectively converting the training source codes of the at least two versions into corresponding training feature vectors;
and performing model training by using at least two training feature vectors obtained by conversion.
Optionally, before the performing of the model training substep using the at least two training feature vectors obtained by the conversion, the method includes:
carrying out balance processing on positive training data and negative training data in the original training code set;
wherein the positive training data refers to that clone codes exist in the training source codes of the at least two versions, and the negative training data refers to that clone codes do not exist in the training source codes of the at least two versions.
Optionally, the balancing the positive training data and the negative training data in the original training code set includes:
calculating the ratio of positive training data to negative training data in the original training code set;
if the ratio of the positive training data to the negative training data is smaller than a target ratio, randomly selecting the negative training data from the original training code set and adding the negative training data into the original training code set;
and if the ratio of the positive training data to the negative training data is larger than a target ratio, randomly selecting the positive training data from the original training code set and adding the positive training data into the original training code set.
Optionally, the converting the training source codes of the at least two versions into corresponding training feature vectors respectively includes:
extracting code character units in the code character corpus contained in the training source codes based on a code character corpus;
constructing a semantic vector of the code character unit;
generating a semantic matrix of the training source code according to a semantic vector of a code character unit contained in the training source code;
and converting the semantic matrix into a semantic vector as a training feature vector corresponding to the training source code.
Optionally, the clone detection result carries a code clone type between the at least two versions of the source code;
wherein the code clone type comprises at least one of: text cloning, tag cloning, syntactic cloning, and semantic cloning.
Optionally, the integrated classification model includes: training the obtained random forest classification model by adopting an ensemble learning method;
wherein, the basic learner of the random forest classification model adopts at least one classification technology as follows: decision tree, naive Bayes, support vector machine, linear discriminant analysis and k-nearest neighbor classifier;
the integration method of the random forest classification model adopts at least one of the following items: bagging, reinforcing and stacking.
Optionally, in the step of converting the source codes of the at least two versions into corresponding code feature vectors, a word embedding technology is adopted to convert the source codes of the at least two versions in the source code set into corresponding code feature vectors;
wherein the word embedding technique comprises at least one of: word vectorization and text vectorization.
The present application also provides a code clone detection device, comprising:
the source code set acquisition unit is used for acquiring a source code set consisting of at least two versions of source codes;
the code feature vector conversion unit is used for respectively converting the source codes of the at least two versions into corresponding code feature vectors;
and the clone detection unit is used for inputting the code feature vectors corresponding to the source codes of the at least two versions into the integrated classification model for clone detection to obtain a clone detection result.
The present application further provides an electronic device, comprising:
a memory and a processor;
the memory is to store computer-executable instructions, and the processor is to execute the computer-executable instructions to:
acquiring a source code set consisting of at least two versions of source codes;
respectively converting the source codes of the at least two versions into corresponding code feature vectors;
and inputting the code characteristic vectors corresponding to the source codes of the at least two versions into an integrated classification model for clone detection to obtain a clone detection result.
Compared with the prior art, the method has the following advantages:
the code clone detection method provided by the application comprises the following steps: acquiring a source code set consisting of at least two versions of source codes; respectively converting the source codes of the at least two versions into corresponding code feature vectors; and inputting the code characteristic vectors corresponding to the source codes of the at least two versions into an integrated classification model for clone detection to obtain a clone detection result.
According to the code clone detection method, in the process of clone detection of the source codes of at least two versions, the source codes are converted into code feature vectors corresponding to the source codes by extracting the feature information of the source codes of the at least two versions, and the clone detection of the source codes of the at least two versions is realized by utilizing the integrated classification model on the basis of the code feature vectors, so that the feature loss of the source codes is reduced, and the code clone detection realized on the basis is more accurate and more effective.
Drawings
FIG. 1 is a process flow diagram of an embodiment of a code clone detection method provided herein;
FIG. 2 is a schematic diagram of a code clone detection framework provided herein;
FIG. 3 is a schematic diagram of an embodiment of a code clone detection device provided by the present application;
fig. 4 is a schematic diagram of an electronic device provided by the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The application provides a code clone detection method, a code clone detection device and electronic equipment. The following detailed description and the description of the steps of the method are individually made with reference to the drawings of the embodiments provided in the present application.
The embodiment of the code clone detection method provided by the application is as follows:
referring to fig. 1, a flow chart of a process of an embodiment of the code clone detection method provided by the present application is shown, and referring to fig. 2, a schematic diagram of a code clone detection framework provided by the present application is shown.
Step S101, a source code set composed of at least two versions of source codes is obtained.
In general, code clones can be classified into 4 types according to their degree of similarity: (1) the Type-1 Type, the source code of multiple versions of the Type has the same source code segments except for annotation and layout, and can be called as a text clone Type; (2) the Type-2 Type, besides the difference of the text clone Type, the source codes of the multiple versions also have difference in identifier name and literal constant value, which can be called as mark clone Type; (3) type-3, in addition to the above difference of the marked clone Type, the multiple versions of source code also add, modify and delete related statements, that is, the multiple versions of code fragments have similarity at the syntax level, which can be called as a syntactic clone Type; (4) type-4, the source code of multiple versions of the Type, except for the difference of the above syntactic clone Type, the source code segments of multiple versions, although realizing the same function, have no similarity in syntax and can be called semantic clone Type. Although a plurality of code clone detection methods and tools are proposed in the aspects of text clone, marker clone and grammar clone detection at present, even a deep learning model is proposed to be used for code clone detection, the improvement space still exists in the aspects of model training, code clone detection efficiency and detection precision, and the specific implementation is that the method has no good expansibility on a source code segment with a slightly large scale, because the deep learning model consumes longer time on model training.
According to the code clone detection method, a complex deep learning model is not used, the model is trained through integrated learning, so that the integrated classification model obtained through training has higher effectiveness and expansibility, meanwhile, the original characteristic information of a plurality of versions of source codes is maintained, the characteristic information of lexical methods, syntax and the like in the source codes is mined by using a word embedding technology, the characteristic information of the source codes is converted into the characteristic vector on the technology of extracting the characteristic information of the source codes, the integrated classification model with lower training cost is used for learning the characteristic information of the source codes in the characteristic vector, so that the information loss of the source codes is reduced, more complete characteristic information of the source codes is reserved, and more accurate and more effective code clone detection effect is realized on the basis.
It should be noted that the code clone detection method provided by the present application can implement similarity detection between two versions and source codes of more than two versions, the embodiment of the present application takes clone detection of source codes of two versions as an example for description, and the specific implementation process of clone detection of source codes of three, four or more versions is similar to that of clone detection of source codes of two versions, which is referred to for clone detection of source codes of two versions provided by the embodiment of the present application, and this embodiment is not described again. In addition, in the specific implementation, the three, four or more versions of the source codes can be combined in pairs, so that the clone detection of the three, four or more versions of the source codes is converted into the clone detection of two versions.
In practical application, the code clone detection method provided by the embodiment of the application comprises a source code clone detection link of inputting code feature vectors corresponding to at least two versions of source codes into an integrated classification model for clone detection, and also comprises an integrated classification model training link of training the integrated classification model according to training data, wherein the source code clone detection link and the integrated classification model training link can be performed in a crossed manner, and the integrated classification model is continuously trained in the source code clone detection process, so that the accuracy of the integrated classification model for detecting the source codes is continuously improved. For example, for code clone testing, many source codes to be tested may be submitted every day, and the number of source codes matched with each other for clone testing may be very large, so that both the model training time and the code clone testing time are important, the model training time is required to ensure that the integrated classification model can be updated every day, and the code clone testing time is required to ensure that all the source code pairs submitted every day can be timely tested as clone or non-clone.
In this embodiment, first, the integrated classification model training link is described in detail with reference to fig. 2. In an integrated classification model training link, an integrated classification model is constructed by integrated learning, and an integrated classification model with high credibility characteristics is trained from a source code sample known to be cloned/uncloned, and in a preferred implementation manner provided by the embodiment of the application, the integrated classification model is trained in the following manner:
(1) acquiring training source codes of at least two versions in an original training code set;
the original training code set is a training sample set of an integrated classification model, and the training source code of each group of at least two versions in the original training code set is a training sample for model training, and it is noted that whether the training source code of each group of at least two versions in the original training code set is clone/non-clone is known, so that the model training is performed by using the known training sample in the original training code set.
For example, source code pairs in the original training code set, each marked as cloned/uncloned, may be paired as a training sample for model training.
Preferably, the integrated classification model in the embodiment of the application refers to a random forest classification model obtained by training by using an integrated learning method; wherein, the basic learner of the random forest classification model adopts at least one classification technology as follows: decision tree, naive Bayes, support vector machine, linear discriminant analysis and k-nearest neighbor classifier; the integration method of the random forest classification model adopts at least one of the following items: bagging, reinforcing and stacking.
For example, a random forest classification model is constructed by combining a decision tree and Bagging (Bagging), wherein the decision tree uses a group of features to perform hierarchical decision, the features are arranged in a tree structure, and meanwhile, feature variables which can distinguish most different categories can be quickly found in the tree construction process to serve as branch attributes. In addition, the decision tree may also generate corresponding explicit rules for different classes. In order to avoid the need of constructing a very large decision tree for dealing with all situations, and save time and space cost, integrated learning is utilized, and some medium-scale decision trees are expected to be established to replace a huge decision tree, specifically, a Bagging method (Bagging) is used for establishing a random forest. The random forest is an advanced integration technology based on decision trees, the random forest introduces randomness into the model construction process of each decision tree, so that the decision trees have less correlation, and the integration performance realized by using a Bagging method (Bagging) is better on the basis.
(2) Respectively converting the training source codes of the at least two versions into corresponding training feature vectors;
preferably, the at least two versions of the training source code are respectively converted into corresponding training feature vectors, and the following method is adopted: extracting code character units in the code character corpus contained in the training source codes based on a code character corpus; constructing a semantic vector of the code character unit; generating a semantic matrix of the training source code according to a semantic vector of a code character unit contained in the training source code; and converting the semantic matrix into a semantic vector as a training feature vector corresponding to the training source code.
It should be noted that, in the integrated classification model training process, at least two versions of training source codes are respectively converted into corresponding training feature vectors, and the following step S102 is to respectively convert the at least two versions of source codes into corresponding code feature vectors, which may be preferably implemented by using the same word embedding technique, where the word embedding technique includes at least one of the following: word vectorization (Word2Vec) and text vectorization (Doc2 Vec).
For example, each source code is converted into a corresponding code feature vector by using a Word embedding technology and combining a skip-gram model in Word vectorization (Word2Vec), which is specifically implemented as follows:
given a token t (e.g., a character in the source code) in the source code, the set of surrounding tokens for the token t (e.g., the neighboring words of the given token) is represented as Ct. The objective function J (to be maximized) of the skip-gram model, which is the sum of the logarithm of the probability of the occurrence of the surrounding mark under given mark conditions, can be expressed as:
Figure BDA0001850985040000071
where n represents the entire length of the marker sequence. Furthermore, p (t)j|ti) Is a conditional probability defined using the following softmax function:
Figure BDA0001850985040000072
wherein v istIs a vector representation of the token T, which is the vocabulary of all tokens.
For each character in the corpus of code characters, it can be represented as a d-dimensional vector where d is a variable parameter and is typically set to an integer (say 100). With the skip-gram model, each tag is converted into a fixed-length vector (i.e., a d-dimensional vector), and on this basis, the source code can be represented as a matrix, where each row represents one tag. Since different source codes have different numbers of tokens, it is difficult to input a matrix representing the source codes directly into the ensemble classification model, which is converted into vectors by averaging all token vectors contained in the source codes.
The average value is obtained based on numerical calculations for each dimension in the vector, in particular, given a source code matrix of n rows in total, the ith row of the matrix is denoted riAnd a vector v of the transformed source code is generated as followsd
Figure BDA0001850985040000081
In the above manner, each source code can be represented as a vector, i.e., a character feature vector, by which the features of the source code are represented.
(3) And performing model training by using at least two training feature vectors obtained by conversion.
The integrated classification model can be constructed by using the training feature vectors corresponding to the source codes obtained by the conversion, but in the integrated classification model construction process, if the training samples in the original training code set are not balanced (the difference between the number of the training samples with the clone codes and the number of the training samples without the clone codes is large, such as in practice), the detection precision and accuracy of the integrated classification model obtained by training can be directly influenced. Therefore, before constructing the integrated classification model, the training samples in the original training code set need to be balanced, so as to train a more accurate integrated classification model.
In a preferred embodiment provided in the embodiment of the present application, before constructing an integrated classification model, a balance process is performed on positive training data and negative training data in the original training code set; wherein the positive training data refers to the existence of clone code in the at least two versions of training source code (e.g., the pair of source code of the two versions is clone code), and the negative training data refers to the absence of clone code in the at least two versions of training source code (e.g., the pair of source code of the two versions is not clone code).
Preferably, the balancing processing is performed on the positive direction training data and the negative direction training data in the original training code set, and is implemented by adopting the following method:
calculating the ratio of positive training data to negative training data in the original training code set; if the ratio of the positive training data to the negative training data is smaller than a target ratio, randomly selecting the negative training data from the original training code set and adding the negative training data into the original training code set; and if the ratio of the positive training data to the negative training data is larger than a target ratio, randomly selecting the positive training data from the original training code set and adding the positive training data into the original training code set.
For example, using oversampling to balance positive training data and negative training data in the original training code set, a target ratio p is first set, which is defined as a target ratio of a small number of classes of training data to the total training data amount. The following two steps are then repeated until the ratio of training data of the minority class to the total training data amount increases to the target ratio p:
the method comprises the following steps: randomly selecting training data belonging to a few classes in an original training code set;
step two: and adding the training data selected in the step one to the original training code set.
Step S102, the source codes of the at least two versions are respectively converted into corresponding code feature vectors.
In an embodiment of the present application, the source codes of the at least two versions are respectively converted into corresponding code feature vectors, and the following method is adopted:
for at least two versions of source code in the set of source code, performing the following: extracting code character units in the code character corpus contained in the source code based on a code character corpus; constructing a semantic vector of the code character unit; generating a semantic matrix of the source code according to the semantic vector of the code character unit; and converting the semantic matrix of the source code into a semantic vector as a code characteristic vector corresponding to the source code.
It should be noted that, in this step, the source codes of the at least two versions are respectively converted into corresponding code feature vectors, and the word embedding technique adopted by the integrated classification model training process to respectively convert the training source codes of the at least two versions into corresponding training feature vectors is consistent with the word embedding technique adopted by the integrated classification model training process, and a specific implementation process of respectively converting the source codes of the at least two versions into corresponding code feature vectors may refer to a specific description of respectively converting the training source codes of the at least two versions into corresponding training feature vectors in the integrated classification model training process, which is not described herein again.
And S103, inputting the code feature vectors corresponding to the source codes of the at least two versions into an integrated classification model for clone detection to obtain a clone detection result.
As described above, on the basis of the code feature vectors corresponding to the source codes of the at least two versions obtained by conversion in the step S102, the code feature vectors are input into the integrated classification model obtained by the training to perform clone detection, that is, whether a code clone exists between the source codes of the at least two versions is detected, and after the detection is completed, the integrated classification model outputs a clone detection result whether a code clone exists between the source codes of the at least two versions.
Preferably, the clone detection result output by the integrated classification model also carries a code clone type between the source codes of the at least two versions; wherein the code clone type comprises at least one of: text cloning, tag cloning, syntactic cloning, and semantic cloning.
For example, inputting the code feature vectors corresponding to the source codes of the two versions to be detected into a random forest classification model, and if the code similarity of the source codes of the two versions is between [0.7, 1] after the detection of the random forest classification model, outputting the clone detection result of the source codes of the two versions by the random forest classification model as a text clone; if the code similarity of the source codes of the two versions is between [0.5 and 0.7] after the random forest classification model detection, the random forest classification model outputs the clone detection result of the source codes of the two versions as a marker clone; and if the code similarity of the source codes of the two versions is between 0 and 0.5 after the random forest classification model detection, the random forest classification model outputs the clone detection result of the source codes of the two versions to be syntactic clone or semantic clone.
To sum up, in the process of performing clone detection on the source codes of at least two versions, the code clone detection method converts the source codes into code feature vectors corresponding to the source codes by extracting the feature information of the source codes of the at least two versions, and realizes clone detection between the source codes of the at least two versions by using the integrated classification model on the basis of the code feature vectors, so that the loss of the feature information of the source codes is reduced, and the code clone detection realized on the basis is more accurate and more effective.
The embodiment of the code clone detection device provided by the application is as follows:
in the above embodiments, a code clone detection method is provided, and correspondingly, a code clone detection device is also provided in the present application, which is described below with reference to the accompanying drawings.
Referring to fig. 3, a schematic diagram of an embodiment of a code clone detection device provided in the present application is shown.
Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to the corresponding description of the method embodiments provided above for relevant portions. The device embodiments described below are merely illustrative.
The application provides a code clone detection device, including:
a source code set obtaining unit 301, configured to obtain a source code set composed of at least two versions of source codes;
a code feature vector converting unit 302, configured to convert the source codes of the at least two versions into corresponding code feature vectors, respectively;
a clone detection unit 303, configured to input the code feature vectors corresponding to the source codes of the at least two versions into the integrated classification model for clone detection, so as to obtain a clone detection result.
Optionally, the code feature vector converting unit 302 is specifically configured to, for at least two versions of the source code in the source code set, execute the following sub-units:
a code character unit extracting subunit, configured to extract, based on a code character corpus, a code character unit in the code character corpus included in the source code;
a semantic vector constructing subunit, configured to construct a semantic vector of the code character unit;
a semantic matrix generating subunit, configured to generate a semantic matrix of the source code according to the semantic vector of the code character unit;
and the semantic vector conversion subunit is used for converting the semantic matrix of the source code into a semantic vector as a code feature vector corresponding to the source code.
Optionally, the integrated classification model is obtained by performing the following unit training:
the training source code acquisition unit is used for acquiring training source codes of at least two versions in an original training code set;
the training feature vector conversion unit is used for respectively converting the training source codes of the at least two versions into corresponding training feature vectors;
and the model training unit is used for performing model training by using at least two training feature vectors obtained by conversion.
Optionally, the integrated classification model is further obtained by performing the following unit training:
the training data balance processing unit is used for carrying out balance processing on positive training data and negative training data in the original training code set;
wherein the positive training data refers to that clone codes exist in the training source codes of the at least two versions, and the negative training data refers to that clone codes do not exist in the training source codes of the at least two versions.
Optionally, the training data balance processing unit is specifically configured to calculate a ratio of positive training data to negative training data in the original training code set;
if the ratio of the positive training data to the negative training data is smaller than a target ratio, randomly selecting the negative training data from the original training code set and adding the negative training data into the original training code set;
and if the ratio of the positive training data to the negative training data is larger than a target ratio, randomly selecting the positive training data from the original training code set and adding the positive training data into the original training code set.
Optionally, the training feature vector converting unit includes:
the extraction subunit is used for extracting the code character units in the code character corpus contained in the training source code based on the code character corpus;
the vector construction subunit is used for constructing the semantic vector of the code character unit;
a matrix generation subunit, configured to generate a semantic matrix of the training source code according to a semantic vector of a code character unit included in the training source code;
and the training feature vector generating subunit is used for converting the semantic matrix into a semantic vector as a training feature vector corresponding to the training source code.
Optionally, the clone detection result carries a code clone type between the at least two versions of the source code;
wherein the code clone type comprises at least one of: text cloning, tag cloning, syntactic cloning, and semantic cloning.
Optionally, the integrated classification model includes: training the obtained random forest classification model by adopting an ensemble learning method;
wherein, the basic learner of the random forest classification model adopts at least one classification technology as follows: decision tree, naive Bayes, support vector machine, linear discriminant analysis and k-nearest neighbor classifier;
the integration method of the random forest classification model adopts at least one of the following items: bagging, reinforcing and stacking.
Optionally, the code feature vector converting unit 302 is configured to convert source codes of at least two versions in a source code set into corresponding code feature vectors by using a word embedding technique;
wherein the word embedding technique comprises at least one of: word vectorization and text vectorization.
The embodiment of the electronic equipment provided by the application is as follows:
in the above embodiment, a code clone detection method is provided, and in addition, the present application also provides an electronic device for implementing the code clone detection method, which is described below with reference to the accompanying drawings.
Referring to fig. 4, a schematic diagram of an electronic device provided in the present embodiment is shown.
The embodiments of the electronic device provided in the present application are described more simply, and for related parts, reference may be made to the corresponding descriptions of the embodiments of the code clone detection method provided above. The embodiments described below are merely illustrative.
The application provides an electronic device, including:
a memory 401 and a processor 402;
the memory 401 is configured to store computer-executable instructions, and the processor 402 is configured to execute the following computer-executable instructions:
acquiring a source code set consisting of at least two versions of source codes;
respectively converting the source codes of the at least two versions into corresponding code feature vectors;
and inputting the code characteristic vectors corresponding to the source codes of the at least two versions into an integrated classification model for clone detection to obtain a clone detection result.
Optionally, the converting the source codes of the at least two versions into corresponding code feature vectors respectively includes:
for at least two versions of source code in the set of source code, performing the following:
extracting code character units in the code character corpus contained in the source code based on a code character corpus;
constructing a semantic vector of the code character unit;
generating a semantic matrix of the source code according to the semantic vector of the code character unit;
and converting the semantic matrix of the source code into a semantic vector as a code characteristic vector corresponding to the source code.
Optionally, the integrated classification model is obtained by training in the following way:
acquiring training source codes of at least two versions in an original training code set;
respectively converting the training source codes of the at least two versions into corresponding training feature vectors;
and performing model training by using at least two training feature vectors obtained by conversion.
Optionally, before the performing of the model training instruction by using the at least two training feature vectors obtained by the conversion, the processor 402 is further configured to execute the following computer-executable instructions:
carrying out balance processing on positive training data and negative training data in the original training code set;
wherein the positive training data refers to that clone codes exist in the training source codes of the at least two versions, and the negative training data refers to that clone codes do not exist in the training source codes of the at least two versions.
Optionally, the balancing the positive training data and the negative training data in the original training code set includes:
calculating the ratio of positive training data to negative training data in the original training code set;
if the ratio of the positive training data to the negative training data is smaller than a target ratio, randomly selecting the negative training data from the original training code set and adding the negative training data into the original training code set;
and if the ratio of the positive training data to the negative training data is larger than a target ratio, randomly selecting the positive training data from the original training code set and adding the positive training data into the original training code set.
Optionally, the converting the training source codes of the at least two versions into corresponding training feature vectors respectively includes:
extracting code character units in the code character corpus contained in the training source codes based on a code character corpus;
constructing a semantic vector of the code character unit;
generating a semantic matrix of the training source code according to a semantic vector of a code character unit contained in the training source code;
and converting the semantic matrix into a semantic vector as a training feature vector corresponding to the training source code.
Optionally, the clone detection result carries a code clone type between the at least two versions of the source code;
wherein the code clone type comprises at least one of: text cloning, tag cloning, syntactic cloning, and semantic cloning.
Optionally, the integrated classification model includes: training the obtained random forest classification model by adopting an ensemble learning method;
wherein, the basic learner of the random forest classification model adopts at least one classification technology as follows: decision tree, naive Bayes, support vector machine, linear discriminant analysis and k-nearest neighbor classifier;
the integration method of the random forest classification model adopts at least one of the following items: bagging, reinforcing and stacking.
Optionally, the step of converting the source codes of the at least two versions into corresponding code feature vector instructions separately adopts a word embedding technique to convert the source codes of the at least two versions in the source code set into corresponding code feature vectors separately;
wherein the word embedding technique comprises at least one of: word vectorization and text vectorization.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.
In a typical configuration, a computing device includes one or more processors, input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (10)

1. A code clone detection method, comprising:
acquiring a source code set consisting of at least two versions of source codes;
respectively converting the source codes of the at least two versions into corresponding code feature vectors;
inputting the code feature vectors corresponding to the source codes of the at least two versions into an integrated classification model for clone detection to obtain clone detection results;
the integrated classification model is obtained by performing model training by using training feature vectors corresponding to at least two versions of training source codes in an original training code set;
the method further comprises the following steps: and carrying out balance processing on positive training data and negative training data in the original training code set, wherein the balance processing comprises the following steps: adjusting the original training code set by comparing the ratio of the positive training data to the negative training data to a target ratio; wherein the positive training data refers to that clone codes exist in the training source codes of the at least two versions, and the negative training data refers to that clone codes do not exist in the training source codes of the at least two versions.
2. The method according to claim 1, wherein the converting the at least two versions of source code into corresponding code feature vectors respectively comprises:
for at least two versions of source code in the set of source code, performing the following:
extracting code character units in the code character corpus contained in the source code based on a code character corpus;
constructing a semantic vector of the code character unit;
generating a semantic matrix of the source code according to the semantic vector of the code character unit;
and converting the semantic matrix of the source code into a semantic vector as a code characteristic vector corresponding to the source code.
3. The method according to claim 1, wherein the integrated classification model is obtained by performing model training using training feature vectors corresponding to at least two versions of training source codes in an original training code set, and comprises:
acquiring training source codes of at least two versions in an original training code set;
respectively converting the training source codes of the at least two versions into corresponding training feature vectors;
and performing model training by using at least two training feature vectors obtained by conversion.
4. The method according to claim 1, wherein the balancing of positive and negative training data in the original training code set comprises:
calculating the ratio of positive training data to negative training data in the original training code set;
if the ratio of the positive training data to the negative training data is smaller than a target ratio, randomly selecting the negative training data from the original training code set and adding the negative training data into the original training code set;
and if the ratio of the positive training data to the negative training data is larger than a target ratio, randomly selecting the positive training data from the original training code set and adding the positive training data into the original training code set.
5. The method according to claim 3, wherein the converting the at least two versions of the training source code into corresponding training feature vectors respectively comprises:
extracting code character units in the code character corpus contained in the training source codes based on a code character corpus;
constructing a semantic vector of the code character unit;
generating a semantic matrix of the training source code according to a semantic vector of a code character unit contained in the training source code;
and converting the semantic matrix into a semantic vector as a training feature vector corresponding to the training source code.
6. The code clone detection method of any of claims 1 to 5, wherein said clone detection result carries a code clone type between said at least two versions of source code;
wherein the code clone type comprises at least one of: text cloning, tag cloning, syntactic cloning, and semantic cloning.
7. The code clone detection method of claim 6, wherein said ensemble classification model comprises: training the obtained random forest classification model by adopting an ensemble learning method;
wherein, the basic learner of the random forest classification model adopts at least one classification technology as follows: decision tree, naive Bayes, support vector machine, linear discriminant analysis and k-nearest neighbor classifier;
the integration method of the random forest classification model adopts at least one of the following items: bagging, reinforcing and stacking.
8. The method according to claim 6, wherein the step of converting the source codes of the at least two versions into the corresponding code feature vectors respectively adopts a word embedding technique to convert the source codes of the at least two versions in the source code set into the corresponding code feature vectors respectively;
wherein the word embedding technique comprises at least one of: word vectorization and text vectorization.
9. A code clone detection device, comprising:
the source code set acquisition unit is used for acquiring a source code set consisting of at least two versions of source codes;
the code feature vector conversion unit is used for respectively converting the source codes of the at least two versions into corresponding code feature vectors;
the clone detection unit is used for inputting the code feature vectors corresponding to the source codes of the at least two versions into the integrated classification model for clone detection to obtain a clone detection result;
the integrated classification model is obtained by performing model training by using training feature vectors corresponding to at least two versions of training source codes in an original training code set;
the device further comprises: and carrying out balance processing on positive training data and negative training data in the original training code set, wherein the balance processing comprises the following steps: adjusting the original training code set by comparing the ratio of the positive training data to the negative training data to a target ratio; wherein the positive training data refers to that clone codes exist in the training source codes of the at least two versions, and the negative training data refers to that clone codes do not exist in the training source codes of the at least two versions.
10. An electronic device, comprising:
a memory and a processor;
the memory is to store computer-executable instructions, and the processor is to execute the computer-executable instructions to:
acquiring a source code set consisting of at least two versions of source codes;
respectively converting the source codes of the at least two versions into corresponding code feature vectors;
inputting the code feature vectors corresponding to the source codes of the at least two versions into an integrated classification model for clone detection to obtain clone detection results;
the integrated classification model is obtained by performing model training by using training feature vectors corresponding to at least two versions of training source codes in an original training code set;
the apparatus further comprises: and carrying out balance processing on positive training data and negative training data in the original training code set, wherein the balance processing comprises the following steps: adjusting the original training code set by comparing the ratio of the positive training data to the negative training data to a target ratio; wherein the positive training data refers to that clone codes exist in the training source codes of the at least two versions, and the negative training data refers to that clone codes do not exist in the training source codes of the at least two versions.
CN201811295180.5A 2018-11-01 2018-11-01 Code clone detection method and device and electronic equipment Active CN111124487B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811295180.5A CN111124487B (en) 2018-11-01 2018-11-01 Code clone detection method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811295180.5A CN111124487B (en) 2018-11-01 2018-11-01 Code clone detection method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN111124487A CN111124487A (en) 2020-05-08
CN111124487B true CN111124487B (en) 2022-01-21

Family

ID=70494816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811295180.5A Active CN111124487B (en) 2018-11-01 2018-11-01 Code clone detection method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111124487B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035165B (en) * 2020-08-26 2023-06-09 山谷网安科技股份有限公司 Code clone detection method and system based on isomorphic network
CN112433756B (en) * 2020-11-24 2021-09-07 北京京航计算通讯研究所 Rapid code clone detection method and device based on weighted recursive self-encoder
CN112214419A (en) * 2020-12-09 2021-01-12 深圳开源互联网安全技术有限公司 Method and device for detecting similarity of component codes
CN112835620B (en) * 2021-02-10 2022-03-25 中国人民解放军军事科学院国防科技创新研究院 Semantic similar code online detection method based on deep learning
CN113220301A (en) * 2021-04-13 2021-08-06 广东工业大学 Clone consistency change prediction method and system based on hierarchical neural network
CN113704108A (en) * 2021-08-27 2021-11-26 浙江树人学院(浙江树人大学) Similar code detection method and device, electronic equipment and storage medium
CN113986345A (en) * 2021-11-01 2022-01-28 天津大学 Pre-training enhanced code clone detection method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code
US8296759B1 (en) * 2006-03-31 2012-10-23 Vmware, Inc. Offloading operations to a replicate virtual machine
CN104077147A (en) * 2014-07-11 2014-10-01 东南大学 Software reusing method based on code clone automatic detection and timely prompting
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN108491228A (en) * 2018-03-28 2018-09-04 清华大学 A kind of binary vulnerability Code Clones detection method and system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120159434A1 (en) * 2010-12-20 2012-06-21 Microsoft Corporation Code clone notification and architectural change visualization
JP2015056140A (en) * 2013-09-13 2015-03-23 アイシン・エィ・ダブリュ株式会社 Clone detection method and clone common function method
CN107608732B (en) * 2017-09-13 2020-08-21 扬州大学 Bug searching and positioning method based on bug knowledge graph
US10114624B1 (en) * 2017-10-12 2018-10-30 Devfactory Fz-Llc Blackbox matching engine
CN108170468B (en) * 2017-12-28 2021-04-20 中山大学 Method and system for automatically detecting annotation and code consistency
CN108171050A (en) * 2017-12-29 2018-06-15 浙江大学 The fine granularity sandbox strategy method for digging of linux container

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8296759B1 (en) * 2006-03-31 2012-10-23 Vmware, Inc. Offloading operations to a replicate virtual machine
CN101697121A (en) * 2009-10-26 2010-04-21 哈尔滨工业大学 Method for detecting code similarity based on semantic analysis of program source code
CN104077147A (en) * 2014-07-11 2014-10-01 东南大学 Software reusing method based on code clone automatic detection and timely prompting
CN104407872A (en) * 2014-12-04 2015-03-11 北京邮电大学 Code clone detection method
CN108491228A (en) * 2018-03-28 2018-09-04 清华大学 A kind of binary vulnerability Code Clones detection method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
An efficient code clone detection model on Java byte code using hybrid approach;Kanika Raheja 等;《Confluence 2013: The Next Generation Information Technology Summit》;20140616;第16-21页 *
移动网络中恶意代码优化检测仿真研究;芦天亮 等;《计算机仿真》;20170815;第377-381页 *

Also Published As

Publication number Publication date
CN111124487A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
CN111124487B (en) Code clone detection method and device and electronic equipment
Bui et al. Infercode: Self-supervised learning of code representations by predicting subtrees
US11106714B2 (en) Summary generating apparatus, summary generating method and computer program
US8457950B1 (en) System and method for coreference resolution
Xu et al. Post2vec: Learning distributed representations of Stack Overflow posts
CN110674297B (en) Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment
CN110750297B (en) Python code reference information generation method based on program analysis and text analysis
CN110688540B (en) Cheating account screening method, device, equipment and medium
CN114416926A (en) Keyword matching method and device, computing equipment and computer readable storage medium
Savci et al. Comparison of pre-trained language models in terms of carbon emissions, time and accuracy in multi-label text classification using AutoML
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
Vu et al. Revising FUNSD dataset for key-value detection in document images
CN112417147A (en) Method and device for selecting training samples
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
Eppa et al. Source code plagiarism detection: A machine intelligence approach
Schirmer et al. A new dataset for topic-based paragraph classification in genocide-related court transcripts
Vu-Manh et al. Improving Vietnamese dependency parsing using distributed word representations
Fokam et al. Influence of contrastive learning on source code plagiarism detection through recursive neural networks
CN115495636A (en) Webpage searching method, device and storage medium
CN110968691B (en) Judicial hotspot determination method and device
CN113420127A (en) Threat information processing method, device, computing equipment and storage medium
Tang et al. Interpretability rules: Jointly bootstrapping a neural relation extractorwith an explanation decoder
Eppa et al. Machine Learning Techniques for Multisource Plagiarism Detection
Zaikis et al. DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks
CN111126066A (en) Method and device for determining Chinese retrieval method based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant