CN112215013B - Clone code semantic detection method based on deep learning - Google Patents

Clone code semantic detection method based on deep learning Download PDF

Info

Publication number
CN112215013B
CN112215013B CN202011205774.XA CN202011205774A CN112215013B CN 112215013 B CN112215013 B CN 112215013B CN 202011205774 A CN202011205774 A CN 202011205774A CN 112215013 B CN112215013 B CN 112215013B
Authority
CN
China
Prior art keywords
tpe
code
clone
token
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011205774.XA
Other languages
Chinese (zh)
Other versions
CN112215013A (en
Inventor
成肖云
王建荣
王赞
贾勇哲
马国宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Thai Technology Co ltd
Tianjin University
Original Assignee
天津大学
天津泰凡科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天津大学, 天津泰凡科技有限公司 filed Critical 天津大学
Priority to CN202011205774.XA priority Critical patent/CN112215013B/en
Publication of CN112215013A publication Critical patent/CN112215013A/en
Application granted granted Critical
Publication of CN112215013B publication Critical patent/CN112215013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a semantic clone detection method based on deep learning, for a given code block pair, firstly preprocessing the code block into a sequence of TPE basic units, then performing word embedding operation on the code block and the TPE basic units, and using the code block and the TPE basic units to a BilSTM module with context characteristic combination; secondly, intensively extracting useful information related to the clone codes learned by the neural network by using a self-attention mechanism; each code segment is converted into a vector representation, the Euclidean distances between the vectors are calculated as the characteristics of classification, and the vectors are classified into two classes: if two code blocks are similar, the vectors they generate over the neural network should be similar, i.e. predictive cloned/uncloned. Compared with the prior art, the invention saves more time and can capture rich grammar and semantic information; TPE can also avoid the problem of insufficient vocabulary (OOV).

Description

Clone code semantic detection method based on deep learning
Technical Field
The invention relates to the field of program analysis and machine learning, in particular to a method for representing source codes and detecting semantic clone.
Background
A clone code is a duplicate code, a similar code. It is common to classify code clones into four types: 1) a type one clone is two code segments that are identical except for the differences in spaces, formats, and annotations; 2) the two-type clone is two code segments which are completely the same except that the identifiers, constants and variable types are different; 3) the third type clone is to modify the copied code segment with a few statements, such as changing, adding or deleting two code segments of the code of a few statements; 4) the four types of clones are mainly associated with functional similarity. The first three types are primarily related to text similarity. The presence of code clones unnecessarily increases program size, changes to code segments also require modifications to their clones, increasing maintenance effort, and copying code segments containing errors can result in error propagation. Detecting code clones helps to reduce software maintenance costs and prevent errors from occurring.
Among the various methods of detecting code clones, semantic clones are rarely detected because they are the most difficult to detect, and include clones that differ in syntax but still perform the same function, and therefore, it is necessary to propose a method for efficiently detecting code semantic clones.
One key issue for semantic clone detection is how to effectively learn the representation of the source code, thereby effectively capturing its semantics. Token and Abstract Syntax Trees (AST) are commonly used to detect code clones. However, Token cannot learn the semantic information contained in the code structure well, which is not enough for the semantic clone detection task. Recent semantic clone detection efforts, which use Abstract Syntax Trees (AST) to represent code in conjunction with syntax information, have proven effective, but are less efficient because the Abstract Syntax Trees (AST) for code are typically more complex than parse trees for text. Code clone detection considers not only accuracy but also efficiency.
In the task of code clone detection, it is very common to use Token as a basic segmentation method representing the source code of a program. The normalized Token vocabulary is so small (typically no more than 300 different normalized tokens) that it results in a learned vocabulary (an external knowledge obtained by pre-training) of limited capacity that the external pre-training is ineffective for neural models. The semantic clone detection task typically employs unnormalized tokens. Token may still fail to capture rich semantic information, especially using meaningless variables in the program.
To extract more information from the pre-training, a straightforward approach is to enlarge the input vocabulary. Sentence-level segmentation may be a natural choice, but due to the diversity of sentences, its vocabulary can be infinite. It is not possible to train a vocabulary containing all possible sentence representations. Thus, input statements may be encountered in which no vector representation is found in the vocabulary, which is known as an OOV (out-of-vocabulary) problem. The OOV problem severely limits the effectiveness of code representation.
Disclosure of Invention
The invention aims to solve the problem of detection of semantic code clone in a program, and provides a semantic clone detection method based on deep learning.
The invention relates to a clone code semantic detection method based on deep learning, which specifically comprises the following processes:
step 1, determining a basic unit of a TPE (thermal plastic elastomer) represented by codes in a semantic clone detection task, wherein the TPE generation process comprises the following steps: firstly, each code in an input corpus is cut into a Token sequence, the obtained Token initialization vocabulary vocab is used for merging all binary groups Token appearing in the corpus, then all Token binary groups in the current corpus are counted, the binary groups Token are sequenced and marked, then a Token with the highest iteration combination frequency in the corpus is used for identifying a new basic unit, the newly obtained unit is added into the vocabulary, the newly generated corpus consists of the newly added binary groups Token, the TPE regards the binary groups Token as a new Token, the process is carried out iteratively, and the vocabulary is updated by continuously iterating and searching the Token combination with higher frequency; after the final vocabulary is obtained, dividing the code sentence into TPEs according to the obtained vocabulary by utilizing a backward maximum matching method;
obtaining TPE basic units of different languages by using a TPE algorithm according to the selected corpus;
step 2, establishing a neural network model suitable for code clone detection, and pre-training the TPE basic units obtained in the step 1 by using a Skip-Gram model to generate a vocabulary of a corresponding TPE unit-word vector representation format; converting the series of discrete sequences in the vocabulary into continuous vector representation, and realizing and training a standard BilSTM model; putting the vector representation of the TPE basic unit learned by the BilSTM model into a matrix, multiplying the matrix by a weight matrix to obtain a vector with a fixed dimension, and grasping the weight of each TPE basic unit vector in the whole sentence by continuously learning and updating the weight matrix so as to obtain the vector representation of the whole code method; the specific formula is as follows:
Figure BDA0002757009150000031
where v denotes the vector parameters, T denotes the transpose of the matrix, each element in the vector represents the importance of each sequence node,
Figure BDA0002757009150000032
representing the hidden layer output, s, of BilSTMtTo represent
Figure BDA0002757009150000033
Significance of (a)tWeight parameter, h, representing the attention weight t position element over the entire sequenceCODERepresenting the final codevector representation; j represents an iteration parameter, and t represents a current position;
step 3, designing a Siamese framework to carry out the classified detection of clone pairs/non-clone pairs, and specifically comprising the following steps: two vectors converted by two code blocks are given as input, pre-training is carried out by adopting a Skip-Gram model to obtain the vector representation of each code block, the difference of the distances between the two output vectors is calculated by utilizing the calculation mode of the Euclidean distance between the vectors as the characteristic of classification, the vectors with similar Euclidean distance are clone pairs, and the final clone/non-clone prediction is obtained.
Compared with the prior art, the invention has the following beneficial effects:
(1) the new source code representation base unit TPE saves more time than the AST based representation while capturing rich syntactic and semantic information;
(2) TPE can also avoid the problem of insufficient vocabulary (OOV).
Drawings
FIG. 1 is an overall flow chart of a semantic clone detection method based on deep learning according to the present invention;
FIG. 2 is an exemplary diagram of TPE units generated using TPE;
fig. 3 is a diagram of a BiLSTM network architecture.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and examples.
As shown in fig. 1, it is an overall flowchart of the semantic clone detection method based on deep learning of the present invention. The process specifically comprises the following steps:
step 1, determining TPE basic units represented by codes in a clone code detection task, learning potential syntactic/semantic information from a large-scale original code corpus through pre-training embedding, and pre-training an embedding matrix with stronger expression capacity based on a vocabulary table generated by the TPE;
the TPE basic unit generation process is as follows: firstly, each code in an input corpus is cut into a Token sequence, the obtained Token is used for initializing a vocabulary vocab, the combination of all binary units Token appearing in the corpus is carried out, then all Token binary units in the current corpus are counted, the binary units Token are sorted and marked, then the Token with the highest iteration combination frequency in the corpus is used for identifying a new basic unit, the newly obtained unit is added into the vocabulary, the newly generated corpus is composed of the newly added binary units Token, the TPE regards the binary units Token as the new Token, the process is carried out iteratively, and the vocabulary is updated by continuously iteratively searching for the Token combination with higher frequency. After obtaining a final vocabulary table, dividing the code sentence into TPE units according to the obtained vocabulary table by utilizing a backward maximum matching method (for example, one code sentence is ABC, two pointers are arranged to respectively point to a first character A and a last character C of the sentence, firstly, whether the basic unit of ABC exists in the vocabulary table is searched, if so, the whole program sentence ABC is regarded as a TPE unit, if not, the first pointer is moved backwards, whether BC is in the vocabulary table is searched, if not, the process is continued, and finally, the whole code sentence is expressed as a TPE unit in the vocabulary table;
as shown in fig. 2, an exemplary diagram of generating TPE units using TPEs is used to demonstrate two iterations of the TPE algorithm. The dashed arrow labeled (iv) is the process of statistics of Token in the segmented corpus. The dashed arrows labeled c represent the frequency of occurrence of each marker after statistical data is obtained, and then the vocabulary is updated to incorporate the newly identified Token combinations. As shown. The dashed arrow labeled as (c) indicates that the last step of the iteration is to update the corpus with the newly obtained dyad Token units. V0A vocabulary obtained from the original code snippet is shown.
Tpe (Token Pair encoding) as an innovative segmentation method, it constructs a new code representation using Token as a basic component. The TPE carries rich code information, and is beneficial to better utilizing the advantages of clone detection based on deep learning. Meanwhile, the TPE can also avoid the problem of insufficient vocabulary (OOV).
Step 2, establishing a neural network model suitable for code clone detection, and training:
before training, pre-training by adopting a Skip-Gram model: firstly, obtaining TPE basic units of different languages by using a TPE algorithm according to a selected corpus; then, the TPE units are pre-trained by using a Skip-Gram model to obtain a vocabulary of the TPE units in a word vector representation format.
The principle of the Skip-Gram model is that a central word is given to predict a context by the model, and the word vector of the central word is adjusted by using the prediction result of surrounding words. In Skip-Gram, each word is influenced by surrounding words, and each word needs to be predicted and adjusted when being used as a central word, so that the learned word vector is more accurate due to repeated adjustment and prediction. Therefore, a Skip-Gram model is selected as a training model of the TPE unit vector. And (4) training the Skip-Gram model by adopting a fastText tool to generate a vocabulary of code basic units-vectors. (supplement: setting parameter learning rate 0.025 in command using fastText, selecting training word vector dimension dim to be 100, defaulting context window size ws to be 5, default value 5 for epoch and lowest occurring word frequency minCount, default value 5 for number of negative samples neg, loss function loss to select ns, number of containable sub-words bucket to be 2000000, default values 3 and 6 for maximum and minimum character lengths maxn and minn, thread number thread to be 4, learning rate to be 100. fastText finally generates two files with suffix vec and bin, bin is a binary file containing model parameters and all hyper-parameters.
And learning TPE basic units of different languages from the selected initial corpus by utilizing the idea of a BPE algorithm, and pre-training the TPE units by adopting a Skip-Gram model to obtain a vocabulary of a TPE unit-word vector representation format. And converting the code sequences after the TPE unit representation into corresponding vector representation by combining with a vocabulary.
Carrying out pretreatment: the empty lines and comments of the source code are deleted and the source code is represented as a sequence of TPE units combined according to a backward maximum matching algorithm. The preprocessed source code refers to a pre-trained vocabulary to obtain a specific vector representation of each TPE unit. After the preprocessing step, the source code has been converted into a sequence for a series of TPE units. The series of discrete sequences needs to be converted into a continuous vector representation, a standard BilSTM model is realized and trained, and the TPE embedded clone detection function is enhanced. LSTM is a widely used network for TPE ticketsThe sequence input of the element is encoded, BilSTM is a bi-directional spread of LSTMs, with a right-to-left spread. The hyper-parameters of the model are determined through preliminary experiments, and the hidden layer dimension is set to be 100. Dropou is applied on the input embedding layer, with a proportion of the BilsTM hidden layer of 0.33. Parameter optimization is carried out by adopting Adam algorithm, and the initial learning rate is 5 multiplied by 10-4The gradient shear threshold is 5 and the minimum batch-size is 32. Semantic information of elements at each position in the sequence is learned through the two-way long-short memory neural network, and meanwhile forward and reverse sequence information is recorded in the learned vector. On the basis, a hidden layer vector generated by the bidirectional long-short memory neural network is extracted through a global pooling layer height, and self-attention pooling (self-attention pooling) is used for achieving the goal. Each Java method is converted to a vector that can be compared to each other by weighted summation using a layer of attention-based neural network.
And (3) putting the vector representation of the TPE unit learned by the BilSTM into a matrix by using a self-attention mechanism, multiplying the matrix by a weight matrix to obtain a vector with a fixed dimension, and grasping the weight of each TPE unit vector in the whole sentence by continuously learning and updating the weight matrix so as to obtain the vector representation of the whole code method. The specific calculation formula is as follows:
wherein v represents a vector parameter, T represents the transpose of the matrix, each element in the matrix represents the importance of each sequence node, expressed by means of a numerical vector,
Figure BDA0002757009150000071
representing hidden layer output, s, of BilSTMtTo represent
Figure BDA0002757009150000072
The significance of (1) represents the weight of the sequence node vector of each position in the whole sentence, the attention weight atAccording to htSelf-calculation and normalization are carried out, and the model is continuously updated in the continuous training and learning process, hCODERepresenting the final code vector representation.
Figure BDA0002757009150000073
Where v denotes the vector parameters, T denotes the transpose of the matrix, each element in the vector represents the importance of each sequence node,
Figure BDA0002757009150000074
representing the hidden layer output, s, of BilSTMtTo represent
Figure BDA0002757009150000075
The significance of (a) represents the weight of the sequence node vector of each position in the whole sentencetWeight parameter representing the attention weight t position element in the whole sequence according to htSelf-calculation and normalization are carried out, and the model is continuously updated in the continuous training and learning process, hCODERepresenting the final codevector representation; j represents an iteration parameter, and t represents a current position;
the bidirectional long and short memory neural network learns all information of the forward sequence and the reverse sequence, can capture the semantic meaning and the time sequence information of the sequence more effectively, and is more sufficient and accurate in prediction compared with the common long and short memory neural network.
As shown in fig. 3, it is a structure diagram of the BiLSTM network. The bidirectional long-short term memory network (BilTM) is composed of two ordinary long-short term memory networks (LSTM), wherein the forward LSTM utilizes information of past time, and the reverse LSTM utilizes information of future time. BilSTM recursively computes the hidden output vector by:
ft=σ(Wf[ht-1,xt]+bf)
it=σ(Wi[ht-1,xt]+bi)
Figure BDA0002757009150000076
ot=σ(Wo[ht-1,xt]+bo)
Figure BDA0002757009150000081
ht=ot⊙tanh(ct)
wherein x is1,x2,…,xnFor input, h1,h2,…,hnFor output, other variables such as W, b are model parameters. The right-to-left direction only previews the same calculations in the opposite way.
Step 3, designing a Siamese framework for the classified detection of clone pairs/non-clone pairs, which specifically comprises the following steps: comparing the code representations of the input fragment pairs, then calculating the difference between the code representations as the classification characteristics, and finally performing final clone/non-clone prediction according to the classification characteristics; the code clone detection problem is formalized into a supervised binary classification task, namely two vectors converted by two code blocks are given as input, the difference of the two output vectors is calculated by utilizing a calculation mode of Euclidean distance between the vectors as the characteristic of classification, and clone pairs with similar Euclidean distance between the vectors are provided. If they are clone pairs, set their tag to 1, if not clone pairs, set their tag to 0; the probability that a given input pair is clonal and non-clonal is also ultimately obtained.
The main idea of the Siamese network is to map the input to the target space by a function, and contrast the similarity in the target space by using simple distance. The code clone detection is converted into a binary classification problem, a simple and effective BilSTM model is provided, and discrete code segments are converted into low-dimension continuous vector representation through the BilSTM. A Simese architecture is designed, and two BilSTM sub-networks have the same structure and share weight values.
The invention adopts the Siamese framework and the standard BiLSTM classification model to carry out code clone detection, thereby improving the efficiency while ensuring the validity of the code clone detection. Abundant grammar and semantic information can be obtained through the TPE unit, and semantic clone detection can be effectively carried out.
The embodiments of the present invention are described below:
step 1, data selection and pretreatment:
1-1, preparing a data set, the data set comprising:
(1) BigCloneBench is one of the widely used evaluation benchmarks for detecting clones, and many code clone detection tools use this dataset as an evaluation benchmark. The bigconebench dataset was created for clone code in Java language, containing only 600 ten thousand labeled true clone pairs and 26 ten thousand labeled false clone pairs in the old version of bigconebench, covering 10 functions. The new version contains over 800 ten thousand pairs of Java code pairs marked as clones (most of which are type three and type four clones) and 279,032 codes marked as non-clone pairs. For the BigCloneBench dataset, 20000 pairs of functions are selected from each type for experiment, and all selections less than 20000 are performed.
(2) Another code clone evaluation benchmark, OJClone, was created for C language programs. OJClone is based on an on-line programming openness judgment System (Open Judge System), often referred to as OJ System [8], and is constructed by selecting 104 programming topics in the OJ System, and regarding the same programming topic, code segments submitted by different people as clone pairs. The OJClone dataset does not explicitly specify the type of clone, but it is generally accepted that most clone pairs of OJClone are either triple or quadruple. 500 programs were selected from the first 15 programming problems of OJClone. The same programming problem can result in 124,750 clone pairs. Different programming problems can be combined into 2800 multiple ten thousand non-cloned pairs. Randomly selected 5 ten thousand function pairs, where the ratio of clone to non-clone was 1: 14.
(3) Google Code Jam (GCJ) is an online international program game held by Google every year. The content of the game comprises a series of algorithmic problems that must be solved within a specified time. Participants can answer questions using any programming language and development environment of their choice. Each project for the same question was implemented by a different programmer, google verified its correctness. Therefore, the processing of the same problem should be functionally similar. 1665 Java functions with 12 problems are selected from 2016, about 27 ten thousand pairs of clone pairs and 100 ten thousand pairs of non-clone pairs are respectively formed, and finally 5 ten thousand pairs of functions are randomly selected, wherein the proportion of clone pairs to non-clone pairs is 1: 4.
1-2, for the data selected in step 1-1, the function is represented by the TPE elementary units using the aforementioned TPE algorithm. Different TPE vocabularies were trained for Java and C languages.
Step 2, dividing a training set and a test set: for each data set, it was randomly divided into three parts, 60%, 20% for training, validation and testing, respectively.
Step 3, training by using the model
The preprocessed TPE basic unit vector representation is input into a first layer LSTM unit to obtain the characteristics of the influence of the previous unit of the basic unit on the TPE basic unit, and then the TPE basic unit vector representation is input into a second layer LSTM unit to obtain the influence of the next character of the TPE basic unit on the second layer LSTM unit. And then splicing and combining the output of the first layer of LSTM and the output of the second layer of LSTM. Through training, the output feature vector contains the context information of the code unit and the sequence information thereof.
Semantic information of elements at each position in the sequence is learned through the two-way long-short memory neural network, and meanwhile forward and reverse sequence information is recorded in the learned vector. On the basis, a hidden layer vector is generated by extracting a bidirectional long-short memory neural network through a global pooling layer height, and self-attention pooling (self-attention pooling) is used for achieving the goal. Each Java method is converted to a vector that can be compared to each other by weighted summation using a layer of attention-based neural network. So far, the original plain code text has been converted into a digitized vector, and then the difference between the two vectors is calculated using a calculation formula of the euclidean distance between the vectors.
The present invention was compared with two of the most advanced clone detection methods, tbcccd and ASTNN, in terms of accuracy (P), recall (R), F1 score (F1), data processing time (data-time) and test-time (test-time). .
As shown in table 1, the results of testing ASTNN, tbcc, and TPE models of the prior art on bigconebench dataset.
TABLE 1
Figure BDA0002757009150000101
It can be seen that on this BigCloneBench dataset, the model of the present invention has higher F values than both of the other tools that have recently detected semantic clones. Especially, the method for representing the code by the TPE basic unit improves the data processing (data-time) by nearly 2.5 times compared with the other two tools for representing the code by the AST.
On the aspect of model detection speed, the standard BiLstm model of the invention has the efficiency improved by nearly 3 times compared with two models. This is because tbcc uses a tree-based convolution model and max-pooling, ASTNN uses RvNN and RNN models for sentence and code coding, while using only a simple BiLSTM model, BiLSTM can learn all the information of forward and reverse sequences, and can capture the semantic and timing information of the sequence more strongly.
Table 2 lists the model of the invention and the results of the detection of ASTNN, TBCCD on the OJClone dataset.
Figure BDA0002757009150000111
The detection effect of the invention on the OJClone dataset is still better than tbcch, and the model does not perform as well as ASTNN, because ASTNN first constructs an AST for each code segment when representing the code, and decomposes the entire AST into small statement trees (trees composed of AST nodes of statements, with statement nodes as roots). And then, capturing sentence-level lexical and syntactic information by adopting a recursive encoder on a multi-path sentence tree to obtain a sentence vector. Finally, the code representation is obtained through a recurrent neural network, and the captured code information is relatively comprehensive, but meanwhile, much time is consumed.
As shown in table 3, the results of the detection of the TPE model, ASTNN model, tbcch model on the GCJ data set are shown.
TABLE 3
Figure BDA0002757009150000112
On the GCJ data set, the model of the invention still performs better than TBCCD, can achieve the effect similar to ASTNN, and has obvious advantage in time. The processing time of ASTNN and tbcch is longer than TPE for both ASTNN and tbcch, because both ASTNN and tbcch express codes based on AST.

Claims (1)

1. A semantic clone detection method based on deep learning is characterized by comprising the following specific steps:
step 1, determining a basic unit of a TPE (thermal plastic elastomer) represented by codes in a semantic clone detection task, wherein the TPE generation process comprises the following steps: firstly, each code in an input corpus is cut into a Token sequence, the obtained Token initialization vocabulary vocab is used for merging all binary groups Token appearing in the corpus, then all Token binary groups in the current corpus are counted, the binary groups Token are sequenced and marked, then a Token with the highest iteration combination frequency in the corpus is used for identifying a new basic unit, the newly obtained unit is added into the vocabulary, the newly generated corpus consists of the newly added binary groups Token, the TPE regards the binary groups Token as a new Token, the process is carried out iteratively, and the vocabulary is updated by continuously iterating and searching the Token combination with higher frequency; after the final vocabulary is obtained, dividing the code sentence into TPEs according to the obtained vocabulary by utilizing a backward maximum matching method;
obtaining TPE basic units of different languages by using a TPE algorithm according to the selected corpus;
step 2, establishing a neural network model suitable for code clone detection, and pre-training the TPE basic units obtained in the step 1 by using a Skip-Gram model to generate a vocabulary of a corresponding TPE unit-word vector representation format; converting the series of discrete sequences in the vocabulary into continuous vector representation, and realizing and training a standard BilSTM model; putting the vector representation of the TPE basic unit learned by the BilSTM model into a matrix, multiplying the matrix by a weight matrix to obtain a vector with a fixed dimension, and grasping the weight of each TPE basic unit vector in the whole sentence by continuously learning and updating the weight matrix so as to obtain the vector representation of the whole code method; the specific formula is as follows:
Figure FDA0002757009140000011
where v denotes the vector parameters, T denotes the transpose of the matrix, each element in the vector represents the importance of each sequence node,
Figure FDA0002757009140000012
representing the hidden layer output, s, of BilSTMtTo represent
Figure FDA0002757009140000013
Significance of (a)tWeight parameter, h, representing the attention weight t position element over the entire sequenceCODERepresenting the final codevector representation; j represents an iteration parameter, and t represents a current position;
step 3, designing a Siamese framework to carry out the classified detection of clone pairs/non-clone pairs, and specifically comprising the following steps: two vectors converted by two code blocks are given as input, pre-training is carried out by adopting a Skip-Gram model to obtain the vector representation of each code block, the difference of the distances between the two output vectors is calculated by utilizing the calculation mode of the Euclidean distance between the vectors as the characteristic of classification, the vectors with similar Euclidean distance are clone pairs, and the final clone/non-clone prediction is obtained.
CN202011205774.XA 2020-11-02 2020-11-02 Clone code semantic detection method based on deep learning Active CN112215013B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011205774.XA CN112215013B (en) 2020-11-02 2020-11-02 Clone code semantic detection method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011205774.XA CN112215013B (en) 2020-11-02 2020-11-02 Clone code semantic detection method based on deep learning

Publications (2)

Publication Number Publication Date
CN112215013A CN112215013A (en) 2021-01-12
CN112215013B true CN112215013B (en) 2022-04-19

Family

ID=74057987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011205774.XA Active CN112215013B (en) 2020-11-02 2020-11-02 Clone code semantic detection method based on deep learning

Country Status (1)

Country Link
CN (1) CN112215013B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835620B (en) * 2021-02-10 2022-03-25 中国人民解放军军事科学院国防科技创新研究院 Semantic similar code online detection method based on deep learning
CN113220301A (en) * 2021-04-13 2021-08-06 广东工业大学 Clone consistency change prediction method and system based on hierarchical neural network
CN113656066B (en) * 2021-08-16 2022-08-05 南京航空航天大学 Clone code detection method based on feature alignment
CN113704108A (en) * 2021-08-27 2021-11-26 浙江树人学院(浙江树人大学) Similar code detection method and device, electronic equipment and storage medium
CN113986345B (en) * 2021-11-01 2024-05-07 天津大学 Pre-training enhanced code clone detection method
CN114780103B (en) * 2022-04-26 2022-12-20 中国人民解放军国防科技大学 Semantic code clone detection method based on graph matching network
CN115373737B (en) * 2022-07-06 2023-05-26 武汉大学 Code clone detection method based on feature fusion
CN117435246B (en) * 2023-12-14 2024-03-05 四川大学 Code clone detection method based on Markov chain model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012079230A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation Intelligent code differencing using code clone detection
CN107943516A (en) * 2017-12-06 2018-04-20 南京邮电大学 Cloned codes detection method based on LLVM
JP2018136900A (en) * 2017-02-24 2018-08-30 東芝情報システム株式会社 Sentence analysis device and sentence analysis program
CN110851176A (en) * 2019-10-22 2020-02-28 天津大学 Clone code detection method capable of automatically constructing and utilizing pseudo clone corpus
CN111552969A (en) * 2020-04-21 2020-08-18 中国电力科学研究院有限公司 Embedded terminal software code vulnerability detection method and device based on neural network
CN111639344A (en) * 2020-07-31 2020-09-08 中国人民解放军国防科技大学 Vulnerability detection method and device based on neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101780233B1 (en) * 2016-04-26 2017-09-21 고려대학교 산학협력단 Apparatus and method for deteting code cloning of software

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012079230A1 (en) * 2010-12-15 2012-06-21 Microsoft Corporation Intelligent code differencing using code clone detection
JP2018136900A (en) * 2017-02-24 2018-08-30 東芝情報システム株式会社 Sentence analysis device and sentence analysis program
CN107943516A (en) * 2017-12-06 2018-04-20 南京邮电大学 Cloned codes detection method based on LLVM
CN110851176A (en) * 2019-10-22 2020-02-28 天津大学 Clone code detection method capable of automatically constructing and utilizing pseudo clone corpus
CN111552969A (en) * 2020-04-21 2020-08-18 中国电力科学研究院有限公司 Embedded terminal software code vulnerability detection method and device based on neural network
CN111639344A (en) * 2020-07-31 2020-09-08 中国人民解放军国防科技大学 Vulnerability detection method and device based on neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"TECCD: A Tree Embedding Approach for Code Clone Detection";Yi Gao etc.;《2019 IEEE International Conference on Software Maintenance and Evolution》;20191205;全文 *
"函数级别结构化克隆与语义克隆的检测";杨燕鸣;《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》;20200215;全文 *

Also Published As

Publication number Publication date
CN112215013A (en) 2021-01-12

Similar Documents

Publication Publication Date Title
CN112215013B (en) Clone code semantic detection method based on deep learning
CN110929030B (en) Text abstract and emotion classification combined training method
CN112270379B (en) Training method of classification model, sample classification method, device and equipment
US7035789B2 (en) Supervised automatic text generation based on word classes for language modeling
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN110851176B (en) Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus
CN112306494A (en) Code classification and clustering method based on convolution and cyclic neural network
CN112800776A (en) Bidirectional GRU relation extraction data processing method, system, terminal and medium
CN113420296A (en) C source code vulnerability detection method based on Bert model and BiLSTM
CN110442880B (en) Translation method, device and storage medium for machine translation
CN111651974A (en) Implicit discourse relation analysis method and system
CN113505225B (en) Small sample medical relation classification method based on multi-layer attention mechanism
CN112507337A (en) Implementation method of malicious JavaScript code detection model based on semantic analysis
CN117151220B (en) Entity link and relationship based extraction industry knowledge base system and method
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN115392252A (en) Entity identification method integrating self-attention and hierarchical residual error memory network
CN115688784A (en) Chinese named entity recognition method fusing character and word characteristics
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN114742069A (en) Code similarity detection method and device
CN114742016A (en) Chapter-level event extraction method and device based on multi-granularity entity differential composition
CN114327609A (en) Code completion method, model and tool
CN116340507A (en) Aspect-level emotion analysis method based on mixed weight and double-channel graph convolution
CN115840815A (en) Automatic abstract generation method based on pointer key information
CN114969763A (en) Fine-grained vulnerability detection method based on seq2seq code representation learning
CN114896966A (en) Method, system, equipment and medium for positioning grammar error of Chinese text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210512

Address after: 300072 Tianjin City, Nankai District Wei Jin Road No. 92

Applicant after: Tianjin University

Applicant after: Tianjin Thai Technology Co.,Ltd.

Address before: 300072 Tianjin City, Nankai District Wei Jin Road No. 92

Applicant before: Tianjin University

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant