CN116166321B - Code clone detection method, system and computer readable storage medium - Google Patents

Code clone detection method, system and computer readable storage medium Download PDF

Info

Publication number
CN116166321B
CN116166321B CN202310457759.1A CN202310457759A CN116166321B CN 116166321 B CN116166321 B CN 116166321B CN 202310457759 A CN202310457759 A CN 202310457759A CN 116166321 B CN116166321 B CN 116166321B
Authority
CN
China
Prior art keywords
source code
code
segmentation
detected
codes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310457759.1A
Other languages
Chinese (zh)
Other versions
CN116166321A (en
Inventor
陈晓莉
国毓芯
朱崇
赵祥廷
林建洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Ponshine Information Technology Co ltd
Original Assignee
Zhejiang Ponshine Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Ponshine Information Technology Co ltd filed Critical Zhejiang Ponshine Information Technology Co ltd
Priority to CN202310457759.1A priority Critical patent/CN116166321B/en
Publication of CN116166321A publication Critical patent/CN116166321A/en
Application granted granted Critical
Publication of CN116166321B publication Critical patent/CN116166321B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a code clone detection method, a system and a computer readable storage medium, wherein the code clone detection method comprises the following steps: s1, collecting a source code dataset, performing cluster analysis, and outputting class labels and label characteristics of n classes of source codes; s2, sequentially processing codes to be detected to obtain a segmentation matrix; s3, respectively matching the segmentation matrixes with the source codes to obtain target source code class labels corresponding to each segmentation matrix; s4, traversing and calculating cosine similarity of the segmentation matrixes and all source code fragment matrixes under the corresponding target source code category labels respectively, weighting and calculating similarity scores of codes to be detected of each source code for each segmentation matrix, arranging the similarity scores in descending order, and reserving source code fragments corresponding to the scores topN; s5, inputting the source code fragments and the codes to be detected into an LSTM-DSSM network model to calculate similarity scores, and outputting the source code fragments with highest similarity. The invention can effectively detect whether source code clone exists.

Description

Code clone detection method, system and computer readable storage medium
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a code clone detection method, a code clone detection system and a computer readable storage medium.
Background
The source code cloning refers to two or more identical or similar source code fragments in a code library, and is a common phenomenon in the software development process. The source code cloning can improve the development efficiency of developers to a certain extent, but external loopholes are also easy to introduce, and a series of safety problems are caused.
Disclosure of Invention
Based on the foregoing deficiencies in the art, the present invention is directed to a code clone detection method, system and computer readable storage medium.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
a code clone detection method comprises the following steps:
s1, acquiring a source code data set, and sequentially carrying out fragment segmentation and data preprocessing on source codes of the source code data set to obtain a source code fragment matrix data set; performing cluster analysis on a source code segment matrix of the source code segment matrix dataset, and outputting class labels and label characteristics of n classes of source codes; n is an integer greater than 1;
s2, sequentially performing fragment segmentation and data preprocessing on the code to be detected to obtain S/x segmentation matrixes; wherein s represents the length of the code to be detected, and x represents the length of the segment segmentation;
s3, matching the segmentation matrixes with n types of source codes respectively to obtain target source code class labels corresponding to each segmentation matrix;
s4, traversing and calculating cosine similarity of the segmentation matrixes and all source code fragment matrixes under the corresponding target source code category labels respectively, weighting and calculating similarity scores of codes to be detected of each source code by x/S of each segmentation matrix, arranging in descending order according to the similarity scores, and reserving source code fragments corresponding to the scores topN; wherein topN is the first N bits arranged in descending order;
s5, inputting the source code segment corresponding to the score topN and the code to be detected into an LSTM-DSSM network model, calculating the similarity score of the source code segment corresponding to the score topN and the code to be detected, and outputting the source code segment with the highest similarity.
Preferably, in the step S5, the processing procedure of the LSTM-DSSM network model includes:
s51, word segmentation is carried out on an input code based on a Bert model, and LSTM input is obtained through token ebedding layer conversion;
s52, outputting a latent semantic vector by the LSTM input through an LSTM model;
s53, inputting the latent semantic vector corresponding to the source code segment and the latent semantic vector corresponding to the code to be detected into a DSSM model, and calculating the similarity score of the source code segment and the code to be detected.
As a preferred scheme, the number of the DSSM models is N, the output of each DSSM model is connected to a full-connection layer, and the Softmax layer is connected after the full-connection layer so as to output and obtain the similarity proportion of the code to be detected and each source code segment corresponding to the score topN.
Preferably, the data preprocessing comprises data cleaning, text word segmentation and vectorization conversion, and the matrix with missing values after vectorization conversion is filled with 0.
As a preferable scheme, the text word segmentation adopts a word segmentation library NLTK to carry out English code word segmentation.
In the step S3, a source code class with the largest attribution class weight is selected as the target source code class label corresponding to the segmentation matrix according to the voting method.
In a preferred embodiment, in the step S1, the cluster analysis uses a GMM clustering algorithm, and estimates GMM parameters by using an EM algorithm.
Preferably, the value of N is 3-6.
The invention also provides a code clone detection system, which applies the code clone detection method according to any scheme, and comprises the following steps:
the acquisition module is used for acquiring a source code data set or codes to be detected;
the data processing module is used for sequentially carrying out fragment segmentation and data preprocessing on the source code data set or the code to be detected;
the cluster analysis module is used for carrying out cluster analysis on the source code segment matrix of the source code segment matrix data set and outputting class labels and label characteristics of n classes of source codes;
the matching module is used for respectively matching the segmentation matrixes with n types of source codes to obtain target source code class labels corresponding to each segmentation matrix;
the computing and sequencing module is used for traversing and computing cosine similarity for all source code segment matrixes under the target source code category labels corresponding to the segmentation matrixes respectively, weighting and computing similarity scores of codes to be detected of each source code by each segmentation matrix, arranging the similarity scores in descending order, and reserving source code segments corresponding to the scores topN;
the detection module is used for inputting the source code fragments corresponding to the score topN and the codes to be detected into the LSTM-DSSM network model, calculating the similarity score of the source code fragments corresponding to the score topN and the codes to be detected, and outputting the source code fragments with the highest similarity.
The present invention also provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the code clone detection method according to any one of the above aspects.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention can effectively detect whether the source code cloning or tampering exists, and the source code with highest similarity can be obtained by inputting the code to be detected by the user;
(2) The invention considers that the word bag model BOW is adopted in the expression layer of the DSSM model, which leads to the loss of the word order information and the context information, so that an LSTM-DSSM network model is introduced, and the LSTM is utilized to solve the problem of the context characteristics and the word order information in a far distance.
Drawings
FIG. 1 is a flow chart of a code clone detection method of embodiment 1 of the present invention;
FIG. 2 is a block diagram of the LSTM-DSSM network model of embodiment 1 of the invention;
FIG. 3 is a network architecture diagram of an LSTM fusion Bert of embodiment 1 of the invention;
FIG. 4 is a block diagram of a code clone detection system of embodiment 1 of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, specific embodiments of the present invention will be described below with reference to the accompanying drawings. It is evident that the drawings in the following description are only examples of the invention, from which other drawings and other embodiments can be obtained by a person skilled in the art without inventive effort.
Example 1:
as shown in fig. 1, the code clone detection method of the present embodiment includes the steps of:
s1, constructing a source code warehouse. The construction process of the source code warehouse of the embodiment comprises the following steps:
s11, collecting a source code data set, and sequentially carrying out fragment segmentation and data preprocessing on each source code file in the source code data set to obtain a source code fragment matrix data set.
The method comprises the steps of dividing the segments into segments with x rows of codes as intervals, wherein x can be 100 rows, 200 rows and the like and can be set according to actual conditions; the data preprocessing comprises data cleaning, text word segmentation and vectorization conversion to obtain a source code segment matrix, and filling the source code segment matrix with a missing value after vectorization conversion with 0.
The specific process of data cleaning and vectorization conversion in this embodiment may refer to the prior art, and is not described herein in detail; in addition, the text word segmentation adopts a common word segmentation library NLTK to carry out English code word segmentation, and then the word segmentation result is subjected to vectorization conversion to obtain a data set consisting of a source code segment matrix, namely a source code segment matrix data set.
S12, in order to solve the problems of complexity, time consumption and the like of traversing the full-source code warehouse through similarity calculation, the embodiment performs cluster analysis on the source code segment matrix data set, outputs n types of source code class labels, n types of label characteristics and source code segment matrixes corresponding to the n types of labels, and forms the source code warehouse. n is an integer greater than 1.
The cluster analysis selects a Gaussian mixture cluster GMM-EM algorithm, which refers to a linear combination of a plurality of Gaussian distribution functions, and theoretically any type of distribution can be fitted through the GMM, so that the situation that data in the same set contain a plurality of different distributions is usually solved. The GMM parameters are estimated by the EM algorithm, typically in two steps, the first step to find the rough value of the estimated parameters and the second step to maximize the likelihood function using the values of the first step.
S2, inputting a code A to be detected;
s3, segment segmentation is carried out on the code A to be detected. Specifically, assuming that the length of the code to be detected is s, performing segment segmentation on the code to be detected by taking x lines of codes as the interval size to obtain an s/x segment segmentation segment set.
S4, data cleaning and data preprocessing are carried out on the S/x segment segmentation fragment set, and S/x segmentation matrixes are obtained. The specific process of data cleaning and data preprocessing is the same as the processing process of the source codes.
And S5, traversing each segmentation matrix through the class labels of the n classes of source codes respectively, and carrying out source code class matching according to the characteristics of the n classes of labels to obtain a target class label t to which the segmentation matrix belongs. Specifically, according to a voting method of machine learning, selecting a source code category with the largest attribution category weight as a target category label attributing to the segmentation matrix. The specific process of the voting method of machine learning can refer to the prior art, and is not described herein.
S6, traversing all source code segment matrixes under the target class labels t corresponding to the segmentation matrixes respectively to calculate cosine similarity.
Specifically, the cosine similarity calculation process is as follows: and calculating weight x/s for the cosine similarity value obtained by calculating each segmentation matrix, calculating similarity score of each source code to be detected by weighted average, arranging the similarity scores in descending order, and reserving source code fragments of top5 (namely the top5 bits of the sequence) to form a screened source code detection data set D.
S7, inputting the code A to be detected and the source code detection data set D output in the step S6 into a pre-training LSTM-DSSM network model, outputting a source code segment with highest matching degree and tracing the source code position.
Specifically, the LSTM-DSSM network model is a network model which is improved based on the defect of a DSSM algorithm, wherein the DSSM algorithm is a method which is commonly used for calculating text similarity in a recommendation system or a retrieval system, and has a three-layer structure from bottom to top: the input layer, the representation layer and the matching layer are based on the principle that two sentences with similarity to be calculated are input, are mapped into a space vector and are converted into low latitude semantic vectors by DNN, and the distance between the two semantic vectors is calculated by a cosine distance. The input layer is used for converting the input text into a vector format capable of being input into a depth network, and the English scene is generally processed in a word sharing mode, mainly based on an N-gram of letters, and mainly used for reducing the dimension of input vectors. Chinese scenes typically require word segmentation or use of pre-trained models of BERT chinese. The presentation layer is typically some complex deep learning network such as CNN, RNN, etc. The two direction representation layers finally enter the matching layer, and the similarity is calculated by using COS and other modes to output the distance. The specific logic structure of the DSSM model may refer to the prior art, and is not described herein.
The word bag model BOW is adopted in the presentation layer based on the DSSM algorithm model, and is a simple assumption in natural language processing and information retrieval. In the model, the text or the paragraphs are regarded as unordered vocabulary sets, grammar or even word sequences are ignored, thereby losing word sequence information and context information, based on the LSTM-DSSM network model of the embodiment, the LSTM is adopted to encode the text into a segment of vector, the LSTM needs to perform pre-treatment on the text length, namely word segmentation is performed on the sentence, and an emmbedding is obtained based on pre-trained Bert model mapping, and is a mode for converting discrete vectors into continuous vector representation, and in a neural network, the emmbedding can not only reduce the space dimension of discrete variables, but also represent the vectors; inputting the whole sentence into LSTM, training the output after LSTM, and obtaining LSTM latent layer semantic vector; the latent semantic vector based on LSTM output is input into a DSSM model, and subsequent logic implementation is consistent with the DSSM model. LSTM solves the problem of remote context features and word order information.
As shown in fig. 2 and 3, the processing procedure of the LSTM-DSSM network model of the present embodiment includes:
s71, word segmentation is carried out on an input code based on a Bert model, and LSTM input is obtained through token ebedding layer conversion; the input codes comprise a code A to be detected and a source code detection data set D, wherein the source code detection data set D comprises a source code segment D1, a source code segment D2, a source code segment D3, a source code segment D4 and a source code segment D5;
s72, outputting a latent semantic vector by the input of the LSTM through an LSTM model; specifically, the latent semantic vectors corresponding to the code A to be detected, the source code segment D1, the source code segment D2, the source code segment D3, the source code segment D4 and the source code segment D5 are A * 、D1 * 、D2 * 、D3 * 、D4 * 、D5 *
S73, inputting the latent semantic vector corresponding to the source code segment and the latent semantic vector corresponding to the code to be detected into a DSSM model, and calculating the similarity score of the source code segment and the code to be detected to obtain the target source code with the highest similarity score.
The number of the DSSM models in this embodiment is five, which are DSSM1, DSSM2, DSSM3, DSSM4, and DSSM5, respectively; the output of each DSSM model is connected to the full connection layer FC, and the Softmax layer is connected after the full connection layer FC so as to output and obtain the similarity proportion of each source code segment corresponding to the score top5 of the code to be detected, namely P (D1 |A), P (D2|A), P (D3|A), P (D4|A) and P (D5|A).
Based on the code clone detection method, as shown in fig. 4, the code clone detection system of the embodiment comprises an acquisition module, a data processing module, a cluster analysis module, a matching module, a calculation and sequencing module and a detection module. Specifically, the acquisition module of the embodiment is used for acquiring a source code dataset or a code to be detected; the data processing module of the embodiment is used for sequentially carrying out segment segmentation and data preprocessing on the source code data set or the code to be detected; the cluster analysis module of the embodiment is used for carrying out cluster analysis on the source code segment matrix of the source code segment matrix data set and outputting class labels and label characteristics of n classes of source codes; the matching module of the embodiment is used for matching the segmentation matrixes with n types of source codes respectively to obtain target source code class labels corresponding to each segmentation matrix; the calculating and sorting module of the embodiment is used for traversing and calculating cosine similarity for all source code segment matrixes under the target source code category labels corresponding to the splitting matrixes respectively, giving weight x/s to each splitting matrix, weighting and calculating similarity scores of codes to be detected of each source code, arranging according to the descending order of the similarity scores, and reserving source code segments corresponding to the score top 5; the detection module of the embodiment is used for inputting the source code segment corresponding to the score top5 and the code to be detected into the LSTM-DSSM network model, calculating the similarity score of the source code segment corresponding to the score top5 and the code to be detected, and outputting the source code segment with the highest similarity.
The specific processing procedure of each functional module may refer to the specific description in the code clone detection method, which is not repeated herein.
The computer readable storage medium of the embodiment stores instructions in the computer readable storage medium, and when the instructions run on a computer, the computer is caused to execute the code clone detection method, so as to realize intelligent detection of codes.
Example 2:
the code clone detection method of the present embodiment is different from that of embodiment 1 in that:
the reserved source code fragment is not limited to the first 5 bits defined in embodiment 1, but may be the first 3 bits, the first 4 bits, the first 6 bits, etc., and may be specifically determined according to the actual application requirements;
for further steps reference is made to example 1.
The code clone detection system of the present embodiment makes an adaptation according to the code clone detection method of the present embodiment;
the computer readable storage medium of the embodiment stores instructions in the computer readable storage medium, and when the instructions run on a computer, the computer is caused to execute the code clone detection method, so as to realize intelligent detection of codes.
The foregoing is only illustrative of the preferred embodiments and principles of the present invention, and changes in specific embodiments will occur to those skilled in the art upon consideration of the teachings provided herein, and such changes are intended to be included within the scope of the invention as defined by the claims.

Claims (10)

1. The code clone detection method is characterized by comprising the following steps of:
s1, acquiring a source code data set, and sequentially carrying out fragment segmentation and data preprocessing on source codes of the source code data set to obtain a source code fragment matrix data set; performing cluster analysis on a source code segment matrix of the source code segment matrix dataset, and outputting class labels and label characteristics of n classes of source codes; n is an integer greater than 1;
s2, sequentially performing fragment segmentation and data preprocessing on the code to be detected to obtain S/x segmentation matrixes; wherein s represents the length of the code to be detected, and x represents the length of the segment segmentation;
s3, matching the segmentation matrixes with n types of source codes respectively to obtain target source code class labels corresponding to each segmentation matrix;
s4, traversing and calculating cosine similarity of the segmentation matrixes and all source code fragment matrixes under the corresponding target source code category labels respectively, weighting and calculating similarity scores of codes to be detected of each source code by x/S of each segmentation matrix, arranging in descending order according to the similarity scores, and reserving source code fragments corresponding to the scores topN; wherein topN is the first N bits arranged in descending order;
s5, inputting the source code segment corresponding to the score topN and the code to be detected into an LSTM-DSSM network model, calculating the similarity score of the source code segment corresponding to the score topN and the code to be detected, and outputting the source code segment with the highest similarity.
2. The code clone detection method according to claim 1, wherein in the step S5, the processing procedure of the LSTM-DSSM network model includes:
s51, word segmentation is carried out on an input code based on a Bert model, and LSTM input is obtained through token ebedding layer conversion;
s52, outputting a latent semantic vector by the LSTM input through an LSTM model;
s53, inputting the latent semantic vector corresponding to the source code segment and the latent semantic vector corresponding to the code to be detected into a DSSM model, and calculating the similarity score of the source code segment and the code to be detected.
3. The code clone detection method according to claim 2, wherein the number of the DSSM models is N, the outputs of each DSSM model are all connected to a full connection layer, and the Softmax layer is connected after the full connection layer to output and obtain the similarity ratio of the code to be detected and each source code segment corresponding to the score topN.
4. The method of claim 1, wherein the data preprocessing includes data cleansing, text segmentation and vectorization conversion, and the matrix with missing values after vectorization conversion is padded with 0.
5. The code clone detection method according to claim 4, wherein the text word segmentation uses a word segmentation library NLTK for english code word segmentation.
6. The code clone detection method according to claim 1, wherein in the step S3, a source code class with the largest attribution class weight is selected as a target source code class label corresponding to the segmentation matrix according to a voting method.
7. The code clone detection method according to claim 1, wherein in the step S1, a GMM clustering algorithm is adopted for cluster analysis, and GMM parameters are estimated by an EM algorithm.
8. The method for detecting code clones according to claim 1, wherein the value of N is 3 to 6.
9. A code clone detection system applying the code clone detection method according to any one of claims 1 to 8, characterized in that the code clone detection system includes:
the acquisition module is used for acquiring a source code data set or codes to be detected;
the data processing module is used for sequentially carrying out fragment segmentation and data preprocessing on the source code data set or the code to be detected;
the cluster analysis module is used for carrying out cluster analysis on the source code segment matrix of the source code segment matrix data set and outputting class labels and label characteristics of n classes of source codes;
the matching module is used for respectively matching the segmentation matrixes with n types of source codes to obtain target source code class labels corresponding to each segmentation matrix;
the computing and sequencing module is used for traversing and computing cosine similarity for all source code segment matrixes under the target source code category labels corresponding to the segmentation matrixes respectively, weighting and computing similarity scores of codes to be detected of each source code by each segmentation matrix, arranging the similarity scores in descending order, and reserving source code segments corresponding to the scores topN;
the detection module is used for inputting the source code fragments corresponding to the score topN and the codes to be detected into the LSTM-DSSM network model, calculating the similarity score of the source code fragments corresponding to the score topN and the codes to be detected, and outputting the source code fragments with the highest similarity.
10. A computer readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the code clone detection method according to any one of claims 1-8.
CN202310457759.1A 2023-04-26 2023-04-26 Code clone detection method, system and computer readable storage medium Active CN116166321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310457759.1A CN116166321B (en) 2023-04-26 2023-04-26 Code clone detection method, system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310457759.1A CN116166321B (en) 2023-04-26 2023-04-26 Code clone detection method, system and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN116166321A CN116166321A (en) 2023-05-26
CN116166321B true CN116166321B (en) 2023-06-27

Family

ID=86416785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310457759.1A Active CN116166321B (en) 2023-04-26 2023-04-26 Code clone detection method, system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116166321B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829780A (en) * 2018-05-31 2018-11-16 北京万方数据股份有限公司 Method for text detection, calculates equipment and computer readable storage medium at device
CN110297750A (en) * 2018-03-22 2019-10-01 北京京东尚科信息技术有限公司 The method and apparatus of program similitude detection
CN110825877A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Semantic similarity analysis method based on text clustering
CN111290784A (en) * 2020-01-21 2020-06-16 北京航空航天大学 Program source code similarity detection method suitable for large-scale samples
CN112257068A (en) * 2020-11-17 2021-01-22 南方电网科学研究院有限责任公司 Program similarity detection method and device, electronic equipment and storage medium
CN112306494A (en) * 2020-12-03 2021-02-02 南京航空航天大学 Code classification and clustering method based on convolution and cyclic neural network
CN112698861A (en) * 2021-03-25 2021-04-23 深圳开源互联网安全技术有限公司 Source code clone identification method and system
CN114153496A (en) * 2021-09-08 2022-03-08 北京天德科技有限公司 Block chain-based high-speed parallelizable code similarity comparison method and system
CN114925702A (en) * 2022-06-13 2022-08-19 深圳市北科瑞声科技股份有限公司 Text similarity recognition method and device, electronic equipment and storage medium
CN115562721A (en) * 2022-10-28 2023-01-03 南开大学 Clone code detection method and system for mining features from assembly language
WO2023028721A1 (en) * 2021-08-28 2023-03-09 Huawei Technologies Co.,Ltd. Systems and methods for detection of code clones

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297750A (en) * 2018-03-22 2019-10-01 北京京东尚科信息技术有限公司 The method and apparatus of program similitude detection
CN108829780A (en) * 2018-05-31 2018-11-16 北京万方数据股份有限公司 Method for text detection, calculates equipment and computer readable storage medium at device
CN110825877A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Semantic similarity analysis method based on text clustering
CN111290784A (en) * 2020-01-21 2020-06-16 北京航空航天大学 Program source code similarity detection method suitable for large-scale samples
CN112257068A (en) * 2020-11-17 2021-01-22 南方电网科学研究院有限责任公司 Program similarity detection method and device, electronic equipment and storage medium
CN112306494A (en) * 2020-12-03 2021-02-02 南京航空航天大学 Code classification and clustering method based on convolution and cyclic neural network
CN112698861A (en) * 2021-03-25 2021-04-23 深圳开源互联网安全技术有限公司 Source code clone identification method and system
WO2023028721A1 (en) * 2021-08-28 2023-03-09 Huawei Technologies Co.,Ltd. Systems and methods for detection of code clones
CN114153496A (en) * 2021-09-08 2022-03-08 北京天德科技有限公司 Block chain-based high-speed parallelizable code similarity comparison method and system
CN114925702A (en) * 2022-06-13 2022-08-19 深圳市北科瑞声科技股份有限公司 Text similarity recognition method and device, electronic equipment and storage medium
CN115562721A (en) * 2022-10-28 2023-01-03 南开大学 Clone code detection method and system for mining features from assembly language

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
代码克隆检测研究进展;陈秋远;李善平;鄢萌;夏鑫;;软件学报(第04期);全文 *
基于代码克隆检测的代码来源分析方法;李锁;吴毅坚;赵文耘;;计算机应用与软件(第02期);全文 *
基于字符串的代码克隆检测方法的分析;安帝玟;唐艳宾;;电脑知识与技术(第31期);全文 *
面向管理的克隆代码研究综述;苏小红;张凡龙;;计算机学报(第03期);全文 *

Also Published As

Publication number Publication date
CN116166321A (en) 2023-05-26

Similar Documents

Publication Publication Date Title
US11631007B2 (en) Method and device for text-enhanced knowledge graph joint representation learning
CN111325028B (en) Intelligent semantic matching method and device based on deep hierarchical coding
CN110609891A (en) Visual dialog generation method based on context awareness graph neural network
CN109472024A (en) A kind of file classification method based on bidirectional circulating attention neural network
CN110969020A (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN109977416A (en) A kind of multi-level natural language anti-spam text method and system
CN107480132A (en) A kind of classic poetry generation method of image content-based
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN111985239A (en) Entity identification method and device, electronic equipment and storage medium
CN112232053B (en) Text similarity computing system, method and storage medium based on multi-keyword pair matching
CN110796160A (en) Text classification method, device and storage medium
CN110222184A (en) A kind of emotion information recognition methods of text and relevant apparatus
CN113076465A (en) Universal cross-modal retrieval model based on deep hash
CN113221571B (en) Entity relation joint extraction method based on entity correlation attention mechanism
CN111858878A (en) Method, system and storage medium for automatically extracting answer from natural language text
CN112651940A (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN113157919A (en) Sentence text aspect level emotion classification method and system
CN114492661B (en) Text data classification method and device, computer equipment and storage medium
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
CN115169429A (en) Lightweight aspect-level text emotion analysis method
CN114492459A (en) Comment emotion analysis method and system based on convolution of knowledge graph and interaction graph
CN113779966A (en) Mongolian emotion analysis method of bidirectional CNN-RNN depth model based on attention
CN117113094A (en) Semantic progressive fusion-based long text similarity calculation method and device
CN116680407A (en) Knowledge graph construction method and device
CN116166321B (en) Code clone detection method, system and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant