CN112835620A - Semantic similar code online detection method based on deep learning - Google Patents
Semantic similar code online detection method based on deep learning Download PDFInfo
- Publication number
- CN112835620A CN112835620A CN202110184538.2A CN202110184538A CN112835620A CN 112835620 A CN112835620 A CN 112835620A CN 202110184538 A CN202110184538 A CN 202110184538A CN 112835620 A CN112835620 A CN 112835620A
- Authority
- CN
- China
- Prior art keywords
- information
- function
- semantic
- codes
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 20
- 238000013135 deep learning Methods 0.000 title claims abstract description 13
- 230000006870 function Effects 0.000 claims abstract description 34
- 238000000034 method Methods 0.000 claims abstract description 29
- 239000013598 vector Substances 0.000 claims abstract description 22
- 238000013528 artificial neural network Methods 0.000 claims abstract description 16
- 238000003058 natural language processing Methods 0.000 claims abstract description 10
- 230000011218 segmentation Effects 0.000 claims abstract description 3
- 238000012549 training Methods 0.000 claims description 13
- 238000004458 analytical method Methods 0.000 claims description 8
- 230000003068 static effect Effects 0.000 claims description 8
- 238000012360 testing method Methods 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 5
- 101100095802 Bacillus subtilis (strain 168) sigV gene Proteins 0.000 claims description 4
- 238000012905 input function Methods 0.000 claims description 2
- 230000007547 defect Effects 0.000 abstract description 4
- 238000012545 processing Methods 0.000 abstract description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/75—Structural analysis for program understanding
- G06F8/751—Code clone detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a semantic similar code online detection method based on deep learning, and belongs to the technical field of software engineering software defect prediction. The method judges whether the given two code segments (functions) are semantic similar codes on line or not based on the identifier text similarity and the semantic similarity. Firstly, extracting relevant identifier text and function structure information of each function from a sample library, carrying out natural language processing such as word segmentation, abbreviation expansion and part of speech tagging on the identifier text information, carrying out abstract processing on the program structure information, converting the program structure information into vector representation, connecting the vector representation of the two functions, inputting the vector representation into a deep neural network, and carrying out semantic similarity learning and automatic detection. The invention fully utilizes the deep learning technology to mine the implicit similar semantic information in the program text, and ensures the real-time and accuracy of online detection.
Description
Technical Field
The invention relates to an online detection method for semantic similar codes, in particular to an online detection method for semantic similar codes based on deep learning, and belongs to the technical field of software defect prediction of software engineering.
Background
In the technical field of software defect prediction, clone codes refer to code segments with the same or similar functions and can be divided into four types, namely clone codes with the same text, clone codes with similar syntax, clone codes with similar semantics and the like.
Due to the existence of the clone codes, the redundancy of the software is increased, and certain difficulty is brought to software maintenance and software evolution. To this end, different methods have been proposed for detecting, eliminating and managing clone codes. The effective detection of clone codes is the premise and the basis for managing and eliminating the clone codes.
The existing clone code detection method mainly detects three types of clone codes, namely clone codes with the same text, clone codes with similar text and clone codes with similar syntax, and can effectively detect clone codes with similar text. However, it is difficult to detect redundant cloned codes where the program structure and algorithm are dissimilar and the semantics are similar. Research results show that the semantic similar codes which can be detected by the existing clone code detection method are less than 1%, namely more than 99% of the semantic similar codes cannot be detected by the existing clone code detection method.
Since detecting cloned codes with similar semantics involves complex program semantic analysis, how to effectively detect such cloned codes is still an unsolved problem.
Disclosure of Invention
The invention aims to provide a semantic similar code online detection method based on a deep learning technology, aiming at overcoming the defects of the prior art and solving the technical problem of clone code detection aiming at semantic similarity.
The method has the innovation points that: the static code analysis technology and the deep learning technology are fully utilized to mine the information of the big data, and the real-time detection of any open source project is supported.
The invention is realized by adopting the following technical scheme.
Firstly, based on a static code analysis technology, the program structure characteristics and the identifier text information of each function in the sample library are extracted. The open source project source code data are stored in the sample library and comprise marked semantic similar codes and semantic dissimilar codes.
Then, the function structure information is preprocessed by using a natural language processing technology, the structure information is symbolized and is represented in an information vectorization mode, and two vectors (functions) are synthesized into a piece of training data.
And then designing a neural network, inputting training data into the neural network for model training, and adjusting network parameters and optimizing the model according to a network output result.
And finally, inputting the item code to be detected into the trained neural network model for detection.
Advantageous effects
Compared with the prior art, the method of the invention has the following advantages:
(1) the static program analysis and natural language processing technology is fully utilized, the natural language and program semantic information hidden in the text identifier in the program is mined, and the method is simple and easy to operate.
(2) Semantic correlation and similarity rules of a deep neural network fitting program are fully utilized, and semantic similar codes can be effectively detected.
Drawings
FIG. 1 is a schematic diagram of the operation of the method of the present invention.
Detailed Description
The method of the present invention will be described in further detail with reference to the accompanying drawings.
According to the method, the static code analysis and natural language processing technology is utilized to mine the natural language and program semantic information hidden in the program text identifier, then the neural network model is trained based on the deep learning technology to carry out semantic similarity relation mapping, and semantic similarity prediction is carried out according to the network output result.
As shown in fig. 1, an online detection method for semantic similar codes based on a deep learning technique includes the following steps:
step 1: function information is extracted from the sample library using static code analysis techniques.
Firstly, analyzing source codes in a sample library by using a static analysis technology, and extracting structural information and text identifier information of each function. Then, each extracted function information is stored as one record in the database. The sample library stores open source project source code data, and the open source project source code data comprise marked semantic similar codes and semantic dissimilar codes.
The structural information includes, but is not limited to, assignment statements, function call statements, selection statements, and loop statements; textual identifier information includes, but is not limited to, function names, argument names, all variable names accessed within the function, and all method names called within the function.
Step 2: and preprocessing the extracted function structure information by utilizing a natural language processing technology.
Specifically, the method comprises the following steps:
step 2.1: and (3) performing classification processing and symbolization representation on the function structure information extracted in the step (1).
Step 2.2: and (3) performing natural language processing on the function text identifier information extracted in the step (1), wherein the natural language processing comprises word segmentation, abbreviation expansion and part of speech tagging. The method comprises the following specific steps:
dividing each identifier Name into a series of words and abbreviations<w1,w2,…,wi,…,wk>If w isiFor abbreviation, expand to word by identifier expansion technique, then carry out step 2.3; otherwise, step 2.3 is performed directly.
Step 2.3: and (3) performing part-of-speech tagging on the word sequence obtained in the step (2.2) by using a part-of-speech tagging technology.
And step 3: and carrying out data vectorization representation on the word information and the symbolization information.
The method specifically comprises the following steps:
step 3.1: subjecting the product obtained in step 2.2Each word wiMapped as a vector ViThus, an identifier maps to a series of vectors nameV ═<V1,V2,…,Vi,…,Vk>The identifier text of the function maps to idV ═<name1V,name2V,…,nameiV,…,nametV>. The name includes t names, and each name includes k V.
Step 3.2: mapping the part-of-speech tagging information obtained in the step 2.3 into a vector posV;
step 3.3: mapping the symbolized information obtained in the step 2 into a vector sigV;
step 3.4: connecting the vectors idV, posV and sigV generates a vectorized representation methodV of the input function.
The order of vector concatenation in step 3.4 cannot be changed, since the order in which the words appear in the identifier will affect the semantics of the identifier.
And 4, step 4: a training data set is generated.
The method specifically comprises the following steps:
step 4.1: any two vectors method obtained in step 31V and method2V is connected as input data inputV to the neural network.
Step 4.2: setting whether two functions marked in the sample library are label information of semantic similar codes as output data OutputV of the neural network;
step 4.3: the input data InputV of step 4.1 and the output data OutputV of step 4.2 are connected to generate training data trainV.
And 5: and training the neural network to obtain a network model DeepCloneClassifier.
The method specifically comprises the following steps:
step 5.1: initializing a deep neural network comprising an input layer, a hidden layer and an output layer;
step 5.2: and (4) sequentially inputting the training set data obtained in the step (4) into a neural network, and training the network to obtain a model DeepCloneClassifier.
Step 6: and generating a test set.
The method specifically comprises the following steps:
step 6.1: for the open source code to be detected, repeatedly executing the steps 1 to 3, extracting the characteristic information of each function and performing vectorization representation;
step 6.2: and (4) sequentially connecting each vector generated in the step (6.1) with each vector generated in the step (3) to obtain a test set.
And 7: and inquiring the neural network for prediction.
The method specifically comprises the following steps:
step 7.1: sequentially inputting the test set data obtained in the step 6 into a network model DeepCloneClassifier to query;
step 7.2: and marking the test data with the model output result higher than the threshold value T as the semantic similar codes, thereby completing the on-line detection of the semantic similar codes.
The threshold value T can be dynamically adjusted according to the learning and predicting results, so that the detection accuracy is continuously improved.
Claims (4)
1. A semantic similar code online detection method based on a deep learning technology is characterized by comprising the following steps:
step 1: extracting function information from a sample library by using a static code analysis technology;
firstly, analyzing source codes in a sample library by using a static analysis technology, extracting structure information and text identifier information of each function, and then storing each extracted function information as a record in a database; the sample library stores source code data of open source projects, wherein the source code data comprises labeled semantic similar codes and semantic dissimilar codes;
step 2: utilizing a natural language processing technology to preprocess the extracted function structure information;
step 2.1: classifying and symbolizing the function structure information extracted in the step 1;
step 2.2: and (2) performing natural language processing on the function text identifier information extracted in the step (1), wherein the natural language processing comprises word segmentation, abbreviation expansion and part of speech tagging, and the method specifically comprises the following steps:
dividing each identifier Name into a series of words and abbreviations<w1,w2,…,wi,…,wk>If w isiFor abbreviation, expand to word by identifier expansion technique, then carry out step 2.3; otherwise, directly executing the step 2.3;
step 2.3: performing part-of-speech tagging on the word sequence obtained in the step 2.2 by using a part-of-speech tagging technology;
and step 3: carrying out data vectorization representation on the word information and the symbolization information;
step 3.1: each word w obtained in step 2.2iMapped as a vector ViThus, an identifier maps to a series of vectors nameV ═<V1,V2,…,Vi,…,Vk>The identifier text of the function maps to idV ═<name1V,name2V,…,nameiV,…,nametV>(ii) a T names are included, and each name comprises k V;
step 3.2: mapping the part-of-speech tagging information obtained in the step 2.3 into a vector posV;
step 3.3: mapping the symbolized information obtained in the step 2 into a vector sigV;
step 3.4: connecting vectors idV, posV and sigV to generate a vectorized representation of the input function methodV, and the order of vector connection in step 3.4 cannot be changed;
and 4, step 4: generating a training data set;
step 4.1: any two vectors method obtained in step 31V and method2V connection as input data inputV of the neural network;
step 4.2: setting whether two functions marked in the sample library are label information of semantic similar codes as output data OutputV of the neural network;
step 4.3: connecting the input data InputV of the step 4.1 with the output data OutputV of the step 4.2 to generate training data trainV;
and 5: training a neural network to obtain a network model DeepCloneClassifier;
step 5.1: initializing a deep neural network comprising an input layer, a hidden layer and an output layer;
step 5.2: inputting the training set data obtained in the step 4 into a neural network in sequence, and training the network to obtain a model DeepCloneClassifier;
step 6: generating a test set;
step 6.1: for the open source code to be detected, repeatedly executing the steps 1 to 3, extracting the characteristic information of each function and performing vectorization representation;
step 6.2: sequentially connecting each vector generated in the step 6.1 with each vector generated in the step 3 to obtain a test set;
and 7: inquiring a neural network for prediction;
step 7.1: sequentially inputting the test set data obtained in the step 6 into a network model DeepCloneClassifier to query;
step 7.2: and marking the test data with the model output result higher than the threshold value T as the semantic similar codes, thereby completing the on-line detection of the semantic similar codes.
2. The method for detecting semantic similar codes on line based on deep learning technology as claimed in claim 1, wherein in step 1, the structural information includes assignment statements, function call statements, selection statements and loop statements.
3. The method for detecting semantic similar codes on line based on deep learning technology as claimed in claim 1, wherein in step 1, the text identifier information includes function name, argument name, all variable names accessed in the function and all method names called in the function.
4. The method for on-line detection of semantic similar codes based on deep learning technology as claimed in claim 1, wherein in step 7, the threshold T is dynamically adjusted according to the learning and prediction results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110184538.2A CN112835620B (en) | 2021-02-10 | 2021-02-10 | Semantic similar code online detection method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110184538.2A CN112835620B (en) | 2021-02-10 | 2021-02-10 | Semantic similar code online detection method based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112835620A true CN112835620A (en) | 2021-05-25 |
CN112835620B CN112835620B (en) | 2022-03-25 |
Family
ID=75933489
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110184538.2A Expired - Fee Related CN112835620B (en) | 2021-02-10 | 2021-02-10 | Semantic similar code online detection method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112835620B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116820564A (en) * | 2023-07-06 | 2023-09-29 | 四川大学 | Unified form semanticalization method of program language |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190243621A1 (en) * | 2018-02-06 | 2019-08-08 | Smartshift Technologies, Inc. | Systems and methods for code clustering analysis and transformation |
CN110413319A (en) * | 2019-08-01 | 2019-11-05 | 北京理工大学 | A kind of code function taste detection method based on deep semantic |
US20200104631A1 (en) * | 2018-09-27 | 2020-04-02 | International Business Machines Corporation | Generating vector representations of code capturing semantic similarity |
US20200133756A1 (en) * | 2018-10-26 | 2020-04-30 | EMC IP Holding Company LLC | Method, apparatus and computer storage medium for error diagnostics of an application |
CN111124487A (en) * | 2018-11-01 | 2020-05-08 | 浙江大学 | Code clone detection method and device and electronic equipment |
CN112215013A (en) * | 2020-11-02 | 2021-01-12 | 天津大学 | Clone code semantic detection method based on deep learning |
-
2021
- 2021-02-10 CN CN202110184538.2A patent/CN112835620B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190243621A1 (en) * | 2018-02-06 | 2019-08-08 | Smartshift Technologies, Inc. | Systems and methods for code clustering analysis and transformation |
US20200104631A1 (en) * | 2018-09-27 | 2020-04-02 | International Business Machines Corporation | Generating vector representations of code capturing semantic similarity |
US20200133756A1 (en) * | 2018-10-26 | 2020-04-30 | EMC IP Holding Company LLC | Method, apparatus and computer storage medium for error diagnostics of an application |
CN111124487A (en) * | 2018-11-01 | 2020-05-08 | 浙江大学 | Code clone detection method and device and electronic equipment |
CN110413319A (en) * | 2019-08-01 | 2019-11-05 | 北京理工大学 | A kind of code function taste detection method based on deep semantic |
CN112215013A (en) * | 2020-11-02 | 2021-01-12 | 天津大学 | Clone code semantic detection method based on deep learning |
Non-Patent Citations (6)
Title |
---|
GUANGJIE LI 等: "A Deep Learning Based Approach to Detect Code Clones", 《2020 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND HUMAN-COMPUTER INTERACTION (ICHCI)》 * |
HUI LIU 等: "Deep Learning Based Code Smell Detection", 《IEEE TRANSACTIONS ON SOFTWARE ENGINEERING》 * |
LIUQING LI 等: "CCLearner: A Deep Learning-Based Clone Detection Approach", 《2017 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME)》 * |
STAY_FOOLISH12: "DSSM、CNN-DSSM、LSTM-DSSM等深度学习模型在计算语义相似度上的应用+距离运算", 《HTTPS://BLOG.CSDN.NET/STAY_FOOLISH12/ARTICLE/DETAILS/107484368》 * |
卜依凡 等: "一种基于深度学习的上帝类检测方法", 《软件学报》 * |
胡艺: "基于深度学习的代码漏洞检测方法", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116820564A (en) * | 2023-07-06 | 2023-09-29 | 四川大学 | Unified form semanticalization method of program language |
CN116820564B (en) * | 2023-07-06 | 2024-04-02 | 四川大学 | Unified form semanticalization method of program language |
Also Published As
Publication number | Publication date |
---|---|
CN112835620B (en) | 2022-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110096570B (en) | Intention identification method and device applied to intelligent customer service robot | |
CN109977205B (en) | Method for computer to independently learn source code | |
CN112215013B (en) | Clone code semantic detection method based on deep learning | |
CN110442676A (en) | Patent retrieval method and device based on more wheel dialogues | |
CN109492106B (en) | Automatic classification method for defect reasons by combining text codes | |
WO2020186627A1 (en) | Public opinion polarity prediction method and apparatus, computer device, and storage medium | |
CN111427775B (en) | Method level defect positioning method based on Bert model | |
CN111274817A (en) | Intelligent software cost measurement method based on natural language processing technology | |
CN112051986B (en) | Code search recommendation device and method based on open source knowledge | |
CN111309607A (en) | Software defect positioning method of code method level | |
CN113434418A (en) | Knowledge-driven software defect detection and analysis method and system | |
CN115964273A (en) | Spacecraft test script automatic generation method based on deep learning | |
CN116108191A (en) | Deep learning model recommendation method based on knowledge graph | |
CN112835620B (en) | Semantic similar code online detection method based on deep learning | |
CN116484024A (en) | Multi-level knowledge base construction method based on knowledge graph | |
CN112862569B (en) | Product appearance style evaluation method and system based on image and text multi-modal data | |
CN111309849B (en) | Fine-grained value information extraction method based on joint learning model | |
CN112685374B (en) | Log classification method and device and electronic equipment | |
CN115982316A (en) | Multi-mode-based text retrieval method, system and medium | |
CN113076421B (en) | Social noise text entity relationship extraction optimization method and system | |
CN116467437A (en) | Automatic flow modeling method for complex scene description | |
CN115309995A (en) | Scientific and technological resource pushing method and device based on demand text | |
CN113886520A (en) | Code retrieval method and system based on graph neural network and computer readable storage medium | |
CN116108392A (en) | Geological structure identification technology based on improved random forest algorithm | |
CN109960798A (en) | Uighur text emergency event element recognition methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220325 |
|
CF01 | Termination of patent right due to non-payment of annual fee |