CN112835620A - Semantic similar code online detection method based on deep learning - Google Patents

Semantic similar code online detection method based on deep learning Download PDF

Info

Publication number
CN112835620A
CN112835620A CN202110184538.2A CN202110184538A CN112835620A CN 112835620 A CN112835620 A CN 112835620A CN 202110184538 A CN202110184538 A CN 202110184538A CN 112835620 A CN112835620 A CN 112835620A
Authority
CN
China
Prior art keywords
information
function
semantic
codes
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110184538.2A
Other languages
Chinese (zh)
Other versions
CN112835620B (en
Inventor
李光杰
唐艺
张翔
易比一
侯胜杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Defense Technology Innovation Institute PLA Academy of Military Science
Original Assignee
National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Defense Technology Innovation Institute PLA Academy of Military Science filed Critical National Defense Technology Innovation Institute PLA Academy of Military Science
Priority to CN202110184538.2A priority Critical patent/CN112835620B/en
Publication of CN112835620A publication Critical patent/CN112835620A/en
Application granted granted Critical
Publication of CN112835620B publication Critical patent/CN112835620B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a semantic similar code online detection method based on deep learning, and belongs to the technical field of software engineering software defect prediction. The method judges whether the given two code segments (functions) are semantic similar codes on line or not based on the identifier text similarity and the semantic similarity. Firstly, extracting relevant identifier text and function structure information of each function from a sample library, carrying out natural language processing such as word segmentation, abbreviation expansion and part of speech tagging on the identifier text information, carrying out abstract processing on the program structure information, converting the program structure information into vector representation, connecting the vector representation of the two functions, inputting the vector representation into a deep neural network, and carrying out semantic similarity learning and automatic detection. The invention fully utilizes the deep learning technology to mine the implicit similar semantic information in the program text, and ensures the real-time and accuracy of online detection.

Description

Semantic similar code online detection method based on deep learning
Technical Field
The invention relates to an online detection method for semantic similar codes, in particular to an online detection method for semantic similar codes based on deep learning, and belongs to the technical field of software defect prediction of software engineering.
Background
In the technical field of software defect prediction, clone codes refer to code segments with the same or similar functions and can be divided into four types, namely clone codes with the same text, clone codes with similar syntax, clone codes with similar semantics and the like.
Due to the existence of the clone codes, the redundancy of the software is increased, and certain difficulty is brought to software maintenance and software evolution. To this end, different methods have been proposed for detecting, eliminating and managing clone codes. The effective detection of clone codes is the premise and the basis for managing and eliminating the clone codes.
The existing clone code detection method mainly detects three types of clone codes, namely clone codes with the same text, clone codes with similar text and clone codes with similar syntax, and can effectively detect clone codes with similar text. However, it is difficult to detect redundant cloned codes where the program structure and algorithm are dissimilar and the semantics are similar. Research results show that the semantic similar codes which can be detected by the existing clone code detection method are less than 1%, namely more than 99% of the semantic similar codes cannot be detected by the existing clone code detection method.
Since detecting cloned codes with similar semantics involves complex program semantic analysis, how to effectively detect such cloned codes is still an unsolved problem.
Disclosure of Invention
The invention aims to provide a semantic similar code online detection method based on a deep learning technology, aiming at overcoming the defects of the prior art and solving the technical problem of clone code detection aiming at semantic similarity.
The method has the innovation points that: the static code analysis technology and the deep learning technology are fully utilized to mine the information of the big data, and the real-time detection of any open source project is supported.
The invention is realized by adopting the following technical scheme.
Firstly, based on a static code analysis technology, the program structure characteristics and the identifier text information of each function in the sample library are extracted. The open source project source code data are stored in the sample library and comprise marked semantic similar codes and semantic dissimilar codes.
Then, the function structure information is preprocessed by using a natural language processing technology, the structure information is symbolized and is represented in an information vectorization mode, and two vectors (functions) are synthesized into a piece of training data.
And then designing a neural network, inputting training data into the neural network for model training, and adjusting network parameters and optimizing the model according to a network output result.
And finally, inputting the item code to be detected into the trained neural network model for detection.
Advantageous effects
Compared with the prior art, the method of the invention has the following advantages:
(1) the static program analysis and natural language processing technology is fully utilized, the natural language and program semantic information hidden in the text identifier in the program is mined, and the method is simple and easy to operate.
(2) Semantic correlation and similarity rules of a deep neural network fitting program are fully utilized, and semantic similar codes can be effectively detected.
Drawings
FIG. 1 is a schematic diagram of the operation of the method of the present invention.
Detailed Description
The method of the present invention will be described in further detail with reference to the accompanying drawings.
According to the method, the static code analysis and natural language processing technology is utilized to mine the natural language and program semantic information hidden in the program text identifier, then the neural network model is trained based on the deep learning technology to carry out semantic similarity relation mapping, and semantic similarity prediction is carried out according to the network output result.
As shown in fig. 1, an online detection method for semantic similar codes based on a deep learning technique includes the following steps:
step 1: function information is extracted from the sample library using static code analysis techniques.
Firstly, analyzing source codes in a sample library by using a static analysis technology, and extracting structural information and text identifier information of each function. Then, each extracted function information is stored as one record in the database. The sample library stores open source project source code data, and the open source project source code data comprise marked semantic similar codes and semantic dissimilar codes.
The structural information includes, but is not limited to, assignment statements, function call statements, selection statements, and loop statements; textual identifier information includes, but is not limited to, function names, argument names, all variable names accessed within the function, and all method names called within the function.
Step 2: and preprocessing the extracted function structure information by utilizing a natural language processing technology.
Specifically, the method comprises the following steps:
step 2.1: and (3) performing classification processing and symbolization representation on the function structure information extracted in the step (1).
Step 2.2: and (3) performing natural language processing on the function text identifier information extracted in the step (1), wherein the natural language processing comprises word segmentation, abbreviation expansion and part of speech tagging. The method comprises the following specific steps:
dividing each identifier Name into a series of words and abbreviations<w1,w2,…,wi,…,wk>If w isiFor abbreviation, expand to word by identifier expansion technique, then carry out step 2.3; otherwise, step 2.3 is performed directly.
Step 2.3: and (3) performing part-of-speech tagging on the word sequence obtained in the step (2.2) by using a part-of-speech tagging technology.
And step 3: and carrying out data vectorization representation on the word information and the symbolization information.
The method specifically comprises the following steps:
step 3.1: subjecting the product obtained in step 2.2Each word wiMapped as a vector ViThus, an identifier maps to a series of vectors nameV ═<V1,V2,…,Vi,…,Vk>The identifier text of the function maps to idV ═<name1V,name2V,…,nameiV,…,nametV>. The name includes t names, and each name includes k V.
Step 3.2: mapping the part-of-speech tagging information obtained in the step 2.3 into a vector posV;
step 3.3: mapping the symbolized information obtained in the step 2 into a vector sigV;
step 3.4: connecting the vectors idV, posV and sigV generates a vectorized representation methodV of the input function.
The order of vector concatenation in step 3.4 cannot be changed, since the order in which the words appear in the identifier will affect the semantics of the identifier.
And 4, step 4: a training data set is generated.
The method specifically comprises the following steps:
step 4.1: any two vectors method obtained in step 31V and method2V is connected as input data inputV to the neural network.
Step 4.2: setting whether two functions marked in the sample library are label information of semantic similar codes as output data OutputV of the neural network;
step 4.3: the input data InputV of step 4.1 and the output data OutputV of step 4.2 are connected to generate training data trainV.
And 5: and training the neural network to obtain a network model DeepCloneClassifier.
The method specifically comprises the following steps:
step 5.1: initializing a deep neural network comprising an input layer, a hidden layer and an output layer;
step 5.2: and (4) sequentially inputting the training set data obtained in the step (4) into a neural network, and training the network to obtain a model DeepCloneClassifier.
Step 6: and generating a test set.
The method specifically comprises the following steps:
step 6.1: for the open source code to be detected, repeatedly executing the steps 1 to 3, extracting the characteristic information of each function and performing vectorization representation;
step 6.2: and (4) sequentially connecting each vector generated in the step (6.1) with each vector generated in the step (3) to obtain a test set.
And 7: and inquiring the neural network for prediction.
The method specifically comprises the following steps:
step 7.1: sequentially inputting the test set data obtained in the step 6 into a network model DeepCloneClassifier to query;
step 7.2: and marking the test data with the model output result higher than the threshold value T as the semantic similar codes, thereby completing the on-line detection of the semantic similar codes.
The threshold value T can be dynamically adjusted according to the learning and predicting results, so that the detection accuracy is continuously improved.

Claims (4)

1. A semantic similar code online detection method based on a deep learning technology is characterized by comprising the following steps:
step 1: extracting function information from a sample library by using a static code analysis technology;
firstly, analyzing source codes in a sample library by using a static analysis technology, extracting structure information and text identifier information of each function, and then storing each extracted function information as a record in a database; the sample library stores source code data of open source projects, wherein the source code data comprises labeled semantic similar codes and semantic dissimilar codes;
step 2: utilizing a natural language processing technology to preprocess the extracted function structure information;
step 2.1: classifying and symbolizing the function structure information extracted in the step 1;
step 2.2: and (2) performing natural language processing on the function text identifier information extracted in the step (1), wherein the natural language processing comprises word segmentation, abbreviation expansion and part of speech tagging, and the method specifically comprises the following steps:
dividing each identifier Name into a series of words and abbreviations<w1,w2,…,wi,…,wk>If w isiFor abbreviation, expand to word by identifier expansion technique, then carry out step 2.3; otherwise, directly executing the step 2.3;
step 2.3: performing part-of-speech tagging on the word sequence obtained in the step 2.2 by using a part-of-speech tagging technology;
and step 3: carrying out data vectorization representation on the word information and the symbolization information;
step 3.1: each word w obtained in step 2.2iMapped as a vector ViThus, an identifier maps to a series of vectors nameV ═<V1,V2,…,Vi,…,Vk>The identifier text of the function maps to idV ═<name1V,name2V,…,nameiV,…,nametV>(ii) a T names are included, and each name comprises k V;
step 3.2: mapping the part-of-speech tagging information obtained in the step 2.3 into a vector posV;
step 3.3: mapping the symbolized information obtained in the step 2 into a vector sigV;
step 3.4: connecting vectors idV, posV and sigV to generate a vectorized representation of the input function methodV, and the order of vector connection in step 3.4 cannot be changed;
and 4, step 4: generating a training data set;
step 4.1: any two vectors method obtained in step 31V and method2V connection as input data inputV of the neural network;
step 4.2: setting whether two functions marked in the sample library are label information of semantic similar codes as output data OutputV of the neural network;
step 4.3: connecting the input data InputV of the step 4.1 with the output data OutputV of the step 4.2 to generate training data trainV;
and 5: training a neural network to obtain a network model DeepCloneClassifier;
step 5.1: initializing a deep neural network comprising an input layer, a hidden layer and an output layer;
step 5.2: inputting the training set data obtained in the step 4 into a neural network in sequence, and training the network to obtain a model DeepCloneClassifier;
step 6: generating a test set;
step 6.1: for the open source code to be detected, repeatedly executing the steps 1 to 3, extracting the characteristic information of each function and performing vectorization representation;
step 6.2: sequentially connecting each vector generated in the step 6.1 with each vector generated in the step 3 to obtain a test set;
and 7: inquiring a neural network for prediction;
step 7.1: sequentially inputting the test set data obtained in the step 6 into a network model DeepCloneClassifier to query;
step 7.2: and marking the test data with the model output result higher than the threshold value T as the semantic similar codes, thereby completing the on-line detection of the semantic similar codes.
2. The method for detecting semantic similar codes on line based on deep learning technology as claimed in claim 1, wherein in step 1, the structural information includes assignment statements, function call statements, selection statements and loop statements.
3. The method for detecting semantic similar codes on line based on deep learning technology as claimed in claim 1, wherein in step 1, the text identifier information includes function name, argument name, all variable names accessed in the function and all method names called in the function.
4. The method for on-line detection of semantic similar codes based on deep learning technology as claimed in claim 1, wherein in step 7, the threshold T is dynamically adjusted according to the learning and prediction results.
CN202110184538.2A 2021-02-10 2021-02-10 Semantic similar code online detection method based on deep learning Expired - Fee Related CN112835620B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110184538.2A CN112835620B (en) 2021-02-10 2021-02-10 Semantic similar code online detection method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110184538.2A CN112835620B (en) 2021-02-10 2021-02-10 Semantic similar code online detection method based on deep learning

Publications (2)

Publication Number Publication Date
CN112835620A true CN112835620A (en) 2021-05-25
CN112835620B CN112835620B (en) 2022-03-25

Family

ID=75933489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110184538.2A Expired - Fee Related CN112835620B (en) 2021-02-10 2021-02-10 Semantic similar code online detection method based on deep learning

Country Status (1)

Country Link
CN (1) CN112835620B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116820564A (en) * 2023-07-06 2023-09-29 四川大学 Unified form semanticalization method of program language

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190243621A1 (en) * 2018-02-06 2019-08-08 Smartshift Technologies, Inc. Systems and methods for code clustering analysis and transformation
CN110413319A (en) * 2019-08-01 2019-11-05 北京理工大学 A kind of code function taste detection method based on deep semantic
US20200104631A1 (en) * 2018-09-27 2020-04-02 International Business Machines Corporation Generating vector representations of code capturing semantic similarity
US20200133756A1 (en) * 2018-10-26 2020-04-30 EMC IP Holding Company LLC Method, apparatus and computer storage medium for error diagnostics of an application
CN111124487A (en) * 2018-11-01 2020-05-08 浙江大学 Code clone detection method and device and electronic equipment
CN112215013A (en) * 2020-11-02 2021-01-12 天津大学 Clone code semantic detection method based on deep learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190243621A1 (en) * 2018-02-06 2019-08-08 Smartshift Technologies, Inc. Systems and methods for code clustering analysis and transformation
US20200104631A1 (en) * 2018-09-27 2020-04-02 International Business Machines Corporation Generating vector representations of code capturing semantic similarity
US20200133756A1 (en) * 2018-10-26 2020-04-30 EMC IP Holding Company LLC Method, apparatus and computer storage medium for error diagnostics of an application
CN111124487A (en) * 2018-11-01 2020-05-08 浙江大学 Code clone detection method and device and electronic equipment
CN110413319A (en) * 2019-08-01 2019-11-05 北京理工大学 A kind of code function taste detection method based on deep semantic
CN112215013A (en) * 2020-11-02 2021-01-12 天津大学 Clone code semantic detection method based on deep learning

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
GUANGJIE LI 等: "A Deep Learning Based Approach to Detect Code Clones", 《2020 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND HUMAN-COMPUTER INTERACTION (ICHCI)》 *
HUI LIU 等: "Deep Learning Based Code Smell Detection", 《IEEE TRANSACTIONS ON SOFTWARE ENGINEERING》 *
LIUQING LI 等: "CCLearner: A Deep Learning-Based Clone Detection Approach", 《2017 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME)》 *
STAY_FOOLISH12: "DSSM、CNN-DSSM、LSTM-DSSM等深度学习模型在计算语义相似度上的应用+距离运算", 《HTTPS://BLOG.CSDN.NET/STAY_FOOLISH12/ARTICLE/DETAILS/107484368》 *
卜依凡 等: "一种基于深度学习的上帝类检测方法", 《软件学报》 *
胡艺: "基于深度学习的代码漏洞检测方法", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116820564A (en) * 2023-07-06 2023-09-29 四川大学 Unified form semanticalization method of program language
CN116820564B (en) * 2023-07-06 2024-04-02 四川大学 Unified form semanticalization method of program language

Also Published As

Publication number Publication date
CN112835620B (en) 2022-03-25

Similar Documents

Publication Publication Date Title
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
CN109977205B (en) Method for computer to independently learn source code
CN112215013B (en) Clone code semantic detection method based on deep learning
CN110442676A (en) Patent retrieval method and device based on more wheel dialogues
CN109492106B (en) Automatic classification method for defect reasons by combining text codes
WO2020186627A1 (en) Public opinion polarity prediction method and apparatus, computer device, and storage medium
CN111427775B (en) Method level defect positioning method based on Bert model
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
CN112051986B (en) Code search recommendation device and method based on open source knowledge
CN111309607A (en) Software defect positioning method of code method level
CN113434418A (en) Knowledge-driven software defect detection and analysis method and system
CN115964273A (en) Spacecraft test script automatic generation method based on deep learning
CN116108191A (en) Deep learning model recommendation method based on knowledge graph
CN112835620B (en) Semantic similar code online detection method based on deep learning
CN116484024A (en) Multi-level knowledge base construction method based on knowledge graph
CN112862569B (en) Product appearance style evaluation method and system based on image and text multi-modal data
CN111309849B (en) Fine-grained value information extraction method based on joint learning model
CN112685374B (en) Log classification method and device and electronic equipment
CN115982316A (en) Multi-mode-based text retrieval method, system and medium
CN113076421B (en) Social noise text entity relationship extraction optimization method and system
CN116467437A (en) Automatic flow modeling method for complex scene description
CN115309995A (en) Scientific and technological resource pushing method and device based on demand text
CN113886520A (en) Code retrieval method and system based on graph neural network and computer readable storage medium
CN116108392A (en) Geological structure identification technology based on improved random forest algorithm
CN109960798A (en) Uighur text emergency event element recognition methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220325

CF01 Termination of patent right due to non-payment of annual fee