CN112328469B - Function level defect positioning method based on embedding technology - Google Patents

Function level defect positioning method based on embedding technology Download PDF

Info

Publication number
CN112328469B
CN112328469B CN202011136892.XA CN202011136892A CN112328469B CN 112328469 B CN112328469 B CN 112328469B CN 202011136892 A CN202011136892 A CN 202011136892A CN 112328469 B CN112328469 B CN 112328469B
Authority
CN
China
Prior art keywords
defect
function
model
report
defect report
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011136892.XA
Other languages
Chinese (zh)
Other versions
CN112328469A (en
Inventor
邹卫琴
李恩铭
房春荣
张静宣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202011136892.XA priority Critical patent/CN112328469B/en
Publication of CN112328469A publication Critical patent/CN112328469A/en
Application granted granted Critical
Publication of CN112328469B publication Critical patent/CN112328469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3684Test management for test design, e.g. generating new test cases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a function level defect positioning method based on an embedding technology, which uses a code embedding technology based on an abstract syntax tree to represent function code function semantics and uses a word embedding technology to represent defect report problem semantics. And finally, fusing the semantic features of the function and the defect report by using a convolutional neural network and predicting a suspicious function related to the given defect. In order to solve the problem of limited training data, the invention provides a method for performing feature representation on a defect report and a code by using a pre-training model; meanwhile, aiming at the problem of unbalanced class instance number, a random oversampling method is provided for processing the class instance number. Tests on three main stream Java items show that when the recommendation list is 10, the accuracy of the method provided by the invention can reach 12.5% -40%, the method has great potential in the field of fine-grained defect location, and the method has great potential application value on the main stream Java software items.

Description

Function level defect positioning method based on embedding technology
Technical Field
The invention relates to a function level defect positioning method based on an embedding technology, and belongs to the technical field of defect positioning and repairing in a software development process.
Background
The software defect positioning is an important link in the software development and maintenance process, and plays an irreplaceable important role in guaranteeing and improving the quality of software products. Defect localization techniques have been extensively focused and studied by academia over the past decades, and are also considered the most valuable class of software techniques by software development practitioners in the industry and open communities. The defect locating technology generally locates suspicious code containing software defects according to program dynamic execution information or static text information, and the like, and the granularity of the locating may be source code files, functions, code lines, and the like. Of the many location granularities, the function discovered by Kochhar et al through interviews of questionnaires is the most important location granularity considered by software development practitioners. The function-level defect positioning technology has important use value in the software development process.
One of the core challenges of the function-level defect localization technology is how to accurately extract the functional semantics (small code volume of the function and high difficulty in extracting the functional semantics compared with the source code file) implied by the function codes and the problem semantics (software problems need to be characterized from complicated and possibly noisy natural language text descriptions) described by the defect reports. Some researchers have already been working on this aspect. Lukins et al propose topic modeling of a document set of function constructs using an LDA topic model, followed by querying the LDA model artificially to extract (or supplement) keywords from the textual description of a defect report to retrieve suspect functions associated with a given defect report. Zhang et al propose to combine LDA topic model and standard space vector model VSM of many abstract levels to model the defect report and function document set, and extract the suspicious function that the given defect corresponds to through calculating the cosine similarity between the two. Youm et al propose to first locate the suspicious source code files and then locate the suspicious functions from the most suspicious 10 source code files. Zhang et al propose to use word2vec to uniformly characterize the defect report and function code, and simultaneously propose to amplify the function to solve the problem of poor positioning effect of the defect positioning technology on short functions.
Surrounding the localization of defects at the functional level, despite the series of research efforts, existing research still suffers from two drawbacks in semantic characterization.
First, functional code semantics are poorly characterized. When the current research is used for semantic extraction of functions, the semantics of the functions are extracted by only taking codes as common natural language texts, and structural characteristics of the codes are ignored. In general, if the same code statement has a different organization (e.g., A- > B- > C becomes B- > A- > C), the corresponding code functions are likely to be different. Ignoring the structural properties of the code will result in loss of semantics of the additional function functions carried by the code structure, thereby affecting the accuracy of defect localization. Second, the defect report semantics extraction process ignores the text context semantics. Existing work has mainly used techniques such as VSM, LDA, etc. to semantically represent text for defects. These techniques often do not characterize the context semantics of the text well, resulting in insufficient semantic characterization of the defect reporting problem, thereby affecting the defect localization effect. To solve these problems, the present invention proposes to develop a function level defect localization technique based on an embedding technique.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the function level defect positioning method based on the embedding technology aims to extract problem semantics of a defect report and function semantics of function codes by adopting the word embedding and code embedding technology, further position the function codes related to the given defect, help developers to repair software defects more efficiently and improve the quality of software.
The invention adopts the following technical scheme for solving the technical problems:
a function level defect positioning method based on an embedding technology comprises the following steps:
step 1, acquiring all defect reports of software, and preprocessing each defect report to obtain preprocessed defect reports;
step 2, utilizing a pre-training model of an ELMO (word embedding technology) to extract semantic features of each word of the preprocessed defect report to obtain a semantic feature vector corresponding to each word; performing maximum pooling on semantic feature vectors corresponding to all words of the preprocessed defect reports to obtain the semantic feature vector corresponding to each defect report;
step 3, extracting semantic features of each function of the software by using a pre-training model of a code2vec technology to obtain a semantic feature vector corresponding to each function;
step 4, selecting a defect report from all defect reports and selecting a function from all functions of the software to form an example, wherein the class label of the example is 0 or 1, 1 represents that the function in the example is related to the defect report, 0 represents irrelevant, and the characteristic of each example is the fusion of semantic feature vectors corresponding to the function in the example and the defect report, so that all examples are obtained; taking all the examples as a data set, and dividing the data set into a training set, a verification set and a test set;
step 5, training the convolutional neural network model by using a training set, using a logistic regression classifier as an output layer of the convolutional neural network model, processing the problem of unbalanced class instance number by using a random oversampling strategy in the model training process, and finely adjusting the model by using a verification set so as to obtain the trained convolutional neural network model, namely a defect positioning model;
and 6, inputting the test set into a defect positioning model, testing the test set, outputting correlation values of each function and each defect report in the test set by the defect positioning model, sequencing all functions in the test set according to the correlation values of the defect reports from large to small for each defect report in the test set, and outputting functions corresponding to the first K correlation values as a problem function recommendation list of the defect report, wherein K is a preset value.
As a preferred embodiment of the present invention, step 1 of preprocessing each defect report to obtain a preprocessed defect report specifically includes the following steps:
and performing a series of preprocessing on each defect report, including word segmentation, stop word removal, symbol removal, digit removal, combined word splitting, word stem removal and capitalization-to-lowercase removal, and finally obtaining the preprocessed defect report.
As a preferred solution of the present invention, the dimension of the semantic feature vector corresponding to each word in step 2 is 768 dimensions.
As a preferred solution of the present invention, the dimension of the semantic feature vector corresponding to each function in step 3 is 384 dimensions.
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
1. the defect positioning method based on the embedding technology provided by the invention extracts the semantics of the defect report and the function code by utilizing the word embedding technology and the code embedding technology, and performs the defect positioning of the function granularity through semantic matching.
2. Aiming at the problem of limited training data, the invention provides the extraction of semantic features by using a pre-training model.
3. Aiming at the problem of unbalanced class instance number, the invention provides a random oversampling mechanism for processing, and the effect of the method is effectively improved.
4. The effectiveness of the method of the invention was evaluated by experiments on three major Java software projects. Experimental results show that when the number of the recommendation functions is 10, the method can accurately position 12.5% -40.0% of defect reports, and has great application potential in the field of fine-grained defect positioning.
Drawings
FIG. 1 is a flow chart of a function level defect positioning method based on an embedding technique according to the present invention.
FIG. 2 is a defect report pre-processing flow diagram.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As shown in FIG. 1, the present invention proposes a function-level defect localization method based on embedding technology, which aims to effectively characterize the semantics of a defect report and a function code by using word embedding and code embedding technology so as to effectively localize defects at the function granularity. The method comprises the steps of extracting the semantics of the software defect report problem and the semantics of the function code function, using a pre-training model, constructing and predicting the model, processing the unbalanced class and the like.
Fig. 1 shows the workflow of the method proposed by the present invention, comprising two phases: namely a semantic extraction phase and a model construction phase. The semantic extraction stage aims at extracting semantic features of the defect report and the function code by utilizing a word embedding model and a code embedding model. The method mainly uses a pre-training embedded model to extract relevant features. The model construction phase aims at training and using the defect localization model for prediction. We use a convolutional neural network to fuse semantic feature vectors of the defect reports and function codes and use a logistic regression layer as a classifier for predicting the problem function corresponding to a given defect report. Considering that the number of problem functions is far smaller than that of normal functions, in the invention, an unbalanced class processing strategy, namely a random oversampling strategy, is embedded into a construction model, so that the problem of unbalanced class instance number is solved.
Semantic feature extraction
In the invention, based on the good effect of the word embedding and code embedding model in each software task, the semantic features of the defect report and the function code are fully extracted by using the word embedding and code embedding model. Large training sets are often required to obtain a well-behaved embedded model, and the size of a general software project data set is relatively small. To this end, we propose to use a pre-trained model to extract the semantics of the defect reports and function codes. The pre-training model is a model constructed on the basis of an existing massive data set (such as massive wiki text or GitHub code) and can be directly or finely adjusted and applied to a new software task. Pre-training models have been applied by researchers in various software tasks.
And (3) extracting semantics of the defect report: ye et al found through experimentation that using massive text on Wikipedia did not differ significantly in the file-level defect localization task from using a model trained with software project components such as API documents. In the present invention, we use a pre-trained model of the classical word embedding model ELMO that performs well in the natural language processing domain (https:// githu. com/flairNLP/Flair is implemented by the Flair library) to extract the problem semantics of the bug report. The ELMO model uses bidirectional LSTM (Long short-term memory) to extract features. ELMO has previously trained word vectors for words using a language model, and then dynamically adjusts the word vectors based on the context of downstream tasks. The adjusted word vector can express the specific meaning of the word in the context, and the synonym problem which cannot be solved by word2vec is further solved by fully utilizing the context semantics. The dimensionality of the semantic feature vector output by the ELMO pre-training model used in the invention is 768. I.e., 768-dimensional numeric vectors per word in the defect report. And performing maximum pooling on 768-dimensional numerical value vectors corresponding to all words of the defect report to obtain a final semantic feature vector of the defect report.
Before applying the pre-training model, the text of each defect report needs to undergo a series of pre-processing as shown in fig. 2, including word segmentation, word deactivation, sign removal, word combination splitting, word stem removal using Porter, word conversion into lower case, and the like. In the present invention, the list of stop words we use is the list of NLTK (https:// www.nltk.org /) stop words. For some compound words (e.g., WindowSize), we split them into individual words (e.g., WindowSize- > Window + Size) according to capitalization, and continue to retain the compound word in the dataset following prior art research practices. Some words are different in form but identical in origin (e.g., fetching, fetches, etc.), and we use Porter (https:// tartarus. org/martin/Porter stemmer /) stemming methods to do this (e.g., fetching and fetches become fetches). The processed words are uniformly changed into a lower case form. And finally, inputting the preprocessed defect report into a word embedding pre-training model for semantic extraction.
Extracting the semantics of the function code: the existing function level defect positioning method usually ignores the structural characteristics of function codes when extracting the functional semantics of the functions, which greatly influences the effect of the function level defect positioning technology. In the present invention, we propose a code embedding technique using code2vec (https:// code2vec. org /) to represent the functional semantics of a function. code2vec is a deep learning model based on an Abstract Syntax Tree (AST) path. code2vec firstly converts the function code into AST, then extracts all paths from leaf nodes to leaf nodes from the AST, then learns the weight and vector expression of each path by using an attention mechanism, and finally integrates the vector expressions of all paths by using a full connection layer to obtain the final vector expression of the code segment. By sufficiently mining the code structure information implied in the AST, the code2vec can accurately recommend a function name of a given code with higher accuracy. This shows to a large extent that the semantic vector generated by code2vec for the function code reflects the function of the function well. Therefore, the method is suitable for functional semantic extraction of functions in the function-level defect positioning method.
In the invention, the pre-training model of code2vec is also used for extracting the functional semantics of the function code. The code2vec code embedded pre-training model was built on 1400 million functions of the most popular 10000 items on GitHub. The diversity and popularity of the items used by the training data ensure the feasibility of extracting the function of the real software item by using the pre-training model to a greater extent. By using code2vec, each function will be represented as a 384-dimensional real number vector that characterizes the functional semantics of the function.
Model construction
In the present invention, we treat function level defect localization as a learning task. In other words, we attempt to determine, for a given defect report, whether each function in the codebase is relevant to repairing the defect. We use convolutional neural networks as the model training framework. The convolutional neural network takes the defect report output by the original pre-training model and the semantic vector of the function code as example characteristics, and the class label of each example is whether the function in the example is related to the repair of the current defect. In order to solve the deviation caused by the large class, the unbalanced class is processed during model training. The specific description is as follows:
model training: as described above, after using the pre-trained model, we will get the different dimensions of the defect report and the semantic feature vector of the function code. In order to improve the learning ability and the nonlinear expression ability of the model, a plurality of fully-connected layers are designed in the convolutional neural network. Specifically, for a 384-dimensional function code semantic vector, the number of nodes of each layer is decreased according to 384- >256- >128- >64- > 32; for defect reporting, the 768-dimensional semantic vector will perform further extraction of features through 5 full-join layers, namely 768- >512- >256- >128- >64- > 32. The logistic regression classifier is used as the output layer of the convolutional neural network to predict whether a function is relevant to the repair of a given software defect.
In the process of model training, the whole data set is divided into three parts, namely a training set, a verification set and a test set. The training set is used for training the model, the verification set is used for fine-tuning the model, and the test set is used for evaluating the effect of the final prediction model constructed on the training set and the verification set. Each instance is characterized by a combination of the defect report and the respective semantic feature vector of the function code. The class label of the example is 0 or 1, 1 indicates that the function in the example is related to defect repair, and 0 indicates incoherence. All examples were sorted by time of defect report and divided into 10 equal parts on average. The first 8 copies were used as test sets, the 9 th copy was used as validation set, and the last copy was used as test set. The effect evaluation criteria for the localization model are Accuracy @ K, MAP and MRR.
Unbalanced type treatment: in the process of model training, the problem of unbalanced class instance number exists. In other words, for a given defect report, the number of problem functions associated with its repair is much smaller than the number of functions not associated with its repair. The direct use of a data set with an unbalanced number of class instances to train the model results in unfairness of the model to the minimal class. When the minimal class is a truly important class (the problem function is the class in which the item is really interested), the model is likely to be unable to make a reasonable prediction.
To solve this problem, we need to do de-unbalance processing on the data of the project. For unbalanced data sets, there are usually two traditional processing strategies, one is a random under-sampling strategy, and the other is a random over-sampling strategy. The random undersampling strategy achieves the balance of class instance numbers primarily by reducing the number of instances of the large class (typically achieved by randomly drawing large class instances equal to the number of small class instances), while the random oversampling strategy achieves the balance of class instances primarily by increasing the number of instances of the small class (typically achieved by repeating the random sampling of small class instances). In the invention, considering that a subclass (namely a problem function) is a more concerned class of a defect localization technology and a convolutional neural network has certain requirements on the data scale, a random oversampling strategy is used for processing the problem of unbalanced class instance number in defect localization.
And (3) prediction: after a defect positioning model is obtained through training, aiming at a new defect report, word segmentation, word deactivation, symbol removal, word combination segmentation, word lower case conversion, word embedding pre-training model and maximum pooling are firstly carried out on the defect report to obtain semantic characteristics of the defect report, the semantic characteristics of the defect report are combined with semantic vectors of functions of item codes corresponding to the defect report to obtain an example with prediction, the size (between 0 and 1) of the correlation of each function and the new defect report is output through the positioning model obtained through training, and the function is more correlated with the new defect report when the value is larger, namely the function is more likely to be a problem function. And finally, the model sorts all functions from large to small according to the correlation numerical values, and a problem function recommendation list is output.
In order to evaluate the function level defect positioning method based on the embedding technology, experiments are carried out on three main stream Java software projects. These three items are close-builder (Closure), common-lang (Lang) and common-math (Math), respectively. The three projects come from different fields, and the sizes of the projects are different, so that a better data base is provided for the evaluation of the method. The data for the three projects is from an authoritative project data set, Defects4J (https:// githu. com/rjust/Defects4j) in the field of defect location and repair. The Defects4J contain a series of real and reproducible software Defects. For each software bug, the maintainer of the Defects4J manually removes code modifications that are not related to the repair of the software bug. Therefore, accurate matching of software defects and problem codes is guaranteed to a great extent. This provides a better data set benchmark for the evaluation of defect localization and repair techniques, and is used to evaluate the effectiveness of various defect localization and repair techniques. In using the data sets of Closure, Math and Lang provided by Defects4J, we found that Defects4J only provided code modifications related to software bug fixes and did not directly provide problem functions related to a given bug fix, and for this reason we needed to further extract problem functions related to bug fixes based on Defects 4J. In the invention, a problem function involved in the defect repair of each software is extracted mainly by analyzing the project code versions before and after the defect repair by means of a git differential tool and a meld (https:// meldmerge. org) tool. In the extraction process, we treat all the functions involved in the code modification as problem functions related to the defect. Table 1 shows the basic cases of three experimental projects.
TABLE 1 basic information of three mainstream Java projects
Name of item Number of defects Average number of files Number of average function Number of mean problem function
Closure 156 387.69 5,662.25 1.64
Math 85 497.79 3,326.40 1.38
Lang 56 89.34 1,868.96 1.23
In the experimental process, for each software defect, when a possible problem function is located, the project code is rolled back to the software project code before the software defect is repaired, so that the defect location is more consistent with the actual situation of real software development. In the table, the average is calculated as the average value of the number of relevant code elements (such as the number of files and the number of functions) of the item code corresponding to all the defects of the item.
To evaluate the functional level defect localization technique proposed in the present invention, we used three common evaluation indexes, Accuracy @ K, MAP and MRR. Table 2 shows the experimental results of our proposed method on Closure, Math and Lang.
From Table 2, it can be seen that the values of the method on Closure, Math and Lang are 11.36%, 7.13% and 10.81%, respectively, in the MAP index. The effect is best on Closure. In MRR, the values of the method on Closure, Math and Lang are 22.45%, 12.80% and 15.46%, respectively. Also the best effect on Closure. On the index of Accuracy @ K, the method has a great potential in three projects. As in the Accuracy @1 index, the values of Accuracy @1 on the three entries are 20%, 12.5%, and 13.33%, respectively. This shows that the method of the present invention can accurately hit 20% of defect reports in Closure, 12.5% of defect reports in Math, and 13.33% of defect reports in Lang, in the case where only 1 problem function is recommended.
TABLE 2 test evaluation of the inventive method on three mainstream Java projects (%)
Figure BDA0002736982560000091
When the value of the number K of the recommendation lists is increased to 10, the Accuracy @10 reaches 40%, 12.5% and 13.33% on the Accuracy of the three items Closure, Math and Lang, and the hit rate of the Closure item reaches 40%. When the recommendation list K is increased to 20, the accuracy on Lang project is increased from 13.33% to 33.33%.
In the present invention, all experiments were run on an x86 — 64Ubuntu server with Ubuntu16.04.5 operating system version, single 6-core Intel (R) core (TM) i7-6850K CPU @3.60GHz, and 64G memory size. In the process of constructing the data set, a series of preprocessing operations such as word segmentation and word deactivation are carried out on a text of a defect report by using a natural language processing interface of an Apache Lucene library (https:// Lucene. Apache. org /) such as TokenStream, Stopanalyzer and the like, and then the text is input into a pre-training model of the ELMO and is subjected to maximum pooling to obtain a semantic feature vector of the text. And inputting the project function code into a pre-training model of code2vec, and outputting to obtain semantic feature vectors of each function. In constructing the defect localization model, we use the deep learning Keras library (https:// Keras. io /) to construct the prediction model. The random oversampling function interface provided by Keras implements a sampling strategy for handling the imbalance class.
In the experimental process, each training example is characterized by a semantic feature vector combination of a defect report and a function, and the number of the training examples is the accumulation of the number of the defect reports and the number of the item functions (the specific experimental data scale can be obtained by referring to the basic statistical data of the three experimental software items listed in table 1). In consideration of the timing of the defect reports, during the model training and validation process, all training examples are arranged according to the time sequence of the defect reports and are averagely cut into 10 equal parts, wherein the first 8 parts are used for training, the first 1 part is used for validation, and the last 1 part is used for testing.
We introduce three evaluation indices, Accuracy @ K, MAP and MRR, to evaluate the performance of the method proposed in the present invention. The three indexes are common evaluation indexes of defect positioning technology. The calculation method is as follows:
Figure BDA0002736982560000101
Figure BDA0002736982560000102
Figure BDA0002736982560000103
Figure BDA0002736982560000104
Figure BDA0002736982560000105
accuracy @ K in equation (1) is calculated as the percentage of recommended accurate defect reports. For a defect, the recommendation is deemed to be accurate if at least one of the K recommended functions is associated with the defect. For example, assuming that there are 500 defect reports, when the recommended problem function list length K is 5, if there are 200 defect reports and at least one of the corresponding problem functions is in the recommended 5 suspicious functions, the Accuracy @5 of the method is 200/500 × 100% — 40%. The MAPs calculated in equations (2) and (3) are often used to evaluate the performance of information-based retrieval techniques, which reflect the average accuracy of all retrieval queries. In equation (2), Q is the query set for all defect reports (i.e., each defect report is taken as a search query), and Precision @ k in equation (3) represents the average accuracy for a recommendation list of length k. The MRR calculated by equations (4) and (5) evaluates the ranking of where the first correlation function (i.e., the problem function associated with defect repair) appears in the recommendation list of problem functions. The further forward the first correlation function appears in the recommendation list, the larger the value of the indicator. For example, assuming there are three defects A, B, C, and assuming that the first problem functions associated with these three defect reports occur at positions 1, 5, and 10, respectively, in the recommended list, the RR values for A, B, C are 1/1, 1/5, and 1/10, respectively. Thus, for defect report set A, B, C, the final MRR is (1\1+1\5+1\10)/3 × 100% — 43.33%.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims (4)

1. A function level defect positioning method based on an embedding technology is characterized by comprising the following steps:
step 1, acquiring all defect reports of software, and preprocessing each defect report to obtain preprocessed defect reports;
step 2, utilizing a pre-training model of an ELMO (word embedding technology) to extract semantic features of each word of the preprocessed defect report to obtain a semantic feature vector corresponding to each word; performing maximum pooling on semantic feature vectors corresponding to all words of the preprocessed defect reports to obtain the semantic feature vector corresponding to each defect report;
step 3, extracting semantic features of each function of the software by using a pre-training model of a code2vec technology to obtain a semantic feature vector corresponding to each function;
step 4, selecting a defect report from all defect reports and selecting a function from all functions of the software to form an example, wherein the class label of the example is 0 or 1, 1 represents that the function in the example is related to the defect report, 0 represents irrelevant, and the characteristic of each example is the fusion of semantic feature vectors corresponding to the function in the example and the defect report, so that all examples are obtained; taking all the examples as a data set, and dividing the data set into a training set, a verification set and a test set;
step 5, training the convolutional neural network model by using a training set, using a logistic regression classifier as an output layer of the convolutional neural network model, processing the problem of unbalanced class instance number by using a random oversampling strategy in the model training process, and finely adjusting the model by using a verification set so as to obtain the trained convolutional neural network model, namely a defect positioning model;
and 6, inputting the test set into a defect positioning model, testing the test set, outputting correlation values of each function and each defect report in the test set by the defect positioning model, sequencing all functions in the test set according to the correlation values of the defect reports from large to small for each defect report in the test set, and outputting functions corresponding to the first K correlation values as a problem function recommendation list of the defect report, wherein K is a preset value.
2. The method for positioning function-level defects based on the embedding technique according to claim 1, wherein step 1 preprocesses each defect report to obtain a preprocessed defect report, and the specific process is as follows:
and performing a series of preprocessing on each defect report, including word segmentation, stop word removal, symbol removal, digit removal, combined word splitting, word stem removal and capitalization-to-lowercase removal, and finally obtaining the preprocessed defect report.
3. The method for locating functional defect based on embedding technology as claimed in claim 1, wherein the dimension of semantic feature vector corresponding to each word in step 2 is 768 dimensions.
4. The method for locating the functional defect based on the embedding technique as claimed in claim 1, wherein the dimension of the semantic feature vector corresponding to each function in step 3 is 384 dimensions.
CN202011136892.XA 2020-10-22 2020-10-22 Function level defect positioning method based on embedding technology Active CN112328469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011136892.XA CN112328469B (en) 2020-10-22 2020-10-22 Function level defect positioning method based on embedding technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011136892.XA CN112328469B (en) 2020-10-22 2020-10-22 Function level defect positioning method based on embedding technology

Publications (2)

Publication Number Publication Date
CN112328469A CN112328469A (en) 2021-02-05
CN112328469B true CN112328469B (en) 2022-03-18

Family

ID=74312163

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011136892.XA Active CN112328469B (en) 2020-10-22 2020-10-22 Function level defect positioning method based on embedding technology

Country Status (1)

Country Link
CN (1) CN112328469B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080973B (en) * 2022-07-20 2022-12-06 中孚安全技术有限公司 Malicious code detection method and system based on multi-mode feature fusion
CN116775871A (en) * 2023-06-15 2023-09-19 南京航空航天大学 Deep learning software defect report classification method based on seBERT pre-training model

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577605A (en) * 2017-09-04 2018-01-12 南京航空航天大学 A kind of feature clustering system of selection of software-oriented failure prediction
CN109657466A (en) * 2018-11-26 2019-04-19 杭州英视信息科技有限公司 A kind of function grade software vulnerability detection method
CN109766277A (en) * 2019-01-02 2019-05-17 北京航空航天大学 A kind of software fault diagnosis method based on transfer learning and DNN
CN110162478A (en) * 2019-05-28 2019-08-23 南京大学 A kind of defect code path orientation method based on defect report
CN110188047A (en) * 2019-06-20 2019-08-30 重庆大学 A kind of repeated defects report detection method based on binary channels convolutional neural networks
CN110502361A (en) * 2019-08-29 2019-11-26 扬州大学 Fine granularity defect positioning method towards bug report
CN110825381A (en) * 2019-09-29 2020-02-21 南京大学 CNN-based bug positioning method combining source code semantics and grammatical features
CN111198817A (en) * 2019-12-30 2020-05-26 武汉大学 SaaS software fault diagnosis method and device based on convolutional neural network
CN111782548A (en) * 2020-07-28 2020-10-16 南京航空航天大学 Software defect prediction data processing method and device and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7996723B2 (en) * 2009-12-22 2011-08-09 Xerox Corporation Continuous, automated discovery of bugs in released software
US20180150742A1 (en) * 2016-11-28 2018-05-31 Microsoft Technology Licensing, Llc. Source code bug prediction
US11748414B2 (en) * 2018-06-19 2023-09-05 Priyadarshini Mohanty Methods and systems of operating computerized neural networks for modelling CSR-customer relationships
CN108829607B (en) * 2018-07-09 2021-08-10 华南理工大学 Software defect prediction method based on convolutional neural network
US11308320B2 (en) * 2018-12-17 2022-04-19 Cognition IP Technology Inc. Multi-segment text search using machine learning model for text similarity
CN110109835B (en) * 2019-05-05 2021-03-30 重庆大学 Software defect positioning method based on deep neural network
CN110334210A (en) * 2019-05-30 2019-10-15 哈尔滨理工大学 A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN
CN110597735B (en) * 2019-09-25 2021-03-05 北京航空航天大学 Software defect prediction method for open-source software defect feature deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577605A (en) * 2017-09-04 2018-01-12 南京航空航天大学 A kind of feature clustering system of selection of software-oriented failure prediction
CN109657466A (en) * 2018-11-26 2019-04-19 杭州英视信息科技有限公司 A kind of function grade software vulnerability detection method
CN109766277A (en) * 2019-01-02 2019-05-17 北京航空航天大学 A kind of software fault diagnosis method based on transfer learning and DNN
CN110162478A (en) * 2019-05-28 2019-08-23 南京大学 A kind of defect code path orientation method based on defect report
CN110188047A (en) * 2019-06-20 2019-08-30 重庆大学 A kind of repeated defects report detection method based on binary channels convolutional neural networks
CN110502361A (en) * 2019-08-29 2019-11-26 扬州大学 Fine granularity defect positioning method towards bug report
CN110825381A (en) * 2019-09-29 2020-02-21 南京大学 CNN-based bug positioning method combining source code semantics and grammatical features
CN111198817A (en) * 2019-12-30 2020-05-26 武汉大学 SaaS software fault diagnosis method and device based on convolutional neural network
CN111782548A (en) * 2020-07-28 2020-10-16 南京航空航天大学 Software defect prediction data processing method and device and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Detecting Duplicate Bug Reports with Convolutional Neural Networks;Qi Xie;《2018 25th Asia-Pacific Software Engineering Conference (APSEC)》;20190323;全文 *
How Practitioners Perceive Automated Bug Report Management Techniques;Weiqin Zou;《IEEE Transactions on Software Engineering ( Volume: 46, Issue: 8, Aug. 1 2020)》;20180914;全文 *
Improving Bug Localization with Word Embedding and Enhanced Convolutional Neural Networks;Yan Xiao;《Information and Software Technology》;20190131;全文 *
Software Defect Prediction via Convolutional Neural Network;Jian Li;《2017 IEEE International Conference on Software Quality, Reliability and Security (QRS)》;20170729;全文 *
基于深度学习的安全缺陷报告识别和缺陷定位;路鹏程;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20200215;正文第2-3、6-7页 *

Also Published As

Publication number Publication date
CN112328469A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
CN109299228B (en) Computer-implemented text risk prediction method and device
CN103473317A (en) Method and equipment for extracting keywords
CN103309862B (en) Webpage type recognition method and system
Soliman et al. Sentiment analysis of Arabic slang comments on facebook
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN107102993B (en) User appeal analysis method and device
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
Probierz et al. Rapid detection of fake news based on machine learning methods
CN112328469B (en) Function level defect positioning method based on embedding technology
US11017002B2 (en) Description matching for application program interface mashup generation
CN103544307B (en) A kind of multiple search engine automation contrast evaluating method independent of document library
CN112100365A (en) Two-stage text summarization method
CN113761218A (en) Entity linking method, device, equipment and storage medium
US11385988B2 (en) System and method to improve results of a static code analysis based on the probability of a true error
CN111539612B (en) Training method and system of risk classification model
CN110968664A (en) Document retrieval method, device, equipment and medium
CN112328475A (en) Defect positioning method for multiple suspicious code files
CN114049505B (en) Method, device, equipment and medium for matching and identifying commodities
Wang et al. Keml: A knowledge-enriched meta-learning framework for lexical relation classification
Asmawati et al. Sentiment analysis of text memes: A comparison among supervised machine learning methods
US20220129630A1 (en) Method For Detection Of Malicious Applications
CN110866172A (en) Data analysis method for block chain system
Ameri et al. Smart semi-supervised accumulation of large repositories for industrial control systems device information
CN112115362B (en) Programming information recommendation method and device based on similar code recognition
US20230075290A1 (en) Method for linking a cve with at least one synthetic cpe

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant