CN112328469B

CN112328469B - Function level defect positioning method based on embedding technology

Info

Publication number: CN112328469B
Application number: CN202011136892.XA
Authority: CN
Inventors: 邹卫琴; 李恩铭; 房春荣; 张静宣
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2022-03-18
Anticipated expiration: 2040-10-22
Also published as: CN112328469A

Abstract

The invention discloses a function level defect positioning method based on an embedding technology, which uses a code embedding technology based on an abstract syntax tree to represent function code function semantics and uses a word embedding technology to represent defect report problem semantics. And finally, fusing the semantic features of the function and the defect report by using a convolutional neural network and predicting a suspicious function related to the given defect. In order to solve the problem of limited training data, the invention provides a method for performing feature representation on a defect report and a code by using a pre-training model; meanwhile, aiming at the problem of unbalanced class instance number, a random oversampling method is provided for processing the class instance number. Tests on three main stream Java items show that when the recommendation list is 10, the accuracy of the method provided by the invention can reach 12.5% -40%, the method has great potential in the field of fine-grained defect location, and the method has great potential application value on the main stream Java software items.

Description

Function level defect positioning method based on embedding technology

Technical Field

The invention relates to a function level defect positioning method based on an embedding technology, and belongs to the technical field of defect positioning and repairing in a software development process.

Background

The software defect positioning is an important link in the software development and maintenance process, and plays an irreplaceable important role in guaranteeing and improving the quality of software products. Defect localization techniques have been extensively focused and studied by academia over the past decades, and are also considered the most valuable class of software techniques by software development practitioners in the industry and open communities. The defect locating technology generally locates suspicious code containing software defects according to program dynamic execution information or static text information, and the like, and the granularity of the locating may be source code files, functions, code lines, and the like. Of the many location granularities, the function discovered by Kochhar et al through interviews of questionnaires is the most important location granularity considered by software development practitioners. The function-level defect positioning technology has important use value in the software development process.

One of the core challenges of the function-level defect localization technology is how to accurately extract the functional semantics (small code volume of the function and high difficulty in extracting the functional semantics compared with the source code file) implied by the function codes and the problem semantics (software problems need to be characterized from complicated and possibly noisy natural language text descriptions) described by the defect reports. Some researchers have already been working on this aspect. Lukins et al propose topic modeling of a document set of function constructs using an LDA topic model, followed by querying the LDA model artificially to extract (or supplement) keywords from the textual description of a defect report to retrieve suspect functions associated with a given defect report. Zhang et al propose to combine LDA topic model and standard space vector model VSM of many abstract levels to model the defect report and function document set, and extract the suspicious function that the given defect corresponds to through calculating the cosine similarity between the two. Youm et al propose to first locate the suspicious source code files and then locate the suspicious functions from the most suspicious 10 source code files. Zhang et al propose to use word2vec to uniformly characterize the defect report and function code, and simultaneously propose to amplify the function to solve the problem of poor positioning effect of the defect positioning technology on short functions.

Surrounding the localization of defects at the functional level, despite the series of research efforts, existing research still suffers from two drawbacks in semantic characterization.

First, functional code semantics are poorly characterized. When the current research is used for semantic extraction of functions, the semantics of the functions are extracted by only taking codes as common natural language texts, and structural characteristics of the codes are ignored. In general, if the same code statement has a different organization (e.g., A- > B- > C becomes B- > A- > C), the corresponding code functions are likely to be different. Ignoring the structural properties of the code will result in loss of semantics of the additional function functions carried by the code structure, thereby affecting the accuracy of defect localization. Second, the defect report semantics extraction process ignores the text context semantics. Existing work has mainly used techniques such as VSM, LDA, etc. to semantically represent text for defects. These techniques often do not characterize the context semantics of the text well, resulting in insufficient semantic characterization of the defect reporting problem, thereby affecting the defect localization effect. To solve these problems, the present invention proposes to develop a function level defect localization technique based on an embedding technique.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the function level defect positioning method based on the embedding technology aims to extract problem semantics of a defect report and function semantics of function codes by adopting the word embedding and code embedding technology, further position the function codes related to the given defect, help developers to repair software defects more efficiently and improve the quality of software.

The invention adopts the following technical scheme for solving the technical problems:

a function level defect positioning method based on an embedding technology comprises the following steps:

step 1, acquiring all defect reports of software, and preprocessing each defect report to obtain preprocessed defect reports;

step 2, utilizing a pre-training model of an ELMO (word embedding technology) to extract semantic features of each word of the preprocessed defect report to obtain a semantic feature vector corresponding to each word; performing maximum pooling on semantic feature vectors corresponding to all words of the preprocessed defect reports to obtain the semantic feature vector corresponding to each defect report;

step 3, extracting semantic features of each function of the software by using a pre-training model of a code2vec technology to obtain a semantic feature vector corresponding to each function;

step 4, selecting a defect report from all defect reports and selecting a function from all functions of the software to form an example, wherein the class label of the example is 0 or 1, 1 represents that the function in the example is related to the defect report, 0 represents irrelevant, and the characteristic of each example is the fusion of semantic feature vectors corresponding to the function in the example and the defect report, so that all examples are obtained; taking all the examples as a data set, and dividing the data set into a training set, a verification set and a test set;

step 5, training the convolutional neural network model by using a training set, using a logistic regression classifier as an output layer of the convolutional neural network model, processing the problem of unbalanced class instance number by using a random oversampling strategy in the model training process, and finely adjusting the model by using a verification set so as to obtain the trained convolutional neural network model, namely a defect positioning model;

and 6, inputting the test set into a defect positioning model, testing the test set, outputting correlation values of each function and each defect report in the test set by the defect positioning model, sequencing all functions in the test set according to the correlation values of the defect reports from large to small for each defect report in the test set, and outputting functions corresponding to the first K correlation values as a problem function recommendation list of the defect report, wherein K is a preset value.

As a preferred embodiment of the present invention, step 1 of preprocessing each defect report to obtain a preprocessed defect report specifically includes the following steps:

and performing a series of preprocessing on each defect report, including word segmentation, stop word removal, symbol removal, digit removal, combined word splitting, word stem removal and capitalization-to-lowercase removal, and finally obtaining the preprocessed defect report.

As a preferred solution of the present invention, the dimension of the semantic feature vector corresponding to each word in step 2 is 768 dimensions.

As a preferred solution of the present invention, the dimension of the semantic feature vector corresponding to each function in step 3 is 384 dimensions.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

1. the defect positioning method based on the embedding technology provided by the invention extracts the semantics of the defect report and the function code by utilizing the word embedding technology and the code embedding technology, and performs the defect positioning of the function granularity through semantic matching.

2. Aiming at the problem of limited training data, the invention provides the extraction of semantic features by using a pre-training model.

3. Aiming at the problem of unbalanced class instance number, the invention provides a random oversampling mechanism for processing, and the effect of the method is effectively improved.

4. The effectiveness of the method of the invention was evaluated by experiments on three major Java software projects. Experimental results show that when the number of the recommendation functions is 10, the method can accurately position 12.5% -40.0% of defect reports, and has great application potential in the field of fine-grained defect positioning.

Drawings

FIG. 1 is a flow chart of a function level defect positioning method based on an embedding technique according to the present invention.

FIG. 2 is a defect report pre-processing flow diagram.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As shown in FIG. 1, the present invention proposes a function-level defect localization method based on embedding technology, which aims to effectively characterize the semantics of a defect report and a function code by using word embedding and code embedding technology so as to effectively localize defects at the function granularity. The method comprises the steps of extracting the semantics of the software defect report problem and the semantics of the function code function, using a pre-training model, constructing and predicting the model, processing the unbalanced class and the like.

Fig. 1 shows the workflow of the method proposed by the present invention, comprising two phases: namely a semantic extraction phase and a model construction phase. The semantic extraction stage aims at extracting semantic features of the defect report and the function code by utilizing a word embedding model and a code embedding model. The method mainly uses a pre-training embedded model to extract relevant features. The model construction phase aims at training and using the defect localization model for prediction. We use a convolutional neural network to fuse semantic feature vectors of the defect reports and function codes and use a logistic regression layer as a classifier for predicting the problem function corresponding to a given defect report. Considering that the number of problem functions is far smaller than that of normal functions, in the invention, an unbalanced class processing strategy, namely a random oversampling strategy, is embedded into a construction model, so that the problem of unbalanced class instance number is solved.

Semantic feature extraction

In the invention, based on the good effect of the word embedding and code embedding model in each software task, the semantic features of the defect report and the function code are fully extracted by using the word embedding and code embedding model. Large training sets are often required to obtain a well-behaved embedded model, and the size of a general software project data set is relatively small. To this end, we propose to use a pre-trained model to extract the semantics of the defect reports and function codes. The pre-training model is a model constructed on the basis of an existing massive data set (such as massive wiki text or GitHub code) and can be directly or finely adjusted and applied to a new software task. Pre-training models have been applied by researchers in various software tasks.

And (3) extracting semantics of the defect report: ye et al found through experimentation that using massive text on Wikipedia did not differ significantly in the file-level defect localization task from using a model trained with software project components such as API documents. In the present invention, we use a pre-trained model of the classical word embedding model ELMO that performs well in the natural language processing domain (https:// githu. com/flairNLP/Flair is implemented by the Flair library) to extract the problem semantics of the bug report. The ELMO model uses bidirectional LSTM (Long short-term memory) to extract features. ELMO has previously trained word vectors for words using a language model, and then dynamically adjusts the word vectors based on the context of downstream tasks. The adjusted word vector can express the specific meaning of the word in the context, and the synonym problem which cannot be solved by word2vec is further solved by fully utilizing the context semantics. The dimensionality of the semantic feature vector output by the ELMO pre-training model used in the invention is 768. I.e., 768-dimensional numeric vectors per word in the defect report. And performing maximum pooling on 768-dimensional numerical value vectors corresponding to all words of the defect report to obtain a final semantic feature vector of the defect report.

Before applying the pre-training model, the text of each defect report needs to undergo a series of pre-processing as shown in fig. 2, including word segmentation, word deactivation, sign removal, word combination splitting, word stem removal using Porter, word conversion into lower case, and the like. In the present invention, the list of stop words we use is the list of NLTK (https:// www.nltk.org /) stop words. For some compound words (e.g., WindowSize), we split them into individual words (e.g., WindowSize- > Window + Size) according to capitalization, and continue to retain the compound word in the dataset following prior art research practices. Some words are different in form but identical in origin (e.g., fetching, fetches, etc.), and we use Porter (https:// tartarus. org/martin/Porter stemmer /) stemming methods to do this (e.g., fetching and fetches become fetches). The processed words are uniformly changed into a lower case form. And finally, inputting the preprocessed defect report into a word embedding pre-training model for semantic extraction.

Extracting the semantics of the function code: the existing function level defect positioning method usually ignores the structural characteristics of function codes when extracting the functional semantics of the functions, which greatly influences the effect of the function level defect positioning technology. In the present invention, we propose a code embedding technique using code2vec (https:// code2vec. org /) to represent the functional semantics of a function. code2vec is a deep learning model based on an Abstract Syntax Tree (AST) path. code2vec firstly converts the function code into AST, then extracts all paths from leaf nodes to leaf nodes from the AST, then learns the weight and vector expression of each path by using an attention mechanism, and finally integrates the vector expressions of all paths by using a full connection layer to obtain the final vector expression of the code segment. By sufficiently mining the code structure information implied in the AST, the code2vec can accurately recommend a function name of a given code with higher accuracy. This shows to a large extent that the semantic vector generated by code2vec for the function code reflects the function of the function well. Therefore, the method is suitable for functional semantic extraction of functions in the function-level defect positioning method.

In the invention, the pre-training model of code2vec is also used for extracting the functional semantics of the function code. The code2vec code embedded pre-training model was built on 1400 million functions of the most popular 10000 items on GitHub. The diversity and popularity of the items used by the training data ensure the feasibility of extracting the function of the real software item by using the pre-training model to a greater extent. By using code2vec, each function will be represented as a 384-dimensional real number vector that characterizes the functional semantics of the function.

Model construction

In the present invention, we treat function level defect localization as a learning task. In other words, we attempt to determine, for a given defect report, whether each function in the codebase is relevant to repairing the defect. We use convolutional neural networks as the model training framework. The convolutional neural network takes the defect report output by the original pre-training model and the semantic vector of the function code as example characteristics, and the class label of each example is whether the function in the example is related to the repair of the current defect. In order to solve the deviation caused by the large class, the unbalanced class is processed during model training. The specific description is as follows:

model training: as described above, after using the pre-trained model, we will get the different dimensions of the defect report and the semantic feature vector of the function code. In order to improve the learning ability and the nonlinear expression ability of the model, a plurality of fully-connected layers are designed in the convolutional neural network. Specifically, for a 384-dimensional function code semantic vector, the number of nodes of each layer is decreased according to 384- >256- >128- >64- > 32; for defect reporting, the 768-dimensional semantic vector will perform further extraction of features through 5 full-join layers, namely 768- >512- >256- >128- >64- > 32. The logistic regression classifier is used as the output layer of the convolutional neural network to predict whether a function is relevant to the repair of a given software defect.

In the process of model training, the whole data set is divided into three parts, namely a training set, a verification set and a test set. The training set is used for training the model, the verification set is used for fine-tuning the model, and the test set is used for evaluating the effect of the final prediction model constructed on the training set and the verification set. Each instance is characterized by a combination of the defect report and the respective semantic feature vector of the function code. The class label of the example is 0 or 1, 1 indicates that the function in the example is related to defect repair, and 0 indicates incoherence. All examples were sorted by time of defect report and divided into 10 equal parts on average. The first 8 copies were used as test sets, the 9 th copy was used as validation set, and the last copy was used as test set. The effect evaluation criteria for the localization model are Accuracy @ K, MAP and MRR.

Unbalanced type treatment: in the process of model training, the problem of unbalanced class instance number exists. In other words, for a given defect report, the number of problem functions associated with its repair is much smaller than the number of functions not associated with its repair. The direct use of a data set with an unbalanced number of class instances to train the model results in unfairness of the model to the minimal class. When the minimal class is a truly important class (the problem function is the class in which the item is really interested), the model is likely to be unable to make a reasonable prediction.

To solve this problem, we need to do de-unbalance processing on the data of the project. For unbalanced data sets, there are usually two traditional processing strategies, one is a random under-sampling strategy, and the other is a random over-sampling strategy. The random undersampling strategy achieves the balance of class instance numbers primarily by reducing the number of instances of the large class (typically achieved by randomly drawing large class instances equal to the number of small class instances), while the random oversampling strategy achieves the balance of class instances primarily by increasing the number of instances of the small class (typically achieved by repeating the random sampling of small class instances). In the invention, considering that a subclass (namely a problem function) is a more concerned class of a defect localization technology and a convolutional neural network has certain requirements on the data scale, a random oversampling strategy is used for processing the problem of unbalanced class instance number in defect localization.

And (3) prediction: after a defect positioning model is obtained through training, aiming at a new defect report, word segmentation, word deactivation, symbol removal, word combination segmentation, word lower case conversion, word embedding pre-training model and maximum pooling are firstly carried out on the defect report to obtain semantic characteristics of the defect report, the semantic characteristics of the defect report are combined with semantic vectors of functions of item codes corresponding to the defect report to obtain an example with prediction, the size (between 0 and 1) of the correlation of each function and the new defect report is output through the positioning model obtained through training, and the function is more correlated with the new defect report when the value is larger, namely the function is more likely to be a problem function. And finally, the model sorts all functions from large to small according to the correlation numerical values, and a problem function recommendation list is output.

In order to evaluate the function level defect positioning method based on the embedding technology, experiments are carried out on three main stream Java software projects. These three items are close-builder (Closure), common-lang (Lang) and common-math (Math), respectively. The three projects come from different fields, and the sizes of the projects are different, so that a better data base is provided for the evaluation of the method. The data for the three projects is from an authoritative project data set, Defects4J (https:// githu. com/rjust/Defects4j) in the field of defect location and repair. The Defects4J contain a series of real and reproducible software Defects. For each software bug, the maintainer of the Defects4J manually removes code modifications that are not related to the repair of the software bug. Therefore, accurate matching of software defects and problem codes is guaranteed to a great extent. This provides a better data set benchmark for the evaluation of defect localization and repair techniques, and is used to evaluate the effectiveness of various defect localization and repair techniques. In using the data sets of Closure, Math and Lang provided by Defects4J, we found that Defects4J only provided code modifications related to software bug fixes and did not directly provide problem functions related to a given bug fix, and for this reason we needed to further extract problem functions related to bug fixes based on Defects 4J. In the invention, a problem function involved in the defect repair of each software is extracted mainly by analyzing the project code versions before and after the defect repair by means of a git differential tool and a meld (https:// meldmerge. org) tool. In the extraction process, we treat all the functions involved in the code modification as problem functions related to the defect. Table 1 shows the basic cases of three experimental projects.

TABLE 1 basic information of three mainstream Java projects

Name of item	Number of defects	Average number of files	Number of average function	Number of mean problem function
					Closure	156	387.69	5，662.25	1.64
Math	85	497.79	3，326.40	1.38
					Lang	56	89.34	1，868.96	1.23

In the experimental process, for each software defect, when a possible problem function is located, the project code is rolled back to the software project code before the software defect is repaired, so that the defect location is more consistent with the actual situation of real software development. In the table, the average is calculated as the average value of the number of relevant code elements (such as the number of files and the number of functions) of the item code corresponding to all the defects of the item.

To evaluate the functional level defect localization technique proposed in the present invention, we used three common evaluation indexes, Accuracy @ K, MAP and MRR. Table 2 shows the experimental results of our proposed method on Closure, Math and Lang.

From Table 2, it can be seen that the values of the method on Closure, Math and Lang are 11.36%, 7.13% and 10.81%, respectively, in the MAP index. The effect is best on Closure. In MRR, the values of the method on Closure, Math and Lang are 22.45%, 12.80% and 15.46%, respectively. Also the best effect on Closure. On the index of Accuracy @ K, the method has a great potential in three projects. As in the Accuracy @1 index, the values of Accuracy @1 on the three entries are 20%, 12.5%, and 13.33%, respectively. This shows that the method of the present invention can accurately hit 20% of defect reports in Closure, 12.5% of defect reports in Math, and 13.33% of defect reports in Lang, in the case where only 1 problem function is recommended.

TABLE 2 test evaluation of the inventive method on three mainstream Java projects (%)

When the value of the number K of the recommendation lists is increased to 10, the Accuracy @10 reaches 40%, 12.5% and 13.33% on the Accuracy of the three items Closure, Math and Lang, and the hit rate of the Closure item reaches 40%. When the recommendation list K is increased to 20, the accuracy on Lang project is increased from 13.33% to 33.33%.

In the present invention, all experiments were run on an x86 — 64Ubuntu server with Ubuntu16.04.5 operating system version, single 6-core Intel (R) core (TM) i7-6850K CPU @3.60GHz, and 64G memory size. In the process of constructing the data set, a series of preprocessing operations such as word segmentation and word deactivation are carried out on a text of a defect report by using a natural language processing interface of an Apache Lucene library (https:// Lucene. Apache. org /) such as TokenStream, Stopanalyzer and the like, and then the text is input into a pre-training model of the ELMO and is subjected to maximum pooling to obtain a semantic feature vector of the text. And inputting the project function code into a pre-training model of code2vec, and outputting to obtain semantic feature vectors of each function. In constructing the defect localization model, we use the deep learning Keras library (https:// Keras. io /) to construct the prediction model. The random oversampling function interface provided by Keras implements a sampling strategy for handling the imbalance class.

In the experimental process, each training example is characterized by a semantic feature vector combination of a defect report and a function, and the number of the training examples is the accumulation of the number of the defect reports and the number of the item functions (the specific experimental data scale can be obtained by referring to the basic statistical data of the three experimental software items listed in table 1). In consideration of the timing of the defect reports, during the model training and validation process, all training examples are arranged according to the time sequence of the defect reports and are averagely cut into 10 equal parts, wherein the first 8 parts are used for training, the first 1 part is used for validation, and the last 1 part is used for testing.

We introduce three evaluation indices, Accuracy @ K, MAP and MRR, to evaluate the performance of the method proposed in the present invention. The three indexes are common evaluation indexes of defect positioning technology. The calculation method is as follows:

accuracy @ K in equation (1) is calculated as the percentage of recommended accurate defect reports. For a defect, the recommendation is deemed to be accurate if at least one of the K recommended functions is associated with the defect. For example, assuming that there are 500 defect reports, when the recommended problem function list length K is 5, if there are 200 defect reports and at least one of the corresponding problem functions is in the recommended 5 suspicious functions, the Accuracy @5 of the method is 200/500 × 100% — 40%. The MAPs calculated in equations (2) and (3) are often used to evaluate the performance of information-based retrieval techniques, which reflect the average accuracy of all retrieval queries. In equation (2), Q is the query set for all defect reports (i.e., each defect report is taken as a search query), and Precision @ k in equation (3) represents the average accuracy for a recommendation list of length k. The MRR calculated by equations (4) and (5) evaluates the ranking of where the first correlation function (i.e., the problem function associated with defect repair) appears in the recommendation list of problem functions. The further forward the first correlation function appears in the recommendation list, the larger the value of the indicator. For example, assuming there are three defects A, B, C, and assuming that the first problem functions associated with these three defect reports occur at positions 1, 5, and 10, respectively, in the recommended list, the RR values for A, B, C are 1/1, 1/5, and 1/10, respectively. Thus, for defect report set A, B, C, the final MRR is (1\1+1\5+1\10)/3 × 100% — 43.33%.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims

1. A function level defect positioning method based on an embedding technology is characterized by comprising the following steps:

2. The method for positioning function-level defects based on the embedding technique according to claim 1, wherein step 1 preprocesses each defect report to obtain a preprocessed defect report, and the specific process is as follows:

3. The method for locating functional defect based on embedding technology as claimed in claim 1, wherein the dimension of semantic feature vector corresponding to each word in step 2 is 768 dimensions.

4. The method for locating the functional defect based on the embedding technique as claimed in claim 1, wherein the dimension of the semantic feature vector corresponding to each function in step 3 is 384 dimensions.