CN110188047A

CN110188047A - A kind of repeated defects report detection method based on binary channels convolutional neural networks

Info

Publication number: CN110188047A
Application number: CN201910474540.6A
Authority: CN
Inventors: 徐玲; 何健军; 帅鉴航; 杨梦宁; 张小洪; 洪明坚; 葛永新; 杨丹; 王洪星; 黄晟; 陈飞宇
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2019-08-30
Anticipated expiration: 2039-06-20
Also published as: CN110188047B

Abstract

The present invention relates to a kind of, and the repeated defects based on binary channels convolutional neural networks report that detection method, including three steps, data preparation establish CNN model and defect report to be predicted prediction；In data preparation, the field useful to duplicate reports, it is extracted from defect report, to each report, structured message and unstructured information are put into togerther in a text invention shelves, by pretreatment, a single channel matrix is each converted to by the report of text representation, single channel matrix is combined into binary channels matrix, then using a part as training set, remaining part is as verifying collection.It is input training pattern with training set in CNN model foundation.In defect report forecast period to be predicted, the similarity of the defect report pair of a unknown defect report and known defect report composition is predicted in trained model load, this similarity, which is one, indicates defect report to the probability for repeating possibility.The method of the present invention forecasting accuracy with higher.

Description

A kind of repeated defects report detection method based on binary channels convolutional neural networks

Technical field

The present invention relates to software testing technology field, in particular to a kind of repetition based on binary channels convolutional neural networks lacks Fall into report detection method.

Background technique

Modern software project carrys out storage and management defect report using the defect tracking system of such as Bugzilla [17].Software Developer, software test personnel and terminal user submit defect report to describe these problems when encountering software issue.It lacks Sunken report can help guiding software maintenance and repair.With the development of software systems, can all there are hundreds of defects daily Report is submitted.When as soon as more than one people submits defect report to describe an identical bug, repeated defects report generation ?.Because defect report always uses natural language description, the same bug is also likely to describe in different forms.

Because of defect report substantial amounts, detecting repeated defects report manually is a difficult job.In addition, because lacking Report is fallen into natural language description, it is also unpractiaca for providing a standard template.Therefore, the automatic detection of repeated defects report It is a significant job, it can be to avoid repeatedly repairing the same bug.This year, many automatic inspections of repeated defects report Survey technology is suggested to solve this problem.These methods can be roughly divided into two sides of information retrieval and machine learning To.

Information retrieval method, it usually calculates similarity of two defect reports on text, that is, is absorbed according to text Description is to calculate similarity.

Such as Hiew establishes a model using VSM (Vector Space Model), a report is calculated as by it One vector with TF-IDF (Term Frequency-Inverse Document Frequency) term weighting scheme. Based on VSM, Runeson et al. detects repeated defects report with natural language processing technique for the first time.Wang et al. thinks only Only consider that natural language information not can be well solved this problem, thus they also using execution information as a feature come into The detection of row duplicate reports.However, only only having sub-fraction report that there is execution information, there is very big office in this way It is sex-limited.Sun et al. proposes REP, and this method not only only used summary and description, also use The structured messages such as product, component, version.Higher text similarity in order to obtain, they extend BM25F, one kind is in the effective similarity calculating method of information retrieval field.In addition to text similarity and structuring similarity, Alipour et al. also contemplates the influence that contextual information detects duplicate reports.They apply to LDA in these features, Achieve better result.The method slowed down based on information is all put up a good show in accuracy rate and time efficiency, but when one It is as a result just unsatisfactory when a problem is described with different terms.

Machine learning method extracts the potential feature of report by the algorithm of self study, but traditional machine learning side Method can not learn the depth characteristic of input well.SVM is one classical method of machine learning.Jalbert et al. is built with it The categorizing system that can filter duplicate reports is found.Meanwhile they think that previous method does not make full use of defect report Various features in announcement, therefore they have used surface characteristics, text semantic and figure cluster in a model.In Jalbert et al. On the basis of work, Tian et al. considers some new features and establishes a linear model.From feature and uneven number According to angle set out, they improve duplicate reports detection accuracy rate.Sun et al. establishes an interpretation model with SVM, Defect report is also divided into repetition and non-duplicate two class for the first time by they.L2R is another highly useful machine learning method. Based on this, Zhou et al. considers text and statistical nature, and has used stochastic gradient descent algorithm to them.This method ratio Traditional information retrieval method, such as VSM and BM25F have better effect.As word embedded technology is [in natural language processing The application in field, more and more researchers detect duplicate reports with it.Budhiraja et al. word embedded technology will lack Sunken report is converted into vector and then calculates their similarity.The experimental results showed that this method, which has, improves duplicate reports inspection Survey the potentiality of accuracy rate.

Summary of the invention

The technical problem to be solved by the present invention is to the automatic test problems of duplicate reports, this problem can be further broken into Judge the relationship between two defect reports, that is, one is reported the defect report that forms to being duplicate or do not weigh by two Multiple.

To achieve the above object, the present invention adopts the following technical scheme: a kind of weight based on binary channels convolutional neural networks Multiple defect report detection method, includes the following steps:

S100: data preparation

S101: extracting the defect report of software, and all defect report is made of structured message and unstructured information, For each defect report, all structured messages and unstructured information are put into an individual text invention shelves；

S102: for each defect report, carrying out pre-treatment step, including segment, extract stem, removal stop words and Capital and small letter conversion；

S103: after pretreatment, the word in all defect report is combined into a corpus, using existing on corpus Word2vec and select CBOW model, obtain each word vector indicate to get arrive each defect report two-dimensional matrix It indicates, referred to as the two-dimentional single channel matrix of defect report；

When according to the defect report for extracting software, (this is matched the Given information which provides Information is in data set, is handled by the people of creation data set), the defect report pair that two defect reports are formed It is indicated by two-dimentional binary channels matrix, the two dimension binary channels matrix is by the corresponding two-dimentional single channel square of described two defect reports Battle array is composed, and then to the binary channels matrix, it stamps repetition or unduplicated label；

By all tagged binary channels matrixes, it is divided into training set and verifying collection；

S200: CNN model is established

S201: all binary channels matrixes that training set and verifying are concentrated are inputted into CNN model together；

S202: in first convolutional layer, settingA convolution kernelWherein d is the length of convolution kernel, k_w It is the width of convolution kernel；After first time convolution, two channels of binary channels matrix are just merged into one, and first layer convolution is public Formula are as follows:

Wherein C₁Indicate the output of first convolutional layer, i indicates that first convolutional layer inputs I₁I-th of channel, j₁It indicates The jth of input₁Row, b₁Indicate offset, f₁It indicates nonlinear activation primitive, gives the length l (l=n of input_w), Filling power P=0 and step-length S=1, the length O of output₁It can be calculated as:

The output shape of first convolutional layer isBy the output shape remodeling of first convolutional layer at Then convolution again in second convolutional layer, and is provided with the convolution kernel of three kinds of sizesEvery kind of convolution kernelIt is a, the formula of second layer convolution are as follows:

Wherein C₂Indicate the output of second convolutional layer, j₂Indicate second convolutional layer input I₂Jth₂Row, b₂Indicate inclined Shifting amount, f₂Indicate nonlinear activation primitive, after current convolution, can obtain three kinds of shapes isFeature Scheme, wherein O₂It can be according to l (l=O₁) and different convolution kernel length d, it is calculated according to formula (2)；

S203: maximum pond is carried out to all characteristic patterns；

S204: remolding and splices all characteristic patterns to obtain oneThe vector of dimension, it will be by as full connection The input of layer；

After two full articulamentums, an independent probability sim is obtained_predict, it represents what two reports were predicted Similarity；

In the last layer, sigmoid is used to obtain sim as activation primitive_predict；

Output T={ the x of given first full articulamentum₁,x₂,…,x₃₀₀And weight vectors W={ w₁,w₂,…,w₃₀₀, s_impredictIt can be calculated as:

Wherein i indicates that i-th of element of T, b indicate offset；

S205: all defect report pair in traversal training set repeats S202-S204；

S206: backpropagation is carried out with the hiding parameter of more new model according to loss function, loss function such as formula (5):

Wherein label_realIndicate that the label of preset defect report pair, i indicate that i-th of defect report pair, n indicate defect The sum of report pair；

S207: it after each epoch training, is verified using verifying the set pair analysis model；When the loss of verifying collection is at 5 When all no longer reducing in epoch, stop updating model parameter；Otherwise S201 is returned, continues to train CNN model；

S300: defect report prediction to be predicted

Defect report to be predicted is pre-processed using the method in S102 first, it then will using the method in S103 The defect report to be predicted is converted into the two-dimentional single channel matrix of prediction defect report；

It will predict the two-dimentional single channel square of the two-dimentional single channel matrix and the existing N number of defect report of the software of defect report Battle array combination of two obtains N number of binary channels matrix to be predicted, and N is constituted forecast set to binary channels matrix to be predicted, will be in forecast set Each of binary channels matrix to be predicted as input, be input in the CNN model, obtain a probability；

In N number of probability, probability then thinks defect report corresponding to the probability and prediction defect report greater than threshold value To repeat.

As an improvement, structured message is product and component in the S101, unstructured letter is summary And description.

As an improvement, being all to use Relu as activation primitive to extract in other layers in addition to the last one full articulamentum More nonlinear characteristic.

Compared with the existing technology, the present invention at least has the advantages that

The invention proposes a new method DC-CNN to carry out repeated defects report detection.It is by two by single channel The defect report that matrix indicates is combined into the defect report pair of binary channels matrix expression.Then, this binary channels matrix quilt It is input in CNN model and extracts implicit feature.The present invention in Open Office, Eclipse, Net Beans and they The method of proposition is demonstrated on combined data set Combined and is examined with the duplicate reports based on deep learning state-of-the-art at present Survey method is compared, and the method for the present invention is effective, it is often more important that performance is also more preferable.

Detailed description of the invention

Fig. 1 is the overall framework of the method for the present invention.

Fig. 2 is the overall procedure for establishing CNN model.

Fig. 3 (a) be ROC curve of the DC-CNN and SC-CNN on Open Offic data set, Fig. 3 (b) be DC-CNN and ROC curve of the SC-CNN on Eclipse data set, Fig. 3 (c) are DC-CNN and SC-CNN on Net Beans data set ROC curve, Fig. 3 (d) are ROC curve of the DC-CNN and SC-CNN on Combined data set.

Fig. 4 is the influence of term vector dimension.

Fig. 5 is the influence of unstructured information.

Specific embodiment

The present invention is described in further detail below in conjunction with the accompanying drawings.

Fig. 1 illustrates the general frame of the method for the present invention DC-CNN, it contains three phases: data preparation establishes CNN Model and defect report to be predicted prediction.In data preparation stage, the field useful to duplicate reports, including component, Product, summary and description are extracted from defect report.To each report, structured message and non- Structured message is put into togerther in a text invention shelves.By pretreatment, all defect report text be collected with Form a corpus.Word2vec is used to extract the semanteme rule of corpus.Each is converted by the report of text representation At a single channel matrix.In order to judge the relationship between two reports, the single channel matrix of expression defect report is combined into Indicate the binary channels matrix of defect report pair.Then using a part as training set, remaining part is as verifying collection.In training Stage is to input one CNN model of training with training set.In defect report forecast period to be predicted, trained model load The similarity of the defect report pair of one unknown defect report of prediction and known defect report composition, this similarity is a table Show defect report to the probability for repeating possibility.

A kind of repeated defects report detection method based on binary channels convolutional neural networks, includes the following steps:

S100: data preparation

Structured message is usually optional attribute, and unstructured information is usually the text description of bug.

The present invention completes above-mentioned pre-treatment step using the StandardAnalyzer of Lucene.When removal stop words When, use the English of a standard to deactivate vocabulary.In addition, even if still being had in the defect report of two wide of the marks Some identical words.These words are usually some specialized vocabularies, such as java, com, org etc..Due to frequently occurring, also him Be added in deactivated vocabulary.By handling above, some nonsensical numbers are left down in text, they are also removed.

Compared with single channel, had the benefit that using the binary channels expression of defect report pair.Firstly, two reports can be with It is handled simultaneously by CNN.Therefore training speed is accelerated.It, can be with using double-channel data training CNN secondly, be proved to Reach higher accuracy rate.For binary channels CNN, by convolution operation, it be can capture between two defect reports Incidence relation.

By all tagged binary channels matrixes, it is divided into training set and verifying collection；When it is implemented, 80% stamps mark The binary channels matrix of label is divided into training set, and remaining 20% tagged binary channels matrix is verifying collection.

S200: CNN model is established

In order to from defect report centering extract feature, the present invention each convolutional layer be provided with three kinds it is different size of Convolution kernel.Therefore, there are three branches for first convolutional layer tool.For each of these three branches, in second convolutional layer It still can be there are three Xin get branch.Because the structure height of these three branches is similar, Fig. 2 shows only CNN overall work A branch of first convolutional layer in structure.Table 3 illustrates the design parameter setting of CNN model of the present invention.

Table 3

S202: in first convolutional layer, settingA convolution kernelWherein d is the length of convolution kernel, k_w It is the width of convolution kernel；Because every a line of input matrix represents a word, convolution kernel width is equal to term vector dimension m； After first time convolution, two channels of binary channels matrix are just merged into one, in this manner it is possible to which two defect reports are seen Feature, first layer Convolution Formula are extracted at an entirety are as follows:

Wherein C₁Indicate the output of first convolutional layer, i indicates that first convolutional layer inputs I₁I-th of channel, j₁It indicates The jth of input₁Row, b₁Indicate offset, f indicates nonlinear activation primitive, and the present invention gives input using Relu Length l (l=n_w), Filling power P=0 and step-length S=1, the length O of output₁It can be calculated as:

The output shape of first convolutional layer isIn order to further extract the linked character of two reports, By the output shape remodeling of first convolutional layer atThen convolution again in second convolutional layer, and is provided with three The convolution kernel of kind sizeEvery kind of convolution kernelIt is a, the formula of second layer convolution Are as follows:

Wherein C₂Indicate the output of second convolutional layer, j₂Indicate second convolutional layer input I₂Jth₂Row, b₂Indicate inclined Shifting amount, f₂Indicate nonlinear activation primitive, the present invention uses Relu, after current convolution, can obtain three kinds of shapes ForCharacteristic pattern, wherein O₂It can be according to l (l=O₁) and different convolution kernel length d, it is counted according to formula (2) It calculates.

S203: maximum pond is carried out to all characteristic patterns；In this way, each characteristic pattern be downsampled for Shape.

Wherein i indicates that i-th of element of T, b indicate offset.

S205: all defect report pair in traversal training set repeats S202-S204.

Wherein label_realIndicate that the label of preset defect report pair, i indicate that i-th of defect report pair, n indicate defect The sum of report pair.

S207: it after each epoch training, is verified using verifying the set pair analysis model；When the loss of verifying collection is at 5 When all no longer reducing in epoch, stop updating model parameter；Otherwise S201 is returned, continues to train CNN model.

S300: defect report prediction to be predicted

Such as certain software has N number of defect report at present, then the corresponding two dimension of each defect report is single after treatment Access matrix will predict the corresponding two-dimentional single channel matrix to be predicted of defect report and N number of two-dimentional single channel matrix arbitrarily two-by-two Composition, obtains N to binary channels matrix to be predicted, then the above-mentioned CNN of input by this N to binary channels matrix to be predicted one by one In model, N number of probability is obtained.When some probability is greater than preset threshold value, then it is assumed that be predicted corresponding to the probability to lack Existing defect report in report and software is fallen into repeat.

Verification experimental verification:

1, data set

In order to compare, present invention employs data set identical with Deshmukh et al., this data set be by What Lazar was collected and was handled.It contains three large-scale open source projects: Open Office, Eclipse and Net Beans. Open Office is the office software similar with Microsoft Office.Eclipse and Net Beans, which is that open source is integrated, to be opened Hair ring border.In order to be tested with more training samples, a bigger data set is obtained by merging these three data sets, And " Combined " is named as to it.These data sets additionally provide defect report pairing relationship, and one in Open Office Divide pairing relationship as shown in table 4.

Table 4: defect report pair

By analyzing all pairing relationships in each data set, some of them problem is found.First, some pairings are It is duplicate.For example, (200622,197347, duplicate) occur 5 times in Open Office.Second, some pairings What is indicated is the same relationship, for example, (159435,164827, duplicate) in Eclipse and (164827,159435, duplicate).Therefore, the present invention will remove these defect reports pair.Table 5 illustrates all in finally obtained data set Match quantity.

Table 5: complete data set

Dataset	Duplicate	Non duplicate
			OpenOffice	57340	41751
Eclipse	86385	160917
			Net Beans	95066	89988
Combined	238791	292476

Each data set is divided into training set and test set, and training set accounts for 80% (wherein 10% as verifying collection), test Collection accounts for 20%.In addition, in partitioned data set, making training set to allow training set and test set to simulate raw data set distribution With duplicate reports in test set to identical as raw data set with non-duplicate report comparative example.Training set and test set are all random Selection.Table 6 illustrates the detailed distribution of defect report pair in training set and test set.

Table 6: training set and test set

Evaluation criteria

In model proposed by the present invention, output indicates the similarity of defect report centering two reports.Therefore, this A value is between 0 to 1.In order to further classify, it will one threshold value of setting.Sim is obtained in third section_predictLater, label_predict(indicating a defect report to the label being predicted) can calculate according to following formula:

According to label_predictAnd label_real, report is to being divided into four classes:

1) TP:label_real=1, label_predict=1

2) TN:label_real=0, label_predict=0

3) FP:label_real=0, label_predict=1

4) FN:label_real=1, label_predict=0

Wherein 1 indicate report to be it is duplicate, 0 indicates report to being non-repetitive.TP expression is predicted correctly to repeat Report to quantity, TN expression is predicted correctly as non-repetitive report to quantity, and it is duplicate report that FP, which indicates mispredicted, It accuses to quantity, it is non-repetitive report to quantity that FN, which indicates mispredicted,.This four indexs are the calculating of following evaluation criterion Basis.

Accuracy

Accuracy indicates the ratio of the defect report pair being predicted correctly with all reports pair, it indicates that model correctly divides The performance of class all defect report pair.Because having used sigmoid function when being returned, Accuracy is being calculated, When Recall and Precision, threshold value is set as 0.5.

Recall:

Recall indicates that being correctly predicted to be duplicate defect report pair with all reality is duplicate defect report pair Ratio.

Precision:

Precision indicate correctly be predicted to be duplicate defect report pair and it is all be predicted to be it is duplicate report pair Ratio.

F₁- Score:

F₁- Score is the harmonic-mean of Recall and Precision.

Roc curve:

In fact, due in data set defect report it is unbalanced to category distribution, traditional evaluation criterion is such as Accuracy cannot classification of assessment device well performance.Therefore, the present invention is using ROC curve come further classification of assessment device Performance.According to different threshold values, then available different TPR and FPR can draw ROC curve by TPR and FPR. TPR and FPR can be calculated according to following formula:

Using all FPR values as horizontal axis, all TPR values are as the longitudinal axis, so that it may obtain ROC curve.Curve is from seat The parameter upper left corner is closer, and the performance of classifier is better.

Experimental result

Show the technical effect of the method for the present invention by answering following Railway Project.

Problem 1: compared with the state-of-the-art repeated defects report detection method based on deep learning, DC-CNN of the invention Whether effectively?

Goal in research of the invention is proposition one more effectively based on the method for deep learning.It therefore, will be of the invention Method and the method for Deshmukh et al. compare on identical data set.

Table 7: the experimental result of the method for the present invention and Deshmukh et al. method

As a result: table 7 illustrates the experimental result of the method for the present invention and Deshmukh et al. method.Use an identical core Heart method --- twin neural network, they establish two similar models, retrieval model and disaggregated model.For mould of classifying Type, highest accuracy are appeared on Open Office data set, have reached 0.8275, and are only only had in Eclipse 0.7268.Their retrieval model performance is better than disaggregated model.For retrieval model, the data set to behave oneself best is still Open Office, its accuracy are up to 0.9455.In the same manner, Eclipse is slightly inferior, and accuracy is 0.906.It can be found that Disaggregated model with twin neural network is compared, DC-CNN in Open Office, Eclipse, Net Beans, Promotion on Combined is 11.54%, 24.17%, 17.89% and 13.33% respectively.With the inspection of twin neural network Rope model is compared, and promotion of the DC-CNN on Eclipse, Net Beans, Combined is 6.25% respectively, 4.07% He 3.84%.On Open Office, the accuracy of DC-CNN is low less than 0.03%.

Influence: according to table 7, the performance of DC-CNN is high on 3 data sets (Eclipse, Net Beans, Combined) In disaggregated model and retrieval model that the twin neural network of Deshmukh et al. constructs.On Open Office, DC-CNN's Performance be higher than the disaggregated model that Deshmukh et al. is constructed with twin neural network and with their retrieval model have one it is non- Normal similar performance.In general, DC-CNN has reached an extraordinary performance and has been more than current state-of-the-art base In the duplicate reports detection method of deep learning.

Problem 2: comparing with SC-CNN, and whether DC-CNN effective?

In order to prove that the binary channels matrix expression of defect report pair proposed by the present invention is effectively, to also use defect report The single channel matrix of announcement indicates as a comparison baseline.Keep the structure of CNN constant, the quantity including convolution kernel, convolution The size of core, the quantity etc. of convolutional layer, and extract a defect report centering two features reported respectively with it, then calculate Their similarity.This method is referred to as single channel convolutional neural networks (Single-Channel Convolutional Neural Networks, SC-CNN).

Table 8:DC-CNN and SC-CNN experimental result

As a result: the property of both methods is evaluated on Accuracy, Recall, Precision, the indexs such as F1-Score Can, experimental result is as shown in table 8, wherein best result is all by overstriking.It is observed that in all fingers of all data sets It puts on, DC-CNN has been above SC-CNN.Compared to SC-CNN, in Open Office, Eclipse, Net Beans, and On Combined, the accuracy of DC-CNN has been respectively increased 2.78%, 2.61%, 1.36% and 2.33%, DC-CNN's The Precision that 2.73%, 0.51%, 1.49% and 3.17%, DC-CNN has been respectively increased in recall is respectively increased The F1-Score of 2.08%, 6.53%, 1.20% and 2.08%, DC-CNN improve 2.40%, 3.53% respectively, 1.35% He 2.62%.Fig. 3 (a) Fig. 3 (d) illustrates the ROC curve of two methods.It is observed that on all data sets, DC-CNN Curve all on SC-CNN, this shows that DC-CNN also has better classification performance even if when sample distribution is unbalanced.

Influence: all experimental results all show more more effective than single channel using twin-channel CNN model.For SC-CNN For.Each report is converted into a matrix and is then input in CNN to extract feature, be as a result represented as feature to Amount.Then judge whether two reports repeat by calculating the similarity of two feature vectors.For DC-CNN, two The matrix of report is combined into a binary channels matrix and is then input to CNN, and then the two reports are convolved together, this side Method can extract profound relationship between two reports, take full advantage of the ability that CNN captures local feature.Because of DC-CNN In CNN model be absorbed in the incidence relation extracted between two reports, so when detect duplicate reports with better performance.

Does problem 3: when changing term vector dimension, how experimental result change?

The invention proposes a kind of new defect reports to representation method --- binary channels matrix.Therefore, also explore with Influence of the relevant parameter to experimental result.For binary channels matrix because the quantity number of word it is fixed and for For CNN, two report positions (which is reported in first channel, which is reported in second channel) be it is indiscriminate, So the parameter for being most likely to occur change is the dimension of term vector.In order to answer when changing term vector dimension, experimental result is such as Term vector dimension is gradually changed from 10 to 100 and observation experiment result is in Open Office data set by what variation this problem On variation.

As a result: from fig. 4, it can be seen that when being gradually increased term vector dimension, under accuracy first increases and then shows Drop trend.When term vector dimension is 20, accuracy rate has reached maximum value, and 94.29%.

Influence: when term vector dimension increases to 20 from 10, accuracy is increased.When we continue to increase term vector dimension Degree, accuracy are reduced.Reason may be, when a term vector dimension characterizes a word enough.Continue to increase dimension Degree prevents it from indicating this word well instead.Although accuracy has reached maximum value when term vector dimension is equal to 20, But it is not higher by too much than the value under other conditions.On the one hand, term vector dimension increase can bring bigger data to deposit Storage problem；On the other hand, word insertion and complexity when CNN model training can all increase.Therefore, in the methods of the invention, 20 It is most suitable term vector dimension.

Problem 4: when not using structured message, whether method proposed by the present invention effective?

Such as the structured messages such as product, component and version are mentioned when judging whether two reports repeat Highly useful information is supplied.Structured message is improved repeated defects report inspection as an individual feature by many methods The accuracy of survey.Unstructured information is usually the natural language description to bug.For duplicate reports detection, CNN is main For handling non-structured text, therefore it has good performance when handling long text.Different from other methods, the present invention It is put into text invention shelves using structured message and unstructured information as text data simultaneously.Then it is extracted with CNN Feature.In order to answer a question 4, structured message is removed from input, and when not changing other conditions, setting comparison Experiment.

As a result: from fig. 5, it can be seen that the experimental result on all data sets all reduces after removing structured message, It is 1.74%, 3.79%, 3.38%, 2.56% respectively on Open Office, Eclipse, Net Beans, Combined.

It influences: the experimental results showed that it is effective that structured message and unstructured information, which are input to together in CNN,.Note It anticipates to after removing structured message, although accuracy has dropped, this reduction is not fatal.The reason is that knot Structure information only accounts for the sub-fraction of entire text.CNN master part to be processed is still unstructured information.

Finally, it is stated that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to compared with Good embodiment describes the invention in detail, those skilled in the art should understand that, it can be to skill of the invention Art scheme is modified or replaced equivalently, and without departing from the objective and range of technical solution of the present invention, should all be covered at this In the scope of the claims of invention.

Claims

1. a kind of repeated defects based on binary channels convolutional neural networks report detection method, it is characterised in that: including walking as follows It is rapid:

S100: data preparation

S101: extracting the defect report of software, and all defect report is made of structured message and unstructured information, for All structured messages and unstructured information are put into an individual text invention shelves by each defect report；

S102: for each defect report, carrying out pre-treatment step, including segments, extracts stem, removal stop words and size Write conversion；

S103: after pretreatment, the word in all defect report is combined into a corpus, using existing on corpus Word2vec simultaneously selects CBOW model, and the vector for obtaining each word indicates to get the two-dimensional matrix table of each defect report is arrived Show, referred to as the two-dimentional single channel matrix of defect report；

When according to the defect report for extracting software, Given information which the provides (information of this pairing Be in data set, handled by the people of creation data set), by the defect report of two defect reports composition to passing through Two-dimentional binary channels matrix indicates that the two dimension binary channels matrix is by the corresponding two-dimentional single channel matrix group of described two defect reports It closes, then to the binary channels matrix, it stamps repetition or unduplicated label；

S200: CNN model is established

S202: in first convolutional layer, settingA convolution kernelWherein d is the length of convolution kernel, k_wIt is volume The width of product core；After first time convolution, two channels of binary channels matrix are just merged into one, first layer Convolution Formula Are as follows:

Wherein C₁Indicate the output of first convolutional layer, i indicates that first convolutional layer inputs I₁I-th of channel, j₁Indicate input Jth₁Row, b₁Indicate offset, f₁It indicates nonlinear activation primitive, gives the length l (l=n of input_w), Filling power P=0 With step-length S=1, the length O of output₁It can be calculated as:

Wherein C₂Indicate the output of second convolutional layer, j₂Indicate second convolutional layer input I₂Jth₂Row, b₂Indicate offset, f₂Indicate nonlinear activation primitive, after current convolution, can obtain three kinds of shapes isCharacteristic pattern, wherein O₂It can be according to l (l=O₁) and different convolution kernel length d, it is calculated according to formula (2)；

S203: maximum pond is carried out to all characteristic patterns；

S204: remolding and splices all characteristic patterns to obtain oneThe vector of dimension, it will be by as full articulamentum Input；

After two full articulamentums, an independent probability sim is obtained_predict, it represent two report be predicted it is similar Degree；

Output T={ the x of given first full articulamentum₁, x₂..., x₃₀₀And weight vectors W={ w₁, w₂..., w₃₀₀, sim_predictIt can be calculated as:

Wherein i indicates that i-th of element of T, b indicate offset；

S205: all defect report pair in traversal training set repeats S202-S204；

Wherein label_realIndicate that the label of preset defect report pair, i indicate that i-th of defect report pair, n indicate defect report Pair sum；

S300: defect report prediction to be predicted

Defect report to be predicted is pre-processed using the method in S102 first, is then waited for this using the method in S103 Prediction defect report is converted into the two-dimentional single channel matrix of prediction defect report；

It will predict the two-dimentional single channel matrix two of the two-dimentional single channel matrix and the existing N number of defect report of the software of defect report Two combinations obtain N number of binary channels matrix to be predicted, and N is constituted forecast set to binary channels matrix to be predicted, will be every in forecast set A binary channels matrix to be predicted is input in the CNN model as input, obtains a probability；

In N number of probability, probability then thinks that defect report corresponding to the probability and prediction defect report are attached most importance to greater than threshold value It is multiple.

2. the repeated defects based on binary channels convolutional neural networks report that detection method, feature exist as described in claim 1 In: structured message is product and component in the S101, and unstructured letter is summary and description.

3. the repeated defects based on binary channels convolutional neural networks report that detection method, feature exist as described in claim 1 In: it is all to use Relu as activation primitive to extract more non-linear spy in other layers in addition to the last one full articulamentum Sign.