CN110188047A - A kind of repeated defects report detection method based on binary channels convolutional neural networks - Google Patents
A kind of repeated defects report detection method based on binary channels convolutional neural networks Download PDFInfo
- Publication number
- CN110188047A CN110188047A CN201910474540.6A CN201910474540A CN110188047A CN 110188047 A CN110188047 A CN 110188047A CN 201910474540 A CN201910474540 A CN 201910474540A CN 110188047 A CN110188047 A CN 110188047A
- Authority
- CN
- China
- Prior art keywords
- defect report
- report
- binary channels
- defect
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3692—Test management for test results analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The present invention relates to a kind of, and the repeated defects based on binary channels convolutional neural networks report that detection method, including three steps, data preparation establish CNN model and defect report to be predicted prediction;In data preparation, the field useful to duplicate reports, it is extracted from defect report, to each report, structured message and unstructured information are put into togerther in a text invention shelves, by pretreatment, a single channel matrix is each converted to by the report of text representation, single channel matrix is combined into binary channels matrix, then using a part as training set, remaining part is as verifying collection.It is input training pattern with training set in CNN model foundation.In defect report forecast period to be predicted, the similarity of the defect report pair of a unknown defect report and known defect report composition is predicted in trained model load, this similarity, which is one, indicates defect report to the probability for repeating possibility.The method of the present invention forecasting accuracy with higher.
Description
Technical field
The present invention relates to software testing technology field, in particular to a kind of repetition based on binary channels convolutional neural networks lacks
Fall into report detection method.
Background technique
Modern software project carrys out storage and management defect report using the defect tracking system of such as Bugzilla [17].Software
Developer, software test personnel and terminal user submit defect report to describe these problems when encountering software issue.It lacks
Sunken report can help guiding software maintenance and repair.With the development of software systems, can all there are hundreds of defects daily
Report is submitted.When as soon as more than one people submits defect report to describe an identical bug, repeated defects report generation
?.Because defect report always uses natural language description, the same bug is also likely to describe in different forms.
Because of defect report substantial amounts, detecting repeated defects report manually is a difficult job.In addition, because lacking
Report is fallen into natural language description, it is also unpractiaca for providing a standard template.Therefore, the automatic detection of repeated defects report
It is a significant job, it can be to avoid repeatedly repairing the same bug.This year, many automatic inspections of repeated defects report
Survey technology is suggested to solve this problem.These methods can be roughly divided into two sides of information retrieval and machine learning
To.
Information retrieval method, it usually calculates similarity of two defect reports on text, that is, is absorbed according to text
Description is to calculate similarity.
Such as Hiew establishes a model using VSM (Vector Space Model), a report is calculated as by it
One vector with TF-IDF (Term Frequency-Inverse Document Frequency) term weighting scheme.
Based on VSM, Runeson et al. detects repeated defects report with natural language processing technique for the first time.Wang et al. thinks only
Only consider that natural language information not can be well solved this problem, thus they also using execution information as a feature come into
The detection of row duplicate reports.However, only only having sub-fraction report that there is execution information, there is very big office in this way
It is sex-limited.Sun et al. proposes REP, and this method not only only used summary and description, also use
The structured messages such as product, component, version.Higher text similarity in order to obtain, they extend
BM25F, one kind is in the effective similarity calculating method of information retrieval field.In addition to text similarity and structuring similarity,
Alipour et al. also contemplates the influence that contextual information detects duplicate reports.They apply to LDA in these features,
Achieve better result.The method slowed down based on information is all put up a good show in accuracy rate and time efficiency, but when one
It is as a result just unsatisfactory when a problem is described with different terms.
Machine learning method extracts the potential feature of report by the algorithm of self study, but traditional machine learning side
Method can not learn the depth characteristic of input well.SVM is one classical method of machine learning.Jalbert et al. is built with it
The categorizing system that can filter duplicate reports is found.Meanwhile they think that previous method does not make full use of defect report
Various features in announcement, therefore they have used surface characteristics, text semantic and figure cluster in a model.In Jalbert et al.
On the basis of work, Tian et al. considers some new features and establishes a linear model.From feature and uneven number
According to angle set out, they improve duplicate reports detection accuracy rate.Sun et al. establishes an interpretation model with SVM,
Defect report is also divided into repetition and non-duplicate two class for the first time by they.L2R is another highly useful machine learning method.
Based on this, Zhou et al. considers text and statistical nature, and has used stochastic gradient descent algorithm to them.This method ratio
Traditional information retrieval method, such as VSM and BM25F have better effect.As word embedded technology is [in natural language processing
The application in field, more and more researchers detect duplicate reports with it.Budhiraja et al. word embedded technology will lack
Sunken report is converted into vector and then calculates their similarity.The experimental results showed that this method, which has, improves duplicate reports inspection
Survey the potentiality of accuracy rate.
Summary of the invention
The technical problem to be solved by the present invention is to the automatic test problems of duplicate reports, this problem can be further broken into
Judge the relationship between two defect reports, that is, one is reported the defect report that forms to being duplicate or do not weigh by two
Multiple.
To achieve the above object, the present invention adopts the following technical scheme: a kind of weight based on binary channels convolutional neural networks
Multiple defect report detection method, includes the following steps:
S100: data preparation
S101: extracting the defect report of software, and all defect report is made of structured message and unstructured information,
For each defect report, all structured messages and unstructured information are put into an individual text invention shelves;
S102: for each defect report, carrying out pre-treatment step, including segment, extract stem, removal stop words and
Capital and small letter conversion;
S103: after pretreatment, the word in all defect report is combined into a corpus, using existing on corpus
Word2vec and select CBOW model, obtain each word vector indicate to get arrive each defect report two-dimensional matrix
It indicates, referred to as the two-dimentional single channel matrix of defect report;
When according to the defect report for extracting software, (this is matched the Given information which provides
Information is in data set, is handled by the people of creation data set), the defect report pair that two defect reports are formed
It is indicated by two-dimentional binary channels matrix, the two dimension binary channels matrix is by the corresponding two-dimentional single channel square of described two defect reports
Battle array is composed, and then to the binary channels matrix, it stamps repetition or unduplicated label;
By all tagged binary channels matrixes, it is divided into training set and verifying collection;
S200: CNN model is established
S201: all binary channels matrixes that training set and verifying are concentrated are inputted into CNN model together;
S202: in first convolutional layer, settingA convolution kernelWherein d is the length of convolution kernel, kw
It is the width of convolution kernel;After first time convolution, two channels of binary channels matrix are just merged into one, and first layer convolution is public
Formula are as follows:
Wherein C1Indicate the output of first convolutional layer, i indicates that first convolutional layer inputs I1I-th of channel, j1It indicates
The jth of input1Row, b1Indicate offset, f1It indicates nonlinear activation primitive, gives the length l (l=n of inputw), Filling power
P=0 and step-length S=1, the length O of output1It can be calculated as:
The output shape of first convolutional layer isBy the output shape remodeling of first convolutional layer at Then convolution again in second convolutional layer, and is provided with the convolution kernel of three kinds of sizesEvery kind of convolution kernelIt is a, the formula of second layer convolution are as follows:
Wherein C2Indicate the output of second convolutional layer, j2Indicate second convolutional layer input I2Jth2Row, b2Indicate inclined
Shifting amount, f2Indicate nonlinear activation primitive, after current convolution, can obtain three kinds of shapes isFeature
Scheme, wherein O2It can be according to l (l=O1) and different convolution kernel length d, it is calculated according to formula (2);
S203: maximum pond is carried out to all characteristic patterns;
S204: remolding and splices all characteristic patterns to obtain oneThe vector of dimension, it will be by as full connection
The input of layer;
After two full articulamentums, an independent probability sim is obtainedpredict, it represents what two reports were predicted
Similarity;
In the last layer, sigmoid is used to obtain sim as activation primitivepredict;
Output T={ the x of given first full articulamentum1,x2,…,x300And weight vectors W={ w1,w2,…,w300,
simpredictIt can be calculated as:
Wherein i indicates that i-th of element of T, b indicate offset;
S205: all defect report pair in traversal training set repeats S202-S204;
S206: backpropagation is carried out with the hiding parameter of more new model according to loss function, loss function such as formula (5):
Wherein labelrealIndicate that the label of preset defect report pair, i indicate that i-th of defect report pair, n indicate defect
The sum of report pair;
S207: it after each epoch training, is verified using verifying the set pair analysis model;When the loss of verifying collection is at 5
When all no longer reducing in epoch, stop updating model parameter;Otherwise S201 is returned, continues to train CNN model;
S300: defect report prediction to be predicted
Defect report to be predicted is pre-processed using the method in S102 first, it then will using the method in S103
The defect report to be predicted is converted into the two-dimentional single channel matrix of prediction defect report;
It will predict the two-dimentional single channel square of the two-dimentional single channel matrix and the existing N number of defect report of the software of defect report
Battle array combination of two obtains N number of binary channels matrix to be predicted, and N is constituted forecast set to binary channels matrix to be predicted, will be in forecast set
Each of binary channels matrix to be predicted as input, be input in the CNN model, obtain a probability;
In N number of probability, probability then thinks defect report corresponding to the probability and prediction defect report greater than threshold value
To repeat.
As an improvement, structured message is product and component in the S101, unstructured letter is summary
And description.
As an improvement, being all to use Relu as activation primitive to extract in other layers in addition to the last one full articulamentum
More nonlinear characteristic.
Compared with the existing technology, the present invention at least has the advantages that
The invention proposes a new method DC-CNN to carry out repeated defects report detection.It is by two by single channel
The defect report that matrix indicates is combined into the defect report pair of binary channels matrix expression.Then, this binary channels matrix quilt
It is input in CNN model and extracts implicit feature.The present invention in Open Office, Eclipse, Net Beans and they
The method of proposition is demonstrated on combined data set Combined and is examined with the duplicate reports based on deep learning state-of-the-art at present
Survey method is compared, and the method for the present invention is effective, it is often more important that performance is also more preferable.
Detailed description of the invention
Fig. 1 is the overall framework of the method for the present invention.
Fig. 2 is the overall procedure for establishing CNN model.
Fig. 3 (a) be ROC curve of the DC-CNN and SC-CNN on Open Offic data set, Fig. 3 (b) be DC-CNN and
ROC curve of the SC-CNN on Eclipse data set, Fig. 3 (c) are DC-CNN and SC-CNN on Net Beans data set
ROC curve, Fig. 3 (d) are ROC curve of the DC-CNN and SC-CNN on Combined data set.
Fig. 4 is the influence of term vector dimension.
Fig. 5 is the influence of unstructured information.
Specific embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
Fig. 1 illustrates the general frame of the method for the present invention DC-CNN, it contains three phases: data preparation establishes CNN
Model and defect report to be predicted prediction.In data preparation stage, the field useful to duplicate reports, including component,
Product, summary and description are extracted from defect report.To each report, structured message and non-
Structured message is put into togerther in a text invention shelves.By pretreatment, all defect report text be collected with
Form a corpus.Word2vec is used to extract the semanteme rule of corpus.Each is converted by the report of text representation
At a single channel matrix.In order to judge the relationship between two reports, the single channel matrix of expression defect report is combined into
Indicate the binary channels matrix of defect report pair.Then using a part as training set, remaining part is as verifying collection.In training
Stage is to input one CNN model of training with training set.In defect report forecast period to be predicted, trained model load
The similarity of the defect report pair of one unknown defect report of prediction and known defect report composition, this similarity is a table
Show defect report to the probability for repeating possibility.
A kind of repeated defects report detection method based on binary channels convolutional neural networks, includes the following steps:
S100: data preparation
S101: extracting the defect report of software, and all defect report is made of structured message and unstructured information,
For each defect report, all structured messages and unstructured information are put into an individual text invention shelves;
Structured message is usually optional attribute, and unstructured information is usually the text description of bug.
S102: for each defect report, carrying out pre-treatment step, including segment, extract stem, removal stop words and
Capital and small letter conversion;
The present invention completes above-mentioned pre-treatment step using the StandardAnalyzer of Lucene.When removal stop words
When, use the English of a standard to deactivate vocabulary.In addition, even if still being had in the defect report of two wide of the marks
Some identical words.These words are usually some specialized vocabularies, such as java, com, org etc..Due to frequently occurring, also him
Be added in deactivated vocabulary.By handling above, some nonsensical numbers are left down in text, they are also removed.
S103: after pretreatment, the word in all defect report is combined into a corpus, using existing on corpus
Word2vec and select CBOW model, obtain each word vector indicate to get arrive each defect report two-dimensional matrix
It indicates, referred to as the two-dimentional single channel matrix of defect report;
When according to the defect report for extracting software, (this is matched the Given information which provides
Information is in data set, is handled by the people of creation data set), the defect report pair that two defect reports are formed
It is indicated by two-dimentional binary channels matrix, the two dimension binary channels matrix is by the corresponding two-dimentional single channel square of described two defect reports
Battle array is composed, and then to the binary channels matrix, it stamps repetition or unduplicated label;
Compared with single channel, had the benefit that using the binary channels expression of defect report pair.Firstly, two reports can be with
It is handled simultaneously by CNN.Therefore training speed is accelerated.It, can be with using double-channel data training CNN secondly, be proved to
Reach higher accuracy rate.For binary channels CNN, by convolution operation, it be can capture between two defect reports
Incidence relation.
By all tagged binary channels matrixes, it is divided into training set and verifying collection;When it is implemented, 80% stamps mark
The binary channels matrix of label is divided into training set, and remaining 20% tagged binary channels matrix is verifying collection.
S200: CNN model is established
In order to from defect report centering extract feature, the present invention each convolutional layer be provided with three kinds it is different size of
Convolution kernel.Therefore, there are three branches for first convolutional layer tool.For each of these three branches, in second convolutional layer
It still can be there are three Xin get branch.Because the structure height of these three branches is similar, Fig. 2 shows only CNN overall work
A branch of first convolutional layer in structure.Table 3 illustrates the design parameter setting of CNN model of the present invention.
Table 3
S201: all binary channels matrixes that training set and verifying are concentrated are inputted into CNN model together;
S202: in first convolutional layer, settingA convolution kernelWherein d is the length of convolution kernel, kw
It is the width of convolution kernel;Because every a line of input matrix represents a word, convolution kernel width is equal to term vector dimension m;
After first time convolution, two channels of binary channels matrix are just merged into one, in this manner it is possible to which two defect reports are seen
Feature, first layer Convolution Formula are extracted at an entirety are as follows:
Wherein C1Indicate the output of first convolutional layer, i indicates that first convolutional layer inputs I1I-th of channel, j1It indicates
The jth of input1Row, b1Indicate offset, f indicates nonlinear activation primitive, and the present invention gives input using Relu
Length l (l=nw), Filling power P=0 and step-length S=1, the length O of output1It can be calculated as:
The output shape of first convolutional layer isIn order to further extract the linked character of two reports,
By the output shape remodeling of first convolutional layer atThen convolution again in second convolutional layer, and is provided with three
The convolution kernel of kind sizeEvery kind of convolution kernelIt is a, the formula of second layer convolution
Are as follows:
Wherein C2Indicate the output of second convolutional layer, j2Indicate second convolutional layer input I2Jth2Row, b2Indicate inclined
Shifting amount, f2Indicate nonlinear activation primitive, the present invention uses Relu, after current convolution, can obtain three kinds of shapes
ForCharacteristic pattern, wherein O2It can be according to l (l=O1) and different convolution kernel length d, it is counted according to formula (2)
It calculates.
S203: maximum pond is carried out to all characteristic patterns;In this way, each characteristic pattern be downsampled for
Shape.
S204: remolding and splices all characteristic patterns to obtain oneThe vector of dimension, it will be by as full connection
The input of layer;
After two full articulamentums, an independent probability sim is obtainedpredict, it represents what two reports were predicted
Similarity;
In the last layer, sigmoid is used to obtain sim as activation primitivepredict;
Output T={ the x of given first full articulamentum1,x2,…,x300And weight vectors W={ w1,w2,…,w300,
simpredictIt can be calculated as:
Wherein i indicates that i-th of element of T, b indicate offset.
S205: all defect report pair in traversal training set repeats S202-S204.
S206: backpropagation is carried out with the hiding parameter of more new model according to loss function, loss function such as formula (5):
Wherein labelrealIndicate that the label of preset defect report pair, i indicate that i-th of defect report pair, n indicate defect
The sum of report pair.
S207: it after each epoch training, is verified using verifying the set pair analysis model;When the loss of verifying collection is at 5
When all no longer reducing in epoch, stop updating model parameter;Otherwise S201 is returned, continues to train CNN model.
S300: defect report prediction to be predicted
Defect report to be predicted is pre-processed using the method in S102 first, it then will using the method in S103
The defect report to be predicted is converted into the two-dimentional single channel matrix of prediction defect report;
It will predict the two-dimentional single channel square of the two-dimentional single channel matrix and the existing N number of defect report of the software of defect report
Battle array combination of two obtains N number of binary channels matrix to be predicted, and N is constituted forecast set to binary channels matrix to be predicted, will be in forecast set
Each of binary channels matrix to be predicted as input, be input in the CNN model, obtain a probability;
In N number of probability, probability then thinks defect report corresponding to the probability and prediction defect report greater than threshold value
To repeat.
Such as certain software has N number of defect report at present, then the corresponding two dimension of each defect report is single after treatment
Access matrix will predict the corresponding two-dimentional single channel matrix to be predicted of defect report and N number of two-dimentional single channel matrix arbitrarily two-by-two
Composition, obtains N to binary channels matrix to be predicted, then the above-mentioned CNN of input by this N to binary channels matrix to be predicted one by one
In model, N number of probability is obtained.When some probability is greater than preset threshold value, then it is assumed that be predicted corresponding to the probability to lack
Existing defect report in report and software is fallen into repeat.
Verification experimental verification:
1, data set
In order to compare, present invention employs data set identical with Deshmukh et al., this data set be by
What Lazar was collected and was handled.It contains three large-scale open source projects: Open Office, Eclipse and Net Beans.
Open Office is the office software similar with Microsoft Office.Eclipse and Net Beans, which is that open source is integrated, to be opened
Hair ring border.In order to be tested with more training samples, a bigger data set is obtained by merging these three data sets,
And " Combined " is named as to it.These data sets additionally provide defect report pairing relationship, and one in Open Office
Divide pairing relationship as shown in table 4.
Table 4: defect report pair
By analyzing all pairing relationships in each data set, some of them problem is found.First, some pairings are
It is duplicate.For example, (200622,197347, duplicate) occur 5 times in Open Office.Second, some pairings
What is indicated is the same relationship, for example, (159435,164827, duplicate) in Eclipse and (164827,159435,
duplicate).Therefore, the present invention will remove these defect reports pair.Table 5 illustrates all in finally obtained data set
Match quantity.
Table 5: complete data set
Dataset | Duplicate | Non duplicate |
OpenOffice | 57340 | 41751 |
Eclipse | 86385 | 160917 |
Net Beans | 95066 | 89988 |
Combined | 238791 | 292476 |
Each data set is divided into training set and test set, and training set accounts for 80% (wherein 10% as verifying collection), test
Collection accounts for 20%.In addition, in partitioned data set, making training set to allow training set and test set to simulate raw data set distribution
With duplicate reports in test set to identical as raw data set with non-duplicate report comparative example.Training set and test set are all random
Selection.Table 6 illustrates the detailed distribution of defect report pair in training set and test set.
Table 6: training set and test set
Evaluation criteria
In model proposed by the present invention, output indicates the similarity of defect report centering two reports.Therefore, this
A value is between 0 to 1.In order to further classify, it will one threshold value of setting.Sim is obtained in third sectionpredictLater,
labelpredict(indicating a defect report to the label being predicted) can calculate according to following formula:
According to labelpredictAnd labelreal, report is to being divided into four classes:
1) TP:labelreal=1, labelpredict=1
2) TN:labelreal=0, labelpredict=0
3) FP:labelreal=0, labelpredict=1
4) FN:labelreal=1, labelpredict=0
Wherein 1 indicate report to be it is duplicate, 0 indicates report to being non-repetitive.TP expression is predicted correctly to repeat
Report to quantity, TN expression is predicted correctly as non-repetitive report to quantity, and it is duplicate report that FP, which indicates mispredicted,
It accuses to quantity, it is non-repetitive report to quantity that FN, which indicates mispredicted,.This four indexs are the calculating of following evaluation criterion
Basis.
Accuracy
Accuracy indicates the ratio of the defect report pair being predicted correctly with all reports pair, it indicates that model correctly divides
The performance of class all defect report pair.Because having used sigmoid function when being returned, Accuracy is being calculated,
When Recall and Precision, threshold value is set as 0.5.
Recall:
Recall indicates that being correctly predicted to be duplicate defect report pair with all reality is duplicate defect report pair
Ratio.
Precision:
Precision indicate correctly be predicted to be duplicate defect report pair and it is all be predicted to be it is duplicate report pair
Ratio.
F1- Score:
F1- Score is the harmonic-mean of Recall and Precision.
Roc curve:
In fact, due in data set defect report it is unbalanced to category distribution, traditional evaluation criterion is such as
Accuracy cannot classification of assessment device well performance.Therefore, the present invention is using ROC curve come further classification of assessment device
Performance.According to different threshold values, then available different TPR and FPR can draw ROC curve by TPR and FPR.
TPR and FPR can be calculated according to following formula:
Using all FPR values as horizontal axis, all TPR values are as the longitudinal axis, so that it may obtain ROC curve.Curve is from seat
The parameter upper left corner is closer, and the performance of classifier is better.
Experimental result
Show the technical effect of the method for the present invention by answering following Railway Project.
Problem 1: compared with the state-of-the-art repeated defects report detection method based on deep learning, DC-CNN of the invention
Whether effectively?
Goal in research of the invention is proposition one more effectively based on the method for deep learning.It therefore, will be of the invention
Method and the method for Deshmukh et al. compare on identical data set.
Table 7: the experimental result of the method for the present invention and Deshmukh et al. method
As a result: table 7 illustrates the experimental result of the method for the present invention and Deshmukh et al. method.Use an identical core
Heart method --- twin neural network, they establish two similar models, retrieval model and disaggregated model.For mould of classifying
Type, highest accuracy are appeared on Open Office data set, have reached 0.8275, and are only only had in Eclipse
0.7268.Their retrieval model performance is better than disaggregated model.For retrieval model, the data set to behave oneself best is still Open
Office, its accuracy are up to 0.9455.In the same manner, Eclipse is slightly inferior, and accuracy is 0.906.It can be found that
Disaggregated model with twin neural network is compared, DC-CNN in Open Office, Eclipse, Net Beans,
Promotion on Combined is 11.54%, 24.17%, 17.89% and 13.33% respectively.With the inspection of twin neural network
Rope model is compared, and promotion of the DC-CNN on Eclipse, Net Beans, Combined is 6.25% respectively, 4.07% He
3.84%.On Open Office, the accuracy of DC-CNN is low less than 0.03%.
Influence: according to table 7, the performance of DC-CNN is high on 3 data sets (Eclipse, Net Beans, Combined)
In disaggregated model and retrieval model that the twin neural network of Deshmukh et al. constructs.On Open Office, DC-CNN's
Performance be higher than the disaggregated model that Deshmukh et al. is constructed with twin neural network and with their retrieval model have one it is non-
Normal similar performance.In general, DC-CNN has reached an extraordinary performance and has been more than current state-of-the-art base
In the duplicate reports detection method of deep learning.
Problem 2: comparing with SC-CNN, and whether DC-CNN effective?
In order to prove that the binary channels matrix expression of defect report pair proposed by the present invention is effectively, to also use defect report
The single channel matrix of announcement indicates as a comparison baseline.Keep the structure of CNN constant, the quantity including convolution kernel, convolution
The size of core, the quantity etc. of convolutional layer, and extract a defect report centering two features reported respectively with it, then calculate
Their similarity.This method is referred to as single channel convolutional neural networks (Single-Channel Convolutional
Neural Networks, SC-CNN).
Table 8:DC-CNN and SC-CNN experimental result
As a result: the property of both methods is evaluated on Accuracy, Recall, Precision, the indexs such as F1-Score
Can, experimental result is as shown in table 8, wherein best result is all by overstriking.It is observed that in all fingers of all data sets
It puts on, DC-CNN has been above SC-CNN.Compared to SC-CNN, in Open Office, Eclipse, Net Beans, and
On Combined, the accuracy of DC-CNN has been respectively increased 2.78%, 2.61%, 1.36% and 2.33%, DC-CNN's
The Precision that 2.73%, 0.51%, 1.49% and 3.17%, DC-CNN has been respectively increased in recall is respectively increased
The F1-Score of 2.08%, 6.53%, 1.20% and 2.08%, DC-CNN improve 2.40%, 3.53% respectively, 1.35% He
2.62%.Fig. 3 (a) Fig. 3 (d) illustrates the ROC curve of two methods.It is observed that on all data sets, DC-CNN
Curve all on SC-CNN, this shows that DC-CNN also has better classification performance even if when sample distribution is unbalanced.
Influence: all experimental results all show more more effective than single channel using twin-channel CNN model.For SC-CNN
For.Each report is converted into a matrix and is then input in CNN to extract feature, be as a result represented as feature to
Amount.Then judge whether two reports repeat by calculating the similarity of two feature vectors.For DC-CNN, two
The matrix of report is combined into a binary channels matrix and is then input to CNN, and then the two reports are convolved together, this side
Method can extract profound relationship between two reports, take full advantage of the ability that CNN captures local feature.Because of DC-CNN
In CNN model be absorbed in the incidence relation extracted between two reports, so when detect duplicate reports with better performance.
Does problem 3: when changing term vector dimension, how experimental result change?
The invention proposes a kind of new defect reports to representation method --- binary channels matrix.Therefore, also explore with
Influence of the relevant parameter to experimental result.For binary channels matrix because the quantity number of word it is fixed and for
For CNN, two report positions (which is reported in first channel, which is reported in second channel) be it is indiscriminate,
So the parameter for being most likely to occur change is the dimension of term vector.In order to answer when changing term vector dimension, experimental result is such as
Term vector dimension is gradually changed from 10 to 100 and observation experiment result is in Open Office data set by what variation this problem
On variation.
As a result: from fig. 4, it can be seen that when being gradually increased term vector dimension, under accuracy first increases and then shows
Drop trend.When term vector dimension is 20, accuracy rate has reached maximum value, and 94.29%.
Influence: when term vector dimension increases to 20 from 10, accuracy is increased.When we continue to increase term vector dimension
Degree, accuracy are reduced.Reason may be, when a term vector dimension characterizes a word enough.Continue to increase dimension
Degree prevents it from indicating this word well instead.Although accuracy has reached maximum value when term vector dimension is equal to 20,
But it is not higher by too much than the value under other conditions.On the one hand, term vector dimension increase can bring bigger data to deposit
Storage problem;On the other hand, word insertion and complexity when CNN model training can all increase.Therefore, in the methods of the invention, 20
It is most suitable term vector dimension.
Problem 4: when not using structured message, whether method proposed by the present invention effective?
Such as the structured messages such as product, component and version are mentioned when judging whether two reports repeat
Highly useful information is supplied.Structured message is improved repeated defects report inspection as an individual feature by many methods
The accuracy of survey.Unstructured information is usually the natural language description to bug.For duplicate reports detection, CNN is main
For handling non-structured text, therefore it has good performance when handling long text.Different from other methods, the present invention
It is put into text invention shelves using structured message and unstructured information as text data simultaneously.Then it is extracted with CNN
Feature.In order to answer a question 4, structured message is removed from input, and when not changing other conditions, setting comparison
Experiment.
As a result: from fig. 5, it can be seen that the experimental result on all data sets all reduces after removing structured message,
It is 1.74%, 3.79%, 3.38%, 2.56% respectively on Open Office, Eclipse, Net Beans, Combined.
It influences: the experimental results showed that it is effective that structured message and unstructured information, which are input to together in CNN,.Note
It anticipates to after removing structured message, although accuracy has dropped, this reduction is not fatal.The reason is that knot
Structure information only accounts for the sub-fraction of entire text.CNN master part to be processed is still unstructured information.
Finally, it is stated that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to compared with
Good embodiment describes the invention in detail, those skilled in the art should understand that, it can be to skill of the invention
Art scheme is modified or replaced equivalently, and without departing from the objective and range of technical solution of the present invention, should all be covered at this
In the scope of the claims of invention.
Claims (3)
1. a kind of repeated defects based on binary channels convolutional neural networks report detection method, it is characterised in that: including walking as follows
It is rapid:
S100: data preparation
S101: extracting the defect report of software, and all defect report is made of structured message and unstructured information, for
All structured messages and unstructured information are put into an individual text invention shelves by each defect report;
S102: for each defect report, carrying out pre-treatment step, including segments, extracts stem, removal stop words and size
Write conversion;
S103: after pretreatment, the word in all defect report is combined into a corpus, using existing on corpus
Word2vec simultaneously selects CBOW model, and the vector for obtaining each word indicates to get the two-dimensional matrix table of each defect report is arrived
Show, referred to as the two-dimentional single channel matrix of defect report;
When according to the defect report for extracting software, Given information which the provides (information of this pairing
Be in data set, handled by the people of creation data set), by the defect report of two defect reports composition to passing through
Two-dimentional binary channels matrix indicates that the two dimension binary channels matrix is by the corresponding two-dimentional single channel matrix group of described two defect reports
It closes, then to the binary channels matrix, it stamps repetition or unduplicated label;
By all tagged binary channels matrixes, it is divided into training set and verifying collection;
S200: CNN model is established
S201: all binary channels matrixes that training set and verifying are concentrated are inputted into CNN model together;
S202: in first convolutional layer, settingA convolution kernelWherein d is the length of convolution kernel, kwIt is volume
The width of product core;After first time convolution, two channels of binary channels matrix are just merged into one, first layer Convolution Formula
Are as follows:
Wherein C1Indicate the output of first convolutional layer, i indicates that first convolutional layer inputs I1I-th of channel, j1Indicate input
Jth1Row, b1Indicate offset, f1It indicates nonlinear activation primitive, gives the length l (l=n of inputw), Filling power P=0
With step-length S=1, the length O of output1It can be calculated as:
The output shape of first convolutional layer isBy the output shape remodeling of first convolutional layer at Then convolution again in second convolutional layer, and is provided with the convolution kernel of three kinds of sizesEvery kind of convolution kernelIt is a, the formula of second layer convolution are as follows:
Wherein C2Indicate the output of second convolutional layer, j2Indicate second convolutional layer input I2Jth2Row, b2Indicate offset,
f2Indicate nonlinear activation primitive, after current convolution, can obtain three kinds of shapes isCharacteristic pattern, wherein
O2It can be according to l (l=O1) and different convolution kernel length d, it is calculated according to formula (2);
S203: maximum pond is carried out to all characteristic patterns;
S204: remolding and splices all characteristic patterns to obtain oneThe vector of dimension, it will be by as full articulamentum
Input;
After two full articulamentums, an independent probability sim is obtainedpredict, it represent two report be predicted it is similar
Degree;
In the last layer, sigmoid is used to obtain sim as activation primitivepredict;
Output T={ the x of given first full articulamentum1, x2..., x300And weight vectors W={ w1, w2..., w300,
simpredictIt can be calculated as:
Wherein i indicates that i-th of element of T, b indicate offset;
S205: all defect report pair in traversal training set repeats S202-S204;
S206: backpropagation is carried out with the hiding parameter of more new model according to loss function, loss function such as formula (5):
Wherein labelrealIndicate that the label of preset defect report pair, i indicate that i-th of defect report pair, n indicate defect report
Pair sum;
S207: it after each epoch training, is verified using verifying the set pair analysis model;When the loss of verifying collection is at 5
When all no longer reducing in epoch, stop updating model parameter;Otherwise S201 is returned, continues to train CNN model;
S300: defect report prediction to be predicted
Defect report to be predicted is pre-processed using the method in S102 first, is then waited for this using the method in S103
Prediction defect report is converted into the two-dimentional single channel matrix of prediction defect report;
It will predict the two-dimentional single channel matrix two of the two-dimentional single channel matrix and the existing N number of defect report of the software of defect report
Two combinations obtain N number of binary channels matrix to be predicted, and N is constituted forecast set to binary channels matrix to be predicted, will be every in forecast set
A binary channels matrix to be predicted is input in the CNN model as input, obtains a probability;
In N number of probability, probability then thinks that defect report corresponding to the probability and prediction defect report are attached most importance to greater than threshold value
It is multiple.
2. the repeated defects based on binary channels convolutional neural networks report that detection method, feature exist as described in claim 1
In: structured message is product and component in the S101, and unstructured letter is summary and description.
3. the repeated defects based on binary channels convolutional neural networks report that detection method, feature exist as described in claim 1
In: it is all to use Relu as activation primitive to extract more non-linear spy in other layers in addition to the last one full articulamentum
Sign.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910474540.6A CN110188047B (en) | 2019-06-20 | 2019-06-20 | Double-channel convolutional neural network-based repeated defect report detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910474540.6A CN110188047B (en) | 2019-06-20 | 2019-06-20 | Double-channel convolutional neural network-based repeated defect report detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110188047A true CN110188047A (en) | 2019-08-30 |
CN110188047B CN110188047B (en) | 2023-04-18 |
Family
ID=67719718
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910474540.6A Active CN110188047B (en) | 2019-06-20 | 2019-06-20 | Double-channel convolutional neural network-based repeated defect report detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110188047B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111177010A (en) * | 2019-12-31 | 2020-05-19 | 杭州电子科技大学 | Software defect severity identification method |
CN111737107A (en) * | 2020-05-15 | 2020-10-02 | 南京航空航天大学 | Repeated defect report detection method based on heterogeneous information network |
CN112328469A (en) * | 2020-10-22 | 2021-02-05 | 南京航空航天大学 | Function level defect positioning method based on embedding technology |
CN112631898A (en) * | 2020-12-09 | 2021-04-09 | 南京理工大学 | Software defect prediction method based on CNN-SVM |
CN113362305A (en) * | 2021-06-03 | 2021-09-07 | 河南中烟工业有限责任公司 | Smoke box strip missing mixed brand detection system and method based on artificial intelligence |
CN113379746A (en) * | 2021-08-16 | 2021-09-10 | 深圳荣耀智能机器有限公司 | Image detection method, device, system, computing equipment and readable storage medium |
CN113379685A (en) * | 2021-05-26 | 2021-09-10 | 广东炬森智能装备有限公司 | PCB defect detection method and device based on dual-channel feature comparison model |
CN113486176A (en) * | 2021-07-08 | 2021-10-08 | 桂林电子科技大学 | News classification method based on secondary feature amplification |
CN113791897A (en) * | 2021-08-23 | 2021-12-14 | 湖北省农村信用社联合社网络信息中心 | Method and system for displaying server baseline detection report of rural telecommunication system |
US20230367967A1 (en) * | 2022-05-16 | 2023-11-16 | Jpmorgan Chase Bank, N.A. | System and method for interpreting stuctured and unstructured content to facilitate tailored transactions |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130067312A1 (en) * | 2006-06-22 | 2013-03-14 | Digg, Inc. | Recording and indicating preferences |
CN103970666A (en) * | 2014-05-29 | 2014-08-06 | 重庆大学 | Method for detecting repeated software defect reports |
CN106250311A (en) * | 2016-07-27 | 2016-12-21 | 成都启力慧源科技有限公司 | Repeated defects based on LDA model report detection method |
US20170212829A1 (en) * | 2016-01-21 | 2017-07-27 | American Software Safety Reliability Company | Deep Learning Source Code Analyzer and Repairer |
CN108491835A (en) * | 2018-06-12 | 2018-09-04 | 常州大学 | Binary channels convolutional neural networks towards human facial expression recognition |
CN108563556A (en) * | 2018-01-10 | 2018-09-21 | 江苏工程职业技术学院 | Software defect prediction optimization method based on differential evolution algorithm |
CN108804558A (en) * | 2018-05-22 | 2018-11-13 | 北京航空航天大学 | A kind of defect report automatic classification method based on semantic model |
CN109376092A (en) * | 2018-11-26 | 2019-02-22 | 扬州大学 | A kind of software defect reason automatic analysis method of facing defects patch code |
CN109491914A (en) * | 2018-11-09 | 2019-03-19 | 大连海事大学 | Defect report prediction technique is influenced based on uneven learning strategy height |
-
2019
- 2019-06-20 CN CN201910474540.6A patent/CN110188047B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130067312A1 (en) * | 2006-06-22 | 2013-03-14 | Digg, Inc. | Recording and indicating preferences |
CN103970666A (en) * | 2014-05-29 | 2014-08-06 | 重庆大学 | Method for detecting repeated software defect reports |
US20170212829A1 (en) * | 2016-01-21 | 2017-07-27 | American Software Safety Reliability Company | Deep Learning Source Code Analyzer and Repairer |
CN106250311A (en) * | 2016-07-27 | 2016-12-21 | 成都启力慧源科技有限公司 | Repeated defects based on LDA model report detection method |
CN108563556A (en) * | 2018-01-10 | 2018-09-21 | 江苏工程职业技术学院 | Software defect prediction optimization method based on differential evolution algorithm |
CN108804558A (en) * | 2018-05-22 | 2018-11-13 | 北京航空航天大学 | A kind of defect report automatic classification method based on semantic model |
CN108491835A (en) * | 2018-06-12 | 2018-09-04 | 常州大学 | Binary channels convolutional neural networks towards human facial expression recognition |
CN109491914A (en) * | 2018-11-09 | 2019-03-19 | 大连海事大学 | Defect report prediction technique is influenced based on uneven learning strategy height |
CN109376092A (en) * | 2018-11-26 | 2019-02-22 | 扬州大学 | A kind of software defect reason automatic analysis method of facing defects patch code |
Non-Patent Citations (2)
Title |
---|
缪浩然等: "改进的词向量特征和CNN在语句分类中的应用", 《第十四届全国人机语音通讯学术会议》 * |
贡岩等: "指挥自动化系统嵌入式软件可靠性评估", 《中国电子学会可靠性分会第十三届学术年会》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111177010B (en) * | 2019-12-31 | 2023-12-15 | 杭州电子科技大学 | Software defect severity identification method |
CN111177010A (en) * | 2019-12-31 | 2020-05-19 | 杭州电子科技大学 | Software defect severity identification method |
CN111737107B (en) * | 2020-05-15 | 2021-10-26 | 南京航空航天大学 | Repeated defect report detection method based on heterogeneous information network |
CN111737107A (en) * | 2020-05-15 | 2020-10-02 | 南京航空航天大学 | Repeated defect report detection method based on heterogeneous information network |
CN112328469A (en) * | 2020-10-22 | 2021-02-05 | 南京航空航天大学 | Function level defect positioning method based on embedding technology |
CN112328469B (en) * | 2020-10-22 | 2022-03-18 | 南京航空航天大学 | Function level defect positioning method based on embedding technology |
CN112631898A (en) * | 2020-12-09 | 2021-04-09 | 南京理工大学 | Software defect prediction method based on CNN-SVM |
CN113379685A (en) * | 2021-05-26 | 2021-09-10 | 广东炬森智能装备有限公司 | PCB defect detection method and device based on dual-channel feature comparison model |
CN113362305A (en) * | 2021-06-03 | 2021-09-07 | 河南中烟工业有限责任公司 | Smoke box strip missing mixed brand detection system and method based on artificial intelligence |
CN113486176A (en) * | 2021-07-08 | 2021-10-08 | 桂林电子科技大学 | News classification method based on secondary feature amplification |
CN113379746B (en) * | 2021-08-16 | 2021-11-02 | 深圳荣耀智能机器有限公司 | Image detection method, device, system, computing equipment and readable storage medium |
CN113379746A (en) * | 2021-08-16 | 2021-09-10 | 深圳荣耀智能机器有限公司 | Image detection method, device, system, computing equipment and readable storage medium |
CN113791897A (en) * | 2021-08-23 | 2021-12-14 | 湖北省农村信用社联合社网络信息中心 | Method and system for displaying server baseline detection report of rural telecommunication system |
CN113791897B (en) * | 2021-08-23 | 2022-09-06 | 湖北省农村信用社联合社网络信息中心 | Method and system for displaying server baseline detection report of rural telecommunication system |
US20230367967A1 (en) * | 2022-05-16 | 2023-11-16 | Jpmorgan Chase Bank, N.A. | System and method for interpreting stuctured and unstructured content to facilitate tailored transactions |
Also Published As
Publication number | Publication date |
---|---|
CN110188047B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188047A (en) | A kind of repeated defects report detection method based on binary channels convolutional neural networks | |
Day et al. | Deep learning for financial sentiment analysis on finance news providers | |
US9779085B2 (en) | Multilingual embeddings for natural language processing | |
CN110245229A (en) | A kind of deep learning theme sensibility classification method based on data enhancing | |
CN107491531A (en) | Chinese network comment sensibility classification method based on integrated study framework | |
CN109918497A (en) | A kind of file classification method, device and storage medium based on improvement textCNN model | |
CN106095928A (en) | A kind of event type recognition methods and device | |
CN109299271A (en) | Training sample generation, text data, public sentiment event category method and relevant device | |
CN108090099B (en) | Text processing method and device | |
CN110097096B (en) | Text classification method based on TF-IDF matrix and capsule network | |
CN109783637A (en) | Electric power overhaul text mining method based on deep neural network | |
CN109960727A (en) | For the individual privacy information automatic testing method and system of non-structured text | |
CN111026870A (en) | ICT system fault analysis method integrating text classification and image recognition | |
CN109800309A (en) | Classroom Discourse genre classification methods and device | |
CN108920446A (en) | A kind of processing method of Engineering document | |
CN117474507A (en) | Intelligent recruitment matching method and system based on big data application technology | |
CN115544252A (en) | Text emotion classification method based on attention static routing capsule network | |
CN107766560A (en) | The evaluation method and system of customer service flow | |
CN103268346A (en) | Semi-supervised classification method and semi-supervised classification system | |
CN114519508A (en) | Credit risk assessment method based on time sequence deep learning and legal document information | |
Marerngsit et al. | A two-stage text-to-emotion depressive disorder screening assistance based on contents from online community | |
CN109871889B (en) | Public psychological assessment method under emergency | |
CN114912460A (en) | Method and equipment for identifying transformer fault through refined fitting based on text mining | |
CN107886233A (en) | The QoS evaluating method and system of customer service | |
CN113297376A (en) | Legal case risk point identification method and system based on meta-learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |