CN110188047A - A kind of repeated defects report detection method based on binary channels convolutional neural networks - Google Patents

A kind of repeated defects report detection method based on binary channels convolutional neural networks Download PDF

Info

Publication number
CN110188047A
CN110188047A CN201910474540.6A CN201910474540A CN110188047A CN 110188047 A CN110188047 A CN 110188047A CN 201910474540 A CN201910474540 A CN 201910474540A CN 110188047 A CN110188047 A CN 110188047A
Authority
CN
China
Prior art keywords
defect report
report
binary channels
defect
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910474540.6A
Other languages
Chinese (zh)
Other versions
CN110188047B (en
Inventor
徐玲
何健军
帅鉴航
杨梦宁
张小洪
洪明坚
葛永新
杨丹
王洪星
黄晟
陈飞宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201910474540.6A priority Critical patent/CN110188047B/en
Publication of CN110188047A publication Critical patent/CN110188047A/en
Application granted granted Critical
Publication of CN110188047B publication Critical patent/CN110188047B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3692Test management for test results analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The present invention relates to a kind of, and the repeated defects based on binary channels convolutional neural networks report that detection method, including three steps, data preparation establish CNN model and defect report to be predicted prediction;In data preparation, the field useful to duplicate reports, it is extracted from defect report, to each report, structured message and unstructured information are put into togerther in a text invention shelves, by pretreatment, a single channel matrix is each converted to by the report of text representation, single channel matrix is combined into binary channels matrix, then using a part as training set, remaining part is as verifying collection.It is input training pattern with training set in CNN model foundation.In defect report forecast period to be predicted, the similarity of the defect report pair of a unknown defect report and known defect report composition is predicted in trained model load, this similarity, which is one, indicates defect report to the probability for repeating possibility.The method of the present invention forecasting accuracy with higher.

Description

A kind of repeated defects report detection method based on binary channels convolutional neural networks
Technical field
The present invention relates to software testing technology field, in particular to a kind of repetition based on binary channels convolutional neural networks lacks Fall into report detection method.
Background technique
Modern software project carrys out storage and management defect report using the defect tracking system of such as Bugzilla [17].Software Developer, software test personnel and terminal user submit defect report to describe these problems when encountering software issue.It lacks Sunken report can help guiding software maintenance and repair.With the development of software systems, can all there are hundreds of defects daily Report is submitted.When as soon as more than one people submits defect report to describe an identical bug, repeated defects report generation ?.Because defect report always uses natural language description, the same bug is also likely to describe in different forms.
Because of defect report substantial amounts, detecting repeated defects report manually is a difficult job.In addition, because lacking Report is fallen into natural language description, it is also unpractiaca for providing a standard template.Therefore, the automatic detection of repeated defects report It is a significant job, it can be to avoid repeatedly repairing the same bug.This year, many automatic inspections of repeated defects report Survey technology is suggested to solve this problem.These methods can be roughly divided into two sides of information retrieval and machine learning To.
Information retrieval method, it usually calculates similarity of two defect reports on text, that is, is absorbed according to text Description is to calculate similarity.
Such as Hiew establishes a model using VSM (Vector Space Model), a report is calculated as by it One vector with TF-IDF (Term Frequency-Inverse Document Frequency) term weighting scheme. Based on VSM, Runeson et al. detects repeated defects report with natural language processing technique for the first time.Wang et al. thinks only Only consider that natural language information not can be well solved this problem, thus they also using execution information as a feature come into The detection of row duplicate reports.However, only only having sub-fraction report that there is execution information, there is very big office in this way It is sex-limited.Sun et al. proposes REP, and this method not only only used summary and description, also use The structured messages such as product, component, version.Higher text similarity in order to obtain, they extend BM25F, one kind is in the effective similarity calculating method of information retrieval field.In addition to text similarity and structuring similarity, Alipour et al. also contemplates the influence that contextual information detects duplicate reports.They apply to LDA in these features, Achieve better result.The method slowed down based on information is all put up a good show in accuracy rate and time efficiency, but when one It is as a result just unsatisfactory when a problem is described with different terms.
Machine learning method extracts the potential feature of report by the algorithm of self study, but traditional machine learning side Method can not learn the depth characteristic of input well.SVM is one classical method of machine learning.Jalbert et al. is built with it The categorizing system that can filter duplicate reports is found.Meanwhile they think that previous method does not make full use of defect report Various features in announcement, therefore they have used surface characteristics, text semantic and figure cluster in a model.In Jalbert et al. On the basis of work, Tian et al. considers some new features and establishes a linear model.From feature and uneven number According to angle set out, they improve duplicate reports detection accuracy rate.Sun et al. establishes an interpretation model with SVM, Defect report is also divided into repetition and non-duplicate two class for the first time by they.L2R is another highly useful machine learning method. Based on this, Zhou et al. considers text and statistical nature, and has used stochastic gradient descent algorithm to them.This method ratio Traditional information retrieval method, such as VSM and BM25F have better effect.As word embedded technology is [in natural language processing The application in field, more and more researchers detect duplicate reports with it.Budhiraja et al. word embedded technology will lack Sunken report is converted into vector and then calculates their similarity.The experimental results showed that this method, which has, improves duplicate reports inspection Survey the potentiality of accuracy rate.
Summary of the invention
The technical problem to be solved by the present invention is to the automatic test problems of duplicate reports, this problem can be further broken into Judge the relationship between two defect reports, that is, one is reported the defect report that forms to being duplicate or do not weigh by two Multiple.
To achieve the above object, the present invention adopts the following technical scheme: a kind of weight based on binary channels convolutional neural networks Multiple defect report detection method, includes the following steps:
S100: data preparation
S101: extracting the defect report of software, and all defect report is made of structured message and unstructured information, For each defect report, all structured messages and unstructured information are put into an individual text invention shelves;
S102: for each defect report, carrying out pre-treatment step, including segment, extract stem, removal stop words and Capital and small letter conversion;
S103: after pretreatment, the word in all defect report is combined into a corpus, using existing on corpus Word2vec and select CBOW model, obtain each word vector indicate to get arrive each defect report two-dimensional matrix It indicates, referred to as the two-dimentional single channel matrix of defect report;
When according to the defect report for extracting software, (this is matched the Given information which provides Information is in data set, is handled by the people of creation data set), the defect report pair that two defect reports are formed It is indicated by two-dimentional binary channels matrix, the two dimension binary channels matrix is by the corresponding two-dimentional single channel square of described two defect reports Battle array is composed, and then to the binary channels matrix, it stamps repetition or unduplicated label;
By all tagged binary channels matrixes, it is divided into training set and verifying collection;
S200: CNN model is established
S201: all binary channels matrixes that training set and verifying are concentrated are inputted into CNN model together;
S202: in first convolutional layer, settingA convolution kernelWherein d is the length of convolution kernel, kw It is the width of convolution kernel;After first time convolution, two channels of binary channels matrix are just merged into one, and first layer convolution is public Formula are as follows:
Wherein C1Indicate the output of first convolutional layer, i indicates that first convolutional layer inputs I1I-th of channel, j1It indicates The jth of input1Row, b1Indicate offset, f1It indicates nonlinear activation primitive, gives the length l (l=n of inputw), Filling power P=0 and step-length S=1, the length O of output1It can be calculated as:
The output shape of first convolutional layer isBy the output shape remodeling of first convolutional layer at Then convolution again in second convolutional layer, and is provided with the convolution kernel of three kinds of sizesEvery kind of convolution kernelIt is a, the formula of second layer convolution are as follows:
Wherein C2Indicate the output of second convolutional layer, j2Indicate second convolutional layer input I2Jth2Row, b2Indicate inclined Shifting amount, f2Indicate nonlinear activation primitive, after current convolution, can obtain three kinds of shapes isFeature Scheme, wherein O2It can be according to l (l=O1) and different convolution kernel length d, it is calculated according to formula (2);
S203: maximum pond is carried out to all characteristic patterns;
S204: remolding and splices all characteristic patterns to obtain oneThe vector of dimension, it will be by as full connection The input of layer;
After two full articulamentums, an independent probability sim is obtainedpredict, it represents what two reports were predicted Similarity;
In the last layer, sigmoid is used to obtain sim as activation primitivepredict
Output T={ the x of given first full articulamentum1,x2,…,x300And weight vectors W={ w1,w2,…,w300, simpredictIt can be calculated as:
Wherein i indicates that i-th of element of T, b indicate offset;
S205: all defect report pair in traversal training set repeats S202-S204;
S206: backpropagation is carried out with the hiding parameter of more new model according to loss function, loss function such as formula (5):
Wherein labelrealIndicate that the label of preset defect report pair, i indicate that i-th of defect report pair, n indicate defect The sum of report pair;
S207: it after each epoch training, is verified using verifying the set pair analysis model;When the loss of verifying collection is at 5 When all no longer reducing in epoch, stop updating model parameter;Otherwise S201 is returned, continues to train CNN model;
S300: defect report prediction to be predicted
Defect report to be predicted is pre-processed using the method in S102 first, it then will using the method in S103 The defect report to be predicted is converted into the two-dimentional single channel matrix of prediction defect report;
It will predict the two-dimentional single channel square of the two-dimentional single channel matrix and the existing N number of defect report of the software of defect report Battle array combination of two obtains N number of binary channels matrix to be predicted, and N is constituted forecast set to binary channels matrix to be predicted, will be in forecast set Each of binary channels matrix to be predicted as input, be input in the CNN model, obtain a probability;
In N number of probability, probability then thinks defect report corresponding to the probability and prediction defect report greater than threshold value To repeat.
As an improvement, structured message is product and component in the S101, unstructured letter is summary And description.
As an improvement, being all to use Relu as activation primitive to extract in other layers in addition to the last one full articulamentum More nonlinear characteristic.
Compared with the existing technology, the present invention at least has the advantages that
The invention proposes a new method DC-CNN to carry out repeated defects report detection.It is by two by single channel The defect report that matrix indicates is combined into the defect report pair of binary channels matrix expression.Then, this binary channels matrix quilt It is input in CNN model and extracts implicit feature.The present invention in Open Office, Eclipse, Net Beans and they The method of proposition is demonstrated on combined data set Combined and is examined with the duplicate reports based on deep learning state-of-the-art at present Survey method is compared, and the method for the present invention is effective, it is often more important that performance is also more preferable.
Detailed description of the invention
Fig. 1 is the overall framework of the method for the present invention.
Fig. 2 is the overall procedure for establishing CNN model.
Fig. 3 (a) be ROC curve of the DC-CNN and SC-CNN on Open Offic data set, Fig. 3 (b) be DC-CNN and ROC curve of the SC-CNN on Eclipse data set, Fig. 3 (c) are DC-CNN and SC-CNN on Net Beans data set ROC curve, Fig. 3 (d) are ROC curve of the DC-CNN and SC-CNN on Combined data set.
Fig. 4 is the influence of term vector dimension.
Fig. 5 is the influence of unstructured information.
Specific embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
Fig. 1 illustrates the general frame of the method for the present invention DC-CNN, it contains three phases: data preparation establishes CNN Model and defect report to be predicted prediction.In data preparation stage, the field useful to duplicate reports, including component, Product, summary and description are extracted from defect report.To each report, structured message and non- Structured message is put into togerther in a text invention shelves.By pretreatment, all defect report text be collected with Form a corpus.Word2vec is used to extract the semanteme rule of corpus.Each is converted by the report of text representation At a single channel matrix.In order to judge the relationship between two reports, the single channel matrix of expression defect report is combined into Indicate the binary channels matrix of defect report pair.Then using a part as training set, remaining part is as verifying collection.In training Stage is to input one CNN model of training with training set.In defect report forecast period to be predicted, trained model load The similarity of the defect report pair of one unknown defect report of prediction and known defect report composition, this similarity is a table Show defect report to the probability for repeating possibility.
A kind of repeated defects report detection method based on binary channels convolutional neural networks, includes the following steps:
S100: data preparation
S101: extracting the defect report of software, and all defect report is made of structured message and unstructured information, For each defect report, all structured messages and unstructured information are put into an individual text invention shelves;
Structured message is usually optional attribute, and unstructured information is usually the text description of bug.
S102: for each defect report, carrying out pre-treatment step, including segment, extract stem, removal stop words and Capital and small letter conversion;
The present invention completes above-mentioned pre-treatment step using the StandardAnalyzer of Lucene.When removal stop words When, use the English of a standard to deactivate vocabulary.In addition, even if still being had in the defect report of two wide of the marks Some identical words.These words are usually some specialized vocabularies, such as java, com, org etc..Due to frequently occurring, also him Be added in deactivated vocabulary.By handling above, some nonsensical numbers are left down in text, they are also removed.
S103: after pretreatment, the word in all defect report is combined into a corpus, using existing on corpus Word2vec and select CBOW model, obtain each word vector indicate to get arrive each defect report two-dimensional matrix It indicates, referred to as the two-dimentional single channel matrix of defect report;
When according to the defect report for extracting software, (this is matched the Given information which provides Information is in data set, is handled by the people of creation data set), the defect report pair that two defect reports are formed It is indicated by two-dimentional binary channels matrix, the two dimension binary channels matrix is by the corresponding two-dimentional single channel square of described two defect reports Battle array is composed, and then to the binary channels matrix, it stamps repetition or unduplicated label;
Compared with single channel, had the benefit that using the binary channels expression of defect report pair.Firstly, two reports can be with It is handled simultaneously by CNN.Therefore training speed is accelerated.It, can be with using double-channel data training CNN secondly, be proved to Reach higher accuracy rate.For binary channels CNN, by convolution operation, it be can capture between two defect reports Incidence relation.
By all tagged binary channels matrixes, it is divided into training set and verifying collection;When it is implemented, 80% stamps mark The binary channels matrix of label is divided into training set, and remaining 20% tagged binary channels matrix is verifying collection.
S200: CNN model is established
In order to from defect report centering extract feature, the present invention each convolutional layer be provided with three kinds it is different size of Convolution kernel.Therefore, there are three branches for first convolutional layer tool.For each of these three branches, in second convolutional layer It still can be there are three Xin get branch.Because the structure height of these three branches is similar, Fig. 2 shows only CNN overall work A branch of first convolutional layer in structure.Table 3 illustrates the design parameter setting of CNN model of the present invention.
Table 3
S201: all binary channels matrixes that training set and verifying are concentrated are inputted into CNN model together;
S202: in first convolutional layer, settingA convolution kernelWherein d is the length of convolution kernel, kw It is the width of convolution kernel;Because every a line of input matrix represents a word, convolution kernel width is equal to term vector dimension m; After first time convolution, two channels of binary channels matrix are just merged into one, in this manner it is possible to which two defect reports are seen Feature, first layer Convolution Formula are extracted at an entirety are as follows:
Wherein C1Indicate the output of first convolutional layer, i indicates that first convolutional layer inputs I1I-th of channel, j1It indicates The jth of input1Row, b1Indicate offset, f indicates nonlinear activation primitive, and the present invention gives input using Relu Length l (l=nw), Filling power P=0 and step-length S=1, the length O of output1It can be calculated as:
The output shape of first convolutional layer isIn order to further extract the linked character of two reports, By the output shape remodeling of first convolutional layer atThen convolution again in second convolutional layer, and is provided with three The convolution kernel of kind sizeEvery kind of convolution kernelIt is a, the formula of second layer convolution Are as follows:
Wherein C2Indicate the output of second convolutional layer, j2Indicate second convolutional layer input I2Jth2Row, b2Indicate inclined Shifting amount, f2Indicate nonlinear activation primitive, the present invention uses Relu, after current convolution, can obtain three kinds of shapes ForCharacteristic pattern, wherein O2It can be according to l (l=O1) and different convolution kernel length d, it is counted according to formula (2) It calculates.
S203: maximum pond is carried out to all characteristic patterns;In this way, each characteristic pattern be downsampled for Shape.
S204: remolding and splices all characteristic patterns to obtain oneThe vector of dimension, it will be by as full connection The input of layer;
After two full articulamentums, an independent probability sim is obtainedpredict, it represents what two reports were predicted Similarity;
In the last layer, sigmoid is used to obtain sim as activation primitivepredict
Output T={ the x of given first full articulamentum1,x2,…,x300And weight vectors W={ w1,w2,…,w300, simpredictIt can be calculated as:
Wherein i indicates that i-th of element of T, b indicate offset.
S205: all defect report pair in traversal training set repeats S202-S204.
S206: backpropagation is carried out with the hiding parameter of more new model according to loss function, loss function such as formula (5):
Wherein labelrealIndicate that the label of preset defect report pair, i indicate that i-th of defect report pair, n indicate defect The sum of report pair.
S207: it after each epoch training, is verified using verifying the set pair analysis model;When the loss of verifying collection is at 5 When all no longer reducing in epoch, stop updating model parameter;Otherwise S201 is returned, continues to train CNN model.
S300: defect report prediction to be predicted
Defect report to be predicted is pre-processed using the method in S102 first, it then will using the method in S103 The defect report to be predicted is converted into the two-dimentional single channel matrix of prediction defect report;
It will predict the two-dimentional single channel square of the two-dimentional single channel matrix and the existing N number of defect report of the software of defect report Battle array combination of two obtains N number of binary channels matrix to be predicted, and N is constituted forecast set to binary channels matrix to be predicted, will be in forecast set Each of binary channels matrix to be predicted as input, be input in the CNN model, obtain a probability;
In N number of probability, probability then thinks defect report corresponding to the probability and prediction defect report greater than threshold value To repeat.
Such as certain software has N number of defect report at present, then the corresponding two dimension of each defect report is single after treatment Access matrix will predict the corresponding two-dimentional single channel matrix to be predicted of defect report and N number of two-dimentional single channel matrix arbitrarily two-by-two Composition, obtains N to binary channels matrix to be predicted, then the above-mentioned CNN of input by this N to binary channels matrix to be predicted one by one In model, N number of probability is obtained.When some probability is greater than preset threshold value, then it is assumed that be predicted corresponding to the probability to lack Existing defect report in report and software is fallen into repeat.
Verification experimental verification:
1, data set
In order to compare, present invention employs data set identical with Deshmukh et al., this data set be by What Lazar was collected and was handled.It contains three large-scale open source projects: Open Office, Eclipse and Net Beans. Open Office is the office software similar with Microsoft Office.Eclipse and Net Beans, which is that open source is integrated, to be opened Hair ring border.In order to be tested with more training samples, a bigger data set is obtained by merging these three data sets, And " Combined " is named as to it.These data sets additionally provide defect report pairing relationship, and one in Open Office Divide pairing relationship as shown in table 4.
Table 4: defect report pair
By analyzing all pairing relationships in each data set, some of them problem is found.First, some pairings are It is duplicate.For example, (200622,197347, duplicate) occur 5 times in Open Office.Second, some pairings What is indicated is the same relationship, for example, (159435,164827, duplicate) in Eclipse and (164827,159435, duplicate).Therefore, the present invention will remove these defect reports pair.Table 5 illustrates all in finally obtained data set Match quantity.
Table 5: complete data set
Dataset Duplicate Non duplicate
OpenOffice 57340 41751
Eclipse 86385 160917
Net Beans 95066 89988
Combined 238791 292476
Each data set is divided into training set and test set, and training set accounts for 80% (wherein 10% as verifying collection), test Collection accounts for 20%.In addition, in partitioned data set, making training set to allow training set and test set to simulate raw data set distribution With duplicate reports in test set to identical as raw data set with non-duplicate report comparative example.Training set and test set are all random Selection.Table 6 illustrates the detailed distribution of defect report pair in training set and test set.
Table 6: training set and test set
Evaluation criteria
In model proposed by the present invention, output indicates the similarity of defect report centering two reports.Therefore, this A value is between 0 to 1.In order to further classify, it will one threshold value of setting.Sim is obtained in third sectionpredictLater, labelpredict(indicating a defect report to the label being predicted) can calculate according to following formula:
According to labelpredictAnd labelreal, report is to being divided into four classes:
1) TP:labelreal=1, labelpredict=1
2) TN:labelreal=0, labelpredict=0
3) FP:labelreal=0, labelpredict=1
4) FN:labelreal=1, labelpredict=0
Wherein 1 indicate report to be it is duplicate, 0 indicates report to being non-repetitive.TP expression is predicted correctly to repeat Report to quantity, TN expression is predicted correctly as non-repetitive report to quantity, and it is duplicate report that FP, which indicates mispredicted, It accuses to quantity, it is non-repetitive report to quantity that FN, which indicates mispredicted,.This four indexs are the calculating of following evaluation criterion Basis.
Accuracy
Accuracy indicates the ratio of the defect report pair being predicted correctly with all reports pair, it indicates that model correctly divides The performance of class all defect report pair.Because having used sigmoid function when being returned, Accuracy is being calculated, When Recall and Precision, threshold value is set as 0.5.
Recall:
Recall indicates that being correctly predicted to be duplicate defect report pair with all reality is duplicate defect report pair Ratio.
Precision:
Precision indicate correctly be predicted to be duplicate defect report pair and it is all be predicted to be it is duplicate report pair Ratio.
F1- Score:
F1- Score is the harmonic-mean of Recall and Precision.
Roc curve:
In fact, due in data set defect report it is unbalanced to category distribution, traditional evaluation criterion is such as Accuracy cannot classification of assessment device well performance.Therefore, the present invention is using ROC curve come further classification of assessment device Performance.According to different threshold values, then available different TPR and FPR can draw ROC curve by TPR and FPR. TPR and FPR can be calculated according to following formula:
Using all FPR values as horizontal axis, all TPR values are as the longitudinal axis, so that it may obtain ROC curve.Curve is from seat The parameter upper left corner is closer, and the performance of classifier is better.
Experimental result
Show the technical effect of the method for the present invention by answering following Railway Project.
Problem 1: compared with the state-of-the-art repeated defects report detection method based on deep learning, DC-CNN of the invention Whether effectively?
Goal in research of the invention is proposition one more effectively based on the method for deep learning.It therefore, will be of the invention Method and the method for Deshmukh et al. compare on identical data set.
Table 7: the experimental result of the method for the present invention and Deshmukh et al. method
As a result: table 7 illustrates the experimental result of the method for the present invention and Deshmukh et al. method.Use an identical core Heart method --- twin neural network, they establish two similar models, retrieval model and disaggregated model.For mould of classifying Type, highest accuracy are appeared on Open Office data set, have reached 0.8275, and are only only had in Eclipse 0.7268.Their retrieval model performance is better than disaggregated model.For retrieval model, the data set to behave oneself best is still Open Office, its accuracy are up to 0.9455.In the same manner, Eclipse is slightly inferior, and accuracy is 0.906.It can be found that Disaggregated model with twin neural network is compared, DC-CNN in Open Office, Eclipse, Net Beans, Promotion on Combined is 11.54%, 24.17%, 17.89% and 13.33% respectively.With the inspection of twin neural network Rope model is compared, and promotion of the DC-CNN on Eclipse, Net Beans, Combined is 6.25% respectively, 4.07% He 3.84%.On Open Office, the accuracy of DC-CNN is low less than 0.03%.
Influence: according to table 7, the performance of DC-CNN is high on 3 data sets (Eclipse, Net Beans, Combined) In disaggregated model and retrieval model that the twin neural network of Deshmukh et al. constructs.On Open Office, DC-CNN's Performance be higher than the disaggregated model that Deshmukh et al. is constructed with twin neural network and with their retrieval model have one it is non- Normal similar performance.In general, DC-CNN has reached an extraordinary performance and has been more than current state-of-the-art base In the duplicate reports detection method of deep learning.
Problem 2: comparing with SC-CNN, and whether DC-CNN effective?
In order to prove that the binary channels matrix expression of defect report pair proposed by the present invention is effectively, to also use defect report The single channel matrix of announcement indicates as a comparison baseline.Keep the structure of CNN constant, the quantity including convolution kernel, convolution The size of core, the quantity etc. of convolutional layer, and extract a defect report centering two features reported respectively with it, then calculate Their similarity.This method is referred to as single channel convolutional neural networks (Single-Channel Convolutional Neural Networks, SC-CNN).
Table 8:DC-CNN and SC-CNN experimental result
As a result: the property of both methods is evaluated on Accuracy, Recall, Precision, the indexs such as F1-Score Can, experimental result is as shown in table 8, wherein best result is all by overstriking.It is observed that in all fingers of all data sets It puts on, DC-CNN has been above SC-CNN.Compared to SC-CNN, in Open Office, Eclipse, Net Beans, and On Combined, the accuracy of DC-CNN has been respectively increased 2.78%, 2.61%, 1.36% and 2.33%, DC-CNN's The Precision that 2.73%, 0.51%, 1.49% and 3.17%, DC-CNN has been respectively increased in recall is respectively increased The F1-Score of 2.08%, 6.53%, 1.20% and 2.08%, DC-CNN improve 2.40%, 3.53% respectively, 1.35% He 2.62%.Fig. 3 (a) Fig. 3 (d) illustrates the ROC curve of two methods.It is observed that on all data sets, DC-CNN Curve all on SC-CNN, this shows that DC-CNN also has better classification performance even if when sample distribution is unbalanced.
Influence: all experimental results all show more more effective than single channel using twin-channel CNN model.For SC-CNN For.Each report is converted into a matrix and is then input in CNN to extract feature, be as a result represented as feature to Amount.Then judge whether two reports repeat by calculating the similarity of two feature vectors.For DC-CNN, two The matrix of report is combined into a binary channels matrix and is then input to CNN, and then the two reports are convolved together, this side Method can extract profound relationship between two reports, take full advantage of the ability that CNN captures local feature.Because of DC-CNN In CNN model be absorbed in the incidence relation extracted between two reports, so when detect duplicate reports with better performance.
Does problem 3: when changing term vector dimension, how experimental result change?
The invention proposes a kind of new defect reports to representation method --- binary channels matrix.Therefore, also explore with Influence of the relevant parameter to experimental result.For binary channels matrix because the quantity number of word it is fixed and for For CNN, two report positions (which is reported in first channel, which is reported in second channel) be it is indiscriminate, So the parameter for being most likely to occur change is the dimension of term vector.In order to answer when changing term vector dimension, experimental result is such as Term vector dimension is gradually changed from 10 to 100 and observation experiment result is in Open Office data set by what variation this problem On variation.
As a result: from fig. 4, it can be seen that when being gradually increased term vector dimension, under accuracy first increases and then shows Drop trend.When term vector dimension is 20, accuracy rate has reached maximum value, and 94.29%.
Influence: when term vector dimension increases to 20 from 10, accuracy is increased.When we continue to increase term vector dimension Degree, accuracy are reduced.Reason may be, when a term vector dimension characterizes a word enough.Continue to increase dimension Degree prevents it from indicating this word well instead.Although accuracy has reached maximum value when term vector dimension is equal to 20, But it is not higher by too much than the value under other conditions.On the one hand, term vector dimension increase can bring bigger data to deposit Storage problem;On the other hand, word insertion and complexity when CNN model training can all increase.Therefore, in the methods of the invention, 20 It is most suitable term vector dimension.
Problem 4: when not using structured message, whether method proposed by the present invention effective?
Such as the structured messages such as product, component and version are mentioned when judging whether two reports repeat Highly useful information is supplied.Structured message is improved repeated defects report inspection as an individual feature by many methods The accuracy of survey.Unstructured information is usually the natural language description to bug.For duplicate reports detection, CNN is main For handling non-structured text, therefore it has good performance when handling long text.Different from other methods, the present invention It is put into text invention shelves using structured message and unstructured information as text data simultaneously.Then it is extracted with CNN Feature.In order to answer a question 4, structured message is removed from input, and when not changing other conditions, setting comparison Experiment.
As a result: from fig. 5, it can be seen that the experimental result on all data sets all reduces after removing structured message, It is 1.74%, 3.79%, 3.38%, 2.56% respectively on Open Office, Eclipse, Net Beans, Combined.
It influences: the experimental results showed that it is effective that structured message and unstructured information, which are input to together in CNN,.Note It anticipates to after removing structured message, although accuracy has dropped, this reduction is not fatal.The reason is that knot Structure information only accounts for the sub-fraction of entire text.CNN master part to be processed is still unstructured information.
Finally, it is stated that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although referring to compared with Good embodiment describes the invention in detail, those skilled in the art should understand that, it can be to skill of the invention Art scheme is modified or replaced equivalently, and without departing from the objective and range of technical solution of the present invention, should all be covered at this In the scope of the claims of invention.

Claims (3)

1. a kind of repeated defects based on binary channels convolutional neural networks report detection method, it is characterised in that: including walking as follows It is rapid:
S100: data preparation
S101: extracting the defect report of software, and all defect report is made of structured message and unstructured information, for All structured messages and unstructured information are put into an individual text invention shelves by each defect report;
S102: for each defect report, carrying out pre-treatment step, including segments, extracts stem, removal stop words and size Write conversion;
S103: after pretreatment, the word in all defect report is combined into a corpus, using existing on corpus Word2vec simultaneously selects CBOW model, and the vector for obtaining each word indicates to get the two-dimensional matrix table of each defect report is arrived Show, referred to as the two-dimentional single channel matrix of defect report;
When according to the defect report for extracting software, Given information which the provides (information of this pairing Be in data set, handled by the people of creation data set), by the defect report of two defect reports composition to passing through Two-dimentional binary channels matrix indicates that the two dimension binary channels matrix is by the corresponding two-dimentional single channel matrix group of described two defect reports It closes, then to the binary channels matrix, it stamps repetition or unduplicated label;
By all tagged binary channels matrixes, it is divided into training set and verifying collection;
S200: CNN model is established
S201: all binary channels matrixes that training set and verifying are concentrated are inputted into CNN model together;
S202: in first convolutional layer, settingA convolution kernelWherein d is the length of convolution kernel, kwIt is volume The width of product core;After first time convolution, two channels of binary channels matrix are just merged into one, first layer Convolution Formula Are as follows:
Wherein C1Indicate the output of first convolutional layer, i indicates that first convolutional layer inputs I1I-th of channel, j1Indicate input Jth1Row, b1Indicate offset, f1It indicates nonlinear activation primitive, gives the length l (l=n of inputw), Filling power P=0 With step-length S=1, the length O of output1It can be calculated as:
The output shape of first convolutional layer isBy the output shape remodeling of first convolutional layer at Then convolution again in second convolutional layer, and is provided with the convolution kernel of three kinds of sizesEvery kind of convolution kernelIt is a, the formula of second layer convolution are as follows:
Wherein C2Indicate the output of second convolutional layer, j2Indicate second convolutional layer input I2Jth2Row, b2Indicate offset, f2Indicate nonlinear activation primitive, after current convolution, can obtain three kinds of shapes isCharacteristic pattern, wherein O2It can be according to l (l=O1) and different convolution kernel length d, it is calculated according to formula (2);
S203: maximum pond is carried out to all characteristic patterns;
S204: remolding and splices all characteristic patterns to obtain oneThe vector of dimension, it will be by as full articulamentum Input;
After two full articulamentums, an independent probability sim is obtainedpredict, it represent two report be predicted it is similar Degree;
In the last layer, sigmoid is used to obtain sim as activation primitivepredict
Output T={ the x of given first full articulamentum1, x2..., x300And weight vectors W={ w1, w2..., w300, simpredictIt can be calculated as:
Wherein i indicates that i-th of element of T, b indicate offset;
S205: all defect report pair in traversal training set repeats S202-S204;
S206: backpropagation is carried out with the hiding parameter of more new model according to loss function, loss function such as formula (5):
Wherein labelrealIndicate that the label of preset defect report pair, i indicate that i-th of defect report pair, n indicate defect report Pair sum;
S207: it after each epoch training, is verified using verifying the set pair analysis model;When the loss of verifying collection is at 5 When all no longer reducing in epoch, stop updating model parameter;Otherwise S201 is returned, continues to train CNN model;
S300: defect report prediction to be predicted
Defect report to be predicted is pre-processed using the method in S102 first, is then waited for this using the method in S103 Prediction defect report is converted into the two-dimentional single channel matrix of prediction defect report;
It will predict the two-dimentional single channel matrix two of the two-dimentional single channel matrix and the existing N number of defect report of the software of defect report Two combinations obtain N number of binary channels matrix to be predicted, and N is constituted forecast set to binary channels matrix to be predicted, will be every in forecast set A binary channels matrix to be predicted is input in the CNN model as input, obtains a probability;
In N number of probability, probability then thinks that defect report corresponding to the probability and prediction defect report are attached most importance to greater than threshold value It is multiple.
2. the repeated defects based on binary channels convolutional neural networks report that detection method, feature exist as described in claim 1 In: structured message is product and component in the S101, and unstructured letter is summary and description.
3. the repeated defects based on binary channels convolutional neural networks report that detection method, feature exist as described in claim 1 In: it is all to use Relu as activation primitive to extract more non-linear spy in other layers in addition to the last one full articulamentum Sign.
CN201910474540.6A 2019-06-20 2019-06-20 Double-channel convolutional neural network-based repeated defect report detection method Active CN110188047B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910474540.6A CN110188047B (en) 2019-06-20 2019-06-20 Double-channel convolutional neural network-based repeated defect report detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910474540.6A CN110188047B (en) 2019-06-20 2019-06-20 Double-channel convolutional neural network-based repeated defect report detection method

Publications (2)

Publication Number Publication Date
CN110188047A true CN110188047A (en) 2019-08-30
CN110188047B CN110188047B (en) 2023-04-18

Family

ID=67719718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910474540.6A Active CN110188047B (en) 2019-06-20 2019-06-20 Double-channel convolutional neural network-based repeated defect report detection method

Country Status (1)

Country Link
CN (1) CN110188047B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177010A (en) * 2019-12-31 2020-05-19 杭州电子科技大学 Software defect severity identification method
CN111737107A (en) * 2020-05-15 2020-10-02 南京航空航天大学 Repeated defect report detection method based on heterogeneous information network
CN112328469A (en) * 2020-10-22 2021-02-05 南京航空航天大学 Function level defect positioning method based on embedding technology
CN112631898A (en) * 2020-12-09 2021-04-09 南京理工大学 Software defect prediction method based on CNN-SVM
CN113362305A (en) * 2021-06-03 2021-09-07 河南中烟工业有限责任公司 Smoke box strip missing mixed brand detection system and method based on artificial intelligence
CN113379746A (en) * 2021-08-16 2021-09-10 深圳荣耀智能机器有限公司 Image detection method, device, system, computing equipment and readable storage medium
CN113379685A (en) * 2021-05-26 2021-09-10 广东炬森智能装备有限公司 PCB defect detection method and device based on dual-channel feature comparison model
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification
CN113791897A (en) * 2021-08-23 2021-12-14 湖北省农村信用社联合社网络信息中心 Method and system for displaying server baseline detection report of rural telecommunication system
US20230367967A1 (en) * 2022-05-16 2023-11-16 Jpmorgan Chase Bank, N.A. System and method for interpreting stuctured and unstructured content to facilitate tailored transactions

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130067312A1 (en) * 2006-06-22 2013-03-14 Digg, Inc. Recording and indicating preferences
CN103970666A (en) * 2014-05-29 2014-08-06 重庆大学 Method for detecting repeated software defect reports
CN106250311A (en) * 2016-07-27 2016-12-21 成都启力慧源科技有限公司 Repeated defects based on LDA model report detection method
US20170212829A1 (en) * 2016-01-21 2017-07-27 American Software Safety Reliability Company Deep Learning Source Code Analyzer and Repairer
CN108491835A (en) * 2018-06-12 2018-09-04 常州大学 Binary channels convolutional neural networks towards human facial expression recognition
CN108563556A (en) * 2018-01-10 2018-09-21 江苏工程职业技术学院 Software defect prediction optimization method based on differential evolution algorithm
CN108804558A (en) * 2018-05-22 2018-11-13 北京航空航天大学 A kind of defect report automatic classification method based on semantic model
CN109376092A (en) * 2018-11-26 2019-02-22 扬州大学 A kind of software defect reason automatic analysis method of facing defects patch code
CN109491914A (en) * 2018-11-09 2019-03-19 大连海事大学 Defect report prediction technique is influenced based on uneven learning strategy height

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130067312A1 (en) * 2006-06-22 2013-03-14 Digg, Inc. Recording and indicating preferences
CN103970666A (en) * 2014-05-29 2014-08-06 重庆大学 Method for detecting repeated software defect reports
US20170212829A1 (en) * 2016-01-21 2017-07-27 American Software Safety Reliability Company Deep Learning Source Code Analyzer and Repairer
CN106250311A (en) * 2016-07-27 2016-12-21 成都启力慧源科技有限公司 Repeated defects based on LDA model report detection method
CN108563556A (en) * 2018-01-10 2018-09-21 江苏工程职业技术学院 Software defect prediction optimization method based on differential evolution algorithm
CN108804558A (en) * 2018-05-22 2018-11-13 北京航空航天大学 A kind of defect report automatic classification method based on semantic model
CN108491835A (en) * 2018-06-12 2018-09-04 常州大学 Binary channels convolutional neural networks towards human facial expression recognition
CN109491914A (en) * 2018-11-09 2019-03-19 大连海事大学 Defect report prediction technique is influenced based on uneven learning strategy height
CN109376092A (en) * 2018-11-26 2019-02-22 扬州大学 A kind of software defect reason automatic analysis method of facing defects patch code

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
缪浩然等: "改进的词向量特征和CNN在语句分类中的应用", 《第十四届全国人机语音通讯学术会议》 *
贡岩等: "指挥自动化系统嵌入式软件可靠性评估", 《中国电子学会可靠性分会第十三届学术年会》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177010B (en) * 2019-12-31 2023-12-15 杭州电子科技大学 Software defect severity identification method
CN111177010A (en) * 2019-12-31 2020-05-19 杭州电子科技大学 Software defect severity identification method
CN111737107B (en) * 2020-05-15 2021-10-26 南京航空航天大学 Repeated defect report detection method based on heterogeneous information network
CN111737107A (en) * 2020-05-15 2020-10-02 南京航空航天大学 Repeated defect report detection method based on heterogeneous information network
CN112328469A (en) * 2020-10-22 2021-02-05 南京航空航天大学 Function level defect positioning method based on embedding technology
CN112328469B (en) * 2020-10-22 2022-03-18 南京航空航天大学 Function level defect positioning method based on embedding technology
CN112631898A (en) * 2020-12-09 2021-04-09 南京理工大学 Software defect prediction method based on CNN-SVM
CN113379685A (en) * 2021-05-26 2021-09-10 广东炬森智能装备有限公司 PCB defect detection method and device based on dual-channel feature comparison model
CN113362305A (en) * 2021-06-03 2021-09-07 河南中烟工业有限责任公司 Smoke box strip missing mixed brand detection system and method based on artificial intelligence
CN113486176A (en) * 2021-07-08 2021-10-08 桂林电子科技大学 News classification method based on secondary feature amplification
CN113379746B (en) * 2021-08-16 2021-11-02 深圳荣耀智能机器有限公司 Image detection method, device, system, computing equipment and readable storage medium
CN113379746A (en) * 2021-08-16 2021-09-10 深圳荣耀智能机器有限公司 Image detection method, device, system, computing equipment and readable storage medium
CN113791897A (en) * 2021-08-23 2021-12-14 湖北省农村信用社联合社网络信息中心 Method and system for displaying server baseline detection report of rural telecommunication system
CN113791897B (en) * 2021-08-23 2022-09-06 湖北省农村信用社联合社网络信息中心 Method and system for displaying server baseline detection report of rural telecommunication system
US20230367967A1 (en) * 2022-05-16 2023-11-16 Jpmorgan Chase Bank, N.A. System and method for interpreting stuctured and unstructured content to facilitate tailored transactions

Also Published As

Publication number Publication date
CN110188047B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN110188047A (en) A kind of repeated defects report detection method based on binary channels convolutional neural networks
Day et al. Deep learning for financial sentiment analysis on finance news providers
US9779085B2 (en) Multilingual embeddings for natural language processing
CN110245229A (en) A kind of deep learning theme sensibility classification method based on data enhancing
CN107491531A (en) Chinese network comment sensibility classification method based on integrated study framework
CN109918497A (en) A kind of file classification method, device and storage medium based on improvement textCNN model
CN106095928A (en) A kind of event type recognition methods and device
CN109299271A (en) Training sample generation, text data, public sentiment event category method and relevant device
CN108090099B (en) Text processing method and device
CN110097096B (en) Text classification method based on TF-IDF matrix and capsule network
CN109783637A (en) Electric power overhaul text mining method based on deep neural network
CN109960727A (en) For the individual privacy information automatic testing method and system of non-structured text
CN111026870A (en) ICT system fault analysis method integrating text classification and image recognition
CN109800309A (en) Classroom Discourse genre classification methods and device
CN108920446A (en) A kind of processing method of Engineering document
CN117474507A (en) Intelligent recruitment matching method and system based on big data application technology
CN115544252A (en) Text emotion classification method based on attention static routing capsule network
CN107766560A (en) The evaluation method and system of customer service flow
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
CN114519508A (en) Credit risk assessment method based on time sequence deep learning and legal document information
Marerngsit et al. A two-stage text-to-emotion depressive disorder screening assistance based on contents from online community
CN109871889B (en) Public psychological assessment method under emergency
CN114912460A (en) Method and equipment for identifying transformer fault through refined fitting based on text mining
CN107886233A (en) The QoS evaluating method and system of customer service
CN113297376A (en) Legal case risk point identification method and system based on meta-learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant