CN106250311A - Repeated defects based on LDA model report detection method - Google Patents
Repeated defects based on LDA model report detection method Download PDFInfo
- Publication number
- CN106250311A CN106250311A CN201610599276.5A CN201610599276A CN106250311A CN 106250311 A CN106250311 A CN 106250311A CN 201610599276 A CN201610599276 A CN 201610599276A CN 106250311 A CN106250311 A CN 106250311A
- Authority
- CN
- China
- Prior art keywords
- detection
- report
- theme
- repeated defects
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3692—Test management for test results analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of repeated defects based on LDA model report detection method;It includes structure topic model, calculates the similarity of the execution information of detection sample, calculates the similarity of the classification information of detection sample and is weighted comparing with rear and detection threshold value that is that pre-set, completing repeated defects report detection.The present invention is based on LDA model, the repeated defects report of information is performed for band, by building topic model, detection sample is carried out theme deduction, calculate execution information and the degree of approximation of classification information of detection sample respectively, thus realize repeated defects report detection, efficiently solve the problem such as high-dimensional, high voice that traditional method exists, the accuracy of testing result after being simultaneously introduced execution information and classification information, can be greatly improved.
Description
Technical field
The invention belongs to defect report detection technique field, particularly relate to the report of a kind of repeated defects based on LDA model
Detection method.
Background technology
Defect report describes data as the defect produced in software maintenance periodic process, and it is tieed up for whole software defect
Protect and all play vital effect.So-called software deficiency report be software maintenance work in, software test personnel or
User's defect that may be present for software, the document of mistake describe.For defect report application data mining technology, it is possible to have
Effect ground solves the problems such as defect assignment, incorrect restitution and defect classification, and wherein defect is assigned is that user submits to ground defect report
Announcement is dispatched to the process of the developer specified, corresponding developer be analyzed defect report, thus it is speculated that the mould residing for defect
Block, and finally repair this defect.
Generally forming by writing the software the fewest user of understanding itself hastily due to these reports, they not only also exist
Express that fuzzy, the most professional, information is complete, be difficult to the problems such as understanding, be also present in same defect and be repeated the problem of submission,
Thereby result in and the defect report submitting to there are the defect report that bulk redundancy repeats, this type of repeated defects is reported
If being repeatedly assigned to developer, the serious waste of human resources certainly will be caused, especially for for large-scale open source projects this
Class problem is particularly evident.Open source software such as Eclipse, Firefox, Open Office etc., along with the increase of scale and version
Update, have every day different user submit to defect report in Bug Tracking management system, system by produce a large amount of repeat lack
Fall into report, as in the software defect data base of open source software Firefox, the ratio of repeated defects defect is up to 30%.2005,
Mozilla programmer reports " everyday, almost 300bugs appear that need triaging.This is
far too much for only the Mozilla programmers to handle”。
In order to alleviate the burden of manual detection repeated defects report, lot of domestic and international experts and scholars put into repeated defects report
Accuse the research of detection field and propose the detection method of a series of repeated defects report.Initially everybody research is not examined
Consider the problem to repeated defects report detection, but concentrate on the problem that defect is assigned, asking of repeated defects report detection
Topic substantially proposes to solve the problem of defect assignment.But owing to traditional duplicate reports automatic testing method is general
All over using vector space model as theoretical basis, it is necessary first to build vector space module with TF-IDF, by each defect
Account is shown as a word vector spatially, then calculate between it by the way of calculating co sinus vector included angle value similar
Degree.But owing to corpus pending in reality is the hugest, typically up to 10,000 documents more than, the vector of structure is empty
Between have that dimension is high, Sparse and have the problems such as throat sound, reduce detection execution efficiency, recall ratio and precision ratio relatively low.
Summary of the invention
The goal of the invention of the present invention is: in order to solve to exist in prior art problem above, and the present invention proposes a kind of base
Repeated defects in LDA model reports detection method.
The technical scheme is that a kind of repeated defects based on LDA model reports detection method, including following step
Rapid:
A, from repeated defects report data concentrate choose a number of data construct training sample space, to training sample
Space uses LDA method to build topic model;
B, from repeated defects report data concentrate randomly select a number of data construct detection sample space, according to step
The topic model built in rapid A carries out theme deduction to detection sample space, calculates the similarity of the execution information of detection sample;
C, extract the classification information of each defect according to the detection sample space in step B and compare according to layer order
Relatively, obtain detecting the similarity of the classification information of sample;
D, the similarity that detection sample in step B is performed information and step C detect the phase of the classification information of sample
Be weighted like degree and, and compare with the detection threshold value pre-set, complete repeated defects report detection.
Further, in described step A build topic model particularly as follows:
Wherein, D is corpus, the parameter of the prior distribution that α is the theme, and β is the theme the parameter of distribution of middle word, and M is literary composition
Shelves record, d is document sequence number, θdFor the theme distribution vector of document d, NdFor the word number of document d, zdnN the word for document d
Theme vector, wdnVector for n the word composition of document d.
Further, in described step B, according to the topic model built in step A, detection sample space is carried out theme
Infer, calculate the similarity of execution information of detection sample, specifically include following step by step:
B1, to detection sample space carry out theme deduction, obtain the test sample conditional probability distribution in theme space;
B2, the conditional probability distribution of test sample is converted to vector representation, calculates the phase of the execution information of detection sample
Like degree.
Further, described calculate detection sample execution information similarity formula particularly as follows:
Wherein, wherein, diFor test sample d at the vector value of theme spatially i-th dimension, qiEmpty at theme for test sample q
The vector value of i-th dimension between.
Further, described step D compares with the detection threshold value pre-set, complete repeated defects report inspection
Survey, if particularly as follows: more than detection threshold value, then detection sample is repeated defects report;If less than or equal to detection threshold value, then detecting
Sample is not repeated defects report.
The invention has the beneficial effects as follows: the present invention, based on LDA model, performs the repeated defects report of information for band, logical
Cross structure topic model and detection sample is carried out theme deduction, calculate the execution information of detection sample and the near of classification information respectively
Like degree, thus realize repeated defects report detection, efficiently solve the problem such as high-dimensional, high voice that traditional method exists, with
Time add execution information and classification information after can be greatly improved the accuracy of testing result.
Accompanying drawing explanation
Fig. 1 is repeated defects based on the LDA model report detection method schematic flow sheet of the present invention.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right
The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, not
For limiting the present invention.
As it is shown in figure 1, be repeated defects based on the LDA model report detection method schematic flow sheet of the present invention.A kind of
Repeated defects based on LDA model report detection method, comprises the following steps:
A, from repeated defects report data concentrate choose a number of data construct training sample space, to training sample
Space uses LDA method to build topic model;
B, from repeated defects report data concentrate randomly select a number of data construct detection sample space, according to step
The topic model built in rapid A carries out theme deduction to detection sample space, calculates the similarity of the execution information of detection sample;
C, extract the classification information of each defect according to the detection sample space in step B and compare according to layer order
Relatively, obtain detecting the similarity of the classification information of sample;
D, the similarity that detection sample in step B is performed information and step C detect the phase of the classification information of sample
Be weighted like degree and, and compare with the detection threshold value pre-set, complete repeated defects report detection.
In step, the present invention concentrates from repeated defects report data and chooses a number of data construct training sample
Space, uses LDA method to build topic model in training sample space, particularly as follows:
Assuming that K dimensional vector α is the parameter of the prior distribution of theme, the matrix β of K × V is the parameter of the distribution of word in theme
(V is the sum of word), i.e. βij=p (wj|ziThere is word w in)=i-th themejProbability, then generate a document theme
The probability of distribution, the N number of theme of regeneration and then N number of word of obtaining this document can be expressed as:
Wherein θ is the theme distribution vector of document, and z is the theme vector of N-dimensional, and w is the vector of N number of word composition.Due to θ and
Z is the latent variable not observed in training data, asks marginal distribution it to be disappeared from the left side, is expressed as:
Corpus D containing M piece document is hadThen can obtain:
Wherein, D is corpus, the parameter of the prior distribution that α is the theme, and β is the theme the parameter of distribution of middle word, and M is literary composition
Shelves record, d is document sequence number, θdFor the theme distribution vector of document d, NdFor the word number of document d, zdnN the word for document d
Theme vector, wdnVector for n the word composition of document d.The building process of topic model is asked exactly so that p (D | α, β) is maximum
Parameter alpha and the process of value of β.
In stepb, excessive due to repeated defects report data set capacity, it is impossible to be directly used in repeated defects report inspection
Survey, therefore concentrate from repeated defects report data and randomly select a number of data construct detection sample space.Here this
Bright from 9600 defect reports, randomly select 200 defect reports build detection sample spaces, randomly select through statistics
In report, defect report 22, account for 11%.Simultaneously because the repeated defects report section in detection sample space is pure natural language
Speech describes, and part comprises execution information, performs to process as follows to improve the accuracy of experimental result:
S1, delete the defect report of pure natural language description, retain the defect report comprising execution information;
S2, in training sample space, again randomly select the defect of repetition, with supplementary defect report quantity to 22 simultaneously
Individual;
The defect report data content attribute that S3, again artificial judgment newly select, and repeat step S1;
S4, terminate this circulation when 22 defect reports in experiment basis data all comprise execution information;
S5, finally obtain final detection sample space, including 22 defect reports having execution information.
According to the statistical result on eclipse official website, manual obtain the reality of the correspondence of institute in these 22 defect reports
It is also joined in training sample space and adds AddOut field classification life by rule so that filename mark is upper by duplicate reports
Name, to show that it is the outside defect report added.Through statistics, external addition of duplicate reports has 44, in order to further
Ensure the accuracy of experimental result, during the information that needs to classify defect adds with the form that filename mark and tests, with original number
According to based on collection, the classification information of each report is added to filename, individually construct and raw data set according to filename
Corresponding categorized data set, the data that each categorical data is concentrated are classification information.Finally we have obtained one containing 244
The small-sized test sample space of defect report.
According to the topic model built in step A, detection sample space obtained above is carried out theme deduction, calculate inspection
This similarity of execution information of test sample, specifically include following step by step:
B1, to detection sample space carry out theme deduction, obtain the test sample conditional probability distribution in theme space;
B2, the conditional probability distribution of test sample is converted to vector representation, calculates the phase of the execution information of detection sample
Like degree.
In step bl is determined., utilize the topic model built in step A that detection sample space is carried out theme deduction, it was predicted that its
Corresponding theme distribution, obtains the test sample conditional probability distribution in theme space, is expressed as:
Wherein, θ is the theme distribution vector of document, and z is the theme vector, and w is the vector of word composition.Concrete manifestation form is
Document-theme matrix.
In step B2, the conditional probability distribution of test sample is converted to vector representation, will step B1 obtain
Document-theme matrix conversion is the document vector representation in theme space, by calculating between each document in theme phase spatially
Expressing semantic similarity like degree, i.e. can obtain detecting the similarity of the execution information of sample, computing formula is expressed as:
Wherein, diFor test sample d at the vector value of theme spatially i-th dimension, qiFor test sample q at theme spatially
The vector value of i-th dimension, it is 1 that the similarity of the execution information of detection sample gives weight limit.It is vector angle due to calculate
Cosine value, its scope is between 0 to 1, and angle is the biggest, and cosine value is the least, and angle is the least, and cosine value is the biggest, space close
System corresponds on similarity, it can be appreciated that when the space of cosine value the biggest namely angle two documents of the least interval scale
The semantic similarity of the highest namely document of similarity is the highest.
In step C, the detection sample space from step B extracts the classification information of each defect, and information of classifying
Be tagged on each detection sample, due to classification information be structured message and by classification product
Component version sequential organization, therefore compares according to layer order, stops when finding different for the first time comparing.
It is 1 that the similarity of the classification information of detection sample gives weight limit, according to the number difference of the identical level obtained more afterwards
It is 0.4,0.55,0.7,0.85,1.
In step D, to step B detects sample detect in theme similarity spatially and step C sample point
The similarity of category information be weighted and, obtain the final similarity of defect report, and compare with the detection threshold value pre-set
Relatively, if the final similarity of defect report is more than detection threshold value, then this detection sample is for repeating defect report, if defect report is final
Similarity is less than or equal to detection threshold value, then this detection sample is not repeated defects report, thus completes repeated defects report inspection
Survey.
Those of ordinary skill in the art it will be appreciated that embodiment described here be to aid in reader understanding this
Bright principle, it should be understood that protection scope of the present invention is not limited to such special statement and embodiment.This area
It is each that those of ordinary skill can make various other without departing from essence of the present invention according to these technology disclosed by the invention enlightenment
Planting concrete deformation and combination, these deform and combine the most within the scope of the present invention.
Claims (5)
1. repeated defects based on a LDA model report detection method, it is characterised in that comprise the following steps:
A, from repeated defects report data concentrate choose a number of data construct training sample space, to training sample space
LDA method is used to build topic model;
B, from repeated defects report data concentrate randomly select a number of data construct detection sample space, according to step A
The topic model of middle structure carries out theme deduction to detection sample space, calculates the similarity of the execution information of detection sample;
C, extract the classification information of each defect according to the detection sample space in step B and compare according to layer order,
Similarity to the classification information of detection sample;
D, the similarity that detection sample in step B is performed information and step C detect the similarity of the classification information of sample
Be weighted and, and compare with the detection threshold value pre-set, complete repeated defects report detection.
2. repeated defects based on LDA model report detection method as claimed in claim 1, it is characterised in that described step A
The topic model of middle structure particularly as follows:
Wherein, D is corpus, the parameter of the prior distribution that α is the theme, and β is the theme the parameter of distribution of middle word, and M is a document piece
Number, d is document sequence number, θdFor the theme distribution vector of document d, NdFor the word number of document d, zdnTheme for n the word of document d
Vector, wdnVector for n the word composition of document d.
3. repeated defects based on LDA model report detection method as claimed in claim 2, it is characterised in that described step B
In, according to the topic model built in step A, detection sample space is carried out theme deduction, calculate the execution information of detection sample
Similarity, specifically include following step by step:
B1, to detection sample space carry out theme deduction, obtain the test sample conditional probability distribution in theme space;
B2, the conditional probability distribution of test sample is converted to vector representation, calculates the similarity of the execution information of detection sample.
4. repeated defects based on LDA model report detection method as claimed in claim 3, it is characterised in that described calculating
Detection sample execution information similarity formula particularly as follows:
Wherein, wherein, diFor test sample d at the vector value of theme spatially i-th dimension, qiFor test sample q at theme spatially
The vector value of i-th dimension.
5. repeated defects based on LDA model report detection method as claimed in claim 4, it is characterised in that described step D
In compare with the detection threshold value pre-set, complete repeated defects report detection, if particularly as follows: more than detection threshold value, then
Detection sample is repeated defects report;If less than or equal to detection threshold value, then detection sample is not repeated defects report.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610599276.5A CN106250311A (en) | 2016-07-27 | 2016-07-27 | Repeated defects based on LDA model report detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610599276.5A CN106250311A (en) | 2016-07-27 | 2016-07-27 | Repeated defects based on LDA model report detection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106250311A true CN106250311A (en) | 2016-12-21 |
Family
ID=57604107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610599276.5A Pending CN106250311A (en) | 2016-07-27 | 2016-07-27 | Repeated defects based on LDA model report detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106250311A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106777237A (en) * | 2016-12-27 | 2017-05-31 | 武汉延锋时代检测技术服务有限公司 | A kind of analysis method of surface defect |
CN106874203A (en) * | 2017-02-15 | 2017-06-20 | 南京大学 | A kind of defect positioning method based on the analysis of defect report text subject |
CN108804558A (en) * | 2018-05-22 | 2018-11-13 | 北京航空航天大学 | A kind of defect report automatic classification method based on semantic model |
CN110188047A (en) * | 2019-06-20 | 2019-08-30 | 重庆大学 | A kind of repeated defects report detection method based on binary channels convolutional neural networks |
CN111178037A (en) * | 2019-12-24 | 2020-05-19 | 中信银行股份有限公司 | Repeated defect report identification method and device and electronic equipment |
CN111258905A (en) * | 2020-01-19 | 2020-06-09 | 中信银行股份有限公司 | Defect positioning method and device, electronic equipment and computer readable storage medium |
CN111737107A (en) * | 2020-05-15 | 2020-10-02 | 南京航空航天大学 | Repeated defect report detection method based on heterogeneous information network |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103246603A (en) * | 2013-03-21 | 2013-08-14 | 中国科学院软件研究所 | Automatic distribution method for software bug reports of bug tracking system |
CN103970666A (en) * | 2014-05-29 | 2014-08-06 | 重庆大学 | Method for detecting repeated software defect reports |
CN104021180A (en) * | 2014-06-09 | 2014-09-03 | 南京航空航天大学 | Combined software defect report classification method |
-
2016
- 2016-07-27 CN CN201610599276.5A patent/CN106250311A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103246603A (en) * | 2013-03-21 | 2013-08-14 | 中国科学院软件研究所 | Automatic distribution method for software bug reports of bug tracking system |
CN103970666A (en) * | 2014-05-29 | 2014-08-06 | 重庆大学 | Method for detecting repeated software defect reports |
CN104021180A (en) * | 2014-06-09 | 2014-09-03 | 南京航空航天大学 | Combined software defect report classification method |
Non-Patent Citations (1)
Title |
---|
蒋欣志: "基于LDA模型的重复缺陷报告检测的研究", 《中国优秀硕士论文全文数据库 信息科技辑》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106777237A (en) * | 2016-12-27 | 2017-05-31 | 武汉延锋时代检测技术服务有限公司 | A kind of analysis method of surface defect |
CN106777237B (en) * | 2016-12-27 | 2019-10-01 | 武汉延锋时代检测技术股份有限公司 | A kind of analysis method of surface defect |
CN106874203A (en) * | 2017-02-15 | 2017-06-20 | 南京大学 | A kind of defect positioning method based on the analysis of defect report text subject |
CN106874203B (en) * | 2017-02-15 | 2019-11-12 | 南京大学 | A kind of defect positioning method based on the analysis of defect report text subject |
CN108804558A (en) * | 2018-05-22 | 2018-11-13 | 北京航空航天大学 | A kind of defect report automatic classification method based on semantic model |
CN110188047A (en) * | 2019-06-20 | 2019-08-30 | 重庆大学 | A kind of repeated defects report detection method based on binary channels convolutional neural networks |
CN110188047B (en) * | 2019-06-20 | 2023-04-18 | 重庆大学 | Double-channel convolutional neural network-based repeated defect report detection method |
CN111178037A (en) * | 2019-12-24 | 2020-05-19 | 中信银行股份有限公司 | Repeated defect report identification method and device and electronic equipment |
CN111258905A (en) * | 2020-01-19 | 2020-06-09 | 中信银行股份有限公司 | Defect positioning method and device, electronic equipment and computer readable storage medium |
CN111258905B (en) * | 2020-01-19 | 2023-05-23 | 中信银行股份有限公司 | Defect positioning method and device, electronic equipment and computer readable storage medium |
CN111737107A (en) * | 2020-05-15 | 2020-10-02 | 南京航空航天大学 | Repeated defect report detection method based on heterogeneous information network |
CN111737107B (en) * | 2020-05-15 | 2021-10-26 | 南京航空航天大学 | Repeated defect report detection method based on heterogeneous information network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106250311A (en) | Repeated defects based on LDA model report detection method | |
Rahman et al. | A bidirectional LSTM language model for code evaluation and repair | |
US20200134263A1 (en) | Non-factoid question-answering device | |
CN105930368B (en) | A kind of sensibility classification method and system | |
CN108595602A (en) | The question sentence file classification method combined with depth model based on shallow Model | |
Bekkerman et al. | High-precision phrase-based document classification on a modern scale | |
CN106611375A (en) | Text analysis-based credit risk assessment method and apparatus | |
CN105843897A (en) | Vertical domain-oriented intelligent question and answer system | |
CN102332012A (en) | Chinese text sorting method based on correlation study between sorts | |
CN116982037A (en) | Semantic coverage in managing and measuring knowledge discovery processes | |
CN104750833A (en) | Text classification method and device | |
CN105354595A (en) | Robust visual image classification method and system | |
CN111710428B (en) | Biomedical text representation method for modeling global and local context interaction | |
CN104951987B (en) | Crop Breeding evaluation method based on decision tree | |
Juhász et al. | The who, what, when, and how of industrial policy: A text-based approach | |
Xue et al. | An evolutionary algorithm based ontology matching system. | |
CN108717459A (en) | A kind of mobile application defect positioning method of user oriented comment information | |
CN109615242A (en) | A kind of software bug allocating method based on Recognition with Recurrent Neural Network and cost-sensitive | |
CN106446117A (en) | Text analysis method based on poisson-gamma belief network | |
Zhao | Research and design of automatic scoring algorithm for English composition based on machine learning | |
Hussain et al. | A tutorial on open-source large language models for behavioral science | |
Wu et al. | Using apriori algorithm on students’ performance data for Association Rules Mining | |
CN109189848A (en) | Abstracting method, system, computer equipment and the storage medium of knowledge data | |
Hulsebos et al. | Models and Practice of Neural Table Representations | |
Zhong et al. | A Projection Pursuit Dynamic Cluster Model for Tourism Safety Early Warning and Its Implications for Sustainable Tourism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161221 |
|
RJ01 | Rejection of invention patent application after publication |