CN106250311A

CN106250311A - Repeated defects based on LDA model report detection method

Info

Publication number: CN106250311A
Application number: CN201610599276.5A
Authority: CN
Inventors: 蒋欣志; 盛泳潘
Original assignee: Chengdu Qili Juxiantang Technology Co Ltd
Current assignee: Chengdu Qili Juxiantang Technology Co Ltd
Priority date: 2016-07-27
Filing date: 2016-07-27
Publication date: 2016-12-21

Abstract

The invention discloses a kind of repeated defects based on LDA model report detection method；It includes structure topic model, calculates the similarity of the execution information of detection sample, calculates the similarity of the classification information of detection sample and is weighted comparing with rear and detection threshold value that is that pre-set, completing repeated defects report detection.The present invention is based on LDA model, the repeated defects report of information is performed for band, by building topic model, detection sample is carried out theme deduction, calculate execution information and the degree of approximation of classification information of detection sample respectively, thus realize repeated defects report detection, efficiently solve the problem such as high-dimensional, high voice that traditional method exists, the accuracy of testing result after being simultaneously introduced execution information and classification information, can be greatly improved.

Description

Repeated defects based on LDA model report detection method

Technical field

The invention belongs to defect report detection technique field, particularly relate to the report of a kind of repeated defects based on LDA model Detection method.

Background technology

Defect report describes data as the defect produced in software maintenance periodic process, and it is tieed up for whole software defect Protect and all play vital effect.So-called software deficiency report be software maintenance work in, software test personnel or User's defect that may be present for software, the document of mistake describe.For defect report application data mining technology, it is possible to have Effect ground solves the problems such as defect assignment, incorrect restitution and defect classification, and wherein defect is assigned is that user submits to ground defect report Announcement is dispatched to the process of the developer specified, corresponding developer be analyzed defect report, thus it is speculated that the mould residing for defect Block, and finally repair this defect.

Generally forming by writing the software the fewest user of understanding itself hastily due to these reports, they not only also exist Express that fuzzy, the most professional, information is complete, be difficult to the problems such as understanding, be also present in same defect and be repeated the problem of submission, Thereby result in and the defect report submitting to there are the defect report that bulk redundancy repeats, this type of repeated defects is reported If being repeatedly assigned to developer, the serious waste of human resources certainly will be caused, especially for for large-scale open source projects this Class problem is particularly evident.Open source software such as Eclipse, Firefox, Open Office etc., along with the increase of scale and version Update, have every day different user submit to defect report in Bug Tracking management system, system by produce a large amount of repeat lack Fall into report, as in the software defect data base of open source software Firefox, the ratio of repeated defects defect is up to 30%.2005, Mozilla programmer reports " everyday, almost 300bugs appear that need triaging.This is far too much for only the Mozilla programmers to handle”。

In order to alleviate the burden of manual detection repeated defects report, lot of domestic and international experts and scholars put into repeated defects report Accuse the research of detection field and propose the detection method of a series of repeated defects report.Initially everybody research is not examined Consider the problem to repeated defects report detection, but concentrate on the problem that defect is assigned, asking of repeated defects report detection Topic substantially proposes to solve the problem of defect assignment.But owing to traditional duplicate reports automatic testing method is general All over using vector space model as theoretical basis, it is necessary first to build vector space module with TF-IDF, by each defect Account is shown as a word vector spatially, then calculate between it by the way of calculating co sinus vector included angle value similar Degree.But owing to corpus pending in reality is the hugest, typically up to 10,000 documents more than, the vector of structure is empty Between have that dimension is high, Sparse and have the problems such as throat sound, reduce detection execution efficiency, recall ratio and precision ratio relatively low.

Summary of the invention

The goal of the invention of the present invention is: in order to solve to exist in prior art problem above, and the present invention proposes a kind of base Repeated defects in LDA model reports detection method.

The technical scheme is that a kind of repeated defects based on LDA model reports detection method, including following step Rapid:

A, from repeated defects report data concentrate choose a number of data construct training sample space, to training sample Space uses LDA method to build topic model；

B, from repeated defects report data concentrate randomly select a number of data construct detection sample space, according to step The topic model built in rapid A carries out theme deduction to detection sample space, calculates the similarity of the execution information of detection sample；

C, extract the classification information of each defect according to the detection sample space in step B and compare according to layer order Relatively, obtain detecting the similarity of the classification information of sample；

D, the similarity that detection sample in step B is performed information and step C detect the phase of the classification information of sample Be weighted like degree and, and compare with the detection threshold value pre-set, complete repeated defects report detection.

Further, in described step A build topic model particularly as follows:

p (D | α, β) = Π_{d = 1}^{M} &Integral; p (θ_{d} | α) (Π_{n = 1}^{N_{d}} \underset{Z_{d n}}{Σ} p (z_{d n} | θ_{d}) p (w_{d n} | z_{d n}, β)) {dθ}_{d}

Wherein, D is corpus, the parameter of the prior distribution that α is the theme, and β is the theme the parameter of distribution of middle word, and M is literary composition Shelves record, d is document sequence number, θ_dFor the theme distribution vector of document d, N_dFor the word number of document d, z_dnN the word for document d Theme vector, w_dnVector for n the word composition of document d.

Further, in described step B, according to the topic model built in step A, detection sample space is carried out theme Infer, calculate the similarity of execution information of detection sample, specifically include following step by step:

B1, to detection sample space carry out theme deduction, obtain the test sample conditional probability distribution in theme space；

B2, the conditional probability distribution of test sample is converted to vector representation, calculates the phase of the execution information of detection sample Like degree.

Further, described calculate detection sample execution information similarity formula particularly as follows:

S i m (d, q) = \frac{\underset{i}{Σ} (d_{i} * q_{i})}{\sqrt{\underset{i}{Σ} a_{i}^{2} * \underset{i}{Σ} q_{i}^{2}}}

Wherein, wherein, d_iFor test sample d at the vector value of theme spatially i-th dimension, q_iEmpty at theme for test sample q The vector value of i-th dimension between.

Further, described step D compares with the detection threshold value pre-set, complete repeated defects report inspection Survey, if particularly as follows: more than detection threshold value, then detection sample is repeated defects report；If less than or equal to detection threshold value, then detecting Sample is not repeated defects report.

The invention has the beneficial effects as follows: the present invention, based on LDA model, performs the repeated defects report of information for band, logical Cross structure topic model and detection sample is carried out theme deduction, calculate the execution information of detection sample and the near of classification information respectively Like degree, thus realize repeated defects report detection, efficiently solve the problem such as high-dimensional, high voice that traditional method exists, with Time add execution information and classification information after can be greatly improved the accuracy of testing result.

Accompanying drawing explanation

Fig. 1 is repeated defects based on the LDA model report detection method schematic flow sheet of the present invention.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, not For limiting the present invention.

As it is shown in figure 1, be repeated defects based on the LDA model report detection method schematic flow sheet of the present invention.A kind of Repeated defects based on LDA model report detection method, comprises the following steps:

In step, the present invention concentrates from repeated defects report data and chooses a number of data construct training sample Space, uses LDA method to build topic model in training sample space, particularly as follows:

Assuming that K dimensional vector α is the parameter of the prior distribution of theme, the matrix β of K × V is the parameter of the distribution of word in theme (V is the sum of word), i.e. β_ij=p (w_j|z_iThere is word w in)=i-th theme_jProbability, then generate a document theme The probability of distribution, the N number of theme of regeneration and then N number of word of obtaining this document can be expressed as:

p (θ, z, w | α, β) = p (θ | α) Π_{n = 1}^{N} p (z_{n} | θ) p (w_{n} | z_{n}, β)

Wherein θ is the theme distribution vector of document, and z is the theme vector of N-dimensional, and w is the vector of N number of word composition.Due to θ and Z is the latent variable not observed in training data, asks marginal distribution it to be disappeared from the left side, is expressed as:

p (w | α, β) = &Integral; p (θ | α) (Π_{n = 1}^{N} \underset{z_{n}}{Σ} p (z_{n} | θ) p (w_{n} | z_{n}, β)) d θ

Corpus D containing M piece document is hadThen can obtain:

p (D | α, β) = Π_{d = 1}^{M} &Integral; p (θ_{d} | α) (Π_{n = 1}^{N_{d}} \underset{Z_{d n}}{Σ} p (z_{d n} | θ_{d}) p (w_{d n} | z_{d n}, β)) {dθ}_{d}

Wherein, D is corpus, the parameter of the prior distribution that α is the theme, and β is the theme the parameter of distribution of middle word, and M is literary composition Shelves record, d is document sequence number, θ_dFor the theme distribution vector of document d, N_dFor the word number of document d, z_dnN the word for document d Theme vector, w_dnVector for n the word composition of document d.The building process of topic model is asked exactly so that p (D | α, β) is maximum Parameter alpha and the process of value of β.

In stepb, excessive due to repeated defects report data set capacity, it is impossible to be directly used in repeated defects report inspection Survey, therefore concentrate from repeated defects report data and randomly select a number of data construct detection sample space.Here this Bright from 9600 defect reports, randomly select 200 defect reports build detection sample spaces, randomly select through statistics In report, defect report 22, account for 11%.Simultaneously because the repeated defects report section in detection sample space is pure natural language Speech describes, and part comprises execution information, performs to process as follows to improve the accuracy of experimental result:

S1, delete the defect report of pure natural language description, retain the defect report comprising execution information；

S2, in training sample space, again randomly select the defect of repetition, with supplementary defect report quantity to 22 simultaneously Individual；

The defect report data content attribute that S3, again artificial judgment newly select, and repeat step S1；

S4, terminate this circulation when 22 defect reports in experiment basis data all comprise execution information；

S5, finally obtain final detection sample space, including 22 defect reports having execution information.

According to the statistical result on eclipse official website, manual obtain the reality of the correspondence of institute in these 22 defect reports It is also joined in training sample space and adds AddOut field classification life by rule so that filename mark is upper by duplicate reports Name, to show that it is the outside defect report added.Through statistics, external addition of duplicate reports has 44, in order to further Ensure the accuracy of experimental result, during the information that needs to classify defect adds with the form that filename mark and tests, with original number According to based on collection, the classification information of each report is added to filename, individually construct and raw data set according to filename Corresponding categorized data set, the data that each categorical data is concentrated are classification information.Finally we have obtained one containing 244 The small-sized test sample space of defect report.

According to the topic model built in step A, detection sample space obtained above is carried out theme deduction, calculate inspection This similarity of execution information of test sample, specifically include following step by step:

In step bl is determined., utilize the topic model built in step A that detection sample space is carried out theme deduction, it was predicted that its Corresponding theme distribution, obtains the test sample conditional probability distribution in theme space, is expressed as:

p (θ, z | w, α, β) = \frac{p (θ, z, w | α, β)}{p (w | α, β)}

Wherein, θ is the theme distribution vector of document, and z is the theme vector, and w is the vector of word composition.Concrete manifestation form is Document-theme matrix.

In step B2, the conditional probability distribution of test sample is converted to vector representation, will step B1 obtain Document-theme matrix conversion is the document vector representation in theme space, by calculating between each document in theme phase spatially Expressing semantic similarity like degree, i.e. can obtain detecting the similarity of the execution information of sample, computing formula is expressed as:

S i m (d, q) = \frac{\underset{i}{Σ} (d_{i} * q_{i})}{\sqrt{\underset{i}{Σ} d_{i}^{2} * \underset{i}{Σ} q_{i}^{2}}}

Wherein, d_iFor test sample d at the vector value of theme spatially i-th dimension, q_iFor test sample q at theme spatially The vector value of i-th dimension, it is 1 that the similarity of the execution information of detection sample gives weight limit.It is vector angle due to calculate Cosine value, its scope is between 0 to 1, and angle is the biggest, and cosine value is the least, and angle is the least, and cosine value is the biggest, space close System corresponds on similarity, it can be appreciated that when the space of cosine value the biggest namely angle two documents of the least interval scale The semantic similarity of the highest namely document of similarity is the highest.

In step C, the detection sample space from step B extracts the classification information of each defect, and information of classifying Be tagged on each detection sample, due to classification information be structured message and by classification product Component version sequential organization, therefore compares according to layer order, stops when finding different for the first time comparing. It is 1 that the similarity of the classification information of detection sample gives weight limit, according to the number difference of the identical level obtained more afterwards It is 0.4,0.55,0.7,0.85,1.

In step D, to step B detects sample detect in theme similarity spatially and step C sample point The similarity of category information be weighted and, obtain the final similarity of defect report, and compare with the detection threshold value pre-set Relatively, if the final similarity of defect report is more than detection threshold value, then this detection sample is for repeating defect report, if defect report is final Similarity is less than or equal to detection threshold value, then this detection sample is not repeated defects report, thus completes repeated defects report inspection Survey.

Those of ordinary skill in the art it will be appreciated that embodiment described here be to aid in reader understanding this Bright principle, it should be understood that protection scope of the present invention is not limited to such special statement and embodiment.This area It is each that those of ordinary skill can make various other without departing from essence of the present invention according to these technology disclosed by the invention enlightenment Planting concrete deformation and combination, these deform and combine the most within the scope of the present invention.

Claims

1. repeated defects based on a LDA model report detection method, it is characterised in that comprise the following steps:

A, from repeated defects report data concentrate choose a number of data construct training sample space, to training sample space LDA method is used to build topic model；

B, from repeated defects report data concentrate randomly select a number of data construct detection sample space, according to step A The topic model of middle structure carries out theme deduction to detection sample space, calculates the similarity of the execution information of detection sample；

C, extract the classification information of each defect according to the detection sample space in step B and compare according to layer order, Similarity to the classification information of detection sample；

D, the similarity that detection sample in step B is performed information and step C detect the similarity of the classification information of sample Be weighted and, and compare with the detection threshold value pre-set, complete repeated defects report detection.

2. repeated defects based on LDA model report detection method as claimed in claim 1, it is characterised in that described step A The topic model of middle structure particularly as follows:

p (D | α, β) = Π_{d = 1}^{M} &Integral; p (θ_{d} | α) (Π_{n = 1}^{N_{d}} \underset{z_{d n}}{Σ} p (z_{d n} | θ_{d}) p (w_{d n} | z_{d n}, β)) {dθ}_{d}

Wherein, D is corpus, the parameter of the prior distribution that α is the theme, and β is the theme the parameter of distribution of middle word, and M is a document piece Number, d is document sequence number, θ_dFor the theme distribution vector of document d, N_dFor the word number of document d, z_dnTheme for n the word of document d Vector, w_dnVector for n the word composition of document d.

3. repeated defects based on LDA model report detection method as claimed in claim 2, it is characterised in that described step B In, according to the topic model built in step A, detection sample space is carried out theme deduction, calculate the execution information of detection sample Similarity, specifically include following step by step:

B2, the conditional probability distribution of test sample is converted to vector representation, calculates the similarity of the execution information of detection sample.

4. repeated defects based on LDA model report detection method as claimed in claim 3, it is characterised in that described calculating Detection sample execution information similarity formula particularly as follows:

S i m (d, q) = \frac{\underset{i}{Σ} (d_{i} * q_{i})}{\sqrt{\underset{i}{Σ} d_{i}^{2} * \underset{i}{Σ} q_{i}^{2}}}

Wherein, wherein, d_iFor test sample d at the vector value of theme spatially i-th dimension, q_iFor test sample q at theme spatially The vector value of i-th dimension.

5. repeated defects based on LDA model report detection method as claimed in claim 4, it is characterised in that described step D In compare with the detection threshold value pre-set, complete repeated defects report detection, if particularly as follows: more than detection threshold value, then Detection sample is repeated defects report；If less than or equal to detection threshold value, then detection sample is not repeated defects report.