CN109976993A

CN109976993A - A kind of defect mode based on text mining determines method and system

Info

Publication number: CN109976993A
Application number: CN201711450639.XA
Authority: CN
Inventors: 吴旭; 许航
Original assignee: Aisino Corp
Current assignee: Aisino Corp
Priority date: 2017-12-27
Filing date: 2017-12-27
Publication date: 2019-07-05

Abstract

The invention discloses a kind of defect modes based on text mining to determine method and system, which comprises obtains multiple defects and describes text data；The multiple defect is extracted respectively describes multiple feature words that each defect in text data describes text data；Vectorization processing is carried out using multiple feature words that word frequency inverse document frequency TF-IDF algorithm describes text data to each defect, obtains vectorization data set；Cluster calculation is carried out to the vectorization data set, obtains multiple cluster set；According to the corresponding feature word of vectorization data in cluster set each in the multiple cluster set, determine that defect describes the corresponding defect mode of text data.The present invention will use K-Means to carry out clustering by TF-IDF algorithm after defect characteristic word vectors, and potential correlation between defect can be found according to cluster analysis result, it determines that defect describes the corresponding defect mode of text data, provides objective data foundation for the generation of defect mode.

Description

A kind of defect mode based on text mining determines method and system

Technical field

The present invention relates to measuring software the field of test technology, and more particularly, to a kind of lacking based on text mining Fall into mode determining method and system.

Background technique

Software defect is the important indicator for measuring software quality, and the timely reparation of defect is quite heavy to user satisfaction is improved It wants.For effective management software defect, come management software defect and demand usually using defect tracking system.When one it is soft After part publication, corresponding defect tracking system be just responsible for that acquisition program tester or terminal user submit to software defect Description report, the developer of software can safeguard to software and improve through these reports processing.

In terms of the form of expression of defect is not only embodied in the failure of function, it is also embodied in other aspects.Main Types have: soft Part is not carried out functional module required by product requirement specification；Occur product requirement specification in software and indicates occur Mistake；The functional module that software realization product requirement specification is not mentioned；Although software is not carried out product requirement specification The target that does not refer to clearly but should realize；Software indigestion, it is not easy to use, operation slowly, or from the angle of test man Degree sees that end user will be considered that bad.

But need most maintenance is corrective defect, these defects are related to the modification of code, extremely to the operation of software It closes important.Therefore it needs to classify to the defect of software, solution is determined more accurately.

Traditional defect mode is mostly the structural type characteristic attribute according to defect report, such as defect severity, priority Deng and artificial experience be determined；But the non-structural type characteristic attribute in defect report also contains a large amount of information, especially It is the defects of defect report description text.And defect description is analyzed using manual type, heavy workload, it is also difficult to It was found that the relationship that defect is mutual.

Summary of the invention

The present invention provides a kind of defect modes based on text mining to determine method and system, how quickly to solve really The problem of determining defect corresponding defect mode.

To solve the above-mentioned problems, according to an aspect of the invention, there is provided a kind of Defect Modes based on text mining Formula determines method, which is characterized in that the described method includes:

It obtains multiple defects and describes text data；

The multiple defect is extracted respectively describes multiple feature words that each defect in text data describes text data；

It is carried out using multiple feature words that word frequency inverse document frequency TF-IDF algorithm describes text data to each defect Vectorization processing, obtains vectorization data set；

Cluster calculation is carried out to the vectorization data set, obtains multiple cluster set；

According to the corresponding feature word of vectorization data in cluster set each in the multiple cluster set, determines and lack Fall into the corresponding defect mode of description text data.

Preferably, wherein the multiple defects of acquisition describe text data, comprising:

Multiple defect reports are obtained from defect library, and extract the text of defect description section from each defect report respectively This describes text data as defect.

Preferably, wherein describing text data extracting the multiple defect respectively and describing each defect in text data Before multiple feature words, further includes:

Text data is described to the multiple defect respectively to handle, remove it is non-textual and with defect description content without The text of pass.

Preferably, wherein it is described extract the multiple defect respectively and describe each defect in text data text data is described Multiple feature words, comprising:

It is multiple words that each defect, which is described the sentences decomposition in text data, using participle tool；

The interference word in the multiple word is removed, multiple feature words are obtained.

Preferably, wherein described describe the more of text data to each defect using word frequency inverse document frequency TF-IDF algorithm A feature word carries out vectorization processing, obtains vectorization data set, comprising:

The word frequency inverse document frequency for calculating each feature word in each text data, to each of each text data Feature word carries out vectorization processing, is converted to corresponding term vector, determines vectorization data set.

Preferably, wherein calculating the word frequency inverse document frequency of each feature word using following formula:

TFIDFx_{I, j}=TF_{I, j}×IDF_i,

Wherein, TF_i,jIt is characterized the frequency that word i occurs in corresponding text j；IDF_iFor the inverse document frequency of word i Rate；TFIDF_i,jIt is characterized the word frequency inverse document frequency of word i；n_i,jFor the number of feature word i in text j；K is in text j All feature words；D is the sum of document；

| { j:t_i∈d_j| to include feature word t_iNumber of files.

Preferably, wherein described carry out cluster calculation to the vectorization data set, multiple cluster set are obtained, comprising:

The term vector in vectorization data set is divided into the cluster of k setting using means clustering algorithm K-Means, And each term vector is calculated separately to the distance of corresponding cluster centre, and determine that new cluster centre is continuous according to the distance Iteration determines final cluster set.

According to another aspect of the present invention, it provides a kind of defect mode based on text mining and determines system, it is described System includes:

Defective data acquiring unit describes text data for obtaining multiple defects；

Feature word extraction unit describes each defect description text in text data for extracting the multiple defect respectively Multiple feature words of notebook data；

Vectorization processing unit, for describing text data to each defect using word frequency inverse document frequency TF-IDF algorithm Multiple feature words carry out vectorization processing, obtain vectorization data set；

Clustering processing unit obtains multiple cluster set for carrying out cluster calculation to the vectorization data set；

Defect mode determination unit, for according to the vectorization data in cluster set each in the multiple cluster set Corresponding feature word determines that defect describes the corresponding defect mode of text data.

Preferably, wherein the defective data acquiring unit, obtains multiple defects and describe text data, be specifically used for:

Preferably, wherein the system also includes:

Data cleansing unit, for describing text extracting the multiple defect respectively and describe each defect in text data Before multiple feature word of data, text data is described to the multiple defect respectively and is handled, remove it is non-textual and The text unrelated with defect description content.

Preferably, wherein the Feature Words language extraction unit, extracts the multiple defect respectively and describe in text data often A defect describes multiple feature words of text data, comprising:

Word segmentation module is multiple words for each defect to be described the sentences decomposition in text data using participle tool Language；

Feature word obtains module and obtains multiple feature words for removing the interference word in the multiple word.

Preferably, wherein the vectorization processing unit, using word frequency inverse document frequency TF-IDF algorithm to each defect The multiple feature words for describing text data carry out vectorization processing, obtain vectorization data set, comprising:

TFIDF_{I, j}=TF_{I, j}×IDF_i,

| { j:t_i∈d_j| to include feature word t_iNumber of files.

Preferably, it wherein the clustering processing unit, carries out cluster calculation to the vectorization data set, obtains multiple poly- Class set, is specifically used for:

The invention has the benefit that

The present invention provides a kind of defect modes based on text mining to determine method and system, compared to traditional defect Classification method is handled by the text of the different defects description report to selection, extracts defect characteristic word, pass through TF- IDF algorithm will use means clustering algorithm K-Means to carry out clustering after defect characteristic word vectors, and according to cluster point Analysis result quickly determines that defect describes the corresponding defect mode of text data.Technical solution of the present invention is dived between capable of finding defect Correlation, determine that defect describes the corresponding defect mode of text data, provided objectively for the generation of defect mode Data foundation.

Detailed description of the invention

By reference to the following drawings, exemplary embodiments of the present invention can be more fully understood by:

Fig. 1 is the flow chart that method 100 is determined according to the defect mode based on text mining of embodiment of the present invention；

Fig. 2 is the flow chart that clustering is carried out using means clustering algorithm K-Means according to embodiment of the present invention； And

Fig. 3 is the structural representation that system 300 is determined according to the defect mode based on text mining of embodiment of the present invention Figure.

Specific embodiment

Exemplary embodiments of the present invention are introduced referring now to the drawings, however, the present invention can use many different shapes Formula is implemented, and is not limited to the embodiment described herein, and to provide these embodiments be at large and fully disclose The present invention, and the scope of the present invention is sufficiently conveyed to person of ordinary skill in the field.Show for what is be illustrated in the accompanying drawings Term in example property embodiment is not limitation of the invention.In the accompanying drawings, identical cells/elements use identical attached Icon note.

Unless otherwise indicated, term (including scientific and technical terminology) used herein has person of ordinary skill in the field It is common to understand meaning.Further it will be understood that with the term that usually used dictionary limits, should be understood as and its The context of related fields has consistent meaning, and is not construed as Utopian or too formal meaning.

Fig. 1 is the flow chart that method 100 is determined according to the defect mode based on text mining of embodiment of the present invention.Such as Shown in Fig. 1, the defect mode based on text mining that embodiment of the present invention provides determines method how quickly to solve really The problem of determining defect corresponding defect mode.Embodiment of the present invention is compared to traditional defect classification method, by selection Different defects description report text handled, extract defect characteristic word, pass through word frequency inverse document frequency TF-IDF ((Term Frequency-Inverse Document Frequency, TF-IDF) algorithm will be after defect characteristic word vectors Clustering is carried out using means clustering algorithm K-Means, and quickly determines that defect describes textual data according to cluster analysis result According to corresponding defect mode.Technical solution of the present invention can find potential correlation between defect, determine defect description text The corresponding defect mode of notebook data provides objective data foundation for the generation of defect mode.The base of embodiment of the present invention Determine that method 100 since step 101 place, obtains multiple defects in step 101 and describes text in the defect mode of text mining Data.

In embodiments of the present invention, defect report can be obtained according to querying condition from defect library, and therefrom mentioned The text for taking defect description section describes text data as defect.Wherein, defect report is obtained from defect library according to querying condition When announcement, all defect report record can be obtained, fetching portion defect report can also be recorded according to demand.For example, can a basis The title of tester only obtains the defect report of some tester or obtains the corresponding institute of some project according to project name Some defect reports.

Preferably, the multiple defect is extracted respectively describe each defect in text data describe textual data in step 102 According to multiple feature words.

In order to which defect mode is determined more accurately, in embodiments of the present invention, each defect of acquisition is described Text data is cleaned, and the part unrelated with defect description content is removed.For example, each defect describes the corresponding mark of text Topic, line number etc..Then, it is multiple for each defect through over cleaning being described the sentences decomposition in text data using participle tool Word, and the interference word in the multiple word is removed, obtain multiple feature words.Wherein, interference word include: with wait try The relevant proper noun of system and stop words.Stop words (stop words), indicate to find result have no help, must filter The most word of the word frequency of occurrence fallen, such as " ", "Yes", " ", this kind of most common words such as "and".

Preferably, text data is described to each defect using word frequency inverse document frequency TF-IDF algorithm in step 103 Multiple feature words carry out vectorization processing, obtain vectorization data set.

TFIDF_{I, j}=TF_{I, j}×IDF_i,

Wherein, TF_i,jIt is characterized the frequency that word i occurs in corresponding text j；IDF_iFor the inverse document frequency of word i Rate；TFIDFi, j are characterized the word frequency inverse document frequency of word i；Ni, j are the number of feature word i in text j；K is text j In all feature word；D is the sum of document；

| { j:t_i∈d_j| it is the number of files comprising feature word ti.

TF-IDF is to assess a words for the important of a copy of it file in a file set or a corpus Degree.If the value of TF-IDF is bigger, weight of the table name word in corresponding document is bigger.All feature words are existed After the completion of TF-IDF value in corresponding document calculates, that is, complete the vectorization of the document.

The main thought of TF-IDF is: if the frequency TF high that some word or phrase occur in an article, and Seldom occur in other articles, then it is assumed that this word or phrase have good class discrimination ability, are adapted to classify.

Word frequency TF refers to the frequency that a given word occurs in a given document.From the point of view of concrete instance, Word frequency is the number of word appearance divided by the total word number of document/sentence.For example, one entitled " software testing technology " text Total word number of part is 2000, and word " software " occurs 100 times, then the word frequency of " software " word in this document is just It is 100/2000=0.05.In addition, word " test " occurs 60 times, then the word frequency of " test " word in this document is exactly 60/2000=0.03, and word " technology " occurs 40 times, then the word frequency of " technology " word in this document is exactly 40/ 2000=0.02.By taking " software ", " test " and " technology " in " software testing technology " file as an example, calculated by word frequency TF Afterwards, wherein n (" software testing technology ", software)=100, n (" software testing technology ", test)=60, n (" software test skill Art ", technology)=40, TF (" software testing technology ", software)=0.05, TF (" software testing technology ", test)=0.03 and TF (" software testing technology ", technology)=0.02.It preferably, further include other words in " software testing technology " file, in order to For the sake of simplicity, the present invention is only illustrated with three words therein.

Inverse document frequency IDF is the statistic for measuring a word general importance.The IDF of one given word, can be with By total number of files divided by the number of files comprising the word, then obtained quotient taken into logarithm to obtain the value of IDF.For example, " software " One word occurred in 1000 parts of files, and total number of files is 10,000,000 part, reverse document-frequency be log (10,000, 000/1,000)=4.Also, " software " word frequency TF (" software testing technology ", software) in " software testing technology "= 0.05.It follows that the score of the TF-IDF of the word " software " in document " software testing technology " is 0.05*4=0.2.

Preferably, cluster calculation is carried out to the vectorization data set in step 104, obtains multiple cluster set.

Fig. 2 is the flow chart that clustering is carried out using means clustering algorithm K-Means according to embodiment of the present invention. As shown in Fig. 2, the method for carrying out clustering using means clustering algorithm K-Means of embodiment of the present invention includes: random K number strong point is chosen as initial cluster centre, then calculate separately each point to cluster centre distance, and according to described Distance divides the corresponding cluster of each point.Judge whether it is and carry out cluster calculation for the first time, if it is recalculates each cluster Center as new cluster centre, and be back to and calculate each point to cluster centre apart from step；Conversely, judgement is this time poly- Whether class division result and the clustering result of last time are consistent, if unanimously, terminated, recalculate each gather conversely, returning The center of class is as new cluster centre step, continuous iteration, until determining final cluster set.Wherein, distance is being calculated When, it can be calculated using Euclidean distance formula, i.e. the quadratic sum of each dimension difference of two o'clock opens radical sign again.Wherein, it counts Calculate formula are as follows:

Preferably, corresponding according to the vectorization data in cluster set each in the multiple cluster set in step 105 Feature word, determine that defect describes the corresponding defect mode of text data.

In final cluster result, each cluster is the set comprising multiple feature words, the feature in same cluster There is word group the similitude of height to determine corresponding typical Defect Modes according to the analysis to these similar feature words Formula.In embodiments of the present invention, by the way that multiple k values are arranged, debugging is repeated, obtains different cluster results, and right The reasonability of cluster result judged, if rationally, directly according to the feature word group in cluster result the characteristics of, determine and lack Fall into the corresponding defect mode of description text data；If unreasonable, k value is exchanged, re-starts classification, until cluster result closes Reason, and the characteristics of according to feature word group in cluster result, determine that defect describes the corresponding defect mode of text data.

Fig. 3 is the structural representation that system 300 is determined according to the defect mode based on text mining of embodiment of the present invention Figure.As shown in figure 3, the defect mode based on text mining of embodiment of the present invention determines that system 300 includes: that defective data obtains Unit 301, feature word extraction unit 302, vectorization processing unit 303, clustering processing unit 304 and defect mode is taken to determine Unit 305.Preferably, in the defective data acquiring unit 301, multiple defects is obtained and describe text data.

Preferably, the defect mode based on text mining of embodiment of the present invention determines system 300 further include: data are clear Unit is washed, for describing multiple features that each defect in text data describes text data extracting the multiple defect respectively Before word, text data is described to the multiple defect respectively and is handled, remove it is non-textual and with defect description content Unrelated text.

Preferably, in the feature word extraction unit 302, the multiple defect is extracted respectively and is described in text data often A defect describes multiple feature words of text data.

Preferably, in the vectorization processing unit 303, using word frequency inverse document frequency TF-IDF algorithm to each defect The multiple feature words for describing text data carry out vectorization processing, obtain vectorization data set.

TFIDF_{I, j}=TF_{I, j}×IDF_i,

| { j:t_i∈d_j| to include feature word t_iNumber of files.

Preferably, in the clustering processing unit 304, cluster calculation is carried out to the vectorization data set, is obtained multiple Cluster set.

Preferably, in the defect mode determination unit 305, according in cluster set each in the multiple cluster set The corresponding feature word of vectorization data, determine that defect describes the corresponding defect mode of text data.

The defect mode based on text mining of the embodiment of the present invention determines system 300 and another implementation of the invention The defect mode based on text mining of example determines that method 100 is corresponding, and details are not described herein.

The present invention is described by reference to a small amount of embodiment.However, it is known in those skilled in the art, as Defined by subsidiary Patent right requirement, in addition to the present invention other embodiments disclosed above equally fall in it is of the invention In range.

Normally, all terms used in the claims are all solved according to them in the common meaning of technical field It releases, unless in addition clearly being defined wherein.All references " one/described/be somebody's turn to do [device, component etc.] " are all opened ground At least one example being construed in described device, component etc., unless otherwise expressly specified.Any method disclosed herein Step need not all be run with disclosed accurate sequence, unless explicitly stated otherwise.

Claims

1. a kind of defect mode based on text mining determines method, which is characterized in that the described method includes:

It obtains multiple defects and describes text data；

Vector is carried out using multiple feature words that word frequency inverse document frequency TF-IDF algorithm describes text data to each defect Change processing, obtains vectorization data set；

According to the corresponding feature word of vectorization data in cluster set each in the multiple cluster set, determine that defect is retouched State the corresponding defect mode of text data.

2. the method according to claim 1, wherein the multiple defects of acquisition describe text data, comprising:

Multiple defect reports are obtained from defect library, and the text for extracting defect description section from each defect report respectively is made Text data is described for defect.

3. the method according to claim 1, wherein being described in text data extracting the multiple defect respectively Each defect describes before multiple feature words of text data, further includes:

Text data is described to the multiple defect respectively to handle, and is removed non-textual and unrelated with defect description content Text.

4. extracting the multiple defect respectively the method according to claim 1, wherein described and describing text data In each defect multiple feature words of text data are described, comprising:

5. the method according to claim 1, wherein described utilize word frequency inverse document frequency TF-IDF algorithm to every Multiple feature words that a defect describes text data carry out vectorization processing, obtain vectorization data set, comprising:

The word frequency inverse document frequency for calculating each feature word in each text data, to each feature in each text data Word carries out vectorization processing, is converted to corresponding term vector, determines vectorization data set.

6. according to the method described in claim 5, it is characterized in that, inverse using the word frequency that following formula calculates each feature word Document frequency:

TFIDF_{I, j}=TF_{I, j}×IDF_i,

Wherein, TF_i,jIt is characterized the frequency that word i occurs in corresponding text j；IDF_iFor the inverse document frequency of word i； TFIDF_i,jIt is characterized the word frequency inverse document frequency of word i；n_i,jFor the number of feature word i in text j；K is to own in text j Feature word；D is the sum of document；

| { j:t_i∈d_j| to include feature word t_iNumber of files.

7. according to the method described in claim 5, it is characterized in that, it is described to the vectorization data set carry out cluster calculation, Obtain multiple cluster set, comprising:

The term vector in vectorization data set is divided into the cluster of k setting using means clustering algorithm K-Means, and point Each term vector is not calculated to the distance of corresponding cluster centre, and determines that new cluster centre constantly changes according to the distance In generation, determines final cluster set.

8. a kind of defect mode based on text mining determines system, which is characterized in that the system comprises:

Feature word extraction unit describes each defect in text data for extracting the multiple defect respectively and describes textual data According to multiple feature words；

Vectorization processing unit, for describing the more of text data to each defect using word frequency inverse document frequency TF-IDF algorithm A feature word carries out vectorization processing, obtains vectorization data set；

Defect mode determination unit, for corresponding according to the vectorization data in cluster set each in the multiple cluster set Feature word, determine that defect describes the corresponding defect mode of text data.

9. system according to claim 8, which is characterized in that the defective data acquiring unit obtains multiple defects and retouches Text data is stated, is specifically used for:

10. system according to claim 8, which is characterized in that the system also includes:

Data cleansing unit, for describing text data extracting the multiple defect respectively and describe each defect in text data Multiple feature words before, text data is described to the multiple defect respectively and is handled, remove it is non-textual and with lack Fall into the unrelated text of description content.

11. system according to claim 8, which is characterized in that the feature word extraction unit extracts described more respectively A defect describes multiple feature words that each defect in text data describes text data, comprising:

Word segmentation module is multiple words for each defect to be described the sentences decomposition in text data using participle tool；

12. system according to claim 8, which is characterized in that the vectorization processing unit, using word frequency against document frequency Multiple feature words that rate TF-IDF algorithm describes text data to each defect carry out vectorization processing, obtain vectorization data Collection, comprising:

13. system according to claim 12, which is characterized in that calculate the word frequency of each feature word using following formula Inverse document frequency:

TFIDF_{I, j}=TF_{I, j}×IDF_i,

| { j:t_i∈d_j| to include feature word t_iNumber of files.

14. system according to claim 12, which is characterized in that the clustering processing unit, to the vectorization data Collection carries out cluster calculation, obtains multiple cluster set, is specifically used for: