CN110502361A

CN110502361A - Fine granularity defect positioning method towards bug report

Info

Publication number: CN110502361A
Application number: CN201910806392.3A
Authority: CN
Inventors: 张婧玉; 孙小兵; 杨硕; 陈天浩; 曹琛
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-11-26
Anticipated expiration: 2039-08-29
Also published as: CN110502361B

Abstract

The invention proposes a kind of fine granularity defect positioning methods towards bug report in software maintenance field, comprising the following steps: is primarily based on history bug report and obtains defect Expressive Features vector, constructs high frequency words corpus；Code block where extracting defect constructs code semantic matrix；Feature vector and code semantic matrix are inputted into neural network, obtain prediction model；Neural network model is obtained according to intimate procedural training；Replaced feature vector is input to output code semantic matrix after prediction code tectonic model by the replacement that a semantic similarity word is carried out to new bug report；Source code corresponding to new bug report is divided into code basic block, and is separately encoded into code semantic matrix, constitutes the list of code semantic matrix；Finally compare the similitude between code semantic matrix and obtain the highest code block of similarity, defect location range is accurate to code atomic block level by the present invention, enables to software defect repair more efficient.

Description

Fine granularity defect positioning method towards bug report

Technical field

The present invention relates to a kind of software maintenance method, in particular to a kind of defect positioning method towards bug report belongs to Software maintenance field.

Background technique

With flourishing for software technology, the expansion and software product even more complex of software size, containing defective Software size and type are also doubled and redoubled therewith.Traditional manual analysis defect cause manually carries out positioning and artificial repairs Defect can no longer meet requirement of the modern society to software maintenance.How ever-increasing bug report quantity and code are coped with File size, the efficiency for improving software defect positioning and repairing, become one of the hot spot of researcher's concern.

Software defect positioning, which is carried out, currently based on bug report has many advanced technologies.The first kind is based on information retrieval skill Art, superior vector spatial model, and history bug report is combined to carry out defect location.Second class is by using code segmentation technology Defect location performance is improved with Stack Trace information analysis.Third class is utilized by improving the calculation of tf-idf Source code structured message come retrieve and calculate with bug report in abstract and description section content similarity, carry out defect determine Position.These current researchs all regard bug report and source code file as plain text, still, are used in bug report Natural language, and in source code file it is then code language, it is concerned only with the text similarity of the two, and ignore semantic information, meeting Certain influence is generated on defect location effect, cannot achieve fine-grained defect location.Defect location technology is defect repair skill One step of key of art, reported as precisely as possible navigates in smaller code region, can effectively improve defect repair efficiency, saves Software maintenance cost.

Summary of the invention

The object of the present invention is to provide a kind of fine granularity defect positioning method towards bug report, it is fine-grained to indicate to lack Code is fallen into, orientation range is accurate to code atomic block level, and learns bug report automatically using deep neural network and lacks The association between code is fallen into, realizes the automatic positioning of software defect.

The object of the present invention is achieved like this: a kind of fine granularity defect positioning method towards bug report, including following Step:

Step 1, by the natural language processing of problem description section in history bug report at defect Expressive Features vector, make For total descriptor corpus, while the word vector of k before the frequency of occurrences comes is chosen, constructs high frequency words corpus；

Step 2, using abstract syntax tree technology extraction procedure structure, obtain the generation in history source code file where defect Code block syntax tree, based on syntax tree from the angle of control stream and data flow, then the feature vector of constructed variable and basic block will Code block is encoded to code semantic matrix；

Step 3, according to the code semantic matrix construction depth mind in the defects of step 1 Expressive Features vector sum step 2 Through network, training obtains prediction code tectonic model；

Step 4, according to intimate program structure neural network, training obtains comparing code similarity model；

Step 5, the high frequency words corpus based on step 1 carry out a language to the defect Expressive Features vector of new bug report The replacement of the close word of justice, output code after the prediction code tectonic model in replaced feature vector input step 3 is semantic Matrix；

Step 6 is based on abstract syntax tree technology, and whole source code files of new bug report are divided into code basic block, And it is separately encoded into code semantic matrix, constitute the list of code semantic matrix；

Step 7, the model obtained using step 4, according to code in the list of code semantic matrix and step 5 in step 6 The similarity of semantic matrix carries out ranking to the code semantic matrix list in step 6, the code of n before output similarity comes Code basic block corresponding to semantic matrix.

It is further limited as of the invention, history bug report described in step 1 and step 2 and history source code text Part is to have obtained from having repaired to extract in defect library, and there is corresponding close between history bug report and history source code file System.

It is further limited as of the invention, specific step is as follows for building high frequency words corpus described in step 1:

Step 1-1: segmenting history bug report, goes the natural language processings operation such as stop words and stem extraction, obtains To word list；Does participle include removal punctuation mark (for example,！；), and a complete sentence is converted into word list； Removal stop words be leave out it is being commonly used and for the semantic information of entire sentence without too big contribution word (such as the, And, a)；Stem extraction is that the derivative words of each word of list are reduced into corresponding original shape (such as to extract from opening open)；

Step 1-2: the contextual information based on word, word, which is converted to vector, to be indicated, and all vectors are constructed At total descriptor corpus；

Step 1-3: the word vector for choosing k before the frequency of occurrences comes constitutes high frequency words corpus.

It is further limited as of the invention, the code block where defect, granularity of division is marked off described in step 2 For code basic block；

Code basic block is defined as the statement sequence that program sequence executes, only one of them entrance and one outlet, And if first sentence of this basic block is performed, all sentences all can only execute one in sequence in basic block It is secondary.

It is further limited as of the invention, the specific of abstract syntax tree extraction procedure structure is utilized described in step 2 Step includes:

Step 2-1: the source code of modification front and back is characterized as to the form of abstract syntax tree；

Step 2-2: the abstract syntax tree that front and back code obtains is modified in comparison, navigates to defect in source code before a modification Place basic block；

Step 2-3: the code basic block where defect is obtained.All are built into containing defective basic block containing scarce Sunken basic block collection.

It is further limited as of the invention, code block is encoded to a code semantic matrix described in step 2 Specific steps are as follows:

Step 2-4: based on the obtained code block of extraction, extracting its control stream and traffic flow information, type including variable, Type and action type of basic block etc.；

Step 2-5: according to these information structuring feature vectors, this feature vector can indicate variable and base in code block The feature of this block；

Step 2-6: using this feature vector as the element of matrix, using variable or basic block as row and column, building is semantic Matrix.

It is further limited as of the invention, according to defect Expressive Features vector sum code semanteme square described in step 3 Battle array construction depth neural network, the specific steps that training obtains prediction code tectonic model include:

Step 3-1: with bug Expressive Features vector M obtained in same a history bug report_iIt (i=1,2 ...) and goes through The semantic matrix T of history source code is input, constructs first layer hidden layer, establishes preliminary Copula to two kinds of data of input a_i ⁽¹⁾(T,M_i) (i=1,2 ...)；

Step 3-2: carrying out k repetitive exercise to the hidden layer of step 3-1, improves feature vector and corresponds to semantic matrix Copula, i.e. a_i ⁽ⁿ⁺¹⁾(a₁ ⁽ⁿ⁾, a₂ ⁽ⁿ⁾, a₃ ⁽ⁿ⁾...) (i=1,2 ..., n=1,2 ..., k)；

Step 3-3: retain the training of this history bug report finally obtained Copula A (T, M_i) (i=1,2 ...) =a^(k+2)(a₁ ^(k+1), a₂ ^(k+1), a₃ ^(k+1)...), it chooses another history bug report and is trained；It is reported in the history bug of selection It accuses and repeats step 3-1 and step 3-2 in library；

Step 3-4: it takes and trains obtaining in above-mentioned steps as a result, giving expression to its correspondence code of defect Expressive Features vector sum The relation function of semantic matrix, construction are calculated the model of its code semantic matrix by defect Expressive Features vector, referred to as predicted Code constructs model, and the output as neural network.

It is further limited as of the invention, according to intimate program structure neural network, instruction described in step 4 It gets and compares the specific steps of code similarity model and include:

Step 4-1: multiple program M to realize a certain function_i(i=1,2 ...) is the input of neural network, and constructs Similarity-rough set function a_i ⁽¹⁾(M_i) (i=1,2 ...)；

Step 4-2: by k repetitive exercise, the function for calculating same function program similarity, i.e. a are improved_i ⁽ⁿ⁺¹⁾(a₁ ⁽ⁿ⁾, a₂ ⁽ⁿ⁾, a₃ ⁽ⁿ⁾...) (i=1,2 ..., n=1,2 ..., k)；

Step 4-3: step 4-1 and step 4-2 is repeated, final function A (M is retained_p,M_q)=∑ a_i ^(k+1)/ n (i=1, 2 ..., n), and training is iterated to these functions, obtaining can the generally function of calculation code similarity and output.

It is further limited as of the invention, based on high frequency words corpus to the defect of new bug report described in step 5 Expressive Features vector carries out the replacement of a semantic similarity word, comprising:

Step 5-1: being segmented, go stop words and operation and stem extraction operation, and new bug report is processed into word column Table；

Step 5-2: based on total descriptor corpus, the word in new bug report is substituted for total descriptor corpus In word vector；For the word not occurred in total descriptor corpus, in order not to lose its semantic information, to it Carry out the replacement of a semantic similarity word；

Step 5-3: it is based on cosine similarity, for the word not occurred in total descriptor corpus in step 5-2 Language, searched in high frequency words corpus and return with the highest word vector of its similarity, if with the highest word of similarity Similarity value between vector is lower than M, then it is assumed that does not refuse then with word similar in this word justice in high frequency words corpus Replacement；Otherwise, it is replaced.

Compared with prior art, the beneficial effects of the present invention are:

1, the present invention proposes a kind of new method for the defect location problem towards bug report, it extracts code block Then code basic block is expressed as a code semanteme by control stream and traffic flow information, the vector of constructed variable and basic block Matrix, it is possible thereby to the syntax and semantics related information in code be indicated, rather than just syntactic information either semantic information；

2, the present invention is converted for code block, can be by defect location to atomic block level, so that software defect is repaired The efficiency of work of returning to work greatly improves；

3, the present invention construct the model of two deep learnings, for learn the association between bug report and source code with And the similitude between two codes of study；

4, the invention proposes the buildings of high frequency words corpus, for carrying out the replacement of semantic similarity word to new bug report, The loss of problem description section semantic information can be thus reduced, locating effect is improved.

Detailed description of the invention

Fig. 1 is the flow chart of the fine granularity defect positioning method of the invention towards bug report.

Fig. 2 is the partial schematic diagram of medium-high frequency of embodiment of the present invention word corpus.

Fig. 3 is the schematic diagram that code semantic matrix is constructed in the embodiment of the present invention.

Fig. 4 is the prediction code tectonic model in the present invention.

Fig. 5 is the comparison code similarity model in the present invention.

Fig. 6 is the description of the AspectJ project extracted in the present embodiment

Specific embodiment

To keep the purpose of the present invention, content and advantage clearer, with reference to the accompanying drawings and examples, to tool of the invention Body embodiment is described in further detail.

A kind of fine granularity defect positioning method towards bug report as shown in Figure 1, the specific steps are as follows:

Step 1, this example are tested on AspectJ, Lucene, OpenJpa, Solr open source projects data set, will The problem of state is fix (reparation) and the bug report of verified (having proven to) description section screens, and carries out nature language Speech processing step (segment, stop words and stem is gone to extract) is by the natural language processing of problem description section at word list.

It is then based on the contextual information of word, each word is processed into a 100d vector, constituting one can retouch State the m*100 of bug report semantic information feature vector (m for pretreatment after word number, each of feature vector member Element is all the vector of a 100d).

In view of the difference between project and project, the high frequency words between disparity items have very big difference, so this hair High word frequency corpus will be established respectively according to project in bright, in use, it also will be in the corresponding height of bug report in step below It modifies in frequency word corpus to new bug report.

The building process of AspectJ project high frequency words corpus, the building process of remaining project are just introduced in the present embodiment It is similar.

The word vector that this step obtains is divided by 5: 3: 2 form and is collected for training set, verifying as defective data collection And test set.

According to the frequency that word vector occurs, the high preceding k word vector of selecting frequency is built into a high frequency words corpus Library carries out the word length after natural language pretreatment operation in 200 words or so in the bug report of AspectJ project, Therefore retain 256 word vectors in AspectJ corpus in the present embodiment；Fig. 2 is the high frequency words corpus constructed in the present invention Library (first 100).

Step 2, step 1 screen bug report when, using abstract syntax tree technology extraction procedure structure, obtain history source Code block in code where defect；Here the granularity of division of code block is code basic block, one used in the present embodiment Basic block is as follows:

if(0！=(modifiers&InputEvent.BUTTON3_MASK)) { System.out.println (" Button3″)；

}

Based on code block from the angle of control stream and data flow, the feature vector of constructed variable and basic block then will generation Code block is encoded to a code semantic matrix；Table 1 is the control stream and traffic flow information extracted in the present embodiment:

Table 1

Variable/basic block	Value	Type
			V₁	0	int
V₂	modiffers	int
			V₃	BUTTON3_MASK	int
V₄	BUTTON3	int
			B₁	If(){}	if

。

Fig. 3 is the semantic matrix in the present embodiment according to 1 information architecture of table；Wherein V₀、I₀For null vector.

Step 3, the code semantic matrix construction depth neural network according to the feature vector in step 1 and in step 2, instruction Get prediction code tectonic model；Fig. 4 is the neural network configuration schematic diagram for prediction code；

Step 4, according to intimate program structure neural network, training obtains comparing code similarity model；Fig. 5 is The neural network configuration schematic diagram for being used to compare code similarity in the present invention.

Step 5, the high frequency words corpus based on step 1 carry out a language to the defect Expressive Features vector of new bug report The replacement of the close word of justice；Fig. 6 is the description for the bug report that BugID is 39436 in the AspectJ project extracted in the present embodiment Part.

Step 5-1: natural language pretreatment operation is carried out to problem description section first, new bug report is processed into list Word list obtains:

" old ", " task ", " view ", " show ", " summary ", " number ", " task ", " error ", " Warn ", " info ", " status ", " line ", " miss ", " rework ", " view " }

Step 5-2: based on total descriptor corpus, 12 words occurred in total descriptor corpus, respectively The 100d vector obtained in step 1 is replaced；Without the word rework of appearance, then based on cosine similarity in high frequency words In corpus, obtain and the highest word vector of rework similarity --- the similarity of behavior, two words reach 0.9703289270401001, then the word vector of rework is expressed as to the 100d word vector of behavior.

Finally replaced feature vector (13*100) is input in prediction code tectonic model and exports semantic matrix.Table 3 numbers occurred in entire corpus for each word of new bug report in the present embodiment:

Step 6 is based on abstract syntax tree technology, and whole source code files of new bug report are divided into code basic block, And it is separately encoded into code semantic matrix, the list of code semantic matrix is constituted, as shown in the table:

Step 7, the model obtained using step 4, according to code in the list of code semantic matrix and step 5 in step 6 Semantic matrix similarity size carries out ranking to the code semantic matrix list in step 6, and output similarity comes first n Code basic block corresponding to code semantic matrix, i.e., this positioning as a result, as shown in the table:

In the present embodiment, the present invention has successfully navigated to the generation where defect by fine-grained expression defect code Code block, i.e. code block 1 greatly improve location efficiency, software defect remediation efficiency.

The present invention is not limited to the above embodiments, on the basis of technical solution disclosed by the invention, the skill of this field For art personnel according to disclosed technology contents, one can be made to some of which technical characteristic by not needing creative labor A little replacements and deformation, these replacements and deformation are within the scope of the invention.

Claims

1. a kind of fine granularity defect positioning method towards bug report, which comprises the following steps:

Step 1, by the natural language processing of problem description section in history bug report at defect Expressive Features vector, as total Descriptor corpus, while choose the frequency of occurrences come before k word vector, construct high frequency words corpus；

Step 2, using abstract syntax tree technology extraction procedure structure, obtain the code block in history source code file where defect Syntax tree flows the angle with data flow, the feature vector of constructed variable and basic block, then by code from control based on syntax tree Block coding is code semantic matrix；

Step 3, according to the code semantic matrix construction depth nerve net in the defects of step 1 Expressive Features vector sum step 2 Network, training obtain prediction code tectonic model；

Step 5, the high frequency words corpus based on step 1 carry out primary semantic phase to the defect Expressive Features vector of new bug report The replacement of nearly word, by output code semantic matrix after the prediction code tectonic model in replaced feature vector input step 3；

Step 6 is based on abstract syntax tree technology, whole source code files of new bug report is divided into code basic block, and divide It is not encoded into code semantic matrix, constitutes the list of code semantic matrix；

Step 7, the model obtained using step 4, it is semantic according to code in the list of code semantic matrix and step 5 in step 6 The similarity of matrix carries out ranking to the code semantic matrix list in step 6, and the code of n is semantic before output similarity comes Code basic block corresponding to matrix.

2. the fine granularity defect positioning method according to claim 1 towards bug report, which is characterized in that step 1 and step History bug report described in rapid 2 and history source code file have been obtained from having repaired to extract in defect library, and history There are corresponding relationships between bug report and history source code file.

3. the fine granularity defect positioning method according to claim 1 towards bug report, which is characterized in that institute in step 1 Specific step is as follows for the building high frequency words corpus stated:

Step 1-1: segmenting history bug report, goes the natural language processings operation such as stop words and stem extraction, obtains list Word list；Participle includes removal punctuation mark, and a complete sentence is converted into word list；Removing stop words is to leave out It is being commonly used and for the semantic information of entire sentence without the word of too big contribution；Stem extraction is by each list of list The derivative words of word are reduced into corresponding original shape；

Step 1-2: the contextual information based on word, word, which is converted to vector, to be indicated, and all vectors are built into always Descriptor corpus；

4. the fine granularity defect positioning method according to claim 1 towards bug report, which is characterized in that institute in step 2 That states marks off the code block where defect, and granularity of division is code basic block；

Code basic block is defined as the statement sequence that program sequence executes, only one of them entrance and one outlet, and such as First sentence of this basic block of fruit is performed, then all sentences can be all only performed once in sequence in basic block.

5. the fine granularity defect positioning method according to claim 1 towards bug report, which is characterized in that institute in step 2 The specific steps using abstract syntax tree extraction procedure structure stated include:

Step 2-2: the abstract syntax tree that front and back code obtains is modified in comparison, is navigated to where defect in source code before a modification Basic block；

Step 2-3: the code basic block where defect is obtained.All are built into containing defective basic block containing defective Basic block collection.

6. the fine granularity defect positioning method according to claim 1 towards bug report, which is characterized in that institute in step 2 That states is encoded to code block the specific steps of one code semantic matrix are as follows:

Step 2-4: based on the obtained code block of extraction, extracting its control stream and traffic flow information, type including variable, basic Type and action type of block etc.；

Step 2-5: according to these information structuring feature vectors, this feature vector can indicate variable and basic block in code block Feature；

Step 2-6: semantic matrix is constructed as the element of matrix using variable or basic block as row and column using this feature vector.

7. the fine granularity defect positioning method according to claim 1 towards bug report, which is characterized in that institute in step 3 State according to defect Expressive Features vector sum code semantic matrix construction depth neural network, training obtains prediction code construction mould The specific steps of type include:

Step 3-1: with bug Expressive Features vector M obtained in same a history bug report_i(i=1,2 ...) and history source generation The semantic matrix T of code is input, constructs first layer hidden layer, establishes preliminary Copula a to two kinds of data of input_i ⁽¹⁾ (T,M_i) (i=1,2 ...)；

Step 3-2: k repetitive exercise is carried out to the hidden layer of step 3-1, improves the connection of feature vector and corresponding semantic matrix Function, i.e. a_i ⁽ⁿ⁺¹⁾(a₁ ⁽ⁿ⁾, a₂ ⁽ⁿ⁾, a₃ ⁽ⁿ⁾...) (i=1,2 ..., n=1,2 ..., k)；

Step 3-3: retain the training of this history bug report finally obtained Copula A (T, M_i) (i=1,2 ...)=a^(k ⁺²⁾(a₁ ^(k+1), a₂ ^(k+1), a₃ ^(k+1)...), it chooses another history bug report and is trained；In the history bug report library of selection It is middle to repeat step 3-1 and step 3-2；

Step 3-4: it takes and trains obtaining in above-mentioned steps as a result, giving expression to its correspondence code semanteme of defect Expressive Features vector sum The relation function of matrix, construction are calculated the model of its code semantic matrix, referred to as prediction code by defect Expressive Features vector Tectonic model, and the output as neural network.

8. the fine granularity defect positioning method according to claim 1 towards bug report, which is characterized in that institute in step 4 State according to intimate program structure neural network, the specific steps that training obtains comparing code similarity model include:

Step 4-1: multiple program M to realize a certain function_i(i=1,2 ...) is the input of neural network, and constructs similar Spend comparison function a_i ⁽¹⁾(M_i) (i=1,2 ...)；

Step 4-3: step 4-1 and step 4-2 is repeated, final function A (M is retained_p,M_q)=∑ a_i ^(k+1)/ n (i=1,2 ..., N), it and to these functions is iterated training, obtaining can the generally function of calculation code similarity and output.

9. the fine granularity defect positioning method according to claim 1 towards bug report, which is characterized in that institute in step 5 That states carries out the replacement of a semantic similarity word, packet based on high frequency words corpus to the defect Expressive Features vector of new bug report It includes:

Step 5-1: being segmented, go stop words and operation and stem extraction operation, and new bug report is processed into word list；

Step 5-2: based on total descriptor corpus, the word in new bug report is substituted in total descriptor corpus Word vector；For the word not occurred in total descriptor corpus, in order not to lose its semantic information, it is carried out The replacement of semantic similarity word；

Step 5-3: it is based on cosine similarity, for the word not occurred in total descriptor corpus in step 5-2, In Searched in high frequency words corpus and return with the highest word vector of its similarity, if with the highest word vector of similarity it Between similarity value be lower than M, then it is assumed that in high frequency words corpus with this word justice similar in word, then not replacement； Otherwise, it is replaced.