CN107844414A

CN107844414A - A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method

Info

Publication number: CN107844414A
Application number: CN201610854844.1A
Authority: CN
Inventors: 许蕾; 樊田田; 周博聪; 陈林; 徐宝文
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-09-21
Filing date: 2016-09-21
Publication date: 2018-03-27

Abstract

The present invention is a kind of spanned item mesh based on defect report analysis, parallelization defect positioning method：The project and its source code and defect report of correlation are collected first, and source code and defect report are segmented respectively, remove the text-processing work such as stop words, extraction stem；Then the words similarity and semantic similarity of source code file in defect report and project are calculated using vector space model method and topic model method, to improve positioning precision；And calculating speed, raising efficiency are accelerated using parallel method；The time performance improved using parallel method to defect location and conventional serial method is finally assessed using time and speed-up ratio, Top k accuracys rate (Accuracy@k) and tagmeme inverse average (MRR), which are assessed, uses the precision of vector space model and topic model to defect location effect.

Description

A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method

Technical field

The invention belongs to field of computer technology, especially software technology field.The present invention is from defect report, there is provided Parallel method to software defect carry out file-level granularity across item location, for solving spanned item mesh defect location precision not The problem of high, less efficient.

Background technology

Bugs are present in computer program, understand the problem of Rogue program's normal work, mistake, are needed for system The failure of some functions to be realized is run counter to.During software development and maintenance, due to the influence of various factors, program Defect is difficult to avoid, and can often be occurred.With the continuous growth of software size, software defect also emerges in an endless stream.Work as software When user runs into a defect, it will usually write a defect report to describe problem encountered and relevant information, instead Feed developer.For example, the development teams of Eclipse softwares were in 2015, daily the defects of receiving, is reported more than 100 parts.It is right In each is found the defects of, developer will often take a substantial amount of time to go to position defect and it is repaiied with energy It is multiple.

The process of defect location is to find out the process of defect position in software project.Research shows, in software project Whole life cycle in, software maintenance cost accounts for more than the 75% of totle drilling cost, wherein find defect than repair defect often It is more difficult, it is necessary to take more time.In traditional software debugging process, people can typically use printing, statement, breakpoint The methods of position defect, these methods can take a substantial amount of time and energy.In order to help developer efficiently and accurate Ground positions defect, since last century, the defects of researcher proposes some automations successively localization method, be broadly divided into two Class：The method analyzed based on program and analyzed based on defect report.

The defects of being analyzed based on program localization method can be divided into is cut into slices and based on dynamic program slicing based on static routine Method.Based on static routine section method be since wrong sentence, using program control flow and data flow backward reasoning with Defect is positioned, but there can be wrong report.The defects of dynamic program slicing, which localization method using the execution information of program judged Sentence belongs to program slice, it is possible to reduce the quantity of section, but expense can increase.

In order to weigh the accuracy of defect location and expense, researcher was directed to individual event purpose defect location and carried in recent years A series of the defects of being based on information retrieval localization methods are gone out, have taken full advantage of included in defect report and software source code Information.The target of defect location is to find out the source code related to defect report from code library according to defect report, is modeled as One Issues about Information Retrieval, i.e.,：The source code in a defect report and software project is given, mesh is calculated using specific algorithm The similarity between code and defect report is marked, then these source code files are arranged according to the size descending of similarity, Recommendation list feeds back to developer, and developer is examined source code file according to this recommendation list, is lacked with positioning Fall into.

At present, both for single project, this is defined in user and knows the defects of occurred just the scene of defect location It is as caused by the project.According to statistics, the project source code that the positioning of the defects of single project is chosen is up to rows up to ten thousand；To one , it is necessary to calculate similarity to Item version source file corresponding to the defect report and the defect when individual defect is positioned, This is a very time-consuming job, it is assumed that user have submitted 500 parts of the defects of needing processing reports (this in actual applications very It is common), the time of cost will be more than 24 hours.

In actual applications, be between multiple projects in the software ecosystem it is related, such as eclipse softwares ecology Many of system project, project can call identical software kit on stream, be related between project, be used so working as When family finds a defect, often do not know and which project the defect report is submitted to, the difficulty of defect location and time will Significantly increase.Therefore, the present invention solves the problems, such as spanned item purpose defect location emphatically.

Existing spanned item mesh defect location research is multiple typically on a few even unique code snapshot Defect report generates recommendation list.So-called code snapshot refers to item code storehouse in the sometime state at point, including code There is the content in which source code file and source code file in storehouse.But because defect report corresponds to the generation of different editions The source code of code storehouse and different software versions is in continually changing process, and the source code in code library may increase or subtract Few, the precision that may result in result of practical application to carry out spanned item purpose defect location using a small number of code snapshots shows shakiness It is fixed.Therefore, we plan using the code snapshot for obtaining every part of defect report when repairing corresponding to multiple projects, to defect report Announcement is positioned to improve precision in all code snapshots got.

While precision is improved, the quantity of source code is multiplied, if still using the positioning of the defects of individual event mesh Mode is investigated to the source code in each project successively, and workload can be multiplied.In order to reduce the work of developer Measure, we plan the method using parallel computation come improving performance.

Parallel Programming mode is divided into shared storage variable mode, message transmission mode, MapReduce concurrent programs and set Meter mode and other manner.We are calculated using MapReduce parallel Programmings mode.MapReduce concurrent programs Design is that Google proposes, and the big data processing parallel Programming method of main flow at present, be can be widely applied to The large-scale data processing of the internet industrys such as every field, especially search engine.Its fundamental design idea is divided into following three Step：First, to big data parallel processing, using the method divided and rule, a big data, which can be divided into having, equally to be calculated The data block of journey, and data dependence relation is not present between these data blocks, therefore we plan to not having between each other Or have the big data of little data dependence, with certain data partition method to data fragmentation, then by each data Burst transfers to a node to go to handle, last aggregation process result；Second, rise to abstract model Map and Reduce, Map are grasped Make mainly to be responsible for carrying out one group of data record specific reprocessing, the main responsible intermediate result to Map of Reduce operations is entered Row is further to be arranged and exports；3rd, framework is risen to, system layer details is hidden as programmer using unified framework, The main target for the unified calculation framework that MapReduce is provided is to realize automatic paralleling calculating.

The content of the invention

On the basis of work on hand, the subject matter that the present invention solves is spanned item mesh defect location.Resolving ideas is basis Defect report and project source file information, the defects of processing, is reported words file and project source code words file space to Measure model (VSM) and topic model (LDA) calculates similarity, to be positioned to defect report；Further, since defect report number Mesh is more and the source file of multiple projects is in large scale, and to improve efficiency, plan is lifted using MapReduce Programming Methodologies Performance.

The technical solution adopted in the present invention is：The similarity of the source code file of defect report and multiple projects is calculated, Calculating the method for similarity includes vector space model (VSM) and topic model (LDA).When calculating similarity, due to multiple The source code of project is huge, and we solve this problem using MapReduce parallel Programmings method, during so as to reduce calculating Between.

The present invention specifically includes following steps：

1) source code file in defect report and project is parsed to generate words document matrix, generates corpus；

2) source code file in defect report and multiple projects is positioned together, i.e., using support vector machine method and master Model method is inscribed, calculates the calculating similarity of every a defect report and the source code in multiple projects；

3) source code file in defect report and multiple projects is positioned using MapReduce parallel methods, Judge that the source code file similarity of defect report and which project is larger.

The source code file in defect report and project is handled in step 1), generates corpus.In the step In emphasis be collect data set, establish corpus.It is described in detail below：

(1) data set collected is included in project and each project with defect tracking systematic collection the defects of being repaired Report, ensures that project is increased income, is related between project as far as possible, and has complete historical changing to record.

(2) in order to improve the accuracy of defect location, a certain code for every a defect report and source code file is fast According to establishing independent corpus.Due to other defect repair codes may be included during defect is repaired, so we adopt Recent code snapshot is used as replacement before that code being repaired by the use of defect is submitted.

In practice, it has been found that even less project, can also include up to a hundred parts of defect reports, to per a defect report Establishing independent corpus at least needs to handle up to ten thousand parts of source code files, and workload is very huge.

In order to alleviate the problem, we use incremental update corpus method, that is, handle first defect report when, it is right All source code files in the code snapshot are handled, for it is follow-up the defects of report, using Item version management tool The corresponding code snapshot of extraction, compares the file changed with the code snapshot corresponding to previous defect report reparation, Increase the source code file changed on the basis of previous defect report corpus, as corresponding to the defect report Corpus.This mode can substantially reduce time complexity and improve experiment accuracy.

(3) corpus is established according to the data set of the method collection of above-mentioned (2) description, to code snapshot corresponding to selection In source code file handled, i.e., source code file is parsed into word, and segmented, remove stop words, extraction word It is dry to wait work, source code corpus is formed, same processing is also done to defect report；

Using the source generation in vector space model method and topic model method calculating defect report and project in step 2) Code file calculates similarity.The corpus of the generation in step 1) is needed to use during this, calculating the method for similarity is Vector space model VSM and topic model LDA.Then, the similarity of calculating is ranked up, returns to sorted lists to exploit person Member, sorted lists will mark which project each file is belonging respectively to, and be easy to developer to go in corresponding project to find correspondingly Source code file checked.

Vector space model is an algebraic model based on corpus words-document matrix, its words-document matrix The different words included in corpus are represented per a line, each row represent different files, the value generation corresponding to each row and column Shared weight in file of the word in row in table row, if the word does not occur hereof, then corresponding weight It is zero, each file has corresponding column vector in words-document matrix.Similarity between two files can lead to Cross and compare their weight vector computation and draw, suitable for determining similarity of two text documents on words.

Topic model LDA is a kind of document subject matter generation model, including word, theme, also referred to as document three-decker, bag of words Model, the model can be used for the source code file computing semantic similarity in defect report and project, help solve sky Between vector model processing synonym, the deficiency of near synonym etc..

Step 3) employs MapReduce parallel methods and carrys out settlement steps to deal 2) in run into it is computationally intensive the problem of.Step It is rapid 2) in be that source code file in every part of defect report and multiple projects calculates similarity, this part of task can be performed with parallelization With raising efficiency.

We calculate the source generation of defect report and multiple projects using distributed storage and MapReduce parallel methods Code similarity.The characteristics of multitask multi-node parallel based on MapReduce calculates, the amount of calculation that each node distribution arrives will be remote Less than the amount of calculation of single node serial computing, it can largely shorten run time.Calculating process comprises the following steps：

(1) Map processes are scanned to a large amount of sequential data elements or record first, and every record is included to code Carry out the words record and corresponding label (i.e. No. id) obtained after the operation such as stop words processing.Scanning obtains each record Afterwards, string processing is carried out to it and extracts id and corresponding words content.Automatically will according still further to id as keyword by system Corresponding content is sent on corresponding reduce nodes；

(2) each node of Reduce processes reads shared file content and is stored in array in order to subsequently carry out first Then the intermediate result that Similarity Measure, afterwards sequence and arrangement map processes send over is collected and made in favor of subsequent treatment Similarity Measure is carried out with intermediate result, final result is finally produced and is output in corresponding file.Read defect report text Part is simultaneously stored into array, travels through each defect report element in array and successively with being assigned to the source file code of the node Similarity Measure is carried out, and final result is write into file.

Above scheme is used, the invention has the advantages that：

(1) present invention can realize spanned item purpose defect location.In practical application, have some relevant items one and start shipment Row is, it is necessary to across project implementation defect location.The present invention proposes a kind of spanned item mesh defect positioning method, passes through incremental update language material Storehouse and the method that defect report and source code file similarity are calculated using vector space model VSM and topic model LDA, solution During spanned item of having determined mesh defect location the problem of source code file huge number, and improve positioning precision；

(2) present invention can realize that the defects of parallelization positions.In the source code file to defect report and multiple projects When calculating similarity, because source code file quantity is extremely more and defect report data are also more, calculated by the way of serial computing Amount is very big and can take considerable time.In view of the logic flow of calculating task, essentially identical and independence is higher, is well suited for point More relatively independent small task is cut into, therefore we improve efficiency using MapReduce parallel methods.

Brief description of the drawings

Fig. 1 is to carry out spanned item mesh defect location using MapReduce parallel methods, and the present invention uses MapReduce simultaneously Row method carries out spanned item purpose defect location, and detailed process is as follows：

(1) initialize：Defect report and project source code file are segmented, remove stop words, extraction stem etc. Reason, is processed into words one by one, is then sent to defect report file as shared file on each node；

(2) Map processes：Source code file is read, using the code sequence number of Codabar code record as key, code sequence number pair The words file answered is used as value, and key-value pair is sent on corresponding reduce nodes；

(3) Reduce processes：Read defect report file and be stored into array, travel through each defect report in array Element simultaneously carries out Similarity Measure with being assigned to the source file code of the node successively, and final result is write into file.

Embodiment

The present invention specifically includes following steps：

1) several related open source software projects are chosen as experimental subjects, and from defect tracking system corresponding to these projects Respective defect report is found in system, as data set；

2) according to the data set in step 1), corresponding corpus is established by the way of incremental update；

3) defect report and source code file similarity in corpus are calculated with vector space model and topic model；

4) spanned item purpose defect report and source code file Similarity Measure are realized using MapReduce parallel methods；

5) the defects of being obtained in step 4) report and source code file similarity are directed to, is ranked up according to size, and it is right Ranking results are assessed.

In step 1) open source projects selection, it is necessary to select correlation project (i.e. in project development process, a project Some functions other projects can be called to complete；Can also be project function it is similar), can be from GitHub websites (https：//github.com/github) on clone, ensure as far as possible choose project control version to become using instrument git More；Extracted from defect tracking system and be labeled as having solved to repair, have verified that reparation, complete to report the defects of reparation.From extraction The defects of report in can obtain what defect identifier, summary, description, defect report submission time and defect report were repaired The information such as time.Occur because defect tracking system does not record defect in which of source code position, it is necessary to find out and each The related source code file of defect.According to project is caused in being submitted with git daily record can be submitted to obtain, with code snapshot Corresponding submission label associates defect report with corresponding code snapshot.When choosing experimental subjects, it is necessary to meet following four Individual condition：

(1) project chosen is developed with Java language as far as possible；

(2) each project will have complete and specification development records process, because each project is not on stream Break what is changed；

(3) in order to ensure the reliability of experimental result and credibility, selected project is widely known by the people, and comparative maturity simultaneously reaches To certain scale；

(4) project selected by is increased income and can get the complete source code information of the project.

According to the data set being collected into step 1), step 2) establishes corresponding corpus.The present invention is each defect report Announcement establishes independent corpus.During corpus is generated, the present invention uses the method for incremental update, concrete operation step It is as follows：

(1) report for the defects of project, arranged according to defect report submission time ascending order；

(2) for each defect report, the code according to corresponding to its corresponding code submits identifier to obtain the identifier Snapshot, selected project are by the use of git as version control tool, so intending using " git checkout " orders extract The code submits code snapshot corresponding to identifier；

(3), it is necessary to obtain each source code file in code snapshot when handling first part of defect report, establish corresponding Corpus；

(4) to every a defect report later, using " git diff " obtain it is right before current defect report is repaired The code snapshot answered and the file that above code snapshot corresponding to a defect report changes, are then handled it, Obtain corresponding corpus.

The defects of in step 3) to being generated in step 2), reports that corpus and source code file calculate similarity, using space Vector model and topic model calculate the similarity of defect report and source code file.

In vector space model method, the calculating to similarity between defect report r and source code file d passes through calculating COS distance between weight vectors corresponding with them show that calculation formula is as follows：

WhereinWithDefect report r and source code file d weight vectors are represented respectively,Represent two weights to The Inner products of amount.

Topic model (LDA) is a kind of probability generation topic model, and its operation principle is by each text in corpus Shelves carry out a kind of corresponding, and probability of each potential theme and word on lexical set with the probability distribution of one group of potential theme Distribution is carried out correspondingly.Sample to obtain document-theme distribution of source code document using LDA topic models method and Gibbs, then Calculate defect report r and source code file d_iThe formula that similarity uses is as follows：

Wherein, r_kIt is k-th of word in defect report r.

Because defect report and source code file utilization space vector model and topic model calculate similarity in step 4) When, each defect report has a corpus corresponding to its own, conventional method be individual defect report successively serially and its Corresponding corpus calculates similarity, and it is considerably long to expend the time.In the present invention, since it is observed that each defect report and corresponding language It is independent to expect that storehouse calculates similarity according to vector space model and topic model, thus we using parallel method come to it Calculated.It is as follows to implement step：

(1) initialize：It is sent to defect report as shared file on each node；

(2) Map processes：The source code file of reading project, using the code sequence number of Codabar code record as key, code Key-value pair is sent on corresponding reduce nodes by words file corresponding to sequence number as value.The specific code of Map processes is real It is now as follows：

(3) Reduce processes：Read defect report file and be stored into array, travel through each defect report in array Element simultaneously carries out Similarity Measure with being assigned to the source file code of the node successively, and final result is write into file. Reduce classes are implemented as follows：

For evaluation, our used methods carry out the effect of defect location to defect report and multiple projects in step 5) Fruit, provide following evaluation index.

(1) time, speed-up ratio are performed.When the execution time is that all defect is reported in the execution of defect location in spanned item mesh Between, including the execution time TimeS of positioning is serialized in spanned item mesh for defect report and defect report uses in spanned item mesh The execution time TimeP of MapReduce parallel methods positioning；Speed-up ratio (SpeedRatio) is reported in for metric deficiencies Serialization positioning performs time and defect report in spanned item mesh using the positioning of MapReduce parallel methods in spanned item mesh It is as follows compared to the effect that can be lifted, calculation formula to perform the time：

In above-mentioned formula, SpeedRatio value is bigger, represents the time performance using the lifting of parallelization defect location Better.

(2) Top-k accuracys rate (Accuracy@k).Top-k accuracys rate show that the recommendation to defect positioning method generation arranges Defect positioning method successful probability when k file is examined before table.It is to be positioned for one in the assessment of the present invention Defect report, if defect report method generation recommendation list preceding k source code file at least one and give Defect report is related, then it is assumed that positions successfully.Calculation formula is as follows：

In above-mentioned formula, n represents to use the sum of defect report in evaluation process；|R_k| represent that defect positioning method enters When row top-k recommends, the quantity reported the defects of successfully is positioned in n defect report.Accuracy@k value is bigger, and defect is fixed The positioning performance of position method is better.

(3) tagmeme inverse average (MRR).First in the recommendation list that tagmeme inverse average is generated according to defect positioning method The inverse of the position of the individual source code file related to defect report judges the performance of defect positioning method, tagmeme inverse average Bigger, the performance of defect positioning method is better.

In above formula, first_jRepresent in the recommendation list corresponding to j-th of defect report first it is related to defect report Source code file position.MRR value is bigger, and the performance of defect report localization method is better.

In a word, the present invention realizes positioning of the defect report in spanned item mesh using parallel method, can solve software The positioning of ecosystem mid-span project defect report, can solve spanned item mesh using parallel method and have a large amount of source code files The problem of, location efficiency can be improved, greatly reduces the time.Relative to traditional method before, the side proposed in of the invention Method can solve wider problem, it is possible to reduce developer is positioning the spent time across defect report in entry, carries High locating effect.

Claims

1. it is a kind of based on defect report analysis spanned item mesh, parallelization defect positioning method, it is characterized in that according to defect report, across Multiple projects carry out the positioning of file-level granularity, it is necessary to be segmented, gone to source code file in defect report and project to defect Except text-processing work such as stop words, extraction stems；Then calculated and lacked using vector space model method and topic model method The words similarity and semantic similarity of source code file in report and project are fallen into, to improve positioning precision；And use parallelization Method accelerates calculating speed, raising efficiency.

2. a kind of spanned item mesh based on defect report analysis, parallelization defect positioning method according to described by claim 1, its Feature comprises the following steps：

1) collect related project and reported the defects of being repaired, the source code and defect report to project are carried out at text Reason, corpus is generated using the method for incremental update；

2) vector space model method and topic model method, the source code file meter to defect report and across multiple projects are used Similarity is calculated, then it is ranked up according to similarity size；

3) similarity is calculated using MapReduce parallel methods to defect report and across the source code file in multiple projects, And it is sorted, sorted lists are then returned to developer.

3. a kind of spanned item mesh based on defect report analysis, parallelization defect positioning method according to described by claim 2, its It is characterized in step 1) that selected project has complete historical changing to record, and is related, i.e. project between project Between have mutually call or project function it is similar；The defects of selected report is that solved to repair, had verified that reparation, complete The defects of repairing, and extracting defect report mark, summary, description, submission time, repair time；To defect report and project Source code establishes corpus, and the source code file in corpus is multiplied, so established using increment type processing mode, i.e., pair First part of defect report establishes corpus, it is necessary to handle all source code files in code snapshot, to lacking below Corpus is established in sunken report, and the code snapshot corresponding to the defect report and preceding a defect report are obtained by version change instrument The file changed in code snapshot corresponding to announcement, is then handled it；Item destination code snapshot is obtained, according to defect The generation corresponding to version that report repair time is chosen in Item version history and the time submits closest to the corresponding time Code snapshot.

4. a kind of spanned item mesh based on defect report analysis, parallelization defect positioning method according to described by claim 2, its It is characterized in step 2) using the source generation of vector space model method and topic model method to defect report and across multiple projects Code file calculates similarity.It is related in the software ecosystem, between project, because project can be called on stream Identical code library, it is that the positioning of a defect report is not known for which project, so to defect report and across multiple projects Source code calculates similarity, is arranged according to similarity size, returns to sorted lists, developer is according to the Sorted list of return Table is in corresponding project to defect location.

5. a kind of spanned item mesh based on defect report analysis, parallelization defect positioning method according to described by claim 2, its It is characterized in step 3) calculating using MapReduce parallel methods to defect report and across the source code file in multiple projects Similarity, it is independent task to carry out positioning using vector space model and topic model method to every a defect report, and And the logic flow of calculating task is essentially identical, so these tasks are divided into many relatively independent small tasks, parallelization Processing：During Map, the source code file of project is read, using the code sequence number of Codabar code record as key, code sequence number Key-value pair is sent on corresponding Reduce nodes by corresponding words file as value；During Reduce, defect report is read Accuse file to be simultaneously stored into array, travel through each defect report element in array and successively with being assigned to the source file of the node Code carries out Similarity Measure, and final result is write into file.