CN107844414A - A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method - Google Patents
A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method Download PDFInfo
- Publication number
- CN107844414A CN107844414A CN201610854844.1A CN201610854844A CN107844414A CN 107844414 A CN107844414 A CN 107844414A CN 201610854844 A CN201610854844 A CN 201610854844A CN 107844414 A CN107844414 A CN 107844414A
- Authority
- CN
- China
- Prior art keywords
- defect
- defect report
- project
- report
- source code
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000007547 defect Effects 0.000 title claims abstract description 202
- 238000000034 method Methods 0.000 title claims abstract description 94
- 238000004458 analytical method Methods 0.000 title claims abstract description 8
- 239000013598 vector Substances 0.000 claims abstract description 21
- 238000012545 processing Methods 0.000 claims abstract description 13
- 238000000605 extraction Methods 0.000 claims abstract description 6
- 238000011524 similarity measure Methods 0.000 claims description 7
- 230000008439 repair process Effects 0.000 claims description 6
- 230000000694 effects Effects 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 17
- 238000004364 calculation method Methods 0.000 description 6
- 230000004807 localization Effects 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 206010044565 Tremor Diseases 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000005553 drilling Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000012958 reprocessing Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3692—Test management for test results analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3696—Methods or tools to render software testable
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention is a kind of spanned item mesh based on defect report analysis, parallelization defect positioning method:The project and its source code and defect report of correlation are collected first, and source code and defect report are segmented respectively, remove the text-processing work such as stop words, extraction stem;Then the words similarity and semantic similarity of source code file in defect report and project are calculated using vector space model method and topic model method, to improve positioning precision;And calculating speed, raising efficiency are accelerated using parallel method;The time performance improved using parallel method to defect location and conventional serial method is finally assessed using time and speed-up ratio, Top k accuracys rate (Accuracy@k) and tagmeme inverse average (MRR), which are assessed, uses the precision of vector space model and topic model to defect location effect.
Description
Technical field
The invention belongs to field of computer technology, especially software technology field.The present invention is from defect report, there is provided
Parallel method to software defect carry out file-level granularity across item location, for solving spanned item mesh defect location precision not
The problem of high, less efficient.
Background technology
Bugs are present in computer program, understand the problem of Rogue program's normal work, mistake, are needed for system
The failure of some functions to be realized is run counter to.During software development and maintenance, due to the influence of various factors, program
Defect is difficult to avoid, and can often be occurred.With the continuous growth of software size, software defect also emerges in an endless stream.Work as software
When user runs into a defect, it will usually write a defect report to describe problem encountered and relevant information, instead
Feed developer.For example, the development teams of Eclipse softwares were in 2015, daily the defects of receiving, is reported more than 100 parts.It is right
In each is found the defects of, developer will often take a substantial amount of time to go to position defect and it is repaiied with energy
It is multiple.
The process of defect location is to find out the process of defect position in software project.Research shows, in software project
Whole life cycle in, software maintenance cost accounts for more than the 75% of totle drilling cost, wherein find defect than repair defect often
It is more difficult, it is necessary to take more time.In traditional software debugging process, people can typically use printing, statement, breakpoint
The methods of position defect, these methods can take a substantial amount of time and energy.In order to help developer efficiently and accurate
Ground positions defect, since last century, the defects of researcher proposes some automations successively localization method, be broadly divided into two
Class:The method analyzed based on program and analyzed based on defect report.
The defects of being analyzed based on program localization method can be divided into is cut into slices and based on dynamic program slicing based on static routine
Method.Based on static routine section method be since wrong sentence, using program control flow and data flow backward reasoning with
Defect is positioned, but there can be wrong report.The defects of dynamic program slicing, which localization method using the execution information of program judged
Sentence belongs to program slice, it is possible to reduce the quantity of section, but expense can increase.
In order to weigh the accuracy of defect location and expense, researcher was directed to individual event purpose defect location and carried in recent years
A series of the defects of being based on information retrieval localization methods are gone out, have taken full advantage of included in defect report and software source code
Information.The target of defect location is to find out the source code related to defect report from code library according to defect report, is modeled as
One Issues about Information Retrieval, i.e.,:The source code in a defect report and software project is given, mesh is calculated using specific algorithm
The similarity between code and defect report is marked, then these source code files are arranged according to the size descending of similarity,
Recommendation list feeds back to developer, and developer is examined source code file according to this recommendation list, is lacked with positioning
Fall into.
At present, both for single project, this is defined in user and knows the defects of occurred just the scene of defect location
It is as caused by the project.According to statistics, the project source code that the positioning of the defects of single project is chosen is up to rows up to ten thousand;To one
, it is necessary to calculate similarity to Item version source file corresponding to the defect report and the defect when individual defect is positioned,
This is a very time-consuming job, it is assumed that user have submitted 500 parts of the defects of needing processing reports (this in actual applications very
It is common), the time of cost will be more than 24 hours.
In actual applications, be between multiple projects in the software ecosystem it is related, such as eclipse softwares ecology
Many of system project, project can call identical software kit on stream, be related between project, be used so working as
When family finds a defect, often do not know and which project the defect report is submitted to, the difficulty of defect location and time will
Significantly increase.Therefore, the present invention solves the problems, such as spanned item purpose defect location emphatically.
Existing spanned item mesh defect location research is multiple typically on a few even unique code snapshot
Defect report generates recommendation list.So-called code snapshot refers to item code storehouse in the sometime state at point, including code
There is the content in which source code file and source code file in storehouse.But because defect report corresponds to the generation of different editions
The source code of code storehouse and different software versions is in continually changing process, and the source code in code library may increase or subtract
Few, the precision that may result in result of practical application to carry out spanned item purpose defect location using a small number of code snapshots shows shakiness
It is fixed.Therefore, we plan using the code snapshot for obtaining every part of defect report when repairing corresponding to multiple projects, to defect report
Announcement is positioned to improve precision in all code snapshots got.
While precision is improved, the quantity of source code is multiplied, if still using the positioning of the defects of individual event mesh
Mode is investigated to the source code in each project successively, and workload can be multiplied.In order to reduce the work of developer
Measure, we plan the method using parallel computation come improving performance.
Parallel Programming mode is divided into shared storage variable mode, message transmission mode, MapReduce concurrent programs and set
Meter mode and other manner.We are calculated using MapReduce parallel Programmings mode.MapReduce concurrent programs
Design is that Google proposes, and the big data processing parallel Programming method of main flow at present, be can be widely applied to
The large-scale data processing of the internet industrys such as every field, especially search engine.Its fundamental design idea is divided into following three
Step:First, to big data parallel processing, using the method divided and rule, a big data, which can be divided into having, equally to be calculated
The data block of journey, and data dependence relation is not present between these data blocks, therefore we plan to not having between each other
Or have the big data of little data dependence, with certain data partition method to data fragmentation, then by each data
Burst transfers to a node to go to handle, last aggregation process result;Second, rise to abstract model Map and Reduce, Map are grasped
Make mainly to be responsible for carrying out one group of data record specific reprocessing, the main responsible intermediate result to Map of Reduce operations is entered
Row is further to be arranged and exports;3rd, framework is risen to, system layer details is hidden as programmer using unified framework,
The main target for the unified calculation framework that MapReduce is provided is to realize automatic paralleling calculating.
The content of the invention
On the basis of work on hand, the subject matter that the present invention solves is spanned item mesh defect location.Resolving ideas is basis
Defect report and project source file information, the defects of processing, is reported words file and project source code words file space to
Measure model (VSM) and topic model (LDA) calculates similarity, to be positioned to defect report;Further, since defect report number
Mesh is more and the source file of multiple projects is in large scale, and to improve efficiency, plan is lifted using MapReduce Programming Methodologies
Performance.
The technical solution adopted in the present invention is:The similarity of the source code file of defect report and multiple projects is calculated,
Calculating the method for similarity includes vector space model (VSM) and topic model (LDA).When calculating similarity, due to multiple
The source code of project is huge, and we solve this problem using MapReduce parallel Programmings method, during so as to reduce calculating
Between.
The present invention specifically includes following steps:
1) source code file in defect report and project is parsed to generate words document matrix, generates corpus;
2) source code file in defect report and multiple projects is positioned together, i.e., using support vector machine method and master
Model method is inscribed, calculates the calculating similarity of every a defect report and the source code in multiple projects;
3) source code file in defect report and multiple projects is positioned using MapReduce parallel methods,
Judge that the source code file similarity of defect report and which project is larger.
The source code file in defect report and project is handled in step 1), generates corpus.In the step
In emphasis be collect data set, establish corpus.It is described in detail below:
(1) data set collected is included in project and each project with defect tracking systematic collection the defects of being repaired
Report, ensures that project is increased income, is related between project as far as possible, and has complete historical changing to record.
(2) in order to improve the accuracy of defect location, a certain code for every a defect report and source code file is fast
According to establishing independent corpus.Due to other defect repair codes may be included during defect is repaired, so we adopt
Recent code snapshot is used as replacement before that code being repaired by the use of defect is submitted.
In practice, it has been found that even less project, can also include up to a hundred parts of defect reports, to per a defect report
Establishing independent corpus at least needs to handle up to ten thousand parts of source code files, and workload is very huge.
In order to alleviate the problem, we use incremental update corpus method, that is, handle first defect report when, it is right
All source code files in the code snapshot are handled, for it is follow-up the defects of report, using Item version management tool
The corresponding code snapshot of extraction, compares the file changed with the code snapshot corresponding to previous defect report reparation,
Increase the source code file changed on the basis of previous defect report corpus, as corresponding to the defect report
Corpus.This mode can substantially reduce time complexity and improve experiment accuracy.
(3) corpus is established according to the data set of the method collection of above-mentioned (2) description, to code snapshot corresponding to selection
In source code file handled, i.e., source code file is parsed into word, and segmented, remove stop words, extraction word
It is dry to wait work, source code corpus is formed, same processing is also done to defect report;
Using the source generation in vector space model method and topic model method calculating defect report and project in step 2)
Code file calculates similarity.The corpus of the generation in step 1) is needed to use during this, calculating the method for similarity is
Vector space model VSM and topic model LDA.Then, the similarity of calculating is ranked up, returns to sorted lists to exploit person
Member, sorted lists will mark which project each file is belonging respectively to, and be easy to developer to go in corresponding project to find correspondingly
Source code file checked.
Vector space model is an algebraic model based on corpus words-document matrix, its words-document matrix
The different words included in corpus are represented per a line, each row represent different files, the value generation corresponding to each row and column
Shared weight in file of the word in row in table row, if the word does not occur hereof, then corresponding weight
It is zero, each file has corresponding column vector in words-document matrix.Similarity between two files can lead to
Cross and compare their weight vector computation and draw, suitable for determining similarity of two text documents on words.
Topic model LDA is a kind of document subject matter generation model, including word, theme, also referred to as document three-decker, bag of words
Model, the model can be used for the source code file computing semantic similarity in defect report and project, help solve sky
Between vector model processing synonym, the deficiency of near synonym etc..
Step 3) employs MapReduce parallel methods and carrys out settlement steps to deal 2) in run into it is computationally intensive the problem of.Step
It is rapid 2) in be that source code file in every part of defect report and multiple projects calculates similarity, this part of task can be performed with parallelization
With raising efficiency.
We calculate the source generation of defect report and multiple projects using distributed storage and MapReduce parallel methods
Code similarity.The characteristics of multitask multi-node parallel based on MapReduce calculates, the amount of calculation that each node distribution arrives will be remote
Less than the amount of calculation of single node serial computing, it can largely shorten run time.Calculating process comprises the following steps:
(1) Map processes are scanned to a large amount of sequential data elements or record first, and every record is included to code
Carry out the words record and corresponding label (i.e. No. id) obtained after the operation such as stop words processing.Scanning obtains each record
Afterwards, string processing is carried out to it and extracts id and corresponding words content.Automatically will according still further to id as keyword by system
Corresponding content is sent on corresponding reduce nodes;
(2) each node of Reduce processes reads shared file content and is stored in array in order to subsequently carry out first
Then the intermediate result that Similarity Measure, afterwards sequence and arrangement map processes send over is collected and made in favor of subsequent treatment
Similarity Measure is carried out with intermediate result, final result is finally produced and is output in corresponding file.Read defect report text
Part is simultaneously stored into array, travels through each defect report element in array and successively with being assigned to the source file code of the node
Similarity Measure is carried out, and final result is write into file.
Above scheme is used, the invention has the advantages that:
(1) present invention can realize spanned item purpose defect location.In practical application, have some relevant items one and start shipment
Row is, it is necessary to across project implementation defect location.The present invention proposes a kind of spanned item mesh defect positioning method, passes through incremental update language material
Storehouse and the method that defect report and source code file similarity are calculated using vector space model VSM and topic model LDA, solution
During spanned item of having determined mesh defect location the problem of source code file huge number, and improve positioning precision;
(2) present invention can realize that the defects of parallelization positions.In the source code file to defect report and multiple projects
When calculating similarity, because source code file quantity is extremely more and defect report data are also more, calculated by the way of serial computing
Amount is very big and can take considerable time.In view of the logic flow of calculating task, essentially identical and independence is higher, is well suited for point
More relatively independent small task is cut into, therefore we improve efficiency using MapReduce parallel methods.
Brief description of the drawings
Fig. 1 is to carry out spanned item mesh defect location using MapReduce parallel methods, and the present invention uses MapReduce simultaneously
Row method carries out spanned item purpose defect location, and detailed process is as follows:
(1) initialize:Defect report and project source code file are segmented, remove stop words, extraction stem etc.
Reason, is processed into words one by one, is then sent to defect report file as shared file on each node;
(2) Map processes:Source code file is read, using the code sequence number of Codabar code record as key, code sequence number pair
The words file answered is used as value, and key-value pair is sent on corresponding reduce nodes;
(3) Reduce processes:Read defect report file and be stored into array, travel through each defect report in array
Element simultaneously carries out Similarity Measure with being assigned to the source file code of the node successively, and final result is write into file.
Embodiment
The present invention specifically includes following steps:
1) several related open source software projects are chosen as experimental subjects, and from defect tracking system corresponding to these projects
Respective defect report is found in system, as data set;
2) according to the data set in step 1), corresponding corpus is established by the way of incremental update;
3) defect report and source code file similarity in corpus are calculated with vector space model and topic model;
4) spanned item purpose defect report and source code file Similarity Measure are realized using MapReduce parallel methods;
5) the defects of being obtained in step 4) report and source code file similarity are directed to, is ranked up according to size, and it is right
Ranking results are assessed.
In step 1) open source projects selection, it is necessary to select correlation project (i.e. in project development process, a project
Some functions other projects can be called to complete;Can also be project function it is similar), can be from GitHub websites
(https://github.com/github) on clone, ensure as far as possible choose project control version to become using instrument git
More;Extracted from defect tracking system and be labeled as having solved to repair, have verified that reparation, complete to report the defects of reparation.From extraction
The defects of report in can obtain what defect identifier, summary, description, defect report submission time and defect report were repaired
The information such as time.Occur because defect tracking system does not record defect in which of source code position, it is necessary to find out and each
The related source code file of defect.According to project is caused in being submitted with git daily record can be submitted to obtain, with code snapshot
Corresponding submission label associates defect report with corresponding code snapshot.When choosing experimental subjects, it is necessary to meet following four
Individual condition:
(1) project chosen is developed with Java language as far as possible;
(2) each project will have complete and specification development records process, because each project is not on stream
Break what is changed;
(3) in order to ensure the reliability of experimental result and credibility, selected project is widely known by the people, and comparative maturity simultaneously reaches
To certain scale;
(4) project selected by is increased income and can get the complete source code information of the project.
According to the data set being collected into step 1), step 2) establishes corresponding corpus.The present invention is each defect report
Announcement establishes independent corpus.During corpus is generated, the present invention uses the method for incremental update, concrete operation step
It is as follows:
(1) report for the defects of project, arranged according to defect report submission time ascending order;
(2) for each defect report, the code according to corresponding to its corresponding code submits identifier to obtain the identifier
Snapshot, selected project are by the use of git as version control tool, so intending using " git checkout " orders extract
The code submits code snapshot corresponding to identifier;
(3), it is necessary to obtain each source code file in code snapshot when handling first part of defect report, establish corresponding
Corpus;
(4) to every a defect report later, using " git diff " obtain it is right before current defect report is repaired
The code snapshot answered and the file that above code snapshot corresponding to a defect report changes, are then handled it,
Obtain corresponding corpus.
The defects of in step 3) to being generated in step 2), reports that corpus and source code file calculate similarity, using space
Vector model and topic model calculate the similarity of defect report and source code file.
In vector space model method, the calculating to similarity between defect report r and source code file d passes through calculating
COS distance between weight vectors corresponding with them show that calculation formula is as follows:
WhereinWithDefect report r and source code file d weight vectors are represented respectively,Represent two weights to
The Inner products of amount.
Topic model (LDA) is a kind of probability generation topic model, and its operation principle is by each text in corpus
Shelves carry out a kind of corresponding, and probability of each potential theme and word on lexical set with the probability distribution of one group of potential theme
Distribution is carried out correspondingly.Sample to obtain document-theme distribution of source code document using LDA topic models method and Gibbs, then
Calculate defect report r and source code file diThe formula that similarity uses is as follows:
Wherein, rkIt is k-th of word in defect report r.
Because defect report and source code file utilization space vector model and topic model calculate similarity in step 4)
When, each defect report has a corpus corresponding to its own, conventional method be individual defect report successively serially and its
Corresponding corpus calculates similarity, and it is considerably long to expend the time.In the present invention, since it is observed that each defect report and corresponding language
It is independent to expect that storehouse calculates similarity according to vector space model and topic model, thus we using parallel method come to it
Calculated.It is as follows to implement step:
(1) initialize:It is sent to defect report as shared file on each node;
(2) Map processes:The source code file of reading project, using the code sequence number of Codabar code record as key, code
Key-value pair is sent on corresponding reduce nodes by words file corresponding to sequence number as value.The specific code of Map processes is real
It is now as follows:
(3) Reduce processes:Read defect report file and be stored into array, travel through each defect report in array
Element simultaneously carries out Similarity Measure with being assigned to the source file code of the node successively, and final result is write into file.
Reduce classes are implemented as follows:
For evaluation, our used methods carry out the effect of defect location to defect report and multiple projects in step 5)
Fruit, provide following evaluation index.
(1) time, speed-up ratio are performed.When the execution time is that all defect is reported in the execution of defect location in spanned item mesh
Between, including the execution time TimeS of positioning is serialized in spanned item mesh for defect report and defect report uses in spanned item mesh
The execution time TimeP of MapReduce parallel methods positioning;Speed-up ratio (SpeedRatio) is reported in for metric deficiencies
Serialization positioning performs time and defect report in spanned item mesh using the positioning of MapReduce parallel methods in spanned item mesh
It is as follows compared to the effect that can be lifted, calculation formula to perform the time:
In above-mentioned formula, SpeedRatio value is bigger, represents the time performance using the lifting of parallelization defect location
Better.
(2) Top-k accuracys rate (Accuracy@k).Top-k accuracys rate show that the recommendation to defect positioning method generation arranges
Defect positioning method successful probability when k file is examined before table.It is to be positioned for one in the assessment of the present invention
Defect report, if defect report method generation recommendation list preceding k source code file at least one and give
Defect report is related, then it is assumed that positions successfully.Calculation formula is as follows:
In above-mentioned formula, n represents to use the sum of defect report in evaluation process;|Rk| represent that defect positioning method enters
When row top-k recommends, the quantity reported the defects of successfully is positioned in n defect report.Accuracy@k value is bigger, and defect is fixed
The positioning performance of position method is better.
(3) tagmeme inverse average (MRR).First in the recommendation list that tagmeme inverse average is generated according to defect positioning method
The inverse of the position of the individual source code file related to defect report judges the performance of defect positioning method, tagmeme inverse average
Bigger, the performance of defect positioning method is better.
In above formula, firstjRepresent in the recommendation list corresponding to j-th of defect report first it is related to defect report
Source code file position.MRR value is bigger, and the performance of defect report localization method is better.
In a word, the present invention realizes positioning of the defect report in spanned item mesh using parallel method, can solve software
The positioning of ecosystem mid-span project defect report, can solve spanned item mesh using parallel method and have a large amount of source code files
The problem of, location efficiency can be improved, greatly reduces the time.Relative to traditional method before, the side proposed in of the invention
Method can solve wider problem, it is possible to reduce developer is positioning the spent time across defect report in entry, carries
High locating effect.
Claims (5)
1. it is a kind of based on defect report analysis spanned item mesh, parallelization defect positioning method, it is characterized in that according to defect report, across
Multiple projects carry out the positioning of file-level granularity, it is necessary to be segmented, gone to source code file in defect report and project to defect
Except text-processing work such as stop words, extraction stems;Then calculated and lacked using vector space model method and topic model method
The words similarity and semantic similarity of source code file in report and project are fallen into, to improve positioning precision;And use parallelization
Method accelerates calculating speed, raising efficiency.
2. a kind of spanned item mesh based on defect report analysis, parallelization defect positioning method according to described by claim 1, its
Feature comprises the following steps:
1) collect related project and reported the defects of being repaired, the source code and defect report to project are carried out at text
Reason, corpus is generated using the method for incremental update;
2) vector space model method and topic model method, the source code file meter to defect report and across multiple projects are used
Similarity is calculated, then it is ranked up according to similarity size;
3) similarity is calculated using MapReduce parallel methods to defect report and across the source code file in multiple projects,
And it is sorted, sorted lists are then returned to developer.
3. a kind of spanned item mesh based on defect report analysis, parallelization defect positioning method according to described by claim 2, its
It is characterized in step 1) that selected project has complete historical changing to record, and is related, i.e. project between project
Between have mutually call or project function it is similar;The defects of selected report is that solved to repair, had verified that reparation, complete
The defects of repairing, and extracting defect report mark, summary, description, submission time, repair time;To defect report and project
Source code establishes corpus, and the source code file in corpus is multiplied, so established using increment type processing mode, i.e., pair
First part of defect report establishes corpus, it is necessary to handle all source code files in code snapshot, to lacking below
Corpus is established in sunken report, and the code snapshot corresponding to the defect report and preceding a defect report are obtained by version change instrument
The file changed in code snapshot corresponding to announcement, is then handled it;Item destination code snapshot is obtained, according to defect
The generation corresponding to version that report repair time is chosen in Item version history and the time submits closest to the corresponding time
Code snapshot.
4. a kind of spanned item mesh based on defect report analysis, parallelization defect positioning method according to described by claim 2, its
It is characterized in step 2) using the source generation of vector space model method and topic model method to defect report and across multiple projects
Code file calculates similarity.It is related in the software ecosystem, between project, because project can be called on stream
Identical code library, it is that the positioning of a defect report is not known for which project, so to defect report and across multiple projects
Source code calculates similarity, is arranged according to similarity size, returns to sorted lists, developer is according to the Sorted list of return
Table is in corresponding project to defect location.
5. a kind of spanned item mesh based on defect report analysis, parallelization defect positioning method according to described by claim 2, its
It is characterized in step 3) calculating using MapReduce parallel methods to defect report and across the source code file in multiple projects
Similarity, it is independent task to carry out positioning using vector space model and topic model method to every a defect report, and
And the logic flow of calculating task is essentially identical, so these tasks are divided into many relatively independent small tasks, parallelization
Processing:During Map, the source code file of project is read, using the code sequence number of Codabar code record as key, code sequence number
Key-value pair is sent on corresponding Reduce nodes by corresponding words file as value;During Reduce, defect report is read
Accuse file to be simultaneously stored into array, travel through each defect report element in array and successively with being assigned to the source file of the node
Code carries out Similarity Measure, and final result is write into file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610854844.1A CN107844414A (en) | 2016-09-21 | 2016-09-21 | A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610854844.1A CN107844414A (en) | 2016-09-21 | 2016-09-21 | A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107844414A true CN107844414A (en) | 2018-03-27 |
Family
ID=61661039
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610854844.1A Pending CN107844414A (en) | 2016-09-21 | 2016-09-21 | A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107844414A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108717459A (en) * | 2018-05-24 | 2018-10-30 | 哈尔滨工程大学 | A kind of mobile application defect positioning method of user oriented comment information |
CN110175128A (en) * | 2019-05-29 | 2019-08-27 | 北京百度网讯科技有限公司 | A kind of similar codes case acquisition methods, device, equipment and storage medium |
CN111309607A (en) * | 2020-02-12 | 2020-06-19 | 扬州大学 | Software defect positioning method of code method level |
CN111427775A (en) * | 2020-03-12 | 2020-07-17 | 扬州大学 | Method level defect positioning method based on Bert model |
CN111651164A (en) * | 2020-04-29 | 2020-09-11 | 南京航空航天大学 | Code identifier normalization method and device |
CN112001484A (en) * | 2020-08-22 | 2020-11-27 | 哈尔滨工业大学 | Safety defect report prediction method based on multitask deep learning |
CN112306845A (en) * | 2019-07-31 | 2021-02-02 | 腾讯科技(深圳)有限公司 | Concurrent testing method, device, equipment and medium |
CN112380111A (en) * | 2020-10-09 | 2021-02-19 | 华南理工大学 | Real-time defect positioning method and system based on new project |
CN112597063A (en) * | 2021-02-26 | 2021-04-02 | 北京北大软件工程股份有限公司 | Method, device and storage medium for positioning defect code |
CN114416524A (en) * | 2021-12-15 | 2022-04-29 | 北京邮电大学 | File error positioning method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140365990A1 (en) * | 2013-06-11 | 2014-12-11 | Hitachi, Ltd. | Software evaluation device and method |
CN105095091A (en) * | 2015-09-23 | 2015-11-25 | 南京大学 | Software defect code file locating method based on reverse index technology |
CN105159822A (en) * | 2015-08-12 | 2015-12-16 | 南京航空航天大学 | Software defect positioning method based on text part of speech and program call relation |
CN105786704A (en) * | 2016-02-22 | 2016-07-20 | 南京大学 | Work amount sensitive bug positioning technology effectiveness evaluation method |
-
2016
- 2016-09-21 CN CN201610854844.1A patent/CN107844414A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140365990A1 (en) * | 2013-06-11 | 2014-12-11 | Hitachi, Ltd. | Software evaluation device and method |
CN105159822A (en) * | 2015-08-12 | 2015-12-16 | 南京航空航天大学 | Software defect positioning method based on text part of speech and program call relation |
CN105095091A (en) * | 2015-09-23 | 2015-11-25 | 南京大学 | Software defect code file locating method based on reverse index technology |
CN105786704A (en) * | 2016-02-22 | 2016-07-20 | 南京大学 | Work amount sensitive bug positioning technology effectiveness evaluation method |
Non-Patent Citations (2)
Title |
---|
刘海林: ""基于LDA的软件缺陷组件预测研究与实现"", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
宁可为,王炜等: ""基于Hadoop的句群相似度计算"", 《计算机系统应用》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108717459B (en) * | 2018-05-24 | 2019-05-21 | 哈尔滨工程大学 | A kind of mobile application defect positioning method of user oriented comment information |
CN108717459A (en) * | 2018-05-24 | 2018-10-30 | 哈尔滨工程大学 | A kind of mobile application defect positioning method of user oriented comment information |
CN110175128A (en) * | 2019-05-29 | 2019-08-27 | 北京百度网讯科技有限公司 | A kind of similar codes case acquisition methods, device, equipment and storage medium |
CN112306845A (en) * | 2019-07-31 | 2021-02-02 | 腾讯科技(深圳)有限公司 | Concurrent testing method, device, equipment and medium |
CN112306845B (en) * | 2019-07-31 | 2024-06-11 | 腾讯科技(深圳)有限公司 | Concurrency test method, concurrency test device, concurrency test equipment and concurrency test medium |
CN111309607A (en) * | 2020-02-12 | 2020-06-19 | 扬州大学 | Software defect positioning method of code method level |
CN111309607B (en) * | 2020-02-12 | 2023-06-02 | 扬州大学 | Software defect positioning method of code method level |
CN111427775B (en) * | 2020-03-12 | 2023-05-02 | 扬州大学 | Method level defect positioning method based on Bert model |
CN111427775A (en) * | 2020-03-12 | 2020-07-17 | 扬州大学 | Method level defect positioning method based on Bert model |
CN111651164A (en) * | 2020-04-29 | 2020-09-11 | 南京航空航天大学 | Code identifier normalization method and device |
CN112001484A (en) * | 2020-08-22 | 2020-11-27 | 哈尔滨工业大学 | Safety defect report prediction method based on multitask deep learning |
CN112380111A (en) * | 2020-10-09 | 2021-02-19 | 华南理工大学 | Real-time defect positioning method and system based on new project |
CN112597063B (en) * | 2021-02-26 | 2021-06-08 | 北京北大软件工程股份有限公司 | Method, device and storage medium for positioning defect code |
CN112597063A (en) * | 2021-02-26 | 2021-04-02 | 北京北大软件工程股份有限公司 | Method, device and storage medium for positioning defect code |
CN114416524B (en) * | 2021-12-15 | 2023-03-24 | 北京邮电大学 | File error positioning method and device |
CN114416524A (en) * | 2021-12-15 | 2022-04-29 | 北京邮电大学 | File error positioning method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107844414A (en) | A kind of spanned item mesh based on defect report analysis, parallelization defect positioning method | |
Shivaji et al. | Reducing features to improve code change-based bug prediction | |
Lazar et al. | Generating duplicate bug datasets | |
CN111427775B (en) | Method level defect positioning method based on Bert model | |
CN112579586A (en) | Data processing method, device, equipment and storage medium | |
CN111104242A (en) | Method and device for processing abnormal logs of operating system based on deep learning | |
CN115146865A (en) | Task optimization method based on artificial intelligence and related equipment | |
CN112434024A (en) | Relational database-oriented data dictionary generation method, device, equipment and medium | |
CN114936158A (en) | Software defect positioning method based on graph convolution neural network | |
CN110188196A (en) | A kind of text increment dimension reduction method based on random forest | |
Han et al. | SlimML: Removing non-critical input data in large-scale iterative machine learning | |
Hirsch et al. | Root cause prediction based on bug reports | |
CN114398069B (en) | Method and system for identifying accurate version of public component library based on cross fingerprint analysis | |
Mollá et al. | Data-driven decision making: New opportunities for DSS in data stream contexts | |
CN113672506A (en) | Dynamic proportion test case sequencing selection method and system based on machine learning | |
Revindasari et al. | Traceability between business process and software component using Probabilistic Latent Semantic Analysis | |
Rajbahadur et al. | Pitfalls analyzer: quality control for model-driven data science pipelines | |
CN114139636B (en) | Abnormal operation processing method and device | |
CN116894495A (en) | Method, computer readable medium and system for enhancing machine learning pipeline with annotations | |
CN105824976A (en) | Method and device for optimizing word segmentation banks | |
CN116302984A (en) | Root cause analysis method and device for test task and related equipment | |
CN115934548A (en) | Statement level software defect positioning method and system based on information retrieval | |
Mirakhorli et al. | Mining big data for detecting, extracting and recommending architectural design concepts | |
CN115617689A (en) | Software defect positioning method based on CNN model and domain features | |
Jemal et al. | What if mixing technologies for Big Data mining and queries optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180327 |
|
WD01 | Invention patent application deemed withdrawn after publication |