CN113051156B

CN113051156B - Software defect positioning method based on block chain traceability and information retrieval

Info

Publication number: CN113051156B
Application number: CN202110280035.5A
Authority: CN
Inventors: 吴晓鸰; 曾志兵; 凌捷
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2022-03-11
Anticipated expiration: 2041-03-16
Also published as: CN113051156A

Abstract

The invention provides a software defect positioning method based on block chain traceability and information retrieval, which comprises the following steps: storing all historical defect reports and corresponding modified source code files on blocks of a blockchain system; after preprocessing the current new defect report, constructing a query statement; if the block chain system has unsearched blocks, the query statement traceably selects n blocks in the block chain system for retrieval; if the unsearched blocks do not exist, the query statement retrieves all source code files; calculating the final relevance score in the retrieval process, and sequencing the final relevance score from large to small; checking the corresponding modified source code files one by one according to the sequencing result, and positioning the defects; and after the defect positioning is successful, packaging and storing the current new defect report and the corresponding modified source code file into a new block, and chaining the block chain system to finish the defect positioning. The invention has the advantages of high efficiency and high precision when positioning the defects.

Description

Software defect positioning method based on block chain traceability and information retrieval

Technical Field

The invention relates to the technical field of information science, in particular to a software defect positioning method based on block chain traceability and information retrieval.

Background

In order to repair a software defect, a developer needs to determine the source code position of the defect according to the description of the defect report, namely, the location of the software defect. The field of software engineering has a defect positioning technology based on dynamic program analysis and static program analysis to help developers position defects. The dynamic defect positioning technology is used for judging the position of a defect by analyzing the execution behavior of the program in running, the method can determine the possible position of a defect statement in the program to be tested in a fine-grained manner, but the method needs to consume a large running time cost and a large resource cost, and the quantity and the quality of test cases have a large influence on the defect positioning performance. The static defect positioning technology does not need to execute a test case, utilizes the static analysis of a source code, compares a series of programming rules, and has the advantages of low execution cost and simple use for judging the defect position. Since the programming rules are program language dependent, the method is program language limited, and the granularity of the positioning is coarse, usually to the file and method level granularity. One type of current mainstream methods for static defect location research are software defect location (IRBL) technology based on Information Retrieval, and the method aims to determine related source code files or functions semi-automatically or fully automatically by using defect report contents, so that development cost is reduced and development efficiency is improved. However, the IRBL technology also has a disadvantage as a static defect location method, and at present, the IRBL technology can only locate the method granularity, cannot locate the specific code line, and the defect location accuracy is greatly affected by the defect report quality.

Chinese patent CN103176905A published in 26.6.2013 discloses a defect association method, which includes: extracting a code block corresponding to the defect from the defect report, and generating a defect related code block sequence information base according to the extracted code block; acquiring a basic frequent subsequence of the defect related code block sequence information base, and eliminating frequent subsequences which do not meet constraint conditions in the basic frequent subsequence; grouping the defects in the defect report according to the defects corresponding to the current frequent subsequence; and refining the grouped defects according to a preset defect association mode. The method can accurately group the defects, reduce the identification work of part of the defects, improve the working efficiency, but cannot realize the high-precision positioning of the defects.

Disclosure of Invention

The invention aims to solve the problem that the prior art cannot realize high efficiency and high precision when positioning software defects, provides a software defect positioning method based on block chain traceability and information retrieval, and realizes the advantages of high efficiency and high precision when positioning software defects.

In order to solve the technical problems, the technical scheme of the invention is as follows:

the invention provides a software defect positioning method based on block chain traceability and information retrieval, which comprises the following steps:

s1: acquiring a historical defect report and a modified source code file corresponding to the historical defect report, and storing the source code file on a block of a block chain system;

s2: submitting a current new defect report to a blockchain system;

s3: preprocessing a current new defect report;

s4: constructing a query statement based on the preprocessed description information of the current new defect report, and searching a block by using the query statement;

s5: judging whether an unsearched block exists on the block chain system or not; if there is an unsearched tile, go to step S6; if there is no unsearched block, go to step S7;

s6: searching n blocks in a block chain system in a tracing mode by using query statements;

s7: searching all source code files by using the query statement;

s8: calculating a first relevance score of the query statement and the historical defect report and a second relevance score of the modified source code file corresponding to the query statement and the historical defect report in the retrieval process, and integrating the first relevance score and the second relevance score to obtain a final relevance score; presetting a threshold value of the final relevance score, and sequencing the final relevance scores larger than the preset threshold value from large to small;

s9: checking the corresponding source code files one by one according to the sorting result of the final relevance score, and performing defect positioning;

s10: judging whether the defect positioning is successful; if the defect location is unsuccessful, constructing a new query statement, and repeating the steps S5-S9 until the defect location is successful; if the defect location is successful, the current new defect report and the corresponding modified source code file are packaged and stored into a new block, and a chain block chain system is linked up to complete the defect location.

Preferably, in S1, when saving the historical defect reports and the corresponding modified source code files, each historical defect report and the corresponding modified source code file are packed and saved on a tile of the blockchain system.

Storing the historical defect report and the modified source code file corresponding to the defect report in a block, and chaining the chain system of blocks; for the current new defect report, searching the repaired similar defects and the modified source code file firstly can shorten the searching time and improve the efficiency and the accuracy of defect positioning. Also, if a source code file is modified multiple times to fix the same defect or to implement the same function, the source code file is more likely to be defective.

Preferably, in S3, the preprocessing specifically includes: and performing text standardization, stop word removal and root recovery on the current new defect report.

Text standardization: removing special characters and punctuation marks in the current new defect report and then cutting the special characters and punctuation marks into discontinuous words; removing the standardized current new defect report by means of common stop words in English, wherein the stop words refer to words which frequently appear in each document but have little effect on distinguishing different documents; finally, the words are converted into corresponding root forms.

Preferably, in S4, the preprocessed description information of the current new defect report includes stack trace information, reporter information, and a description of an API library.

The stack trace information shows the order of instructions executed before the software crashes and may also be considered an ordered list of source code files that may be associated with defects. The reason for using the stack trace information is that the source code files with higher rank in the stack trace information are more prone to error, and in addition, the search space can be reduced and the retrieval efficiency can be improved by using the stack trace information. The reporter information is based because the same reporter may report problems from the same or similar software. The description of the used API library can bridge the semantic ditch between the defect report and the source code file, and the positioning performance is improved.

Preferably, in S4, before the query statement is constructed, the preprocessed description information of the current new defect report is represented as a vector by using the revised vector space model.

The classical vector space model is not all-in-one treated for different texts when ordered: shorter text is more likely to rank up, and longer text is generally less similar and thus difficult to rank up. However, in defect localization algorithms, larger source code files tend to contain defects more easily. The impact of text length on similarity is optimized using the revised vector space model. The revised vector space model is a similarity calculation method commonly used in software defect location technology based on information retrieval.

Preferably, in S8, the specific step of calculating the final relevance score includes:

s8.1: calculating the weight (t, q) of the terms in the query statement in the current new defect report, the weight (t, r) of the terms in the query statement in the historical defect report and the weight (t, d) of the terms in the query statement in the corresponding modified source code file by using a TF-IDF algorithm;

s8.2: calculating first cosine similarity cos (q, r) of the current new defect report and the historical fault report based on weight (t, q) and weight (t, r); calculating a second cosine similarity cos (q, d) of the current new defect report and the corresponding modified source code file based on weight (t, q) and weight (t, d);

s8.3: calculating a first correlation Score (q, r) based on the first cosine similarity cos (q, r); calculating a second correlation Score (q, d) based on the second cosine similarity cos (q, d);

s8.4: a final relevance Score S is calculated based on the first relevance Score (q, r) and the second relevance Score (q, d).

Preferably, in S8.1, the specific method of calculating the weight (t, q) of the term in the query statement in the current new defect report, the weight (t, r) of the term in the query statement in the historical defect report, and the weight (t, d) of the term in the query statement in the corresponding modified source code file is as follows:

weight(t，q)＝tf(t，q)×idf(t，Q)

weight(t，r)＝tf(t，r)×idf(t，R)

weight(t，d)＝tf(t，d)×idf(t，D)

wherein tf (t, q) is the frequency of occurrence of the word t in the current new defect report q; idf (t, Q) is the inverse text frequency, equal to the logarithm of the derivative of tf (t, Q); tf (t, r) is the frequency of occurrence of the word t in the historical defect report r; idf (t, R) is the inverse text frequency, equal to the logarithm of the derivative of tf (t, R); tf (t, d) is the frequency of occurrence of the word t in the source code file d; idf (t, D) is the inverse text frequency, equal to the logarithm of the derivative of tf (t, D).

Preferably, in S8.2, the specific method for calculating the first cosine similarity cos (q, r) and the second cosine similarity cos (q, d) is as follows:

wherein weight²(t, q) represents the square of weight (t, q), weight²(t, r) represents the square value weight of weight (t, r)²(t, d) represents the square of weight (t, d).

Preferably, in S8.3, the specific method of the first correlation Score (q, r) and the second correlation Score (q, d) is as follows:

Score(q，r)＝g(＃term r)×cos(q，r)

Score(q，d)＝g(＃term d)×cos(q，d)

where g (# term r) represents the length of the historical defect report, and g (# term d) represents the source code file length.

Preferably, in S8.4, the specific method for calculating the final relevance score S is as follows:

S＝α*Score(q，r)+(1-α)*Score(q，d)

where α is a weight coefficient.

Preferably, in S8, a coordination ascending algorithm without normalization is used to obtain the ranking result with the correlation score from large to small.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the method stores the historical defect report and the corresponding modified source code file on a block of a block chain system; constructing a query statement for the current new defect report, and if an unsearched block exists in the block chain system, performing source-tracing retrieval on the query statement in the block chain system; when the block which is not searched does not exist, the query statement searches all the source code files; sorting the final relevance scores larger than a preset threshold value from large to small by calculating the final relevance scores in the retrieval process, and checking the corresponding source code files one by one according to a sorting result to position; after the positioning is successful, the current new defect report and the corresponding modified source code file are packaged and stored into a new block, a block chain system is linked, the defect positioning is completed, and a historical defect report in the block chain system is enriched. According to the method, the retrieval time can be shortened by retrieving the repaired historical defects and the modified source code files in the block chain system, and the efficiency and accuracy of defect positioning are improved; calculating the final relevance score in the retrieval process, and starting to inspect the final relevance score larger than a preset threshold from a source code file with the large final relevance score according to the sorting result from large to small, so that the defect can be positioned more quickly, and the defect positioning efficiency is improved; based on the steps, the method has the advantages of high efficiency and high precision when the software is subjected to defect positioning.

Drawings

Fig. 1 is a flowchart of a software defect locating method based on block chain tracing and information retrieval according to embodiment 1.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

The present embodiment provides a software defect positioning method based on block chain tracing and information retrieval, as shown in fig. 1, the method includes the following steps:

s1: acquiring all historical defect reports and modified source code files corresponding to the historical defect reports, and storing the source code files on blocks of a block chain system;

s2: submitting a current new defect report to a blockchain system;

s3: preprocessing a current new defect report;

s7: searching all source code files by using query statements;

In S1, when saving the historical defect reports and the corresponding modified source code files, each historical defect report and the corresponding modified source code file are packed and saved on a tile of the blockchain system.

In S3, the preprocessing specifically includes: and performing text standardization, stop word removal and root recovery on the current new defect report.

In S4, the preprocessed description information of the current new defect report includes stack trace information, reporter information, and a description of an API library.

In S4, before the query statement is constructed, the modified vector space model is used to represent the description information of the preprocessed current new defect report as a vector.

The classical vector space model is not all-in-one treated for different texts when ordered: shorter text is more likely to rank up, and longer text is generally less similar and thus difficult to rank up. However, in defect localization algorithms, larger source code files tend to contain defects more easily. The impact of text length on similarity is optimized using the revised vector space model.

In S8, the specific step of calculating the final relevance score is:

In S8.1, the specific method of calculating the weight (t, q) of the term in the query statement in the current new defect report, the weight (t, r) of the term in the query statement in the history defect report, and the weight (t, d) of the term in the query statement in the corresponding modified source code file is as follows:

weight(t，q)＝tf(t，q)×idf(t，Q)

weight(t，r)＝tf(t，r)×idf(t，R)

weight(t，d)＝tf(t，d)×idf(t，D)

In S8.2, the specific method for calculating the first cosine similarity cos (q, r) and the second cosine similarity cos (q, d) is as follows:

In S8.3, the specific method of the first correlation Score (q, r) and the second correlation Score (q, d) is as follows:

Score(q，r)＝g(＃term r)×cos(q，r)

Score(q，d)＝g(＃term d)×cos(q，d)

In S8.4, the specific method for calculating the final relevance score S is:

S＝α*Score(q，r)+(1-α)*Score(q，d)

where α is a weight coefficient.

In the step S8, a coordination ascending algorithm without standardization is used to obtain a ranking result with a decreasing correlation score.

The embodiment combines the software defect positioning technology based on information retrieval and the block chain tracing technology, and makes full use of the advantages of the software defect positioning technology based on information retrieval: compared with the defect positioning based on dynamic program analysis, the software defect positioning technology based on information retrieval has lower calculation cost. Compared with the traditional defect positioning based on static program analysis, the software defect positioning technology based on information retrieval has stronger universality. Meanwhile, the software defect positioning technology based on information retrieval is stronger in pertinence, and can fully utilize the information provided by the defect report and analyze the position of the source code file where the defect is located aiming at the given current new defect report and historical defect report. Meanwhile, by utilizing the block chain technology, the software defect positioning technology based on information retrieval provides more effective management and realization in the aspects of data preprocessing, engineering application and the like, improves the effectiveness of the algorithm, reduces the cost and difficulty of algorithm realization, provides uniform data storage and management, and has higher safety and reliability.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A software defect positioning method based on block chain tracing and information retrieval is characterized by comprising the following steps:

s2: submitting a current new defect report to a blockchain system;

s3: preprocessing a current new defect report;

s7: searching all source code files by using query statements;

the specific steps of calculating the final relevance score are as follows:

s8.1: calculating the weight (t, q) of the terms in the query statement in the current new defect report, the weight (t, r) of the terms in the query statement in the historical defect report and the weight (t, d) of the terms in the query statement in the corresponding modified source code file by using a TF-IDF algorithm, wherein the specific method comprises the following steps:

weight(t，q)＝tf(t，q)×idf(t，Q)

weight(t，r)＝tf(t，r)×idf(t，R)

weight(t，d)＝tf(t，d)×idf(t，D)

wherein tf (t, q) is the frequency of occurrence of the word t in the current new defect report q; idf (t, Q) is the inverse text frequency, equal to the logarithm of the derivative of tf (t, Q); tf (t, r) is the frequency of occurrence of the word t in the historical defect report r; idf (t, R) is the inverse text frequency, equal to the logarithm of the derivative of tf (t, R); tf (t, d) is the frequency of occurrence of the word t in the source code file d; idf (t, D) is the inverse text frequency, equal to the logarithmic value of the derivative of rf (t, D);

s8.2: calculating first cosine similarity cos (q, r) of the current new defect report and the historical fault report based on weight (t, q) and weight (t, r); based on weight (t, q) and weight (t, d), calculating a second cosine similarity cos (q, d) of the current new defect report and the corresponding modified source code file, wherein the specific method comprises the following steps:

wherein weight²(t, q) represents the square of weight (t, q), weight²(t, r) represents the square of weight (t, r), weight²(t, d) represents the square of weight (t, d);

s8.4: calculating a final relevance Score S based on the first relevance Score (q, r) and the second relevance Score (q, d);

s9: checking the corresponding modified source code files one by one according to the sorting result of the final relevance score, and positioning the defects;

2. The method for software defect location based on blockchain tracing and information retrieval of claim 1, wherein in step S1, when saving the historical defect reports and the corresponding modified source code files, each historical defect report and the corresponding modified source code file are packed and saved on a block of the blockchain system.

3. The method for locating software defects based on blockchain tracing and information retrieval of claim 2, wherein in S3, the preprocessing specifically includes: and performing text standardization, stop word removal and root recovery on the current new defect report.

4. The method according to claim 3, wherein in step S4, before the query statement is constructed, the modified vector space model is used to represent the description information of the preprocessed current new defect report as a vector; the preprocessed current new defect report description information includes stack trace information, reporter information, and descriptions of API libraries.

5. The method for locating software defects based on blockchain traceability and information retrieval as claimed in claim 1, wherein in S8.3, the specific method of the first correlation Score (q, r) and the second correlation Score (q, d) is:

Score(q，r)＝g(#term r)×cos(q，r)

Score(q，d)＝g(#term d)×cos(q，d)

where g (# term r) represents the length of the historical defect report and g (# term d) represents the source code file length.

6. The method for locating software defects based on blockchain traceability and information retrieval as claimed in claim 5, wherein in S8.4, the specific method for calculating the final relevance score S is as follows:

S＝α*Score(q，r)+(1-α)*Score(q，d)

where α is a weight coefficient.

7. The method for locating software defects based on blockchain traceability and information retrieval as claimed in claim 6, wherein in step S8, a coordination ascent algorithm without standardization is adopted to obtain the sorted results of the relevancy scores from large to small.