CN117421226A

CN117421226A - Defect report reconstruction method and system based on large language model

Info

Publication number: CN117421226A
Application number: CN202311420321.2A
Authority: CN
Inventors: 薄莉莉; 纪王杰; 孙小兵; 吕涛; 周运生
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2023-10-27
Filing date: 2023-10-27
Publication date: 2024-01-19

Abstract

The invention discloses a defect report reconstruction method and system based on a generated large language model, wherein the method comprises the following steps: collecting and preprocessing statement sets and report sets, establishing a classification decision model by utilizing the statement sets, evaluating the quality of defect reports in the report sets by utilizing the classification decision model, reconstructing the defect reports with low quality by utilizing the established prompt templates based on the generated large language model, and outputting to obtain new defect reports. The system comprises a data acquisition module, a data processing module before evaluation, a defect report evaluation module, an input text establishment module and a defect report reconstruction module, and compared with the prior art, the system has the characteristics of strong practicability and high accuracy.

Description

Defect report reconstruction method and system based on large language model

Technical Field

The invention relates to the field of software maintenance, in particular to a defect report reconstruction method and system based on a generated large language model.

Background

Software bugs refer to problems in software, such as documents, programs, that affect the normal operation of the software, and are commonly referred to as bugs. The defect report is necessary information necessary in the software maintenance process, describes a software defect phenomenon and a set of reproduction steps, is one of work results of software testers, and can accurately describe and position defects existing in the software, so that the software is convenient for developers to correct. And meanwhile, the current quality state of the project/product is reflected, so that the overall progress and quality control of the project are facilitated.

However, the existing defect report is mainly written and analyzed manually, and the quality of the defect report referred by the defect repairing personnel is often uneven due to the influence of factors such as inconsistent experience of a defect reporter, complexity of software and a use scene thereof, imperfect functions of a defect tracking system and the like, so that the defect report has great trouble on software maintenance, low accuracy and low efficiency. In addition, depending on manual analysis, costs and overhead are increased.

Existing defect report quality improvement measures include improvements to the defect reporting mechanism and system and improvements to the content of the defect report. The present invention improves upon the content. Existing automatic improvements to the quality of defect reports themselves are mainly to supplement the missing information in defect reports with similar defect reports. For example, some efforts have been made to use hybrid analysis methods with dynamic and static combinations to improve android applications or crowdsourcing test applications, such as the document Ctras Crowdsourced test report aggregation and summarization, to use redundant test reports to generate enhanced defect reports that are more abundant. And grouping redundant defect reports according to descriptive texts and pictures in the reports, selecting an information report with the most information content in the group as a main report (masterreport), and then merging information with supplementary effects on defect understanding and repairing in the rest redundant reports into the main report so as to obtain a public measurement report with higher quality. There are also some works such as Bug report enrichment with application of automated fixer recommendation which propose methods to add extra text sentences to augment their content. The sentences to be added are taken from the problem description content in the history report, the author firstly calculates the values of 6 similarity characteristics of candidate sentences and the text, the theme, the components/products and the like of the current report to be expanded, the final similarity values are obtained by carrying out weighted summation on the 6 characteristic values, and the top K sentences with the maximum values are used for expanding the short text report. However, due to different project versions of the same functionality, and different minor problems in the same software functionality, the content of the defect report may be quite different, even though substantially the same, but the reported defects may be quite different. Moreover, the similar defect report is dependent on the existing data of projects, and if the project is a brand new project and problem, the effect of the improvement measure is greatly reduced, and the established defect report is low in quality and not high in practicability.

Disclosure of Invention

The invention aims to: the invention aims to provide a defect report reconstruction method and system based on a generated large language model, which are high in accuracy and high in practicability.

The technical scheme is as follows: the invention discloses a defect report reconstruction method based on a generated large language model, which comprises the following steps:

s1, acquiring and preprocessing an initial sentence set to form a processed sentence set, wherein the initial sentence set comprises a plurality of sentences and a plurality of labels, and one sentence corresponds to at least one label; collecting a report set, wherein the report set comprises a plurality of defect reports, and each defect report comprises an ID, a title and a description text;

s2, carrying out data enhancement on the processed statement set, and merging the processed statement set with the initial statement set to obtain an extended statement set, wherein the extended statement set comprises a tag set; preprocessing the defect report to form a statement list, wherein the statement list corresponds to the defect report one by one;

s3, establishing a classification decision model based on the pre-training model of the attention mechanism and the extended statement set, inputting the statement list into the classification decision model, distributing and counting labels in the label set for the statement list by the classification decision model, evaluating the quality of a corresponding defect report of the statement list, and classifying the statement list into a high-quality defect report and a low-quality defect report according to an evaluation result;

s4, acquiring a generated large language model, and establishing an input text of the generated large language model, wherein the input text comprises a prompt template and a report text; wherein, the prompt template is obtained by manually compiling and optimizing a generated large language model; the report text is extracted from the description text of the low-quality defect report;

s5, inputting the input text into the generated large language model, reconstructing the low-quality defect report, and outputting a new defect report.

Further, in step S1, the process of preprocessing the initial sentence set to form a sentence set is as follows: 1) Deleting the repeated sentences and the corresponding labels; 2) Removing illegal characters and emphasized characters in the sentence, and reserving corresponding labels; 3) Sentences with the number of characters smaller than 5 or larger than 1000 and corresponding labels are deleted.

Further, in step S2, the process of preprocessing the defect report forming statement list is as follows: 1) Deleting non-text data in the defect report; 2) Removing illegal characters and emphasized characters in the defect report; 3) Calling the send_token function of NLTK to segment the title and description text of the defect report into a plurality of sentences with the number of characters less than or equal to 1000, and summarizing the sentences into a sentence list.

Further, in step S2, the data enhancement includes a random replacement operation, a random insertion operation, and a random swap position operation.

Further, in step S3, establishing the classification decision model includes the following steps:

s31, data preparation: taking one part of the extended statement set as a training set and the other part as a verification set;

s32, establishing an initial model: and acquiring a pre-training model based on an attention mechanism, setting tasks as multi-label classification tasks, setting loss functions of the multi-label classification tasks at the same time, learning knowledge in the extended statement set, and distributing labels to statements in a defect list corresponding to the defect report according to the knowledge.

S33, training an initial model to obtain a classification decision model: and training the initial model by using the training set, verifying by using the verification set, and stopping training after the index constraint is met to obtain the classification decision model.

Further, the index includes an accuracy rate, a recall rate, and an F1 score.

Further, in step S1, the tag includes an empty tag; in step S2, the tag set includes empty tags, OB, EB, S2R.

Further, in step S3, the process of evaluating the defect report indicator quality includes:

1) Receiving a statement list corresponding to the defect report;

2) Assigning a label to each sentence in the sentence list;

3) Counting and judging whether the statement list simultaneously comprises OB, EB and S2R, if so, evaluating the defect report as a high-quality report; if not, evaluating the defect report as a low quality defect report.

Further, the method also comprises the step of verifying and reconstructing defect report:

preprocessing a new defect report, forming a corresponding statement list, inputting the classification decision model, evaluating the quality of the new defect report, and if the new defect report is evaluated as a low-quality defect report, extracting a description text of the low-quality defect report as a report text, and repeating the steps S4-S5 until the new defect report is evaluated as a high-quality defect report.

The invention discloses a defect report reconstruction system based on a generative large language model, which comprises the following steps:

the data acquisition module is used for acquiring and preprocessing an initial sentence set to form a processed sentence set, wherein the initial sentence set comprises a plurality of sentences and a plurality of labels, and one sentence corresponds to at least one label; a report set is collected, the report set comprising a number of defect reports, each of the defect reports comprising an ID, a title, and descriptive text.

The pre-evaluation data processing module is used for carrying out data enhancement on the processed statement set and combining the processed statement set with the initial statement set to obtain an extended statement set, wherein the extended statement set comprises a tag set; preprocessing the defect report to form a statement list, wherein the statement list corresponds to the defect report one by one.

The defect report evaluation module is used for establishing a classification decision model based on the pre-training model of the attention mechanism and the extended statement set, inputting the statement list into the classification decision model, distributing and counting labels in the label set for the statement list by the classification decision model, evaluating the quality of a defect report corresponding to the statement list, and classifying the defect report into a high-quality defect report and a low-quality defect report according to an evaluation result.

The input text building module is used for obtaining the generated large language model and building an input text of the generated large language model, wherein the input text comprises a prompt template and a report text; wherein, the prompt template is obtained by manually compiling and optimizing a generated large language model; the report text is extracted from the description text of the low-quality defect report.

And the defect report reconstruction module is used for inputting the input text into the generated large language model, reconstructing the low-quality defect report and outputting a new defect report.

The beneficial effects are that: the invention has the following remarkable effects: 1. the accuracy is high: training a classification decision model with high accuracy by using a pre-training model based on an attention mechanism and an expanded sentence set after data enhancement; simultaneously, reconstructing a low-quality defect report by using the generated large language model, and generating a high-quality new defect report by using massive knowledge and prompt templates in the generated large language model; 2. the practicability is strong: unlike traditional manual analysis and similar content expansion of original low quality defect report, the invention utilizes the knowledge of massive fields in the generated large language model to automatically write a high quality defect report meeting software maintenance, reduces the cost of manually understanding the low quality defect report, can provide high quality data base for downstream automatic software engineering task, and has wider practical application field and higher efficiency.

Drawings

FIG. 1 is a general flow chart of the method of the present invention.

FIG. 2 is a specific flow chart for generating a new defect report.

Detailed Description

The invention is further elucidated below in connection with the drawings and the detailed description.

Referring to fig. 1, the invention discloses a defect report reconstruction method based on a generated large language model, which comprises the following steps:

s1, acquiring and preprocessing an initial sentence set to form a processed sentence set, wherein the initial sentence set comprises a plurality of sentences and a plurality of labels, and one sentence corresponds to at least one label; a report set is collected, the report set comprising a number of defect reports, each of the defect reports comprising an ID, a title, and descriptive text.

S2, carrying out data enhancement on the processed statement set, and merging the processed statement set with the initial statement set to obtain an extended statement set, wherein the extended statement set comprises a tag set. Preprocessing the defect report to form a statement list, wherein the statement list corresponds to the defect report one by one.

S3, establishing a classification decision model based on the pre-training model of the attention mechanism and the extended statement set, inputting the statement list into the classification decision model, distributing and counting labels in the label set for the statement list by the classification decision model, evaluating the quality of a defect report corresponding to the statement list, and classifying the statement list into a high-quality defect report and a low-quality defect report according to an evaluation result.

S4, acquiring a generated large language model, and establishing an input text of the generated large language model, wherein the input text comprises a prompt template and a report text; wherein, the prompt template is obtained by manually compiling and optimizing a generated large language model; the report text is extracted from the description text of the low-quality defect report.

The method is specifically described herein. First, an initial statement set and report set are collected from an open source community and defect tracking platform (e.g., JIRA, bugzilla, mozilla) and a public artificially labeled defect report statement and statement tag dataset.

In step S1, the process of preprocessing the initial sentence set to form the processed sentence set is as follows: 1) Deleting the repeated sentences and the corresponding labels; 2) Removing illegal characters and emphasized characters (characters which cannot be coded by utf-8, emoji expressions and the like) in the sentence, and reserving corresponding labels; 3) Sentences with the number of characters smaller than 5 or larger than 1000 and corresponding labels are deleted.

In step S2, the procedure of preprocessing the defect report forming statement list is as follows: 1) Deleting non-text data (e.g., hyperlinks, pictures) in the defect report; 2) Removing illegal characters and emphasized characters in the defect report; 3) Calling the send_token function of NLTK to segment the title and description text of the defect report into a plurality of sentences with the number of characters less than or equal to 1000, and summarizing the sentences into a sentence list. The present invention provides a method for generating a sentence boundary, which is characterized in that NLTK (Natural Language Toolkit is a widely used Python natural language processing tool library, provides a universal and easy-to-use natural language processing tool set, and also provides extensible and reusable modules and algorithms to meet the requirements of different users, and the send_token function realizes sentence boundary detection and sentence segmentation in a way of combining statistics and heuristic rules, and the core is that a training algorithm learns the sentence characteristics of a given language.

In step S2, the data enhancement includes a random replacement operation, a random insertion operation, and a random exchange position operation, so as to solve the problem of unbalanced data types, and obtain better data. The specific process is as follows:

s21, performing token conversion on all words in the sentence by using a word_token function in NLTK, and changing each word in the sentence into a token. Word_token realizes word boundary detection and word division by a mode of combining statistics and heuristic rules. Token belongs to one of de-identification technologies, and plaintext data is replaced by a pseudonym Token corresponding to the plaintext data one by one. Token is circulated in the subsequent application environment with high efficiency, and since Token and plaintext are in one-to-one correspondence, the Token can be transmitted, exchanged, stored and used in place of plaintext in most scenarios.

S22, the random replacement operation is as follows: n token is randomly selected from a sentence, a synonym set corresponding to each token selected is obtained from a WordNet database, and a synonym is randomly selected from the synonym set to replace the corresponding token. Wherein n=0.01 x token_num, and token_num is the number of tokens contained in a sentence; wordNet is a large-scale english vocabulary database that organizes english words and phrases according to their meaning and establishes various relationships between the meanings. Synonyms are organized in a synset.

The random insertion operation is as follows: n token are randomly selected from a sentence, a synonym set corresponding to each token selected is obtained from a WordNet database, a synonym is randomly selected from the synonym set, and a position is randomly selected from the sentence to insert the synonym.

The random swap location operation is: two tokens are randomly selected in a sentence and the positions of the two tokens are exchanged.

Notably, during the data enhancement process, the labels of the new sentence remain consistent with the labels of the original sentence sample. In this embodiment, in step S1, the tag includes an empty tag. In step S2, the tag set includes empty tags, OB, EB, S2R.

In step S3, a classification decision model is established, that is, modeling is performed by using the pretrained model BERT and the expanded sentence set based on the attention mechanism to allocate zero or more labels in the label sets OB, EB and S2R to the sentences in the sentence list, and the classification decision model is trained and the validity of the model is verified. The method specifically comprises the following steps:

s31, data preparation: to prevent the trained classification decision model from overfitting, a portion of the set of expanded statements is used as a training set and another portion is used as a validation set. The expanded sentence set is obtained by combining the enhanced data set and the initial sentence set, so that the data set size can be expanded, and the classification decision model can learn more useful and comprehensive knowledge conveniently.

S32, establishing an initial model: acquiring a pre-training model based on an attention mechanism, setting tasks as multi-label classification tasks, setting loss functions of the multi-label classification tasks as BCEWIthLogitsLoss, learning knowledge in the extended statement set, and distributing labels to statements in a defect list corresponding to a defect report according to the knowledge; the formula of BCEWithLogitsLoss is:

where N is the number of samples; c is the number of tags; x is x _i Is the i-th sample model output; y is _i Is the true label of the ith sample, either 0 or 1.

The existing method for completing the multi-label classification task is to train three classification models, train a yes or no classification model for each label in OB, EB and S2R, and splice the results of the three classification models. The result of this is that multiple models need to be trained, and inputting multi-label samples into the two-class model to assign single labels, which do not fit the features of the extended statement set, can affect the performance of the model. The multi-label classification task of the sentence is completed by utilizing the pre-training model BERT, only one model needs to be trained, all labels of the label set are fully utilized in the training process, and the knowledge learned by the model is more in line with the characteristics of the extended sentence set. The penalty functions commonly used in multi-tag classification tasks are Binary Cross Entropy and BCEWithLogitsLoss functions, which can calculate the probability of each tag at the same time and allow a statement to belong to multiple tags. However, BCEWithLogitsLoss used in this embodiment calculates the loss directly on the original output of the model, without first applying the Sigmoid function. Thus, the calculation efficiency is improved, and the problem of unstable numerical value caused by the Sigmoid function is avoided.

S33, training an initial model to obtain a classification decision model: and training the initial model by using the training set, verifying by using the verification set, and stopping training after the index constraint is met to obtain the classification decision model. The indexes comprise accuracy, precision, recall and F1 fraction. Accuracy (Accuracy) is the ratio of the number of correctly predicted tags to the total number of tags. The Precision is the ratio of the number of correctly predicted positive samples to the number of all predicted positive samples. The Recall (Recall) is the ratio of the number of samples correctly predicted as positive to the number of samples actually positive, and in many scenarios, it is necessary that not only is the prediction accurate, but the model is able to identify as many positive as possible, which requires the Recall (Recall) of the model, i.e., the ability to identify positive samples, to be evaluated. The F1 score (F1-score) is a harmonic mean value of Precision and Recall, and can comprehensively consider the accuracy and Recall rate of the model and evaluate the effect of the model more comprehensively.

These evaluation metrics can measure the performance of the model on each label as well as the overall performance of the model on multiple labels. And, the accuracy formula is as follows:

the accuracy formula is as follows:

the recall formula is as follows:

the formula of F1 fraction is as follows:

where TP is a true instance, i.e., a number predicted to be positive and actually positive; TN is a true negative example, i.e., a quantity predicted to be negative and actually negative; FP is a false positive case, i.e. a number predicted to be positive but actually negative; FN is a false negative example, i.e. a number that is predicted negative but actually positive.

In step S3, the process of evaluating the quality of the defect report includes:

1) Receiving a statement list corresponding to the defect report;

2) Assigning a label to each sentence in the sentence list;

3) Counting and judging whether the statement list simultaneously comprises OB, EB and S2R; if so, evaluating the defect report as a high quality report; if not, evaluating the defect report as a low quality defect report.

In step S4, the process of creating the alert template is as follows:

s41, manually writing an initial prompt template;

the hint templates include role-imparting text, definitions and requirements of tags (OB, EB, S2R), task descriptions. The role assigning text is used for assigning the role of the generative large language model, and guiding the generative large language model to output knowledge more conforming to the role. The definition and requirements of the tags are used to help the generative large language model understand key information in the task. The task description is used for enabling the generative large language model to understand the task and restricting the output range of the model. In this embodiment, the character-imparting text is: your role is an advanced software engineer who has a rich experience in the software testing and software maintenance areas. The definition and requirements of the tags (OB, EB, S2R) are specifically OB (observation behavior): related software behavior, actions, outputs or results. Sentences without information like "system is not active" are not considered to be OB. EB (desired behavior): if a sentence contains phrases related to what software should/should occur, such as 'should … …', 'expected … …', 'expected … …' should be considered EB. The proposal or advice to solve the bug is not considered EB. S2R (step reproduction): if a sentence potentially contains a user' S actions or operations, it should be considered S2R. Like ' to reproduction ', ' steps to reproduction ', ' follow these steps ', are not considered S2R '. In this embodiment, the task is described as: you should infer the appropriate details from the context and supplement the complete bug report with clear OB/EB/S2R. Where possible, you should improve the existing wording of OB, EB, S2R claims to make it clearer. For S2R you should give as clear, accurate and complete steps as possible.

S42, inputting the initial prompt template into a generated large language model to summarize and optimize, and outputting an optimized prompt template. The generated large language model is used for summarizing and optimizing the initial prompt template written manually, so that the model can more comprehensively and accurately understand tasks, related definitions and requirements described in the prompt template.

S43, verifying the optimized prompt template, and outputting the optimized prompt template as the prompt template after meeting task requirements. The verification mode is manually verified.

In addition, the invention also comprises the steps of verifying and reconstructing defect report:

preprocessing a new defect report output in the step S5, forming a corresponding statement list, inputting the classification decision model, evaluating the quality of the new defect report, extracting a description text of the low-quality defect report as a report text if the new defect report is evaluated as a low-quality defect report, and repeating the steps S4-S5 until the new defect report is evaluated as a high-quality defect report.

The invention also discloses a defect report reconstruction system based on the generated large language model, which comprises:

and a data acquisition module: the method comprises the steps of collecting and preprocessing an initial sentence set to form a processed sentence set, wherein the initial sentence set comprises a plurality of sentences and a plurality of labels, and one sentence corresponds to at least one label; collecting a report set, wherein the report set comprises a plurality of defect reports, and each defect report comprises an ID, a title and a description text;

the pre-evaluation data processing module: the method comprises the steps of carrying out data enhancement on the processed statement set, and merging the processed statement set with an initial statement set to obtain an extended statement set, wherein the extended statement set comprises a tag set; preprocessing the defect report to form a statement list, wherein the statement list corresponds to the defect report one by one;

defect report evaluation module: establishing a classification decision model based on a pre-training model of an attention mechanism and the extended statement set, inputting the statement list into the classification decision model, distributing and counting labels in the label set for the statement list by the classification decision model, evaluating the quality of a corresponding defect report of the statement list, and classifying the statement list into a high-quality defect report and a low-quality defect report according to an evaluation result;

an input text establishment module: acquiring a generated large language model, and establishing an input text of the generated large language model, wherein the input text comprises a prompt template and a report text; wherein, the prompt template is obtained by manually compiling and optimizing a generated large language model; the report text is extracted from the description text of the low-quality defect report;

and a defect report reconstruction module: and inputting the input text into the generated large language model, reconstructing the low-quality defect report, and outputting a new defect report.

Claims

1. A method of defect report reconstruction based on a generative large language model, the method comprising the steps of:

2. The method for defect report reconstruction based on the generative large language model as claimed in claim 1, wherein in step S1, the process of preprocessing the initial sentence set forming sentence set is as follows: 1) Deleting the repeated sentences and the corresponding labels; 2) Removing illegal characters and emphasized characters in the sentence, and reserving corresponding labels; 3) Sentences with the number of characters smaller than 5 or larger than 1000 and corresponding labels are deleted.

3. The method for defect report reconstruction based on the generative large language model as claimed in claim 1, wherein in step S2, the process of preprocessing the defect report forming statement list is as follows: 1) Deleting non-text data in the defect report; 2) Removing illegal characters and emphasized characters in the defect report; 3) Calling the send_token function of NLTK to segment the title and description text of the defect report into a plurality of sentences with the number of characters less than or equal to 1000, and summarizing the sentences into a sentence list.

4. The method of defect report reconstruction based on a generative large language model of claim 2, wherein in step S2, the data augmentation comprises a random substitution operation, a random insertion operation, and a random swap location operation.

5. The method of defect report reconstruction based on the generative large language model as claimed in claim 1, wherein in step S3, the creating of the classification decision model comprises the steps of:

s32, establishing an initial model: acquiring a pre-training model based on an attention mechanism, setting tasks as multi-label classification tasks, setting loss functions of the multi-label classification tasks at the same time, learning knowledge in the extended sentence set, and distributing labels to sentences in a defect list corresponding to the defect report according to the knowledge;

6. The method of defect report reconstruction based on a generative large language model of claim 5, wherein the metrics comprise accuracy, precision, recall, and F1 score.

7. The method of defect report reconstruction based on generative large language model of claim 1, wherein in step S1, the label comprises an empty label; in step S2, the tag set includes empty tags, OB, EB, S2R.

8. The method of defect report reconstruction based on the generative large language model of claim 7, wherein in step S3, the process of evaluating the defect report finger quality comprises:

1) Receiving a statement list corresponding to the defect report;

2) Assigning a label to each sentence in the sentence list;

9. The method for defect report reconstruction based on the generative large language model of claim 1, further comprising the step of verifying and reconstructing the defect report:

10. A system for defect report reconstruction based on a generative large language model, the system comprising:

the data acquisition module is used for acquiring and preprocessing an initial sentence set to form a processed sentence set, wherein the initial sentence set comprises a plurality of sentences and a plurality of labels, and one sentence corresponds to at least one label; collecting a report set, wherein the report set comprises a plurality of defect reports, and each defect report comprises an ID, a title and a description text;

the pre-evaluation data processing module is used for carrying out data enhancement on the processed statement set and combining the processed statement set with the initial statement set to obtain an extended statement set, wherein the extended statement set comprises a tag set; preprocessing the defect report to form a statement list, wherein the statement list corresponds to the defect report one by one;

the defect report evaluation module is used for establishing a classification decision model based on the pre-training model of the attention mechanism and the extended statement set, inputting the statement list into the classification decision model, distributing and counting labels in the label set for the statement list by the classification decision model, evaluating the quality of a defect report corresponding to the statement list, and classifying the defect report into a high-quality defect report and a low-quality defect report according to an evaluation result;

the input text building module is used for obtaining the generated large language model and building an input text of the generated large language model, wherein the input text comprises a prompt template and a report text; wherein, the prompt template is obtained by manually compiling and optimizing a generated large language model; the report text is extracted from the description text of the low-quality defect report;