CN113378907B

CN113378907B - Automated software traceability recovery method for enhancing data preprocessing process

Info

Publication number: CN113378907B
Application number: CN202110626138.2A
Authority: CN
Inventors: 陈静; 张贺; 董黎明; 匡宏宇; 荣国平; 邵栋
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2024-01-09
Anticipated expiration: 2041-06-04
Also published as: CN113378907A

Abstract

The invention discloses an automated software traceability recovery method for enhancing a data preprocessing process, which comprises the following steps: selecting a product of which the tracking relationship is to be recovered, extracting relevant fields of the product, performing data cleaning, developing characteristic engineering, and obtaining a sample data set; dividing a sample data set into a marked data set and a missing tracking data set by using a label marking method; dividing the marked data set into a marked training set and a test set by using a four-fold time sequence verification method; combining the marked training set and the missing tracking data set by using a semi-supervised unbalanced learning framework to generate a new training set; the method comprises the steps of utilizing a plurality of resampling modes to balance a training set, training a classification model, evaluating the performance of the classification model and recovering the tracking relation among products; starting from the enhancement data preprocessing process, the problems of more project products, poor data quality, unbalanced sample data and the like are solved through various enhancement measures, and the F1 value, the accuracy and the recall rate are greatly improved.

Description

Automated software traceability recovery method for enhancing data preprocessing process

Technical Field

The invention relates to the technical field of computers, in particular to an automatic software traceability recovery method for enhancing a data preprocessing process.

Background

Software traceability is the ability to associate any uniquely identifiable software product with other products, maintain and use the generated network answers about the software product and its development process questions. Software traceability techniques are directed to creating or maintaining a tracking relationship between different artifacts, which helps to improve process-oriented data quality. However, software traceability is a difficult and error-prone task. The main difficulty comes from how to fill the gap in logic abstraction between the requirements of natural language writing and code written in programming language. At the same time, the cost and effort required to manually restore the tracking relationship between articles is very high in the face of the huge number of articles and the number of potential tracking relationships between articles. In addition, the quality problem of traceability data in practice increases the difficulty of traceability of software, such as missing, redundant, ambiguous trace paths, failed trace relationships, etc. Therefore, how to automatically recover the tracking relation among software products becomes a hot spot of research; at present, the academia has a great deal of results in the field of the traceability recovery of the automated software, but the methods are concentrated in open source projects and are not widely applied in industrial scenes. The problems that huge product quantity and potential tracking relation among products exist in the enterprise project, data sample unbalance caused by positive and negative label marking strategies in the two classification models are serious, meanwhile, temporary measures are taken in the enterprise aiming at the strategy of how to effectively maintain the tracking relation among the products without long-term stability, and the problems enable the automatic traceability recovery method of the open source project to be not directly applied to the enterprise project.

Based on the above problems, an automatic software traceability recovery method for enhancing the data preprocessing process is provided, and the method reduces noise of a sample data set through four enhanced data preprocessing measures, namely a tag marking method, a four-fold time sequence verification method, a semi-supervised unbalanced learning framework and a plurality of resampling methods, can effectively relieve serious sample data unbalance problems in enterprise projects, and improves robustness and generalization capability of a classification model; compared with a model constructed by an automatic software traceability recovery method of an open source project, the method has obvious improvement on the accuracy, recall rate and F1 value.

Disclosure of Invention

The present invention is directed to an automated software traceability recovery method for enhancing a data preprocessing process, so as to solve the problems set forth in the background art.

In order to solve the technical problems, the invention provides the following technical scheme: an automated software traceability recovery method for enhancing a data preprocessing process, the method comprising the steps of:

s100, determining a software product A and a software product B of a tracking relationship to be recovered, extracting relevant fields of the software product A and the software product B from a software warehouse, and cleaning data of the relevant fields, wherein a data cleaning mode comprises outlier processing, missing value processing and consistency processing;

s200: candidate link pair matching is carried out on the related data subjected to data cleaning, and characteristic engineering is carried out to carry out label marking, so that a sample data set is obtained;

s300: dividing the sample data set into a tag data set and a missing trace data set based on the tag marking condition in step S200;

s400: dividing the marked data set obtained in the step S300 into a marked data training set and a marked data testing set based on a four-fold time sequence verification method;

s500: combining the marked data training set obtained in the step S400 and the missing tracking data set obtained in the step S300 to generate a new training set;

s600: based on the semi-supervised unbalanced learning framework, a final classification model Cfinal is obtained by utilizing the new training set generated in the step S500;

s700: resampling the new training set in the step S500 based on a resampling method to obtain a resampled training set;

s800: training the resampling training set obtained in the step 700 on the two classification model Cfinal obtained in the step S600;

s900: performing performance evaluation on the two classification model Cfinal obtained in the step S600 by using the marked data test set in the step S400;

s1000: and restoring the tracking relation among the software products by using the two-classification model Cfinal.

Further, S200 includes the following steps:

s201: matching the candidate link pairs between the software product A and the software product B based on the isCandidate function; the formula is as follows:

isCandidate(ai，bi)＝created(ai)≤committed(bi)≤resolved(ai)+η

where ai refers to the i-th specific software product within software product a; the created (ai) refers to the creation time of ai; the reserved (ai) refers to the off time of ai;

bi refers to the i-th specific software product within software product B; committed (bi) refers to the commit time of bi; η is the average of the absolute values of the differences between the closing time of software product a and the commit time of software product B;

s202: based on the candidate link pair generated in step S201, if the product number association between ai and bi is found on the platform, then it is considered that there is a valid tracking relationship between ai and bi;

s203: the label value of the candidate link pair with the effective tracking relation is marked as 1, otherwise, the label value of the candidate link pair with the effective tracking relation is marked as 0;

s204: forming a sample data set by the marked software product A and the marked software product B;

the data cleaning of the related fields is beneficial to the error checking and correction of the related fields which are collected preliminarily, and is beneficial to improving the accuracy of the candidate link pairs in the step S200 when the candidate link pairs are matched, so that the candidate link pairs generated by the isCandidate function operation have more data representing significance; the judgment of tracking relation by combining the candidate connection pairs with the product number association enhances the data processing and is beneficial to improving the subsequent data processing process.

Further, S300 includes the following steps:

s301: if a tracking relationship exists between a specific product ai in the software product A to be restored and any specific product bi in the software product B, dividing a sample data set containing all the software products ai into a marked data set;

s302: if no tracking relationship exists between the specific product ai in the software product A to be restored and all the specific products bi in the software product B, dividing a sample data set containing all the software products ai into a missing tracking data set;

the division of the sample data sets is beneficial to the respective processing of the data sets with different representative meanings contained in the sample data sets, so that different more targeted processing and utilization methods can be selected in the subsequent processing process of the data.

Further, S400 includes the following steps:

s401: sequencing the marked data set and the missing tracking data set obtained in the step S300 from front to back according to the submitting time of the specific product bi in the software product B in the candidate connection pair generated in the step S200, and equally dividing the sequenced marked data set into five parts;

s402: in the kth cycle, the first k data sets are used as training sets, and the (k+1) data sets are used as test sets; wherein k is E (1, 2,3, 4);

the segmentation method can realize the effect of fully recycling the acquired data and relieve the problem of unbalance of the sample data in the process of utilizing the sample data.

Further, S600 includes the following steps:

s601: training an initial two-classification model Cint based on a classification algorithm by using a labeled data training set (Xlabel) in the new training set;

s602: calculating a tag prediction probability by using an initial classification model Cint as a missing tracking data set Xunclabel prediction tag in a new training set and using a prediction_proba () function in a python tool library, and marking the prediction tag with the prediction probability larger than a threshold value as a pseudo tag yunlabel to form an intermediate training set (Xunclabel, yunlabel);

s603: combining the intermediate training set with the marked data training set in the new training set, and retraining the initial classification model in the step S501 based on the classification algorithm selected in the step S601;

s604: repeating the steps S601-S603 until no more labels with the prediction probability larger than a threshold value appear in the missing tracking data set in the new training set, so as to obtain a final classification model Cfinal;

the robustness and generalization capability of the classification model can be effectively improved through the steps.

Further, S700 includes the following steps:

s701: equalizing the training set by using an undersampling method;

s702: equalizing the training set by using an oversampling method;

s703: equalizing the training set by using a comprehensive sampling method;

the above process can reduce noise of training set, and can effectively alleviate serious sample data unbalance problem.

Compared with the prior art, the invention has the following beneficial effects: the method starts from the enhanced data preprocessing process, solves the problems of more project products, poor data quality, unbalanced sample data and the like through various enhanced measures, and is particularly suitable for enterprise projects; compared with an automatic traceability recovery method based on open source projects, the method has the advantages that the F1 value, the accuracy and the recall rate are greatly improved.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flow chart of an automated software traceability restoration method that enhances the data preprocessing process;

FIG. 2 is a diagram of related fields of a desired artifact extracted in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;

FIG. 3 is a diagram of extracted defect artifact related fields in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;

FIG. 4 is a code submission artifact-related field extracted in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;

FIG. 5 is a feature of a demand-code submission candidate pair in an embodiment of an automated software traceability restoration method that enhances the data preprocessing process;

FIG. 6 is a performance evaluation result of a demand-code submission two-classification model in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;

FIG. 7 is a performance evaluation result of a defect-code submission two-classification model in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;

FIG. 8 is a schematic diagram of a four-fold time series verification method in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;

FIG. 9 is a schematic diagram of a semi-supervised imbalance learning framework in an embodiment of an automated software traceability recovery method to enhance the data preprocessing process.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In this embodiment, the tracking relationship between the recovery requirement and the code submission, and between the defect and the code submission is mainly described, and support is provided for successful floor enterprises in modeling software process simulation. The software process simulation modeling is widely focused on supporting the software enterprises to realize quantitative management and improve the maturity of the software process. However, when software process simulation modeling research results are landed in enterprise practice, significant process data problems are faced. The data recorded in the software warehouse is mainly attribute information of the product and not behavior information of the process, so that it is difficult to obtain real process data. In addition, enterprise data has quality problems such as low accuracy and poor integrity, and particularly an incomplete tracking relationship among products is obvious. How to effectively recover the tracking relationship between products to improve the data quality and provide guarantee for acquiring the real process data is a problem to be solved urgently. In the method for recovering traceability of automated software based on enhanced data preprocessing process, the data in enterprise projects are mined, and two classification models are constructed to recover tracking relations among products, so that guarantee is provided for acquiring real process data

Referring to fig. 1-9, the present invention provides the following technical solutions: an automated software traceability recovery method for enhancing a data preprocessing process comprises the following steps:

the steps S100-S200 are as follows:

determining a software product A and a software product B of a tracking relationship to be recovered, extracting relevant fields of the software product A and the software product B from a software warehouse, and cleaning data of the relevant fields, wherein a data cleaning mode comprises outlier processing, missing value processing and consistency processing; candidate link pair matching is carried out on the related data subjected to data cleaning, and characteristic engineering is carried out to carry out label marking, so that a sample data set is obtained;

extracting product related fields of requirements, defects and codes from an enterprise software warehouse, wherein the product related fields of the extracted requirements are shown in fig. 2, the product related fields of the extracted defects are shown in fig. 3, the product related fields of the extracted codes are shown in fig. 4, and data cleaning work is carried out on the obtained product related fields, including case-case conversion of numbers of developers and testers, unification of all time formats, removal of text stop words and data aggregation of file names in recent years;

for the demand-code commit candidate pair, 18 features are defined, rc= { RC1, RC2, …, RC18, respectively. For the defect-code commit candidate pair, 17 features are determined, dc= { DC1, DC2, …, DC17}, respectively. Meanwhile, the tag value of the candidate link pair having the "effective tracking relationship" is recorded as 1, otherwise, as 0.

As shown in fig. 5, the characteristics of candidate pairs for demand-code submission are described in detail as follows:

1) Rc1.3 mainly relates to developers in the software development process. RC1 represents developer ID, RC2 represents code submitter ID, RC3 represents whether the developer contains code submitter, if yes, 1 is taken, otherwise 0 is taken.

2) Rc4.7 mainly describes features related to time series in demand. RC4 represents the time difference between demand creation and issue code submission, RC5 represents the time difference between demand code submission and demand closure, and RC6 represents whether the issue code submission time is later than the issue closure time, a7= |a5| <2 5 days.

3) Rc8. 13 is code submission information data relating to the need in the current need-code submission pair. The set of code commit times earlier than the current code commit time is defined as Cpre and the set of code commit times later than this code commit time is defined as Cnext. Meanwhile, an overlap function is defined to calculate the overlapping degree of related files in the two sections of code submission data. RC8 represents the difference between the commit time of the piece of code commit data Cp having the latest code commit time in the Cpre set and the time of this code commit data C; RC9 represents the overlap (C, cp) of the documents involved in Cp and C; RC10 represents the personnel number submitting Cp; RC11 represents the difference between the commit time of the piece of code commit data Cn whose code commit time is earliest in the Cnext set and the time of this code commit data C; RC12 represents the overlap (C, cn) of the documents involved in Cn and C; RC13 represents the personnel number of the commit Cn.

The formula of the file overlap ratio is as follows:

file_names represent filename sets involved in code submission, denominators represent the union of a and b filename sets, and numerator represents the intersection of a and b filename sets; this value ranges from 0 to 1.

4) Rc14.16 submits the link pairs for the current demand-code. RC14 represents the number of demands still under development when the current demand is closed; RC15 represents the number of developers who are the same as the current demand among those also in development; RC16 represents the amount of data submitted by the code associated with the current demand.

5) Rc17.18 feature selection is mainly based on text information in the article. The cosine similarity between a pair of vectors is calculated based on the VSM model to estimate the similarity between the demand document information and the code submission document message. RC17 represents the text similarity between the demand document header and the code submission information; RC18 represents the textual similarity between the desired document content and the code submission information.

The cosine similarity formula is as follows:

the feature set of defect-code submission candidate pairs, except that DC17 represents text similarity between the content of the defect document and the code submission information, the features of dc1.16 are the same as the features of the demand-code submission candidate pairs;

as can be seen by combining fig. 2, fig. 3 and fig. 4, after the related fields are cleaned, the problems of poor data quality and unbalanced sample data can be effectively alleviated;

the procedure of step S300 is as follows:

for a demand-code submission candidate link pair, if a tracking relationship exists between the demand and any code submission information, the candidate link pair containing the demand is divided into a marked data set, otherwise, the candidate link pair is divided into a missing tracking data set. For the candidate link pairs submitted by the defect-code, if tracking relation exists between the defect and any code submitted information, dividing the candidate link pair containing the defect into a marked data set, otherwise, dividing the candidate link pair into a missing tracking data set;

the S400 process is as follows:

dividing the marked data set obtained in the step S300 into a marked data training set and a marked data testing set based on a four-fold time sequence verification method;

as shown in fig. 8, the demand-code commit marker dataset is sorted from front to back by creation time of code commit information in the candidate link pair and split equally into five parts. In the first cycle, the first set of data is used as a training set for model training and the second set of data is used as a test set for model performance evaluation. In the kth cycle (2, 3, 4), the first k data sets are used as training sets for model training, and the kth+1st data set is used as test set for model performance evaluation.

The segmentation mode for the defect-code submission marked data set is the same as that for the demand-code submission;

the process of step S500-step S600 is as follows:

combining the marked data training set obtained in the step S400 and the missing tracking data set obtained in the step S300 to generate a new training set; based on the semi-supervised unbalanced learning framework, a final classification model Cfinal is obtained by utilizing the new training set generated in the step S500;

as shown in fig. 9, for the demand-code submission mark training set, a random forest algorithm is adopted to train to obtain an initial classification model Cint. Submitting a missing tracking data set prediction tag for the demand-code by using Cint, and calling a function to judge the probability of tag prediction; according to the method, only the label with the probability more than 99% is selected and marked as a pseudo label, and a demand-code submitting intermediate training set is obtained;

combining the demand-code submitting mark training set with the intermediate training set to generate a new training set; training to obtain a new classification model by using a new training set and adopting a random forest algorithm again, and predicting a pseudo tag for the missing tracking data set again; repeating the steps until no more labels with the prediction probability of more than 99% appear, and obtaining a final classification model; for the defect-code submitting mark training set, the method is adopted to generate a classification model, and the description is omitted here

The process of step S700-step S1000 is as follows:

resampling the new training set in the step S500 based on a resampling method to obtain a resampled training set; training the resampling training set on the two classification models Cfinal; performing performance evaluation on the classification model Cfinal by the marked data test set in the step S400; recovery of trace relationships between software products using a bi-classification model Cfinal

Equalizing the training set by adopting a plurality of resampling modes; the example adopts the modes of OSS undersampling, SMOTE oversampling, SMOTE-Tomek comprehensive sampling and SMOTE-Enn comprehensive sampling, and balances the training set for model training.

The model performance evaluation is carried out by using a marked test set, 10 industrial projects are selected in the example, and the model performance evaluation is carried out by using the precision rate P, the recall rate R and the F1 value; FIG. 6 is a performance evaluation result of a demand-code submission two-classification model; FIG. 7 is a performance evaluation result of a defect-code submission two-classification model;

from fig. 6 and fig. 7, it can be seen that the classification model has a great improvement in the F1 value, the accuracy and the recall rate.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for automated software traceability restoration of an enhanced data preprocessing procedure, said method comprising the steps of:

s100, determining a software product A and a software product B of a tracking relationship to be restored, extracting relevant fields of the software product A and the software product B from a software warehouse, and cleaning data of the relevant fields, wherein the data cleaning mode comprises abnormal value processing, missing value processing and consistency processing;

s900: performing performance evaluation on the two classification models Cfinal obtained in the step S600 by using the marked data test set in the step S400;

s1000: restoring the tracking relation among the software products by utilizing the two classification models Cfinal;

the step S300 includes the steps of:

the step S400 includes the steps of:

s402: in the kth cycle, the first k data sets are used as training sets, and the (k+1) data sets are used as test sets; wherein the method comprises the steps of；

The step S600 includes the steps of:

s601: training an initial two-classification model Cint based on a classification algorithm by using a marked data training set (Xlabel, ylabel) in the new training set;

s602: calculating a label prediction probability by using the initial two-classification model Cint as a missing tracking data set Xun label prediction label in the new training set and using a prediction_proba () function in a python tool library, and marking the prediction label with the prediction probability larger than a threshold value as a pseudo label yunlabel to form an intermediate training set (Xun label, yunlabel);

s604: and repeating the steps S601-S603 until no more labels with the prediction probability larger than a threshold value appear in the missing tracking data set in the new training set, so as to obtain a final classification model Cfinal.

2. The method for automated software traceability restoration of an enhanced data preprocessing procedure according to claim 1, wherein said S200 comprises the steps of:

s201: matching the candidate link pairs between the software product A and the software product B after the data cleaning based on the isCandidate function; the formula is as follows:

；

s204: the marked software product A and software product B are formed into a sample data set.

3. The method for automated software traceability restoration of an enhanced data preprocessing procedure according to claim 1, wherein said S700 comprises the steps of:

s701: equalizing the training set by using an undersampling method;

s702: equalizing the training set by using an oversampling method;

s703: and (5) balancing the training set by using a comprehensive sampling method.