CN113378907B - Automated software traceability recovery method for enhancing data preprocessing process - Google Patents

Automated software traceability recovery method for enhancing data preprocessing process Download PDF

Info

Publication number
CN113378907B
CN113378907B CN202110626138.2A CN202110626138A CN113378907B CN 113378907 B CN113378907 B CN 113378907B CN 202110626138 A CN202110626138 A CN 202110626138A CN 113378907 B CN113378907 B CN 113378907B
Authority
CN
China
Prior art keywords
data
training set
software product
marked
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110626138.2A
Other languages
Chinese (zh)
Other versions
CN113378907A (en
Inventor
陈静
张贺
董黎明
匡宏宇
荣国平
邵栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202110626138.2A priority Critical patent/CN113378907B/en
Publication of CN113378907A publication Critical patent/CN113378907A/en
Application granted granted Critical
Publication of CN113378907B publication Critical patent/CN113378907B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses an automated software traceability recovery method for enhancing a data preprocessing process, which comprises the following steps: selecting a product of which the tracking relationship is to be recovered, extracting relevant fields of the product, performing data cleaning, developing characteristic engineering, and obtaining a sample data set; dividing a sample data set into a marked data set and a missing tracking data set by using a label marking method; dividing the marked data set into a marked training set and a test set by using a four-fold time sequence verification method; combining the marked training set and the missing tracking data set by using a semi-supervised unbalanced learning framework to generate a new training set; the method comprises the steps of utilizing a plurality of resampling modes to balance a training set, training a classification model, evaluating the performance of the classification model and recovering the tracking relation among products; starting from the enhancement data preprocessing process, the problems of more project products, poor data quality, unbalanced sample data and the like are solved through various enhancement measures, and the F1 value, the accuracy and the recall rate are greatly improved.

Description

Automated software traceability recovery method for enhancing data preprocessing process
Technical Field
The invention relates to the technical field of computers, in particular to an automatic software traceability recovery method for enhancing a data preprocessing process.
Background
Software traceability is the ability to associate any uniquely identifiable software product with other products, maintain and use the generated network answers about the software product and its development process questions. Software traceability techniques are directed to creating or maintaining a tracking relationship between different artifacts, which helps to improve process-oriented data quality. However, software traceability is a difficult and error-prone task. The main difficulty comes from how to fill the gap in logic abstraction between the requirements of natural language writing and code written in programming language. At the same time, the cost and effort required to manually restore the tracking relationship between articles is very high in the face of the huge number of articles and the number of potential tracking relationships between articles. In addition, the quality problem of traceability data in practice increases the difficulty of traceability of software, such as missing, redundant, ambiguous trace paths, failed trace relationships, etc. Therefore, how to automatically recover the tracking relation among software products becomes a hot spot of research; at present, the academia has a great deal of results in the field of the traceability recovery of the automated software, but the methods are concentrated in open source projects and are not widely applied in industrial scenes. The problems that huge product quantity and potential tracking relation among products exist in the enterprise project, data sample unbalance caused by positive and negative label marking strategies in the two classification models are serious, meanwhile, temporary measures are taken in the enterprise aiming at the strategy of how to effectively maintain the tracking relation among the products without long-term stability, and the problems enable the automatic traceability recovery method of the open source project to be not directly applied to the enterprise project.
Based on the above problems, an automatic software traceability recovery method for enhancing the data preprocessing process is provided, and the method reduces noise of a sample data set through four enhanced data preprocessing measures, namely a tag marking method, a four-fold time sequence verification method, a semi-supervised unbalanced learning framework and a plurality of resampling methods, can effectively relieve serious sample data unbalance problems in enterprise projects, and improves robustness and generalization capability of a classification model; compared with a model constructed by an automatic software traceability recovery method of an open source project, the method has obvious improvement on the accuracy, recall rate and F1 value.
Disclosure of Invention
The present invention is directed to an automated software traceability recovery method for enhancing a data preprocessing process, so as to solve the problems set forth in the background art.
In order to solve the technical problems, the invention provides the following technical scheme: an automated software traceability recovery method for enhancing a data preprocessing process, the method comprising the steps of:
s100, determining a software product A and a software product B of a tracking relationship to be recovered, extracting relevant fields of the software product A and the software product B from a software warehouse, and cleaning data of the relevant fields, wherein a data cleaning mode comprises outlier processing, missing value processing and consistency processing;
s200: candidate link pair matching is carried out on the related data subjected to data cleaning, and characteristic engineering is carried out to carry out label marking, so that a sample data set is obtained;
s300: dividing the sample data set into a tag data set and a missing trace data set based on the tag marking condition in step S200;
s400: dividing the marked data set obtained in the step S300 into a marked data training set and a marked data testing set based on a four-fold time sequence verification method;
s500: combining the marked data training set obtained in the step S400 and the missing tracking data set obtained in the step S300 to generate a new training set;
s600: based on the semi-supervised unbalanced learning framework, a final classification model Cfinal is obtained by utilizing the new training set generated in the step S500;
s700: resampling the new training set in the step S500 based on a resampling method to obtain a resampled training set;
s800: training the resampling training set obtained in the step 700 on the two classification model Cfinal obtained in the step S600;
s900: performing performance evaluation on the two classification model Cfinal obtained in the step S600 by using the marked data test set in the step S400;
s1000: and restoring the tracking relation among the software products by using the two-classification model Cfinal.
Further, S200 includes the following steps:
s201: matching the candidate link pairs between the software product A and the software product B based on the isCandidate function; the formula is as follows:
isCandidate(ai,bi)=created(ai)≤committed(bi)≤resolved(ai)+η
where ai refers to the i-th specific software product within software product a; the created (ai) refers to the creation time of ai; the reserved (ai) refers to the off time of ai;
bi refers to the i-th specific software product within software product B; committed (bi) refers to the commit time of bi; η is the average of the absolute values of the differences between the closing time of software product a and the commit time of software product B;
s202: based on the candidate link pair generated in step S201, if the product number association between ai and bi is found on the platform, then it is considered that there is a valid tracking relationship between ai and bi;
s203: the label value of the candidate link pair with the effective tracking relation is marked as 1, otherwise, the label value of the candidate link pair with the effective tracking relation is marked as 0;
s204: forming a sample data set by the marked software product A and the marked software product B;
the data cleaning of the related fields is beneficial to the error checking and correction of the related fields which are collected preliminarily, and is beneficial to improving the accuracy of the candidate link pairs in the step S200 when the candidate link pairs are matched, so that the candidate link pairs generated by the isCandidate function operation have more data representing significance; the judgment of tracking relation by combining the candidate connection pairs with the product number association enhances the data processing and is beneficial to improving the subsequent data processing process.
Further, S300 includes the following steps:
s301: if a tracking relationship exists between a specific product ai in the software product A to be restored and any specific product bi in the software product B, dividing a sample data set containing all the software products ai into a marked data set;
s302: if no tracking relationship exists between the specific product ai in the software product A to be restored and all the specific products bi in the software product B, dividing a sample data set containing all the software products ai into a missing tracking data set;
the division of the sample data sets is beneficial to the respective processing of the data sets with different representative meanings contained in the sample data sets, so that different more targeted processing and utilization methods can be selected in the subsequent processing process of the data.
Further, S400 includes the following steps:
s401: sequencing the marked data set and the missing tracking data set obtained in the step S300 from front to back according to the submitting time of the specific product bi in the software product B in the candidate connection pair generated in the step S200, and equally dividing the sequenced marked data set into five parts;
s402: in the kth cycle, the first k data sets are used as training sets, and the (k+1) data sets are used as test sets; wherein k is E (1, 2,3, 4);
the segmentation method can realize the effect of fully recycling the acquired data and relieve the problem of unbalance of the sample data in the process of utilizing the sample data.
Further, S600 includes the following steps:
s601: training an initial two-classification model Cint based on a classification algorithm by using a labeled data training set (Xlabel) in the new training set;
s602: calculating a tag prediction probability by using an initial classification model Cint as a missing tracking data set Xunclabel prediction tag in a new training set and using a prediction_proba () function in a python tool library, and marking the prediction tag with the prediction probability larger than a threshold value as a pseudo tag yunlabel to form an intermediate training set (Xunclabel, yunlabel);
s603: combining the intermediate training set with the marked data training set in the new training set, and retraining the initial classification model in the step S501 based on the classification algorithm selected in the step S601;
s604: repeating the steps S601-S603 until no more labels with the prediction probability larger than a threshold value appear in the missing tracking data set in the new training set, so as to obtain a final classification model Cfinal;
the robustness and generalization capability of the classification model can be effectively improved through the steps.
Further, S700 includes the following steps:
s701: equalizing the training set by using an undersampling method;
s702: equalizing the training set by using an oversampling method;
s703: equalizing the training set by using a comprehensive sampling method;
the above process can reduce noise of training set, and can effectively alleviate serious sample data unbalance problem.
Compared with the prior art, the invention has the following beneficial effects: the method starts from the enhanced data preprocessing process, solves the problems of more project products, poor data quality, unbalanced sample data and the like through various enhanced measures, and is particularly suitable for enterprise projects; compared with an automatic traceability recovery method based on open source projects, the method has the advantages that the F1 value, the accuracy and the recall rate are greatly improved.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:
FIG. 1 is a flow chart of an automated software traceability restoration method that enhances the data preprocessing process;
FIG. 2 is a diagram of related fields of a desired artifact extracted in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;
FIG. 3 is a diagram of extracted defect artifact related fields in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;
FIG. 4 is a code submission artifact-related field extracted in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;
FIG. 5 is a feature of a demand-code submission candidate pair in an embodiment of an automated software traceability restoration method that enhances the data preprocessing process;
FIG. 6 is a performance evaluation result of a demand-code submission two-classification model in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;
FIG. 7 is a performance evaluation result of a defect-code submission two-classification model in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;
FIG. 8 is a schematic diagram of a four-fold time series verification method in an embodiment of an automated software traceability recovery method that enhances the data preprocessing process;
FIG. 9 is a schematic diagram of a semi-supervised imbalance learning framework in an embodiment of an automated software traceability recovery method to enhance the data preprocessing process.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In this embodiment, the tracking relationship between the recovery requirement and the code submission, and between the defect and the code submission is mainly described, and support is provided for successful floor enterprises in modeling software process simulation. The software process simulation modeling is widely focused on supporting the software enterprises to realize quantitative management and improve the maturity of the software process. However, when software process simulation modeling research results are landed in enterprise practice, significant process data problems are faced. The data recorded in the software warehouse is mainly attribute information of the product and not behavior information of the process, so that it is difficult to obtain real process data. In addition, enterprise data has quality problems such as low accuracy and poor integrity, and particularly an incomplete tracking relationship among products is obvious. How to effectively recover the tracking relationship between products to improve the data quality and provide guarantee for acquiring the real process data is a problem to be solved urgently. In the method for recovering traceability of automated software based on enhanced data preprocessing process, the data in enterprise projects are mined, and two classification models are constructed to recover tracking relations among products, so that guarantee is provided for acquiring real process data
Referring to fig. 1-9, the present invention provides the following technical solutions: an automated software traceability recovery method for enhancing a data preprocessing process comprises the following steps:
the steps S100-S200 are as follows:
determining a software product A and a software product B of a tracking relationship to be recovered, extracting relevant fields of the software product A and the software product B from a software warehouse, and cleaning data of the relevant fields, wherein a data cleaning mode comprises outlier processing, missing value processing and consistency processing; candidate link pair matching is carried out on the related data subjected to data cleaning, and characteristic engineering is carried out to carry out label marking, so that a sample data set is obtained;
extracting product related fields of requirements, defects and codes from an enterprise software warehouse, wherein the product related fields of the extracted requirements are shown in fig. 2, the product related fields of the extracted defects are shown in fig. 3, the product related fields of the extracted codes are shown in fig. 4, and data cleaning work is carried out on the obtained product related fields, including case-case conversion of numbers of developers and testers, unification of all time formats, removal of text stop words and data aggregation of file names in recent years;
for the demand-code commit candidate pair, 18 features are defined, rc= { RC1, RC2, …, RC18, respectively. For the defect-code commit candidate pair, 17 features are determined, dc= { DC1, DC2, …, DC17}, respectively. Meanwhile, the tag value of the candidate link pair having the "effective tracking relationship" is recorded as 1, otherwise, as 0.
As shown in fig. 5, the characteristics of candidate pairs for demand-code submission are described in detail as follows:
1) Rc1.3 mainly relates to developers in the software development process. RC1 represents developer ID, RC2 represents code submitter ID, RC3 represents whether the developer contains code submitter, if yes, 1 is taken, otherwise 0 is taken.
2) Rc4.7 mainly describes features related to time series in demand. RC4 represents the time difference between demand creation and issue code submission, RC5 represents the time difference between demand code submission and demand closure, and RC6 represents whether the issue code submission time is later than the issue closure time, a7= |a5| <2 5 days.
3) Rc8. 13 is code submission information data relating to the need in the current need-code submission pair. The set of code commit times earlier than the current code commit time is defined as Cpre and the set of code commit times later than this code commit time is defined as Cnext. Meanwhile, an overlap function is defined to calculate the overlapping degree of related files in the two sections of code submission data. RC8 represents the difference between the commit time of the piece of code commit data Cp having the latest code commit time in the Cpre set and the time of this code commit data C; RC9 represents the overlap (C, cp) of the documents involved in Cp and C; RC10 represents the personnel number submitting Cp; RC11 represents the difference between the commit time of the piece of code commit data Cn whose code commit time is earliest in the Cnext set and the time of this code commit data C; RC12 represents the overlap (C, cn) of the documents involved in Cn and C; RC13 represents the personnel number of the commit Cn.
The formula of the file overlap ratio is as follows:
file_names represent filename sets involved in code submission, denominators represent the union of a and b filename sets, and numerator represents the intersection of a and b filename sets; this value ranges from 0 to 1.
4) Rc14.16 submits the link pairs for the current demand-code. RC14 represents the number of demands still under development when the current demand is closed; RC15 represents the number of developers who are the same as the current demand among those also in development; RC16 represents the amount of data submitted by the code associated with the current demand.
5) Rc17.18 feature selection is mainly based on text information in the article. The cosine similarity between a pair of vectors is calculated based on the VSM model to estimate the similarity between the demand document information and the code submission document message. RC17 represents the text similarity between the demand document header and the code submission information; RC18 represents the textual similarity between the desired document content and the code submission information.
The cosine similarity formula is as follows:
the feature set of defect-code submission candidate pairs, except that DC17 represents text similarity between the content of the defect document and the code submission information, the features of dc1.16 are the same as the features of the demand-code submission candidate pairs;
as can be seen by combining fig. 2, fig. 3 and fig. 4, after the related fields are cleaned, the problems of poor data quality and unbalanced sample data can be effectively alleviated;
the procedure of step S300 is as follows:
for a demand-code submission candidate link pair, if a tracking relationship exists between the demand and any code submission information, the candidate link pair containing the demand is divided into a marked data set, otherwise, the candidate link pair is divided into a missing tracking data set. For the candidate link pairs submitted by the defect-code, if tracking relation exists between the defect and any code submitted information, dividing the candidate link pair containing the defect into a marked data set, otherwise, dividing the candidate link pair into a missing tracking data set;
the S400 process is as follows:
dividing the marked data set obtained in the step S300 into a marked data training set and a marked data testing set based on a four-fold time sequence verification method;
as shown in fig. 8, the demand-code commit marker dataset is sorted from front to back by creation time of code commit information in the candidate link pair and split equally into five parts. In the first cycle, the first set of data is used as a training set for model training and the second set of data is used as a test set for model performance evaluation. In the kth cycle (2, 3, 4), the first k data sets are used as training sets for model training, and the kth+1st data set is used as test set for model performance evaluation.
The segmentation mode for the defect-code submission marked data set is the same as that for the demand-code submission;
the process of step S500-step S600 is as follows:
combining the marked data training set obtained in the step S400 and the missing tracking data set obtained in the step S300 to generate a new training set; based on the semi-supervised unbalanced learning framework, a final classification model Cfinal is obtained by utilizing the new training set generated in the step S500;
as shown in fig. 9, for the demand-code submission mark training set, a random forest algorithm is adopted to train to obtain an initial classification model Cint. Submitting a missing tracking data set prediction tag for the demand-code by using Cint, and calling a function to judge the probability of tag prediction; according to the method, only the label with the probability more than 99% is selected and marked as a pseudo label, and a demand-code submitting intermediate training set is obtained;
combining the demand-code submitting mark training set with the intermediate training set to generate a new training set; training to obtain a new classification model by using a new training set and adopting a random forest algorithm again, and predicting a pseudo tag for the missing tracking data set again; repeating the steps until no more labels with the prediction probability of more than 99% appear, and obtaining a final classification model; for the defect-code submitting mark training set, the method is adopted to generate a classification model, and the description is omitted here
The process of step S700-step S1000 is as follows:
resampling the new training set in the step S500 based on a resampling method to obtain a resampled training set; training the resampling training set on the two classification models Cfinal; performing performance evaluation on the classification model Cfinal by the marked data test set in the step S400; recovery of trace relationships between software products using a bi-classification model Cfinal
Equalizing the training set by adopting a plurality of resampling modes; the example adopts the modes of OSS undersampling, SMOTE oversampling, SMOTE-Tomek comprehensive sampling and SMOTE-Enn comprehensive sampling, and balances the training set for model training.
The model performance evaluation is carried out by using a marked test set, 10 industrial projects are selected in the example, and the model performance evaluation is carried out by using the precision rate P, the recall rate R and the F1 value; FIG. 6 is a performance evaluation result of a demand-code submission two-classification model; FIG. 7 is a performance evaluation result of a defect-code submission two-classification model;
from fig. 6 and fig. 7, it can be seen that the classification model has a great improvement in the F1 value, the accuracy and the recall rate.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. A method for automated software traceability restoration of an enhanced data preprocessing procedure, said method comprising the steps of:
s100, determining a software product A and a software product B of a tracking relationship to be restored, extracting relevant fields of the software product A and the software product B from a software warehouse, and cleaning data of the relevant fields, wherein the data cleaning mode comprises abnormal value processing, missing value processing and consistency processing;
s200: candidate link pair matching is carried out on the related data subjected to data cleaning, and characteristic engineering is carried out to carry out label marking, so that a sample data set is obtained;
s300: dividing the sample data set into a tag data set and a missing trace data set based on the tag marking condition in step S200;
s400: dividing the marked data set obtained in the step S300 into a marked data training set and a marked data testing set based on a four-fold time sequence verification method;
s500: combining the marked data training set obtained in the step S400 and the missing tracking data set obtained in the step S300 to generate a new training set;
s600: based on the semi-supervised unbalanced learning framework, a final classification model Cfinal is obtained by utilizing the new training set generated in the step S500;
s700: resampling the new training set in the step S500 based on a resampling method to obtain a resampled training set;
s800: training the resampling training set obtained in the step 700 on the two classification model Cfinal obtained in the step S600;
s900: performing performance evaluation on the two classification models Cfinal obtained in the step S600 by using the marked data test set in the step S400;
s1000: restoring the tracking relation among the software products by utilizing the two classification models Cfinal;
the step S300 includes the steps of:
s301: if a tracking relationship exists between a specific product ai in the software product A to be restored and any specific product bi in the software product B, dividing a sample data set containing all the software products ai into a marked data set;
s302: if no tracking relationship exists between the specific product ai in the software product A to be restored and all the specific products bi in the software product B, dividing a sample data set containing all the software products ai into a missing tracking data set;
the step S400 includes the steps of:
s401: sequencing the marked data set and the missing tracking data set obtained in the step S300 from front to back according to the submitting time of the specific product bi in the software product B in the candidate connection pair generated in the step S200, and equally dividing the sequenced marked data set into five parts;
s402: in the kth cycle, the first k data sets are used as training sets, and the (k+1) data sets are used as test sets; wherein the method comprises the steps of
The step S600 includes the steps of:
s601: training an initial two-classification model Cint based on a classification algorithm by using a marked data training set (Xlabel, ylabel) in the new training set;
s602: calculating a label prediction probability by using the initial two-classification model Cint as a missing tracking data set Xun label prediction label in the new training set and using a prediction_proba () function in a python tool library, and marking the prediction label with the prediction probability larger than a threshold value as a pseudo label yunlabel to form an intermediate training set (Xun label, yunlabel);
s603: combining the intermediate training set with the marked data training set in the new training set, and retraining the initial classification model in the step S501 based on the classification algorithm selected in the step S601;
s604: and repeating the steps S601-S603 until no more labels with the prediction probability larger than a threshold value appear in the missing tracking data set in the new training set, so as to obtain a final classification model Cfinal.
2. The method for automated software traceability restoration of an enhanced data preprocessing procedure according to claim 1, wherein said S200 comprises the steps of:
s201: matching the candidate link pairs between the software product A and the software product B after the data cleaning based on the isCandidate function; the formula is as follows:
where ai refers to the i-th specific software product within software product a; the created (ai) refers to the creation time of ai; the reserved (ai) refers to the off time of ai;
bi refers to the i-th specific software product within software product B; committed (bi) refers to the commit time of bi; η is the average of the absolute values of the differences between the closing time of software product a and the commit time of software product B;
s202: based on the candidate link pair generated in step S201, if the product number association between ai and bi is found on the platform, then it is considered that there is a valid tracking relationship between ai and bi;
s203: the label value of the candidate link pair with the effective tracking relation is marked as 1, otherwise, the label value of the candidate link pair with the effective tracking relation is marked as 0;
s204: the marked software product A and software product B are formed into a sample data set.
3. The method for automated software traceability restoration of an enhanced data preprocessing procedure according to claim 1, wherein said S700 comprises the steps of:
s701: equalizing the training set by using an undersampling method;
s702: equalizing the training set by using an oversampling method;
s703: and (5) balancing the training set by using a comprehensive sampling method.
CN202110626138.2A 2021-06-04 2021-06-04 Automated software traceability recovery method for enhancing data preprocessing process Active CN113378907B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110626138.2A CN113378907B (en) 2021-06-04 2021-06-04 Automated software traceability recovery method for enhancing data preprocessing process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110626138.2A CN113378907B (en) 2021-06-04 2021-06-04 Automated software traceability recovery method for enhancing data preprocessing process

Publications (2)

Publication Number Publication Date
CN113378907A CN113378907A (en) 2021-09-10
CN113378907B true CN113378907B (en) 2024-01-09

Family

ID=77575904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110626138.2A Active CN113378907B (en) 2021-06-04 2021-06-04 Automated software traceability recovery method for enhancing data preprocessing process

Country Status (1)

Country Link
CN (1) CN113378907B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6233575B1 (en) * 1997-06-24 2001-05-15 International Business Machines Corporation Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
CN106201871A (en) * 2016-06-30 2016-12-07 重庆大学 Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN107066389A (en) * 2017-04-19 2017-08-18 西安交通大学 The Forecasting Methodology that software defect based on integrated study is reopened
CN109543906A (en) * 2018-11-23 2019-03-29 长三角环境气象预报预警中心(上海市环境气象中心) A kind of method and apparatus of atmospheric visibility prediction
CN110084412A (en) * 2019-04-12 2019-08-02 重庆邮电大学 A kind of photovoltaic power generation big data prediction technique based on the study of Feature Conversion multi-tag
CN110516722A (en) * 2019-08-15 2019-11-29 南京航空航天大学 The automatic generation method of traceability between a kind of demand and code based on Active Learning
CN111191001A (en) * 2019-12-23 2020-05-22 浙江大胜达包装股份有限公司 Enterprise multi-element label identification method for paper package and related industries thereof
CN111367798A (en) * 2020-02-28 2020-07-03 南京大学 Optimization prediction method for continuous integration and deployment results
CN111460401A (en) * 2020-05-20 2020-07-28 南京大学 Automatic product tracking method combining software product process information and text similarity
CN111460137A (en) * 2020-05-20 2020-07-28 南京大学 Micro-service focus identification method, device and medium based on topic model
CN111858328A (en) * 2020-07-15 2020-10-30 南通大学 Software defect module severity prediction method based on ordered neural network
CN111986181A (en) * 2020-08-24 2020-11-24 中国科学院自动化研究所 Intravascular stent image segmentation method and system based on double-attention machine system
CN112181814A (en) * 2020-09-18 2021-01-05 武汉大学 Multi-label marking method for defect report
US11017688B1 (en) * 2019-04-22 2021-05-25 Matan Arazi System, method, and program product for interactively prompting user decisions

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7383239B2 (en) * 2003-04-30 2008-06-03 Genworth Financial, Inc. System and process for a fusion classification for insurance underwriting suitable for use by an automated system
US20040236611A1 (en) * 2003-04-30 2004-11-25 Ge Financial Assurance Holdings, Inc. System and process for a neural network classification for insurance underwriting suitable for use by an automated system
US7813945B2 (en) * 2003-04-30 2010-10-12 Genworth Financial, Inc. System and process for multivariate adaptive regression splines classification for insurance underwriting suitable for use by an automated system
US11676043B2 (en) * 2019-03-04 2023-06-13 International Business Machines Corporation Optimizing hierarchical classification with adaptive node collapses

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6233575B1 (en) * 1997-06-24 2001-05-15 International Business Machines Corporation Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
CN106201871A (en) * 2016-06-30 2016-12-07 重庆大学 Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN107066389A (en) * 2017-04-19 2017-08-18 西安交通大学 The Forecasting Methodology that software defect based on integrated study is reopened
CN109543906A (en) * 2018-11-23 2019-03-29 长三角环境气象预报预警中心(上海市环境气象中心) A kind of method and apparatus of atmospheric visibility prediction
CN110084412A (en) * 2019-04-12 2019-08-02 重庆邮电大学 A kind of photovoltaic power generation big data prediction technique based on the study of Feature Conversion multi-tag
US11017688B1 (en) * 2019-04-22 2021-05-25 Matan Arazi System, method, and program product for interactively prompting user decisions
CN110516722A (en) * 2019-08-15 2019-11-29 南京航空航天大学 The automatic generation method of traceability between a kind of demand and code based on Active Learning
CN111191001A (en) * 2019-12-23 2020-05-22 浙江大胜达包装股份有限公司 Enterprise multi-element label identification method for paper package and related industries thereof
CN111367798A (en) * 2020-02-28 2020-07-03 南京大学 Optimization prediction method for continuous integration and deployment results
CN111460401A (en) * 2020-05-20 2020-07-28 南京大学 Automatic product tracking method combining software product process information and text similarity
CN111460137A (en) * 2020-05-20 2020-07-28 南京大学 Micro-service focus identification method, device and medium based on topic model
CN111858328A (en) * 2020-07-15 2020-10-30 南通大学 Software defect module severity prediction method based on ordered neural network
CN111986181A (en) * 2020-08-24 2020-11-24 中国科学院自动化研究所 Intravascular stent image segmentation method and system based on double-attention machine system
CN112181814A (en) * 2020-09-18 2021-01-05 武汉大学 Multi-label marking method for defect report

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
"A Novel Online Real-time Classifier for Multi-label Data Streams";Rajasekar Venkatesan等;《https://arxiv.org/abs/1608.08905》;全文 *
"Hellinger Net: A Hybrid Imbalance Learning Model to Improve Software Defect Prediction";Tanujit Chakraborty等;《IEEE Transactions on Reliability》;全文 *
"Machine/Deep Learning for Software Engineering: A Systematic Literature Review";Simin Wang等;《IEEE Transactions on Software Engineering》;全文 *
"Semantics-Aware Privacy Risk Assessment Using Self-Learning Weight Assignment for Mobile Apps";Jing Chen等;《IEEE Transactions on Dependable and Secure Computing》;全文 *
"Towards more accurate multi-label software behavior learning";Xin Xia等;《2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE)》;全文 *
"基于表示学习的社交网络链接预测方法研究";李锐;《中国硕士学位论文全文数据库》;全文 *
"基于软件结构的文档与代码间可追踪性研究";杨丙贤等;《计算机科学与探索》;全文 *
"自然语言数据驱动的智能化软件安全评估方法";张一帆等;《软件学报》;全文 *
"软件过程与管理方法综述";荣国平;《软件学报》;全文 *
"面向移动广告的欺诈检测算法研究";邱昱;《中国硕士学位论文全文数据库》;全文 *

Also Published As

Publication number Publication date
CN113378907A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN112800113B (en) Bidding auditing method and system based on data mining analysis technology
US20210248144A1 (en) Systems and methods for data quality monitoring
CN113656805B (en) Event map automatic construction method and system for multi-source vulnerability information
Li et al. Cost drivers of software corrective maintenance: An empirical study in two companies
CN111460401B (en) Product automatic tracking method combining software product process information and text similarity
Bianchi et al. Development of extendable open-source structural inspection datasets
CN115357906A (en) Intelligent auxiliary evaluation method and system for network security level protection 2.0
CN111241079A (en) Data cleaning method and device and computer readable storage medium
CN113378907B (en) Automated software traceability recovery method for enhancing data preprocessing process
CN115587333A (en) Failure analysis fault point prediction method and system based on multi-classification model
CN106844218B (en) Evolution influence set prediction method based on evolution slices
CN115936003A (en) Software function point duplicate checking method, device, equipment and medium based on neural network
CN110163498B (en) Courseware originality scoring method and device, storage medium and processor
CN113032280A (en) Web application testing and repairing method based on GUI element similarity calculation
Platonov et al. Development of a methodology for cost optimization of software testing for the automatically tests generation
Liu et al. Metrics for software process simulation modeling
CN108664590A (en) A kind of matrimony vine data identification method
CN117608545B (en) Standard operation program generation method based on knowledge graph
CN108595693A (en) A kind of matrimony vine data-reduction system
CN114510431B (en) Workload-aware intelligent contract defect prediction method, system and equipment
CN112685532B (en) Test question resource analysis method and device, electronic equipment and storage medium
CN115761524A (en) Remote sensing scene classification method and device
CN117421226A (en) Defect report reconstruction method and system based on large language model
CN118151910A (en) Project optimization problem rapid modeling method and system based on large language model
Sivaramakrishnan et al. Application of a Hidden Markov Model for consistency checking of process plant facility tag numbers: A case study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant