CN116627845A

CN116627845A - Capture-Recapture-based software defect prediction method and system

Info

Publication number: CN116627845A
Application number: CN202310889238.3A
Authority: CN
Inventors: 张岩; 王玉洁; 许龙豹; 李伟; 吴玉忠; 张帆; 刘凯旋
Original assignee: Shandong Lushangtong Technology Co ltd
Current assignee: Shandong Lushangtong Technology Co ltd
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-08-22

Abstract

The invention relates to the technical field of software prediction, in particular to a method and a system for predicting software defects based on Capture-Capture. The method comprises the following steps: acquiring related data in the software development process; preprocessing the acquired related data; extracting characteristics from the preprocessed related data; constructing a prediction model based on a Capture-Capture algorithm, and training and verifying the prediction model by using acquired related data in the software development process; evaluating the trained prediction model, and performing model tuning according to an evaluation result; and obtaining a software defect prediction result by using the tuned prediction model. The invention can predict and count the defect situation by collecting and processing the related data in the software development process and display the defect situation to the user in an intuitive visual mode.

Description

Capture-Recapture-based software defect prediction method and system

Technical Field

The invention relates to the technical field of software prediction, in particular to a method and a system for predicting software defects based on Capture-Capture.

Background

Static code analysis is a technique that identifies potential flaws by parsing and structure analysis of source code. It can check for errors in code, code style violations, inconsistencies, potential security vulnerabilities, etc. The method has the advantages that the static code analysis can find potential defects in the early stage of the development process, and is beneficial to reducing the later repair cost; the static code analysis tool can be automatically executed, so that the efficiency is improved and human errors are reduced; it can detect common defect patterns and best practice violations, contributing to improved code quality. The disadvantage is that static analysis tools may produce a large number of false positives, i.e. marking a code without defects as defective, requiring manual verification and exclusion; static analysis cannot fully understand the semantics and context information of the code, and sometimes implicit defects may not be found; for large and complex software systems, the accuracy of static analysis tools may be limited by code complexity and scale.

Dynamic testing simulates the actual execution environment by running software to discover defects and errors. It may include various forms of unit testing, integrated testing, system testing, etc. The method has the advantages that the dynamic test can simulate the execution of software in the actual running environment, and is helpful for finding defects in the actual scene; the method can capture defects related to errors and dynamic behaviors in running, and is beneficial to improving the robustness of codes; a particular function or execution path may be tested by designing and executing a particular test case. The disadvantage is that dynamic testing requires a large number of test cases and coverage to discover all possible defects, which can be an expensive and time-consuming process; dynamic testing is usually performed at a later stage of software development, defects may not be found early, and repair cost is increased; dynamic testing may not cover all possible execution paths and boundary conditions and therefore may not find all defects.

Machine learning techniques predict software defects by analyzing and learning patterns and features in historical data, constructing predictive models. Common machine learning methods include decision trees, support vector machines, neural networks, and the like. The method has the advantages that the machine learning model can predict defects in new codes through training and learning defect modes in historical data, and has certain learning capacity; the method can process a large amount of historical data and improve the prediction accuracy by mining hidden modes and correlations; the machine learning model can be automatically executed and is suitable for a large-scale software system. A disadvantage is that the accuracy of the machine learning model is affected by the quality of the training data and the accuracy of the labels, and the lack of high quality label data may lead to performance degradation; for complex software systems, the machine learning model may not be able to capture all complex defect modes, with limited performance; the selection of appropriate features is critical to the accuracy and reliability of the machine learning model, and improper feature selection may affect the prediction results.

As can be seen from the above, there are certain drawbacks in the above conventional machine learning prediction models, so a new software defect prediction method and system are needed.

Capture-Recapsture: i.e., capture-recapture, a statistical method for estimating the number of unobserved individuals in a population. Are commonly used in the fields of ecology, wild animal protection, demographics, etc., to infer the overall size or number of a population.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a system for predicting software defects based on Capture-Capture.

In a first aspect, the present invention provides a software defect prediction method based on Capture-Capture, which adopts the following technical scheme:

a software defect prediction method based on Capture-Capture comprises the following steps:

acquiring related data in the software development process;

preprocessing the acquired related data;

extracting characteristics from the preprocessed related data;

constructing a prediction model based on a Capture-Capture algorithm, and training and verifying the prediction model by using acquired related data in the software development process;

evaluating the trained prediction model, and performing model tuning according to an evaluation result;

and obtaining a software defect prediction result by using the tuned prediction model.

Further, the related data in the software development process is obtained, including defect report, system log and code audit record.

Further, the preprocessing of the acquired related data includes operations of removing duplicate data, processing missing values and abnormal values of the related data in the development process of the acquired software.

Further, the extracting features of the preprocessed related data includes extracting code quality indexes, developer information and project attributes of the related data.

Further, the method for constructing the prediction model based on the Capture-Recapture algorithm comprises the steps of sampling by using the Capture-Recapture algorithm based on a Capture-Recapture principle, and estimating the size of the group according to the Capture quantity.

Further, the training and verifying the prediction model by using the obtained related data in the software development process comprises dividing the obtained related data in the software development process into a training set and a verification set, training the prediction model by using the training set, and verifying the prediction performance of the prediction model by using the verification set.

Further, the trained prediction model is evaluated, and model tuning is performed according to an evaluation result, wherein the evaluation includes the accuracy, the precision, the recall and the F1 value of the prediction model, and parameters are adjusted to the prediction model according to the evaluation result.

In a second aspect, a Capture-based software defect prediction system includes:

the data acquisition module is configured to acquire related data in the software development process;

the preprocessing module is configured to preprocess the acquired related data;

the feature extraction module is configured to extract features from the preprocessed related data;

the model construction module is configured to construct a prediction model based on a Capture-Capture algorithm, and train and verify the prediction model by utilizing the acquired related data in the software development process;

the tuning module is configured to evaluate the trained prediction model and perform model tuning according to an evaluation result;

and the prediction module is configured to obtain a software defect prediction result by using the tuned prediction model.

In a third aspect, the present invention provides a computer readable storage medium having stored therein a plurality of instructions adapted to be loaded and executed by a processor of a terminal device for performing the method of Capture-Capture based software defect prediction.

In a fourth aspect, the present invention provides a terminal device, including a processor and a computer readable storage medium, where the processor is configured to implement instructions; the computer readable storage medium is for storing a plurality of instructions adapted to be loaded by a processor and to perform the one Capture-Capture based software defect prediction method.

In summary, the invention has the following beneficial technical effects:

1. the method can predict and count the defect situation by collecting and processing the related data in the software development process, such as defect report, code examination record and the like, and can be displayed to a user in an intuitive visual mode. This provides valuable data support for software development teams, helping them to better understand and manage defects in the project, thereby improving software quality and development efficiency.

2. The system of the present invention utilizes a variety of techniques to achieve its functionality. The front end adopts a Vue.js framework, and a flexible and highly customizable data visualization module is realized through componentization and response characteristics. The Java language and Spring Boot framework are used at the back end, and stable and efficient background service is provided. The data storage and management uses a relational database and performs data interaction with the front end through an API interface. In addition, the R language is utilized to carry out algorithm calculation, so that the software defect prediction function of the Capture-RecAN_SNture algorithm is realized.

3. According to the technical scheme, the Capture-RecAN_SNture algorithm in biology is AN_SNplied to predict the software defect. The introduction of the algorithm enables the prediction result to be more accurate and reliable, and provides more practical data analysis for a software development team. In addition, by combining data visualization with defect statistics, visual and easy-to-understand chart display is provided, so that a user can quickly grasp and analyze defect conditions of projects, and data-driven decision making and improvement are supported.

4. The efficiency of the present invention is embodied in several aspects. Firstly, through automatic data collection and processing, the manual operation and time cost are reduced, and the accuracy and reliability of data are improved. And secondly, defect statistical information is displayed in a visual mode, so that a user can intuitively know the defect condition of the project, and the understandability and analysis efficiency of data are improved. In addition, by introducing algorithms and models, defects are predicted and analyzed, so that a user is helped to discover and solve potential problems earlier, and the efficiency and quality of software development are improved.

In summary, the invention provides software defect prediction and data analysis functions from a service perspective, adopts various technical means such as front and rear end frames and data storage from a technical perspective, introduces a Capture-Capture algorithm from an innovation perspective, and improves the efficiency of data processing and analysis from an efficiency perspective. The characteristics enable the platform to have important application value in social production and software development.

Drawings

FIG. 1 is a schematic diagram of a software defect prediction method based on Capture-RecAN_SNture according to embodiment 1 of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

Example 1

Referring to fig. 1, a software defect prediction method based on Capture-Capture of the present embodiment includes:

acquiring related data in the software development process;

preprocessing the acquired related data;

extracting characteristics from the preprocessed related data;

and obtaining a software defect prediction result by using the tuned prediction model. The related data in the software development process is obtained, including defect report, system log and code audit record. The preprocessing of the acquired related data comprises the operations of removing repeated data, processing missing values and abnormal values of the related data in the development process of the acquired software. And extracting characteristics of the preprocessed related data, wherein the characteristics comprise code quality indexes, developer information and project attributes of the extracted related data. The method comprises the steps of constructing a prediction model based on a Capture-Recapture algorithm, sampling by using the Capture-Recapture algorithm based on a Capture-Recapture principle, and evaluating the size of a group according to the Capture quantity. The method comprises the steps of training and verifying a prediction model by using the obtained related data in the software development process, wherein the step of dividing the obtained related data in the software development process into a training set and a verification set, training the prediction model by using the training set, and verifying the prediction performance of the prediction model by using the verification set. And evaluating the trained prediction model, and performing model tuning according to an evaluation result, wherein the model tuning comprises evaluating the accuracy, the precision, the recall and the F1 value of the prediction model, and adjusting parameters of the prediction model according to the evaluation result.

Specifically, the method comprises the following steps:

first, the Capture-Capture algorithm is based on the following assumptions:

assume a population that includes an individual (e.g., animal, fish, population, etc.). It is desirable to estimate the size of this population, but it is not straightforward to observe and count all individuals. Instead, the estimation can be made by two rounds of observation.

First round of observation (capture): in this round, a portion of the individuals are captured and marked, numbered or otherwise identifiable to distinguish them. They are then released back into the population.

Second round of observation (recapture): after the first round of observation, a second round of observation was performed after a period of time. In this round, some individuals are captured again. This time records how many of the captured individuals were already marked in the first round (recaptured individuals) and how many were newly captured individuals (unmarked).

Based on the captured and re-captured data, some statistical models and methods may be applied to estimate the scale or number of populations. These methods take into account the probability that a tagged individual will be captured again in the second round of observation, as well as the probability that an untagged individual will be captured in the first and second rounds. In this way, the number of individuals not observed can be estimated by the known number of individuals marked and recaptured, thus inferring the overall size.

S1, acquiring related data in a software development process;

relevant data in the software development process, such as defect reports, version control system logs, code audit records, and the like, are collected.

S2, preprocessing the acquired related data;

and cleaning and preprocessing the data, removing repeated data, processing missing values, abnormal values and the like.

In particular, when processing software defect data, further description will be given taking an example data set as an example. There is a software defect database containing the following fields:

1. defect ID: a unique identifier for each defect report.

2. Reporting person: the name or ID of the person reporting the defect.

3. Priority level: the priority of defects may be high, medium, low.

4. Status: the current state of the defect, such as repaired, pending, verified, etc.

5. Description of: detailed description of the defects.

In this example dataset, the following data cleansing and preprocessing tasks were performed:

1. duplicate data is removed:

assume that there are two records in the dataset:

defect ID 001, reporter Alice, priority high, state to be processed, description 1;

defect ID 001, reporter Bob, priority, state to be processed, description 1;

the treatment method comprises the following steps: based on the defect ID, it can be recognized that the two records are repeated. One of the records may be kept and the other deleted, ensuring that there is only one record per defect.

2. Processing the missing values:

assume that there are the following records in the dataset where there are missing values for the reporter and description fields:

defect ID 001, reporter, priority high, status pending, description:

defect ID 002, reporter John, priority, status pending, description defect description 2;

the treatment method comprises the following steps: an interpolation missing value may be selected. For the reporter field, the missing values may be filled in as "un-own" or these records may be deleted. For the description field, if the missing value proportion is small, it may be considered to delete the record containing the missing value. If there are more missing values, text mining techniques or natural language processing models, such as text classifiers, can be used to fill in the missing values.

3. Processing outliers:

assume that there are the following records in the dataset, where the priority field contains outliers:

defect ID 002, reporter Bob, priority 123, status pending description defect description 2

The treatment method comprises the following steps: in this case, the priority field outlier ("123") may be treated as a data entry error and replaced with a reasonable value. For example, it may be replaced with a medium priority ("medium"). These alternative values may also be used for substitution if other reasonable alternative values exist.

These are the data cleaning and preprocessing tasks and their corresponding processing methods that are common when the present system processes software defect data. Appropriate techniques and methods may be applied to clean and pre-process the data to ensure data quality and reliability of subsequent analysis, depending on the particular data set and analysis purpose.

S3, extracting characteristics of the preprocessed related data;

features are extracted from the collected data, including code quality metrics, developer information, project attributes, and the like. Feature selection and dimension reduction are performed to reduce feature dimensions and improve model effects.

Specific:

when feature selection and dimension reduction are performed, the system reuses correlation analysis methods and Principal Component Analysis (PCA) to interpret and reduce features of the dataset. The following is a detailed description of these methods:

correlation analysis method:

1. pearson correlation coefficient: the pearson correlation coefficient is used to measure the strength and direction of the linear relationship between two consecutive variables. Its value range is-1 to +1. A value near +1 indicates a positive correlation, a value near-1 indicates a negative correlation, and a value near 0 indicates no correlation.

2. Spearman correlation coefficient: spearman correlation coefficients are used to measure a monotonic relationship between two variables, and do not require that the variables be continuous. It calculates the correlation coefficient by converting the raw data into a rank (order), and is therefore suitable for the case of a nonlinear relationship.

These correlation analysis methods can learn the strength of the relationship between the features and the target variable, thereby selecting the most relevant features for further analysis and modeling.

The use of principal component analysis in the present system is a commonly used dimension reduction technique that converts the original features into a new set of principal components that are linear combinations of the original features. Its goal is to reduce the dimensionality of the dataset by retaining the most important information.

The principal component analysis steps are as follows:

(1) Normalized data: for each feature, the mean value is subtracted from its value and divided by the standard deviation to ensure that all features have similar dimensions.

(2) Calculating a covariance matrix: a covariance matrix between the normalized features is calculated. The covariance matrix reflects the correlation between features.

(3) Calculating eigenvalues and eigenvectors: and carrying out feature decomposition on the covariance matrix to obtain feature values and corresponding feature vectors. The eigenvector represents the direction of the principal component, and the eigenvalue represents the degree of data variation in that direction.

(4) And selecting main components: the most important feature vectors (principal components) are selected in order of magnitude of the feature values. The number of principal components may be selected based on the amount of information retained or the variance solution.

(5) Converting data: and projecting the original data set onto the selected principal component to obtain the dimension-reduced data set.

Through principal component analysis, the original high-dimensional data can be converted into lower-dimensional data, while retaining the most important information. This helps to reduce complexity of the feature space, improving interpretation and computational efficiency of the model.

S4, constructing a prediction model based on a Capture-Capture algorithm, and training and verifying the prediction model by using the acquired related data in the software development process;

and constructing a defect prediction model based on a Capture-Capture algorithm. By using the R language for feature extraction and model training, the predictive model can estimate the number of possible defects in the future. The user can trigger the prediction process to obtain a prediction result. The software defect prediction system based on the Capture-Recover algorithm helps development team predict and analyze defects in software projects by providing functions such as data import, defect prediction, data analysis and visualization, optimizes software development process, and provides innovative characteristics such as safety, user-defined setting and the like.

The capture-recovery algorithm is calculated as follows:

assume that there are two samples, the first of which captures N individuals, denoted A. M individuals were captured in the second sub-sample, of which C individuals were captured simultaneously in the first sample, denoted B. Then, according to the Capture-Capture algorithm, the size of the whole population can be estimated, denoted as S.

According to the capture-reacquisition principle, the following formula can be derived:

S = (M + 1) × (N + 1) / (C + 1) - 1

wherein, the liquid crystal display device comprises a liquid crystal display device,

n is the number of individuals captured in the first sample.

M is the number of individuals captured in the second sub-sample.

C is the number of individuals captured in two samples simultaneously.

The derivation of this formula is based on the assumption that in two samples, the capture of individuals is random and independent, and the probability of each individual being captured is equal. From this assumption, the size of the entire population can be estimated using the known number of captures.

To avoid introducing estimation errors due to sampling bias, the Capture-RecAN algorithm also considers a Chapman correction factor. The modified formula is as follows:

S = ((M + 1) × (N + 1) / (C + 1) - 1) / (1 - (M + 1) / (N + 1))

this correction factor allows for the fact that the probability of an individual being captured in two samples may not be exactly equal, and by correction the size of the entire population can be estimated more accurately.

As a further embodiment of the method of the present invention,

the invention realizes the following prediction models for users to select:

1. Lincoln-Petersen model:

the model is based on two independent samples and calculates the number of individuals that repeatedly occur in the two samples, and then predicts the overall number by estimating the ratio of the overall number. In practice, it is necessary to record the individual identity of each sample and calculate the number of individuals that repeatedly occur. The advantages are simplicity and easy understanding and easy realization. For the case of two samples, it is feasible to estimate the ratio of the total number. The disadvantage is that ignoring differences in individual capture probabilities may lead to estimation bias. The assumption is that the total number is unchanged, and is not applicable to the case where the total number is considered to be changed.

2. Chapman model:

the model accounts for differences in individual capture probabilities. The total number is estimated by calculating the number of individuals that repeatedly appear and the number of individuals that uniquely appear in the two samplings, and combining the proportion of the individual capture probabilities. In practice, it is necessary to record the individual identification and capture status of each sample and calculate the number of duplicate and unique individuals. The method has the advantages that differences of individual capturing probabilities are considered, and the method is more suitable for practical situations. For the case of two samples, a more accurate estimation result can be provided. The disadvantage is that assuming the overall number is unchanged, it is not applicable to the case where the overall number is considered to be changed. Only duplicate and unique terms in the two samples are considered and more information may be ignored.

3. Jolly-Seber model:

the model is suitable for the case of multiple sampling, taking into account the dynamic variation of the overall number. The total number is estimated by counting the number of newly added individuals and the repeated occurrence of each sampling, and taking into account the difference in individual capture probabilities. In practice, it is necessary to record the individual identification and capture status of each sample and calculate the number of duplicate and newly added individuals. The advantage is that it is applicable to multiple sampling situations, taking into account the overall number of variations. The difference of the individual capturing probabilities is considered, and a more accurate estimation result is provided. The repeated items and newly added items in each sampling are considered, and more comprehensive information is provided. The disadvantage is that multiple samplings are required, increasing the sampling cost and effort.

4. Schnabel model:

the model is an extension of the Jolly-Seber model, taking into account time factors. By establishing a time model, the increasing or decreasing trend of the total number is predicted, and the total number is estimated by combining the repeated occurrence of each sampling and the newly increased number of individuals. In practice, a time model needs to be built in addition to the records and calculations in the Jolly-Seber model. The advantage is that the dynamic variation of the overall number can be handled taking into account the time factor. The difference of the individual capturing probabilities is considered, and a more accurate estimation result is provided. The disadvantage is that multiple samplings are required, increasing the sampling cost and effort. Modeling time is required and more data and complex calculations may be required.

5. Cormack-Jolly-Seber model:

the model is an extension of the Jolly-Seber model, introducing mortality or removal rates for individuals. By taking into account the death or removal of the individual, the overall number can be estimated more accurately. The model calculates the repeated occurrence of each sampling and the number of newly added individuals by recording the identification, the capturing state and the time information of the individuals, and deduces the total number by combining the death or the removal rate of the individuals. The advantage is that the death or removal of the individual is taken into account, representing in this project that the defect is repaired, providing a more accurate estimation. The case of overall number variation and individual capture probability difference can be handled. A disadvantage is that there is a need to accurately estimate the mortality or the removal rate of an individual, and the estimation of this parameter may have some uncertainty. Multiple samplings are required, and the sampling cost and workload are increased.

6. Barker model:

the model is a variation of the Jolly-Seber model, taking into account individual migration conditions. By introducing individual mobility, the overall number and mobility can be estimated. The model records the identification, capture state and time information of the individuals, and calculates the repeated occurrence of each sampling and the number of newly added individuals, and the migration condition of the individuals, thereby deducing the total number and the migration rate. An advantage is that an estimate of the overall number and mobility is provided taking into account individual migration situations. The method is suitable for researching the problems of individual migration and population structure. A disadvantage is that there is a need to accurately estimate the mobility of the individual, and the estimation of this parameter may have some uncertainty. Multiple samplings are required, and the sampling cost and workload are increased.

7. Pradel model:

the model is applicable to populations with an age structure. It divides the population into different age groups and considers the survival and migration of individuals to estimate the overall number and survival rate of each age group. The model records the identity, age, capture status and time information of the individuals and calculates the number of individuals newly added and the survival and migration of the individuals with each sampling to infer the overall number and survival rate of the age group. The advantage is that it is suitable for use in populations with age structure, providing an estimate of survival in different age groups. The survival and migration conditions of the individual can be considered, and a more accurate estimation result can be provided. A disadvantage is that the survival rate of an age group needs to be accurately estimated, and the estimation of this parameter may have some uncertainty. Multiple samplings are required, and the sampling cost and workload are increased.

8. Huggins model:

the model is a continuous-time Capture-Capture model, and can handle continuous occurrence and departure situations of individuals. It combines the time of occurrence and time of departure of individuals to estimate the overall number and activity pattern of the individuals. The model records the identity, capture status, time and activity pattern information of the individual and infers the overall number and activity pattern of the individual based on the individual's presence and departure times. The method has the advantages of being capable of processing continuous appearance and departure situations of individuals and being suitable for researching the activity mode of the individuals. The presence and departure time of the individual are taken into account, providing a more accurate estimation result. A disadvantage is that there is a need to accurately estimate the activity pattern of an individual, and the estimation of this parameter may have some uncertainty. Multiple samplings are required, and the sampling cost and workload are increased.

The following is a specific procedure and step for predicting software defects using the Jolly-Seber model:

(1) Data preparation:

a data set comprising a plurality of software items including characteristics associated with software defects, such as code size, development time, developer experience, etc. The data set also includes a defect label for each smallest code module, with 1 representing defective and 0 representing non-defective.

The user uploads the data to the system, and the system firstly cleans the data, and performs data supplementation, data deletion and data transformation according to preset rules.

(2) Dividing data:

after the user data are cleaned, selecting data set dividing parameters, dividing the data into K equal subsets according to an equipartition principle in K-fold cross validation, and setting a fixed data set for K times of validation.

(3) Model structure and formula:

the Jolly-Seber model is a classical recapture model for estimating the survival and death probabilities of individuals in a dynamic population. The formula of the Jolly-Seber model is as follows:

survival probability: p (t) =s (t)/N (t-1)

Capture probability: c (t) =c (t)/N (t)

Probability of defect discovery: f (t) =f (t)/C (t)

Where S (t) represents the number of individuals that survived before time t, N (t) represents the total number of individuals at time t, C (t) represents the number of individuals that were captured during time t, and F (t) represents the number of individuals that were captured and defective during time t.

(4) Parameter estimation and model training:

the data of the training set is used for K times of parameter estimation and model training. In the system, the parameter values of the survival probability p (t), the capture probability c (t) and the defect discovery probability f (t) are estimated by the methods of maximum likelihood estimation and the like on the data acquired through the model.

(5) Model verification and performance evaluation:

and verifying and evaluating the performance of the trained model by using the data of the verification set. For each verification, the survival probability p (t), the capture probability c (t) and the defect discovery probability f (t) are calculated according to the trained model and formula and compared with the actual defect label. And meanwhile, indexes such as accuracy, precision, recall rate, F1 value and the like of the model verification are displayed to evaluate and compare the performance of the model.

(6) And (3) model tuning:

if the model parameters are not ideal, the user can perform parameter tuning and model improvement. The best combination of parameters is selected using cross-validation techniques to improve the performance of the model.

S5, evaluating the trained prediction model, and performing model tuning according to an evaluation result;

specific:

after the model is trained, tuning can be performed to further improve the performance of the model. The following are the evaluation and tuning steps for the model in the present system:

(1) Data preparation:

the data set is divided into a training set and a test set/verification set, the K-fold cross data set is divided in the system, and the K-fold cross verification is realized by using functions provided in a machine learning library scikit-learn.

K-fold cross validation is a commonly used model evaluation method for partitioning and validating data sets in model tuning. The principle and the implementation mode are as follows:

(1) The original data set is divided into K equal subsets called folds (folds).

(2) One of the folds is selected as a verification set in turn, and the remaining K-1 folds are selected as a training set.

(3) The model is trained using a training set and then evaluated using a validation set.

(4) Repeating the steps K times, and selecting different verification sets each time to obtain K model performance evaluation results.

(5) And finally, taking the average value of the K evaluation results as a performance evaluation index of the model.

K-fold cross-validation can help us evaluate the performance of the model more fully, reduce the impact of the selection of training and validation sets on the results, and provide an estimate of the stability and generalization ability of the model. By repeating the K-fold cross-validation multiple times, the performance of the model can be more reliably evaluated and the optimal parameter configuration or model structure can be selected.

(2) Model evaluation:

and training the Jolly-Seber model by using a training set to obtain parameter estimation. And (5) performing model evaluation by using the data on the test set/verification set, and calculating the performance index of the model.

(3) Calculating performance indexes:

accuracy (Accuracy): the calculation model predicts the ratio of the correct number of samples to the total number of samples. Accuracy = (true positive + true negative)/(true positive + false positive + true negative + false negative).

Precision (Precision): the calculation model predicts the proportion of samples which are truly positive in the samples which are positively classified. Accuracy = true positive/(true positive + false positive), accuracy is of concern for the prediction accuracy of the model.

Recall (Recall): the proportion of samples which are predicted as positive categories by the model in the samples which are truly positive categories is calculated, and the coverage degree of the model alignment example is focused. Recall = true positive/(true positive + false negative), the recall is judged by ROC curve, which is a recall-false positive curve drawn according to different classification thresholds for evaluating the performance of the model under different thresholds. AUC (Area Under the Curve) is the area under the ROC curve and is typically used to measure the overall performance of the model. The closer the AUC value is to 1, the better the model performance.

F1 value: and comprehensively considering the harmonic mean value of the accuracy rate and the recall rate, and evaluating the comprehensive performance of the model. F1 value = 2 (precision rate recall)/(precision rate + recall), higher F1 value indicates that the model has a better balance between accuracy and coverage.

Confusion matrix: the confusion matrix provides more detailed classification result information including True Positive (True Positive), true Negative (True Negative), false Positive (False Positive), and False Negative (False Negative). By analyzing the confusion matrix, a more comprehensive model performance evaluation index can be obtained.

(4) And (3) model tuning:

the best combination of parameters is selected by a K-fold cross-validation technique.

(5) Repeating the steps (2) - (4):

repeating steps (2) - (4) until satisfactory model performance is achieved. Different evaluation indexes and graphs may be used to track the performance changes of the model.

S6, obtaining a software defect prediction result by using the tuned prediction model.

Example 2

The embodiment provides a software defect prediction system based on Capture-Capture, which comprises:

the preprocessing module is configured to preprocess the acquired related data;

A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded and executed by a processor of a terminal device to the described Capture-Capture based software defect prediction method.

A terminal device comprising a processor and a computer readable storage medium, the processor configured to implement instructions; the computer readable storage medium is for storing a plurality of instructions adapted to be loaded by a processor and to perform the one Capture-Capture based software defect prediction method.

The above embodiments are not intended to limit the scope of the present invention, so: all equivalent changes in structure, shape and principle of the invention should be covered in the scope of protection of the invention.

Claims

1. A software defect prediction method based on Capture-Capture is characterized by comprising the following steps:

acquiring related data in the software development process;

preprocessing the acquired related data;

extracting characteristics from the preprocessed related data;

2. The method for predicting software defects based on Capture-Capture as claimed in claim 1, wherein said obtaining relevant data in the software development process includes defect reporting, system logging and code audit logging.

3. The method for predicting software defects based on Capture-Capture according to claim 2, wherein the preprocessing of the acquired related data comprises the operations of removing duplicate data, processing missing values and outliers of the related data in the process of acquiring software development.

4. A Capture-based software defect prediction method according to claim 3, wherein the extracting features of the preprocessed related data includes extracting code quality index, developer information and project attribute of the related data.

5. The Capture-Recapture based software defect prediction method, according to claim 4, wherein the Capture-Recapture based algorithm builds a prediction model, including sampling using Capture-Recapture algorithm based on Capture-Recapture principle, and estimating population size by Capture quantity.

6. The method for predicting software defects based on Capture-Capture according to claim 5, wherein the training and verifying the prediction model using the obtained related data in the software development process comprises dividing the obtained related data in the software development process into a training set and a verification set, training the prediction model using the training set, and verifying the prediction performance of the prediction model using the verification set.

7. The Capture-based software defect prediction method of claim 6, wherein the evaluating the trained prediction model, performing model tuning according to the evaluation result, comprises evaluating accuracy, precision, recall and F1 values of the prediction model, and adjusting parameters of the prediction model according to the evaluation result.

8. A Capture-based software defect prediction system, comprising:

the preprocessing module is configured to preprocess the acquired related data;