CN116401143A

CN116401143A - Software testing method and system based on unbalanced data set

Info

Publication number: CN116401143A
Application number: CN202211630770.5A
Authority: CN
Inventors: 程良伦; 陈武兴; 张伟文
Original assignee: Guangdong Nengge Knowledge Technology Co ltd; Guangdong University of Technology
Current assignee: Guangdong Nengge Knowledge Technology Co ltd; Guangdong University of Technology
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-07-07

Abstract

The invention provides a software testing method and a system based on an unbalanced data set, wherein the method comprises the following steps: acquiring sample data of unbalanced software defects, and preprocessing the sample data; setting a cost sensitivity coefficient for the data category in the sample data according to the unbalance characteristic of the sample data after preprocessing; training a cost sensitive kernel width learning system classification model according to sample data comprising cost sensitive coefficients; determining an optimal penalty coefficient through grid search, and determining a target parameter of the model according to the optimal penalty coefficient; constructing a target classification model according to the target parameters, carrying out classification prediction on the test instance through the target classification model, and outputting a prediction result; the scheme can effectively eliminate the influence of the extremely unbalanced characteristic of the software defect data set on the classification performance, improves the classification accuracy, solves the problem of complicated training in the feature engineering work, improves the training efficiency of the model, and can be widely applied to the technical field of computers.

Description

Software testing method and system based on unbalanced data set

Technical Field

The invention relates to the technical field of computers, in particular to a software testing method and system based on an unbalanced data set.

Background

With the continuous expansion of the software industry, software quality problems are becoming a major concern, and software defects are one of the most important factors threatening the quality of software. Software defects often result from errors made by software engineers during the encoding process, which often require significant human resources and time to repair. Therefore, early discovery of a failure module during software development has become a problem to be solved. Software Defect Prediction (SDP) is an important tool to improve software testing efficiency. SDP is an imbalance problem because defective data is unevenly distributed, with defective modules being much less than non-defective modules. Traditional machine learning methods often ignore a few defective software on the premise that the data obeys an equilibrium distribution.

Although there are many excellent machine learning and deep learning algorithms in the past to achieve higher classification accuracy, most machine learning algorithms will degrade rapidly when encountering data imbalance, often pay excessive attention to most classes and ignore few classes, which is fatal, such as software defect prediction, and normal is often more than abnormal, but during testing, more attention is paid to abnormal software than normal, and if one classifier can classify all normal classes but cannot classify abnormal software, huge economic loss is caused to the software development industry. On the other hand, although deep learning can extract deeper features, a good fitting effect cannot be achieved when the data volume is relatively small, and once the data volume of the data set is relatively large, a relatively long time is required to search the optimal parameters through back propagation, so that the time is excessively long.

Disclosure of Invention

In view of the above, based on the technical problems or defects mentioned in the background art, an object of the embodiments of the present invention is to provide a software testing method based on an unbalanced data set, which can solve the drawbacks of inaccurate and efficient prediction of the existing unbalanced software defect data set. Meanwhile, the technical scheme of the application also provides a software testing system capable of realizing the method.

In one aspect, the present application provides a software testing method based on an unbalanced data set, including the following steps:

acquiring sample data of unbalanced software defects, and preprocessing the sample data;

setting a cost sensitivity coefficient for a data category in the sample data according to the unbalance characteristic of the sample data after preprocessing;

training a cost sensitivity core width learning system classification model according to sample data comprising the cost sensitivity coefficient;

determining an optimal penalty coefficient through grid search, and determining target parameters of the trained cost sensitive kernel width learning system classification model according to the optimal penalty coefficient;

and constructing a target classification model according to the target parameters, carrying out classification prediction on the test instance through the target classification model, and outputting a prediction result.

In a possible embodiment of the present application, the acquiring sample data of the unbalanced software defect, preprocessing the sample data includes:

filling the missing value through the average value of the sample data, or deleting the missing value;

and determining a label of the sample data after the missing value processing, and adding the label after the side-by-side encoding processing to the sample data.

In a possible embodiment of the present application, the setting a cost sensitivity coefficient for a data class in the sample data according to an imbalance characteristic of the sample data after preprocessing includes:

determining a first weight matrix according to the sample number of the sample data, and determining a penalty parameter of the data category according to the first weight matrix;

and determining the cost sensitivity coefficient corresponding to the data category according to the penalty parameter.

In a possible embodiment of the present application, the training a cost-sensitive kernel width learning system classification model according to sample data including the cost-sensitive coefficient includes:

determining an inner product matrix between the sample data through a radial basis function, constructing a mapping layer according to the inner product matrix, and determining a first connection weight between the mapping layer and an input layer;

taking the node of the mapping layer as input, mapping through the radial basis function to obtain an enhancement layer, and determining a second connection weight between the enhancement layer and the mapping layer;

and constructing and obtaining the cost sensitive kernel width learning system classification model according to the mapping layer, the enhancement layer, the first connection weight and the second connection weight.

In a possible embodiment of the present application, the calculation formula of the inner product matrix is:

wherein Ω _i,j Characterizing the inner product matrix, x _a Feature vector, x characterizing sample data a _b Characteristic vectors representing the b-th sample data, sigma represents parameters of a radial basis function, and N is a positive integer.

In a possible embodiment of the present application, the constructing the cost-sensitive kernel width learning system classification model according to the mapping layer, the enhancement layer, the first connection weight, and the second connection weight includes:

determining a second weight matrix of a hidden layer in the cost-sensitive kernel width learning system classification model according to the first connection weight and the second connection weight; the hidden layer is obtained by splicing the mapping layer and the enhancement layer;

and constructing an objective function in the cost-sensitive kernel width learning system classification model according to the second weight matrix.

In a possible embodiment of the present application, in the step of determining an inner product matrix between the sample data by a radial basis function, constructing a mapping layer according to the inner product matrix, and determining a first connection weight between the mapping layer and an input layer, a calculation formula of the first connection weight is as follows:

wherein lambda represents a regularization term parameter, I represents an identity matrix, omega _m And representing a distance matrix of the mapping layer, wherein X represents the input sample data.

In a possible embodiment of the present application, in the step of mapping the node of the mapping layer by the radial basis function to obtain an enhancement layer and determining a second connection weight between the enhancement layer and the mapping layer, a calculation formula of the second connection weight is as follows:

wherein lambda represents a regularization term parameter, I represents an identity matrix, omega _m Characterizing a distance matrix of the mapping layer, Ω _e A distance matrix characterizing the enhancement layer.

In a possible embodiment of the present application, the formula of the output layer weight in the cost sensitive kernel width learning system classification model is as follows:

wherein W represents the weight of an output layer, lambda represents a regularization term parameter, Y represents a label matrix corresponding to the sample data, and C ⁺ Cost sensitivity coefficient for representing misclassification of minority class samples, C ^- Penalty coefficient for representing misclassification of most types of samples, K= [ omega ] _m ,Ω _e ]，Ω _m Characterizing a distance matrix of the mapping layer, Ω _e A distance matrix characterizing the enhancement layer.

On the other hand, the technical scheme of the application also provides a software testing system based on the unbalanced data set, which comprises:

the data acquisition unit is used for acquiring sample data of unbalanced software defects and preprocessing the sample data;

setting a cost sensitivity coefficient for the data category in the sample data according to the unbalance characteristic of the sample data after preprocessing;

the model training unit is used for training a cost sensitivity core width learning system classification model according to sample data comprising the cost sensitivity coefficient;

the parameter optimization unit is used for determining an optimal penalty coefficient through grid search, and determining target parameters of the trained cost sensitive kernel width learning system classification model according to the optimal penalty coefficient;

and the prediction output unit is used for constructing a target classification model according to the target parameters, classifying and predicting the test instance through the target classification model and outputting a prediction result.

Advantages and benefits of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:

according to the technical scheme, the cost-sensitive kernel width learning system (CCR-KBLS) is provided, the problem of unbalanced data classification is solved by combining a kernel learning technology with the cost-sensitive kernel width learning system, and the influence of extremely unbalanced characteristics of a software defect data set on classification performance can be effectively eliminated; according to the scheme, through cost-sensitive width learning, the objective function of the model is modified, so that the model is more focused on few classes, the classification accuracy is improved, the problem of complex training in the feature engineering work is solved, and the training efficiency of the model is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of steps of a software testing method based on an unbalanced data set provided in the technical scheme of the present application.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

In the related art, width learning (BLS) has been applied to various scenes. The width learning is a novel machine learning technology, has a simple structure and has competitive performance. The structure of width learning is very different from that of deep learning. BLS tends to build networks in "wide" rather than "deep" directions, which is a single layer laterally scalable network, with design inspiration from Random Vector Function Linked Neural Networks (RVFLNNs). Width learning is intended to design a more rapid and accurate classification of data, and is also applied to software defect prediction. Related art solutions have disclosed that good classification performance and faster training speed can be achieved on MNIST datasets by comparison to Deep Neural Networks (DNNs). Although breadth learning has excellent classification efficiency, performance loss occurs in the case of data set imbalance. In the classification process, the objective function is more focused on most classes due to unbalance of data and the like, and when software defect prediction is performed, the objective function is more focused on the defective party, namely a few classes, so that the classification performance is reduced when unbalanced data classification is performed by normal width learning.

In addition, the software defect number prediction mainly predicts the defect number in the software module according to the previous software data; related art by predicting the number of document defects using a regression classifier, the influence of developers on a defect prediction model was analyzed, and in recent years, researchers have made a great deal of research on defect severity prediction, such as predicting the severity of defects in combination with multiple logistic regression and gradient descent, and further a ranking-based method for predicting the severity of defects in software has been proposed.

However, although the above-mentioned method related to unbalanced learning can alleviate the influence of unbalanced data distribution on classification performance to some extent, it does not consider that the software defect data set itself has many noise characteristics, and these characteristics may further amplify abnormal values and noise along with the lateral expansion characteristics of width learning, so as to reduce the predicted performance, so that it is necessary to seek a technology for processing noise to ensure that stable characteristics are obtained, so as to obtain a clearer decision boundary.

In view of the above-noted shortcomings of the prior art, in a first aspect, as shown in fig. 1, the technical solution of the present application provides a software testing method based on an unbalanced data set, the method comprising steps S01-S05:

s01, acquiring sample data of unbalanced software defects, and preprocessing the sample data.

In particular, in an embodiment, for the acquired sample data of the unbalanced software defect, the sample data may be divided into a software defect training sample and a test sample, and the data in the sample is preprocessed. In some possible embodiments, the process of preprocessing the sample data may include steps S011-S012:

s011, filling a missing value through the average value of the sample data, or deleting the missing value;

s012, determining a label of the sample data after the missing value processing, and adding the label after the side-by-side encoding processing to the sample data.

In particular, in an embodiment, after the unbalanced software defect training samples and test samples are obtained, the data in the samples need to be preprocessed. In the embodiment, the software defect data is subjected to data segmentation, the missing value is replaced by an average value of a front sample and a rear sample, the sample with the larger missing characteristic is directly deleted, the data is finally divided into a sample and a label, and the label of the data set is converted into one-hot coding.

S02, setting a cost sensitivity coefficient for the data category in the sample data according to the unbalance characteristic of the sample data after preprocessing.

In particular, in the embodiment, different cost sensitivity coefficients are set for different classes according to the unbalanced characteristic of the sample, respectively C ⁺ And C ^- This allows different penalty parameters to be set for different classes of data. In some possible implementationsIn an embodiment, the process of setting different penalty parameters for different types of data may include steps S021-S022:

s021, determining a first weight matrix according to the sample number of the sample data, and determining penalty parameters of the data category according to the first weight matrix;

s022, determining the cost sensitivity coefficient corresponding to the data category according to the penalty parameter.

In an embodiment, a first weight matrix is set for the number of samples of an original data set, different penalty parameters are set for different types of data, first, a cost sensitive parameter is added on the basis of a basic width learning system, and a C is allocated for a plurality of types of data ⁺ The method comprises the steps of carrying out a first treatment on the surface of the Minority class data allocation C ^- The method comprises the steps of carrying out a first treatment on the surface of the Wherein, most class data will be assigned to smaller penalty parameters, and few class data are assigned to larger penalty parameters, and the penalty parameters are used as cost sensitivity coefficients corresponding to the data classes. Further, C ⁺ And C ^- At C _search ＝{2 ^-24 ,2 ^-23 ,…,2 ²⁴ ,2 ²⁵ Lattice lookup in the range finds the optimal regularization parameters.

S03, training a cost sensitivity core width learning system classification model according to sample data comprising the cost sensitivity coefficient.

In an embodiment, a cost-sensitive kernel-width learning system classification model (CCR-KBLS) for unbalanced data is trained, wherein the cost-sensitive kernel-width learning system classification model is a two-layer neural network model, the first layer includes feature nodes and enhancement nodes, and the second layer is an output layer, and the feature nodes and the enhancement nodes in the first layer are spliced together and connected to a fully-connected layer. Further, in an embodiment, the divided training samples in step S01 may be

To train a weighted sum width learning system, wherein X represents an input training sample and Y represents a corresponding sampleIs a label of (2); n represents the total amount of samples, l represents the characteristic dimension of the samples, and c represents the dimension of the output.

In some possible implementations, the training process for CCR-KBLS in the examples may include steps S031-S033:

s031, determining an inner product matrix between the sample data through a radial basis function, constructing a mapping layer according to the inner product matrix, and determining a first connection weight between the mapping layer and an input layer;

in an embodiment, first, an inner product matrix between training samples is calculated through a radial basis function, and original input data is converted into a mapping node layer, wherein the radial basis function is as follows:

wherein x is _a And x _b Feature vectors of the a-th and b-th samples are respectively represented, and sigma represents parameters of a radial basis function; calculating a distance matrix Ω of N by the radial basis function provided in the examples _m As a mapping layer. Further, the connection weight between the input layer and the mapping layer is obtained by pseudo-inversion, namely, the first connection weight:

S032, mapping the nodes of the mapping layer through the radial basis function to obtain an enhancement layer, and determining a second connection weight between the enhancement layer and the mapping layer;

in an embodiment, after determining a mapping layer of the model, all nodes of the mapping layer are used as inputs, and then radial basis function mapping is performed once to obtain an enhancement layer Ω _e The method comprises the steps of carrying out a first treatment on the surface of the Similarly to step S031, the mapping layer is calculated to the enhancement layerI.e. the second connection weight:

s033, constructing and obtaining the cost sensitive kernel width learning system classification model according to the mapping layer, the enhancement layer, the first connection weight and the second connection weight. In some possible embodiments, step S033 in the method may further comprise steps S0331-S0332:

s0331, determining a second weight matrix of a hidden layer in the cost-sensitive kernel width learning system classification model according to the first connection weight and the second connection weight, wherein the hidden layer is obtained by splicing the mapping layer and the enhancement layer.

In an embodiment, a feature layer and an enhancement layer are spliced to form a hidden layer of the width learning system model to obtain k= [ Ω ] _m ,Ω _e ]。

S0332, constructing an objective function in the cost-sensitive kernel width learning system classification model according to the second weight matrix.

Specifically, in an embodiment, the second weight matrix calculated in step S0331 is added to an objective function, where the objective function is:

s.t:K(x _i )w _i ＝y _i -ζ _i ,i＝1,…,N

wherein C is ⁺ Penalty coefficient representing misclassification of minority class samples, C ^- And a penalty coefficient for misclassification of a plurality of types of samples is represented, lambda represents a regularization parameter, and K represents a hidden node formed by splicing a characteristic node and an enhancement node. Further, in the embodiment, the output layer weight is obtained by a pseudo-inverse method:

wherein Y is a tag matrix. Once the output weight of CCR-KBLS is reached, no further adjustment is required to obtain the final result. Finally, the embodiment sets the predictive tag to

The final model output results can be further described as:

s04, determining an optimal penalty coefficient through grid search, and determining target parameters of the trained cost sensitive kernel width learning system classification model according to the optimal penalty coefficient.

In particular, in an embodiment, a grid search strategy may be used to find the optimal penalty coefficients, where the search range is C _search ＝{2 ^-24 ,2 ^-23 ,…,2 ²⁴ ,2 ²⁵ And takes the optimal parameters as the final parameters of the final model.

S05, constructing a target classification model according to the target parameters, carrying out classification prediction on the test instance through the target classification model, and outputting a prediction result.

In particular, in an embodiment, test cases are classified to obtain a final prediction result. In the embodiment, a weighted kernel width learning system is used for testing and classifying test samples, the optimal parameters obtained by grid search are selected to be put into a model to test a software defect data testing set, and after labels are obtained, a G-mean evaluation index is used for testing and classifying results. As shown in Table 1, the experimental results of the CCR-KBLS model constructed in the technical scheme in different data sets are mostly predicted by other models such as SMOTE-BLS and RUS-BLS.

TABLE 1

Data set	SMOTE-BLS	RUS-BLS	WELM	DNC	CCR-KBLS
						MC2	0.735	0.746	0.697	0.699	0.834
PC3	0.745	0.647	0.795	0.816	0.745
						MW1	0.478	0.592	0.667	0.673	0.845
PC1	0.567	0.744	0.787	0.698	0.792

From the above specific implementation process, it can be summarized that, compared with the prior art, the technical solution provided by the present invention has the following advantages or advantages:

(1) According to the technical scheme, the cost-sensitive kernel width learning system (CCR-KBLS) is provided, the problem of unbalanced data classification is solved by combining a kernel learning technology with the cost-sensitive kernel width learning system, and the influence of extremely unbalanced characteristics of a software defect data set on classification performance can be effectively eliminated;

(2) The technical scheme designs a specific kernel mapping scheme in width learning. The dual-core mapping strategy is used for adaptively adjusting the number of feature nodes, reducing uncertainty of random mapping and avoiding manual setting. And meanwhile, the influence of data noise and abnormal values on the accuracy of model classification can be eliminated.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features may be integrated in a single physical device and/or software module or may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. A software testing method based on an unbalanced data set, comprising the steps of:

2. The method for testing software based on an unbalanced data set of claim 1, wherein the step of obtaining sample data of an unbalanced software defect, and preprocessing the sample data comprises:

3. A software testing method based on unbalanced data sets according to claim 1, wherein said setting a cost sensitivity coefficient for a data class in said sample data based on an unbalance characteristic of the sample data after preprocessing comprises:

4. The method of claim 1, wherein training a cost sensitive kernel width learning system classification model based on sample data including the cost sensitive coefficients comprises:

5. The method for testing software based on an unbalanced data set of claim 4 wherein the inner product matrix is calculated by the formula:

wherein Ω _i，j Characterizing the inner product matrix, x _a Feature vector, x characterizing sample data a _b Characteristic vectors representing the b-th sample data, sigma represents parameters of a radial basis function, and N is a positive integer.

6. The method for testing software based on an unbalanced data set of claim 4, wherein the constructing the cost sensitive kernel width learning system classification model according to the mapping layer, the enhancement layer, the first connection weight and the second connection weight comprises:

7. The method according to claim 4, wherein in the step of determining an inner product matrix between the sample data by a radial basis function, constructing a mapping layer based on the inner product matrix, and determining a first connection weight between the mapping layer and an input layer, a calculation formula of the first connection weight is as follows:

8. The method according to claim 4, wherein in the step of mapping the node of the mapping layer by the radial basis function to obtain the enhancement layer and determining the second connection weight between the enhancement layer and the mapping layer, the calculation formula of the second connection weight is as follows:

9. The method for testing software based on an unbalanced data set of claim 4 wherein the formula for the output layer weights in the cost sensitive kernel width learning system classification model is as follows:

wherein W represents the weight of an output layer, lambda represents a regularization term parameter, Y represents a label matrix corresponding to the sample data, and C ⁺ Cost sensitivity coefficient for representing misclassification of minority class samples, C ^- Penalty coefficient for representing misclassification of most types of samples, K= [ omega ] _m ，Ω _e ]，Ω _m Characterizing a distance matrix of the mapping layer, Ω _e A distance matrix characterizing the enhancement layer.

10. A software testing system based on an unbalanced data set, comprising:

the data acquisition unit is used for acquiring sample data of unbalanced software defects and preprocessing the sample data; setting a cost sensitivity coefficient for the data category in the sample data according to the unbalance characteristic of the sample data after preprocessing;