CN116401143A - Software testing method and system based on unbalanced data set - Google Patents

Software testing method and system based on unbalanced data set Download PDF

Info

Publication number
CN116401143A
CN116401143A CN202211630770.5A CN202211630770A CN116401143A CN 116401143 A CN116401143 A CN 116401143A CN 202211630770 A CN202211630770 A CN 202211630770A CN 116401143 A CN116401143 A CN 116401143A
Authority
CN
China
Prior art keywords
sample data
data
layer
determining
classification model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211630770.5A
Other languages
Chinese (zh)
Inventor
程良伦
陈武兴
张伟文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Nengge Knowledge Technology Co ltd
Guangdong University of Technology
Original Assignee
Guangdong Nengge Knowledge Technology Co ltd
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Nengge Knowledge Technology Co ltd, Guangdong University of Technology filed Critical Guangdong Nengge Knowledge Technology Co ltd
Priority to CN202211630770.5A priority Critical patent/CN116401143A/en
Publication of CN116401143A publication Critical patent/CN116401143A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3684Test management for test design, e.g. generating new test cases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a software testing method and a system based on an unbalanced data set, wherein the method comprises the following steps: acquiring sample data of unbalanced software defects, and preprocessing the sample data; setting a cost sensitivity coefficient for the data category in the sample data according to the unbalance characteristic of the sample data after preprocessing; training a cost sensitive kernel width learning system classification model according to sample data comprising cost sensitive coefficients; determining an optimal penalty coefficient through grid search, and determining a target parameter of the model according to the optimal penalty coefficient; constructing a target classification model according to the target parameters, carrying out classification prediction on the test instance through the target classification model, and outputting a prediction result; the scheme can effectively eliminate the influence of the extremely unbalanced characteristic of the software defect data set on the classification performance, improves the classification accuracy, solves the problem of complicated training in the feature engineering work, improves the training efficiency of the model, and can be widely applied to the technical field of computers.

Description

Software testing method and system based on unbalanced data set
Technical Field
The invention relates to the technical field of computers, in particular to a software testing method and system based on an unbalanced data set.
Background
With the continuous expansion of the software industry, software quality problems are becoming a major concern, and software defects are one of the most important factors threatening the quality of software. Software defects often result from errors made by software engineers during the encoding process, which often require significant human resources and time to repair. Therefore, early discovery of a failure module during software development has become a problem to be solved. Software Defect Prediction (SDP) is an important tool to improve software testing efficiency. SDP is an imbalance problem because defective data is unevenly distributed, with defective modules being much less than non-defective modules. Traditional machine learning methods often ignore a few defective software on the premise that the data obeys an equilibrium distribution.
Although there are many excellent machine learning and deep learning algorithms in the past to achieve higher classification accuracy, most machine learning algorithms will degrade rapidly when encountering data imbalance, often pay excessive attention to most classes and ignore few classes, which is fatal, such as software defect prediction, and normal is often more than abnormal, but during testing, more attention is paid to abnormal software than normal, and if one classifier can classify all normal classes but cannot classify abnormal software, huge economic loss is caused to the software development industry. On the other hand, although deep learning can extract deeper features, a good fitting effect cannot be achieved when the data volume is relatively small, and once the data volume of the data set is relatively large, a relatively long time is required to search the optimal parameters through back propagation, so that the time is excessively long.
Disclosure of Invention
In view of the above, based on the technical problems or defects mentioned in the background art, an object of the embodiments of the present invention is to provide a software testing method based on an unbalanced data set, which can solve the drawbacks of inaccurate and efficient prediction of the existing unbalanced software defect data set. Meanwhile, the technical scheme of the application also provides a software testing system capable of realizing the method.
In one aspect, the present application provides a software testing method based on an unbalanced data set, including the following steps:
acquiring sample data of unbalanced software defects, and preprocessing the sample data;
setting a cost sensitivity coefficient for a data category in the sample data according to the unbalance characteristic of the sample data after preprocessing;
training a cost sensitivity core width learning system classification model according to sample data comprising the cost sensitivity coefficient;
determining an optimal penalty coefficient through grid search, and determining target parameters of the trained cost sensitive kernel width learning system classification model according to the optimal penalty coefficient;
and constructing a target classification model according to the target parameters, carrying out classification prediction on the test instance through the target classification model, and outputting a prediction result.
In a possible embodiment of the present application, the acquiring sample data of the unbalanced software defect, preprocessing the sample data includes:
filling the missing value through the average value of the sample data, or deleting the missing value;
and determining a label of the sample data after the missing value processing, and adding the label after the side-by-side encoding processing to the sample data.
In a possible embodiment of the present application, the setting a cost sensitivity coefficient for a data class in the sample data according to an imbalance characteristic of the sample data after preprocessing includes:
determining a first weight matrix according to the sample number of the sample data, and determining a penalty parameter of the data category according to the first weight matrix;
and determining the cost sensitivity coefficient corresponding to the data category according to the penalty parameter.
In a possible embodiment of the present application, the training a cost-sensitive kernel width learning system classification model according to sample data including the cost-sensitive coefficient includes:
determining an inner product matrix between the sample data through a radial basis function, constructing a mapping layer according to the inner product matrix, and determining a first connection weight between the mapping layer and an input layer;
taking the node of the mapping layer as input, mapping through the radial basis function to obtain an enhancement layer, and determining a second connection weight between the enhancement layer and the mapping layer;
and constructing and obtaining the cost sensitive kernel width learning system classification model according to the mapping layer, the enhancement layer, the first connection weight and the second connection weight.
In a possible embodiment of the present application, the calculation formula of the inner product matrix is:
Figure SMS_1
wherein Ω i,j Characterizing the inner product matrix, x a Feature vector, x characterizing sample data a b Characteristic vectors representing the b-th sample data, sigma represents parameters of a radial basis function, and N is a positive integer.
In a possible embodiment of the present application, the constructing the cost-sensitive kernel width learning system classification model according to the mapping layer, the enhancement layer, the first connection weight, and the second connection weight includes:
determining a second weight matrix of a hidden layer in the cost-sensitive kernel width learning system classification model according to the first connection weight and the second connection weight; the hidden layer is obtained by splicing the mapping layer and the enhancement layer;
and constructing an objective function in the cost-sensitive kernel width learning system classification model according to the second weight matrix.
In a possible embodiment of the present application, in the step of determining an inner product matrix between the sample data by a radial basis function, constructing a mapping layer according to the inner product matrix, and determining a first connection weight between the mapping layer and an input layer, a calculation formula of the first connection weight is as follows:
Figure SMS_2
wherein lambda represents a regularization term parameter, I represents an identity matrix, omega m And representing a distance matrix of the mapping layer, wherein X represents the input sample data.
In a possible embodiment of the present application, in the step of mapping the node of the mapping layer by the radial basis function to obtain an enhancement layer and determining a second connection weight between the enhancement layer and the mapping layer, a calculation formula of the second connection weight is as follows:
Figure SMS_3
wherein lambda represents a regularization term parameter, I represents an identity matrix, omega m Characterizing a distance matrix of the mapping layer, Ω e A distance matrix characterizing the enhancement layer.
In a possible embodiment of the present application, the formula of the output layer weight in the cost sensitive kernel width learning system classification model is as follows:
Figure SMS_4
wherein W represents the weight of an output layer, lambda represents a regularization term parameter, Y represents a label matrix corresponding to the sample data, and C + Cost sensitivity coefficient for representing misclassification of minority class samples, C - Penalty coefficient for representing misclassification of most types of samples, K= [ omega ] me ],Ω m Characterizing a distance matrix of the mapping layer, Ω e A distance matrix characterizing the enhancement layer.
On the other hand, the technical scheme of the application also provides a software testing system based on the unbalanced data set, which comprises:
the data acquisition unit is used for acquiring sample data of unbalanced software defects and preprocessing the sample data;
setting a cost sensitivity coefficient for the data category in the sample data according to the unbalance characteristic of the sample data after preprocessing;
the model training unit is used for training a cost sensitivity core width learning system classification model according to sample data comprising the cost sensitivity coefficient;
the parameter optimization unit is used for determining an optimal penalty coefficient through grid search, and determining target parameters of the trained cost sensitive kernel width learning system classification model according to the optimal penalty coefficient;
and the prediction output unit is used for constructing a target classification model according to the target parameters, classifying and predicting the test instance through the target classification model and outputting a prediction result.
Advantages and benefits of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:
according to the technical scheme, the cost-sensitive kernel width learning system (CCR-KBLS) is provided, the problem of unbalanced data classification is solved by combining a kernel learning technology with the cost-sensitive kernel width learning system, and the influence of extremely unbalanced characteristics of a software defect data set on classification performance can be effectively eliminated; according to the scheme, through cost-sensitive width learning, the objective function of the model is modified, so that the model is more focused on few classes, the classification accuracy is improved, the problem of complex training in the feature engineering work is solved, and the training efficiency of the model is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of steps of a software testing method based on an unbalanced data set provided in the technical scheme of the present application.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
In the related art, width learning (BLS) has been applied to various scenes. The width learning is a novel machine learning technology, has a simple structure and has competitive performance. The structure of width learning is very different from that of deep learning. BLS tends to build networks in "wide" rather than "deep" directions, which is a single layer laterally scalable network, with design inspiration from Random Vector Function Linked Neural Networks (RVFLNNs). Width learning is intended to design a more rapid and accurate classification of data, and is also applied to software defect prediction. Related art solutions have disclosed that good classification performance and faster training speed can be achieved on MNIST datasets by comparison to Deep Neural Networks (DNNs). Although breadth learning has excellent classification efficiency, performance loss occurs in the case of data set imbalance. In the classification process, the objective function is more focused on most classes due to unbalance of data and the like, and when software defect prediction is performed, the objective function is more focused on the defective party, namely a few classes, so that the classification performance is reduced when unbalanced data classification is performed by normal width learning.
In addition, the software defect number prediction mainly predicts the defect number in the software module according to the previous software data; related art by predicting the number of document defects using a regression classifier, the influence of developers on a defect prediction model was analyzed, and in recent years, researchers have made a great deal of research on defect severity prediction, such as predicting the severity of defects in combination with multiple logistic regression and gradient descent, and further a ranking-based method for predicting the severity of defects in software has been proposed.
However, although the above-mentioned method related to unbalanced learning can alleviate the influence of unbalanced data distribution on classification performance to some extent, it does not consider that the software defect data set itself has many noise characteristics, and these characteristics may further amplify abnormal values and noise along with the lateral expansion characteristics of width learning, so as to reduce the predicted performance, so that it is necessary to seek a technology for processing noise to ensure that stable characteristics are obtained, so as to obtain a clearer decision boundary.
In view of the above-noted shortcomings of the prior art, in a first aspect, as shown in fig. 1, the technical solution of the present application provides a software testing method based on an unbalanced data set, the method comprising steps S01-S05:
s01, acquiring sample data of unbalanced software defects, and preprocessing the sample data.
In particular, in an embodiment, for the acquired sample data of the unbalanced software defect, the sample data may be divided into a software defect training sample and a test sample, and the data in the sample is preprocessed. In some possible embodiments, the process of preprocessing the sample data may include steps S011-S012:
s011, filling a missing value through the average value of the sample data, or deleting the missing value;
s012, determining a label of the sample data after the missing value processing, and adding the label after the side-by-side encoding processing to the sample data.
In particular, in an embodiment, after the unbalanced software defect training samples and test samples are obtained, the data in the samples need to be preprocessed. In the embodiment, the software defect data is subjected to data segmentation, the missing value is replaced by an average value of a front sample and a rear sample, the sample with the larger missing characteristic is directly deleted, the data is finally divided into a sample and a label, and the label of the data set is converted into one-hot coding.
S02, setting a cost sensitivity coefficient for the data category in the sample data according to the unbalance characteristic of the sample data after preprocessing.
In particular, in the embodiment, different cost sensitivity coefficients are set for different classes according to the unbalanced characteristic of the sample, respectively C + And C - This allows different penalty parameters to be set for different classes of data. In some possible implementationsIn an embodiment, the process of setting different penalty parameters for different types of data may include steps S021-S022:
s021, determining a first weight matrix according to the sample number of the sample data, and determining penalty parameters of the data category according to the first weight matrix;
s022, determining the cost sensitivity coefficient corresponding to the data category according to the penalty parameter.
In an embodiment, a first weight matrix is set for the number of samples of an original data set, different penalty parameters are set for different types of data, first, a cost sensitive parameter is added on the basis of a basic width learning system, and a C is allocated for a plurality of types of data + The method comprises the steps of carrying out a first treatment on the surface of the Minority class data allocation C - The method comprises the steps of carrying out a first treatment on the surface of the Wherein, most class data will be assigned to smaller penalty parameters, and few class data are assigned to larger penalty parameters, and the penalty parameters are used as cost sensitivity coefficients corresponding to the data classes. Further, C + And C - At C search ={2 -24 ,2 -23 ,…,2 24 ,2 25 Lattice lookup in the range finds the optimal regularization parameters.
S03, training a cost sensitivity core width learning system classification model according to sample data comprising the cost sensitivity coefficient.
In an embodiment, a cost-sensitive kernel-width learning system classification model (CCR-KBLS) for unbalanced data is trained, wherein the cost-sensitive kernel-width learning system classification model is a two-layer neural network model, the first layer includes feature nodes and enhancement nodes, and the second layer is an output layer, and the feature nodes and the enhancement nodes in the first layer are spliced together and connected to a fully-connected layer. Further, in an embodiment, the divided training samples in step S01 may be
Figure SMS_5
Figure SMS_6
To train a weighted sum width learning system, wherein X represents an input training sample and Y represents a corresponding sampleIs a label of (2); n represents the total amount of samples, l represents the characteristic dimension of the samples, and c represents the dimension of the output.
In some possible implementations, the training process for CCR-KBLS in the examples may include steps S031-S033:
s031, determining an inner product matrix between the sample data through a radial basis function, constructing a mapping layer according to the inner product matrix, and determining a first connection weight between the mapping layer and an input layer;
in an embodiment, first, an inner product matrix between training samples is calculated through a radial basis function, and original input data is converted into a mapping node layer, wherein the radial basis function is as follows:
Figure SMS_7
wherein x is a And x b Feature vectors of the a-th and b-th samples are respectively represented, and sigma represents parameters of a radial basis function; calculating a distance matrix Ω of N by the radial basis function provided in the examples m As a mapping layer. Further, the connection weight between the input layer and the mapping layer is obtained by pseudo-inversion, namely, the first connection weight:
Figure SMS_8
wherein lambda represents a regularization term parameter, I represents an identity matrix, omega m And representing a distance matrix of the mapping layer, wherein X represents the input sample data.
S032, mapping the nodes of the mapping layer through the radial basis function to obtain an enhancement layer, and determining a second connection weight between the enhancement layer and the mapping layer;
in an embodiment, after determining a mapping layer of the model, all nodes of the mapping layer are used as inputs, and then radial basis function mapping is performed once to obtain an enhancement layer Ω e The method comprises the steps of carrying out a first treatment on the surface of the Similarly to step S031, the mapping layer is calculated to the enhancement layerI.e. the second connection weight:
Figure SMS_9
s033, constructing and obtaining the cost sensitive kernel width learning system classification model according to the mapping layer, the enhancement layer, the first connection weight and the second connection weight. In some possible embodiments, step S033 in the method may further comprise steps S0331-S0332:
s0331, determining a second weight matrix of a hidden layer in the cost-sensitive kernel width learning system classification model according to the first connection weight and the second connection weight, wherein the hidden layer is obtained by splicing the mapping layer and the enhancement layer.
In an embodiment, a feature layer and an enhancement layer are spliced to form a hidden layer of the width learning system model to obtain k= [ Ω ] me ]。
S0332, constructing an objective function in the cost-sensitive kernel width learning system classification model according to the second weight matrix.
Specifically, in an embodiment, the second weight matrix calculated in step S0331 is added to an objective function, where the objective function is:
Figure SMS_10
s.t:K(x i )w i =y ii ,i=1,…,N
wherein C is + Penalty coefficient representing misclassification of minority class samples, C - And a penalty coefficient for misclassification of a plurality of types of samples is represented, lambda represents a regularization parameter, and K represents a hidden node formed by splicing a characteristic node and an enhancement node. Further, in the embodiment, the output layer weight is obtained by a pseudo-inverse method:
Figure SMS_11
wherein Y is a tag matrix. Once the output weight of CCR-KBLS is reached, no further adjustment is required to obtain the final result. Finally, the embodiment sets the predictive tag to
Figure SMS_12
The final model output results can be further described as:
Figure SMS_13
s04, determining an optimal penalty coefficient through grid search, and determining target parameters of the trained cost sensitive kernel width learning system classification model according to the optimal penalty coefficient.
In particular, in an embodiment, a grid search strategy may be used to find the optimal penalty coefficients, where the search range is C search ={2 -24 ,2 -23 ,…,2 24 ,2 25 And takes the optimal parameters as the final parameters of the final model.
S05, constructing a target classification model according to the target parameters, carrying out classification prediction on the test instance through the target classification model, and outputting a prediction result.
In particular, in an embodiment, test cases are classified to obtain a final prediction result. In the embodiment, a weighted kernel width learning system is used for testing and classifying test samples, the optimal parameters obtained by grid search are selected to be put into a model to test a software defect data testing set, and after labels are obtained, a G-mean evaluation index is used for testing and classifying results. As shown in Table 1, the experimental results of the CCR-KBLS model constructed in the technical scheme in different data sets are mostly predicted by other models such as SMOTE-BLS and RUS-BLS.
TABLE 1
Data set SMOTE-BLS RUS-BLS WELM DNC CCR-KBLS
MC2 0.735 0.746 0.697 0.699 0.834
PC3 0.745 0.647 0.795 0.816 0.745
MW1 0.478 0.592 0.667 0.673 0.845
PC1 0.567 0.744 0.787 0.698 0.792
On the other hand, the technical scheme of the application also provides a software testing system based on the unbalanced data set, which comprises:
the data acquisition unit is used for acquiring sample data of unbalanced software defects and preprocessing the sample data;
setting a cost sensitivity coefficient for the data category in the sample data according to the unbalance characteristic of the sample data after preprocessing;
the model training unit is used for training a cost sensitivity core width learning system classification model according to sample data comprising the cost sensitivity coefficient;
the parameter optimization unit is used for determining an optimal penalty coefficient through grid search, and determining target parameters of the trained cost sensitive kernel width learning system classification model according to the optimal penalty coefficient;
and the prediction output unit is used for constructing a target classification model according to the target parameters, classifying and predicting the test instance through the target classification model and outputting a prediction result.
From the above specific implementation process, it can be summarized that, compared with the prior art, the technical solution provided by the present invention has the following advantages or advantages:
(1) According to the technical scheme, the cost-sensitive kernel width learning system (CCR-KBLS) is provided, the problem of unbalanced data classification is solved by combining a kernel learning technology with the cost-sensitive kernel width learning system, and the influence of extremely unbalanced characteristics of a software defect data set on classification performance can be effectively eliminated;
(2) The technical scheme designs a specific kernel mapping scheme in width learning. The dual-core mapping strategy is used for adaptively adjusting the number of feature nodes, reducing uncertainty of random mapping and avoiding manual setting. And meanwhile, the influence of data noise and abnormal values on the accuracy of model classification can be eliminated.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features may be integrated in a single physical device and/or software module or may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. A software testing method based on an unbalanced data set, comprising the steps of:
acquiring sample data of unbalanced software defects, and preprocessing the sample data;
setting a cost sensitivity coefficient for a data category in the sample data according to the unbalance characteristic of the sample data after preprocessing;
training a cost sensitivity core width learning system classification model according to sample data comprising the cost sensitivity coefficient;
determining an optimal penalty coefficient through grid search, and determining target parameters of the trained cost sensitive kernel width learning system classification model according to the optimal penalty coefficient;
and constructing a target classification model according to the target parameters, carrying out classification prediction on the test instance through the target classification model, and outputting a prediction result.
2. The method for testing software based on an unbalanced data set of claim 1, wherein the step of obtaining sample data of an unbalanced software defect, and preprocessing the sample data comprises:
filling the missing value through the average value of the sample data, or deleting the missing value;
and determining a label of the sample data after the missing value processing, and adding the label after the side-by-side encoding processing to the sample data.
3. A software testing method based on unbalanced data sets according to claim 1, wherein said setting a cost sensitivity coefficient for a data class in said sample data based on an unbalance characteristic of the sample data after preprocessing comprises:
determining a first weight matrix according to the sample number of the sample data, and determining a penalty parameter of the data category according to the first weight matrix;
and determining the cost sensitivity coefficient corresponding to the data category according to the penalty parameter.
4. The method of claim 1, wherein training a cost sensitive kernel width learning system classification model based on sample data including the cost sensitive coefficients comprises:
determining an inner product matrix between the sample data through a radial basis function, constructing a mapping layer according to the inner product matrix, and determining a first connection weight between the mapping layer and an input layer;
taking the node of the mapping layer as input, mapping through the radial basis function to obtain an enhancement layer, and determining a second connection weight between the enhancement layer and the mapping layer;
and constructing and obtaining the cost sensitive kernel width learning system classification model according to the mapping layer, the enhancement layer, the first connection weight and the second connection weight.
5. The method for testing software based on an unbalanced data set of claim 4 wherein the inner product matrix is calculated by the formula:
Figure FDA0004005666740000021
wherein Ω i,j Characterizing the inner product matrix, x a Feature vector, x characterizing sample data a b Characteristic vectors representing the b-th sample data, sigma represents parameters of a radial basis function, and N is a positive integer.
6. The method for testing software based on an unbalanced data set of claim 4, wherein the constructing the cost sensitive kernel width learning system classification model according to the mapping layer, the enhancement layer, the first connection weight and the second connection weight comprises:
determining a second weight matrix of a hidden layer in the cost-sensitive kernel width learning system classification model according to the first connection weight and the second connection weight; the hidden layer is obtained by splicing the mapping layer and the enhancement layer;
and constructing an objective function in the cost-sensitive kernel width learning system classification model according to the second weight matrix.
7. The method according to claim 4, wherein in the step of determining an inner product matrix between the sample data by a radial basis function, constructing a mapping layer based on the inner product matrix, and determining a first connection weight between the mapping layer and an input layer, a calculation formula of the first connection weight is as follows:
Figure FDA0004005666740000023
wherein lambda represents a regularization term parameter, I represents an identity matrix, omega m And representing a distance matrix of the mapping layer, wherein X represents the input sample data.
8. The method according to claim 4, wherein in the step of mapping the node of the mapping layer by the radial basis function to obtain the enhancement layer and determining the second connection weight between the enhancement layer and the mapping layer, the calculation formula of the second connection weight is as follows:
Figure FDA0004005666740000024
wherein lambda represents a regularization term parameter, I represents an identity matrix, omega m Characterizing a distance matrix of the mapping layer, Ω e A distance matrix characterizing the enhancement layer.
9. The method for testing software based on an unbalanced data set of claim 4 wherein the formula for the output layer weights in the cost sensitive kernel width learning system classification model is as follows:
Figure FDA0004005666740000022
wherein W represents the weight of an output layer, lambda represents a regularization term parameter, Y represents a label matrix corresponding to the sample data, and C + Cost sensitivity coefficient for representing misclassification of minority class samples, C - Penalty coefficient for representing misclassification of most types of samples, K= [ omega ] m ,Ω e ],Ω m Characterizing a distance matrix of the mapping layer, Ω e A distance matrix characterizing the enhancement layer.
10. A software testing system based on an unbalanced data set, comprising:
the data acquisition unit is used for acquiring sample data of unbalanced software defects and preprocessing the sample data; setting a cost sensitivity coefficient for the data category in the sample data according to the unbalance characteristic of the sample data after preprocessing;
the model training unit is used for training a cost sensitivity core width learning system classification model according to sample data comprising the cost sensitivity coefficient;
the parameter optimization unit is used for determining an optimal penalty coefficient through grid search, and determining target parameters of the trained cost sensitive kernel width learning system classification model according to the optimal penalty coefficient;
and the prediction output unit is used for constructing a target classification model according to the target parameters, classifying and predicting the test instance through the target classification model and outputting a prediction result.
CN202211630770.5A 2022-12-19 2022-12-19 Software testing method and system based on unbalanced data set Pending CN116401143A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211630770.5A CN116401143A (en) 2022-12-19 2022-12-19 Software testing method and system based on unbalanced data set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211630770.5A CN116401143A (en) 2022-12-19 2022-12-19 Software testing method and system based on unbalanced data set

Publications (1)

Publication Number Publication Date
CN116401143A true CN116401143A (en) 2023-07-07

Family

ID=87011132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211630770.5A Pending CN116401143A (en) 2022-12-19 2022-12-19 Software testing method and system based on unbalanced data set

Country Status (1)

Country Link
CN (1) CN116401143A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598187A (en) * 2019-08-27 2020-08-28 华南理工大学 Progressive integrated classification method based on kernel width learning system
US20210342642A1 (en) * 2020-05-03 2021-11-04 Dataloop Ltd. Machine learning training dataset optimization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598187A (en) * 2019-08-27 2020-08-28 华南理工大学 Progressive integrated classification method based on kernel width learning system
US20210342642A1 (en) * 2020-05-03 2021-11-04 Dataloop Ltd. Machine learning training dataset optimization

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WUXING CHEN,ET AL: "Double-kernel based class-specific broad learning system for multiclass imbalance learning", 《KNOWLEDGE-BASED SYSTEMS》, vol. 253, pages 1 - 12 *
WUXING CHEN,ET AL: "Kernel-based Class-specific Broad Learning System for software defect prediction", 《2021 8TH INTERNATIONAL CONFERENCE ON INFORMATION, CYBERNETICS, AND COMPUTATIONAL SOCIAL SYSTEMS (ICCSS)》, pages 109 - 114 *
谷宇等: "人工智能基础", 机械工业出版社, pages: 31 - 32 *

Similar Documents

Publication Publication Date Title
CN107169956B (en) Color woven fabric defect detection method based on convolutional neural network
CN107316061B (en) Deep migration learning unbalanced classification integration method
CN111242071B (en) Attention remote sensing image target detection method based on anchor frame
CN107103332A (en) A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
CN113408423B (en) Aquatic product target real-time detection method suitable for TX2 embedded platform
CN108388925A (en) The anti-pattern collapse robust image generation method for generating network is fought based on New Conditions
CN112541532B (en) Target detection method based on dense connection structure
CN110046249A (en) Training method, classification method, system, equipment and the storage medium of capsule network
CN112699953A (en) Characteristic pyramid neural network architecture searching method based on multi-information path aggregation
CN114861890B (en) Method and device for constructing neural network, computing equipment and storage medium
CN114612721A (en) Image classification method based on multilevel adaptive feature fusion type increment learning
CN112685504A (en) Production process-oriented distributed migration chart learning method
CN111985845B (en) Node priority optimization method of heterogeneous Spark cluster
CN111798469A (en) Digital image small data set semantic segmentation method based on deep convolutional neural network
CN110097096A (en) A kind of file classification method based on TF-IDF matrix and capsule network
CN112819063A (en) Image identification method based on improved Focal loss function
CN115035334A (en) Multi-classification change detection method and system for multi-scale fusion double-temporal remote sensing image
CN114332075A (en) Rapid structural defect identification and classification method based on lightweight deep learning model
CN117455855A (en) YOLOv 8-based lightweight cop detection model construction method
CN116401143A (en) Software testing method and system based on unbalanced data set
Wu et al. Detection of defects on aluminum profile surface based on improved YOLO
CN116485021A (en) Coal enterprise technical skill person sentry matching prediction method and system
CN115294406B (en) Method and system for multi-modal interpretable classification based on attributes
CN116756391A (en) Unbalanced graph node neural network classification method based on graph data enhancement
CN116452950A (en) Multi-target garbage detection method based on improved YOLOv5 model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination