CN114254698A

CN114254698A - Unbalanced data and image processing method and system and computer equipment

Info

Publication number: CN114254698A
Application number: CN202111485510.9A
Authority: CN
Inventors: 戴亚康; 钱旭升; 周志勇; 胡冀苏; 姜宇
Original assignee: Suzhou Guoke Medical Technology Development Group Co ltd; Suzhou Institute of Biomedical Engineering and Technology of CAS
Current assignee: Suzhou Guoke Medical Technology Development Group Co ltd; Suzhou Institute of Biomedical Engineering and Technology of CAS
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-03-29
Anticipated expiration: 2041-12-07
Also published as: CN114254698B

Abstract

The invention discloses a method, a system and computer equipment for processing unbalanced data and images, which comprises the following steps: 1) preprocessing the unbalanced data set O; 2) determining parameters of the RBF neural network data generation model by using a maximum distribution algorithm based on the Hausdorff distance; 3) constructing an RBF neural network data generation model; 4) generating a sample set S by combining the constructed RBF neural network data generation model with the mvnrnd function; 5) filling the generated sample set S into the original unbalanced data set O to obtain a processed balanced data set O_s，O_sO ═ os. The unbalanced data and image processing method provided by the invention can process missing values and different types of attributesThe method adaptively learns the intra-class and inter-class distribution of the original unbalanced data, and automatically generates data according to classes to expand a few classes in the original data, thereby effectively improving the unbalance of the data and improving the accuracy of data analysis.

Description

Unbalanced data and image processing method and system and computer equipment

Technical Field

The invention relates to the field of data analysis and processing, in particular to an unbalanced data and image processing method, system and computer equipment.

Background

In the same dataset, the number of samples of one or a part of the classes is small (positive or few classes), while the number of samples of the other or other part of the classes is relatively large (negative or majority classes), and the samples contained in the two parts are far apart in number, and a dataset that meets this condition is called an unbalanced dataset. In an unbalanced data set, the number of minority class samples is small, so that sufficient information cannot be provided for the classifier in classification learning, and the number of majority classes is large, so that sufficient information is provided for the classifier, which results in that the classifier can more easily identify the majority classes in the classification process, and the identification rate of the minority classes is low.

There are many fields in real life that require knowledge modeling analysis for the condition of data imbalance, such as the following fields: medical information assisted diagnosis, mass advertising spam handling, multimedia information retrieval, credit card fraud detection, text information classification, and the like. In many related fields, the identification and classification of minority classes are important, and the meaning of the correct identification of the minority classes to the whole classification learning is far more than that of the correct identification of the majority classes of samples. For example, in medical information-assisted diagnosis, the diagnosis of a doctor can be divided into four cases: normal persons are correctly diagnosed as normal, persons with diseases are correctly diagnosed as diseased, normal persons are misdiagnosed as diseased, and persons with diseases are misdiagnosed as normal. If the doctor misdiagnoses the normal person as a patient in the process, the serious psychological and monetary pressure can be brought to the normal person. However, if a patient is misdiagnosed as a healthy person by the auxiliary medical diagnosis system, it is highly likely that the patient cannot be treated in time. The misdiagnosis of the patient as normal in the four cases is the least common case in reality and can be regarded as a few types, and the other three cases are frequently regarded as a plurality of types. However, most of the existing classification methods have high recognition rate for most classes, but have low recognition rate for few classes, and do not show the true function of the classifier.

The processing method for the unbalanced data mainly comprises the step of carrying out undersampling or oversampling on a sample through a resampling technology so as to adjust the unbalanced degree of a sample set. Common methods for adjusting imbalance data from a few classes of angles are: random oversampling, SMOTE, borderline-SMOTE, and the like. The methods do not well consider the data distribution characteristics of the actual data set, and have certain randomness and blindness, so that the classification effect is influenced.

Therefore, there is a need to provide a more reliable solution.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide an unbalanced data and image processing method, system and computer device for overcoming the above-mentioned shortcomings in the prior art.

In order to solve the technical problems, the invention adopts the technical scheme that: an unbalanced data and image processing method is provided, which comprises the following steps:

1) preprocessing the unbalanced data set O;

2) processing the preprocessed unbalanced data set O by using a maximum distribution algorithm based on the Hausdorff distance, and determining parameters of an RBF neural network data generation model to be constructed; the parameters comprise hidden layer neurons of an RBF neural network data generation model, a category, an output weight and a diagonal distribution matrix corresponding to each hidden layer neuron, and a connection weight between each hidden layer neuron and a corresponding output neuron;

3) constructing an RBF neural network data generation model based on the result of the step 2);

4) generating data by combining the constructed RBF neural network data generation model with the mvnrnd function to obtain a generated sample set S;

5) filling the generated sample set S into the original unbalanced data set O to obtain a processed balanced data set O_s，O_s＝O∪S。

Preferably, the step 1) is specifically:

complementing the missing value of the numerical attribute in the unbalanced data set O by using the mean value of the attribute of the same type of sample; for missing values of ordinal attributes and nominal attributes, completing the missing values by using the value with the highest attribute occurrence frequency of the same type of samples;

after completing data completion, sequentially coding the ordinal attribute and the nominal attribute;

converting image data in the unbalanced data set O into numerical data by adopting a PyRadiomics-based tool kit, adding the numerical data into the data set O, and standardizing all attributes by using a z-score method to obtain a preprocessed data set D;

using the vector L_meanAnd L_stdAnd respectively storing the mean value and the standard deviation of each attribute, and storing the sequential coding modes of the ordinal attribute and the nominal attribute.

Preferably, the step 2) specifically includes:

2-1) assume that there are N input samples { x in dataset D_nN is 1,2, …, N, each sample has M attributes, each sample belongs to one of C classes, and the number of samples in the C class is N_c，c＝1,2,…,C；

2-2) dividing the samples in the data set according to the categories to obtain a data subset D consisting of the samples belonging to the class c_cC is 1,2, …, C; initializing, and making the current class index c equal to 0 and the current hidden layer neuron number P equal to 0;

2-3) let c ═ c + 1;

2-4) let P ═ P +1, calculate D_cAnd the Hausdorff distance h between other samples_PThe corresponding sample is used as a hidden layer neuron center k newly added in the class c_P(ii) a Calculating D_cAll samples ink_PThe recording distance is less than h_PCorresponding subset d of all samples_cAnd d is_cFrom D_cDeleting; with d_cNumber of intermediate samples as k_PConnection weight w between output neuron and corresponding class_P，k_PThe connection weight value between the neuron and other output neurons is 0; calculating d_cThe variance v of each dimension attribute in_mComposition k_PCorresponding diagonal distribution matrix

2-5) if D_cIf the number of the remaining samples is not 0, returning to the step c; otherwise, check if C is equal to C, if C < C, go back to step 2-3), if C ═ C, the algorithm terminates.

Preferably, the step 3) specifically includes:

3-1) determining that an input layer of the RBF neural network data generation model has M input neurons according to M attributes of each sample in the data set D, wherein each neuron corresponds to one attribute;

3-2) determining that an output layer of the RBF neural network data generation model has C output neurons according to C categories of the data set D, wherein each neuron corresponds to one category;

3-3) obtaining P hidden layer neurons k according to the result of the step 2)₁,k₂,…,k_P-1,k_PAnd its corresponding class and output weight { w }₁,w₂,…,w_P-1,w_PAnd the corresponding P diagonal distribution matrices { V }₁,V₂,…,V_P-1,V_PDetermining parameters of P hidden layer neurons { (k)₁,V₁),(k₂,V₃),…,(k_P-1,V_P-1),(k_P,V_P) And the connection weight between each hidden layer neuron and the corresponding output neuron { w }₁,w₂,…,w_P-1,w_P}。

Preferably, the step 4) specifically includes:

4-1) setting the number S of samples to be generated for each category_cC is 1,2, …, C; initializing, making the current hidden layer neuron center index p equal to 0, and generating a sample set

Representing an empty set;

4-2) let p ═ p +1, assuming current hidden neuron center k_PBelongs to class c, then k_PCorresponding to the number of generated samples of

4-3) generated sample matrix

Wherein each sample belongs to class c; will be provided with

Are combined into the generated set of samples S,

checking whether P is equal to P, and returning to the step 4-2) if P < P); if P is equal to P, obtaining a complete generated sample set S, and executing the next step;

4-4) mean vector L from all attributes saved during preprocessing_meanAnd standard deviation L_stdCarrying out inverse standardization on S; and converting the corresponding numerical value in the S back to the original values of the ordinal attribute and the nominal attribute according to the sequential coding mode of the ordinal attribute and the nominal attribute.

The present invention also provides an unbalanced data and image processing system, which uses the method as described above to process unbalanced data, the system comprising:

the data preprocessing module is used for preprocessing the unbalanced data set O according to the method in the step 1) to obtain a data set D;

the maximum distribution algorithm module is used for determining parameters of the RBF neural network data generation model to be constructed according to the method in the step 2);

the network model building module is used for building an RBF neural network data generation model according to the method in the step 3);

the RBF neural network data generation model is combined with the mvnrnd function, and a new data set S is generated in a self-adaptive mode according to the distribution of the original unbalanced data set by the method in the step 4);

and a data post-processing module for filling the generated sample set S into the original unbalanced data set O to obtain a processed balanced data set O_s。

The invention also provides a storage medium having stored thereon a computer program which, when executed, is adapted to carry out the method as described above.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the computer program.

The invention has the beneficial effects that: the unbalanced data and image processing method provided by the invention can process missing values and attributes of different types, adaptively learn the intra-class and inter-class distribution of the original unbalanced data, automatically generate data according to classes and expand a few classes in the original data, thereby effectively improving the unbalance of the data and improving the accuracy of data analysis.

Drawings

FIG. 1 is a flow chart of an unbalanced data and image processing method of the present invention;

FIG. 2 is a schematic diagram of the schematic structure of the RBF neural network data generation model of the present invention.

Detailed Description

The present invention is further described in detail below with reference to examples so that those skilled in the art can practice the invention with reference to the description.

It will be understood that terms such as "having," "including," and "comprising," as used herein, do not preclude the presence or addition of one or more other elements or groups thereof.

Example 1

Referring to fig. 1, the unbalanced data and image processing method of the present embodiment includes the following steps:

s1, preprocessing the unbalanced data set O:

converting image data in the unbalanced data set O into numerical data by adopting a PyRadiomics-based tool kit, adding the numerical data into the data set O, and standardizing all attributes by using a z-score method to obtain a preprocessed data set D; wherein, the data types in the unbalanced data set O comprise numerical data, image data and the like;

S2, processing the preprocessed unbalanced data set O by using a maximum distribution algorithm based on the Hausdorff distance, and determining parameters of an RBF neural network data generation model to be constructed; the parameters comprise hidden layer neurons of an RBF neural network data generation model, a category, an output weight and a diagonal distribution matrix corresponding to each hidden layer neuron, and a connection weight between each hidden layer neuron and the corresponding output neuron; the method specifically comprises the following steps:

s2-1) assume that there are N input samples { x ] in the data set D_nN is 1,2, …, N, each sample has M attributes, each sample belongs to one of C classes, and the number of samples in the C class is N_c，c＝1,2,…,C；

S2-2) dividing the samples in the data set according to the belonged categories to obtain a data subset D consisting of samples belonging to the class c_cC is 1,2, …, C; initializing the current class index c to 0 and hiding the current class indexThe number P of layer neurons is 0;

s2-3) making c ═ c + 1;

s2-4) let P ═ P +1, calculate D_cAnd the Hausdorff distance h between other samples_PThe corresponding sample is used as a hidden layer neuron center k newly added in the class c_P(ii) a Calculating D_cAll samples in to k_PThe recording distance is less than h_PCorresponding subset d of all samples_cAnd d is_cFrom D_cDeleting; with d_cNumber of intermediate samples as k_PConnection weight w between output neuron and corresponding class_P，k_PThe connection weight value between the neuron and other output neurons is 0; calculating d_cThe variance v of each dimension attribute in_mComposition k_PCorresponding diagonal distribution matrix

S2-5) if D_cIf the number of the remaining samples is not 0, returning to the step c; otherwise, it is checked whether C is equal to C, and if C < C, it returns to step S2-3), and if C ═ C, the algorithm terminates.

S3, constructing an RBF neural network data generation model based on the result of the step S2), specifically comprising the following steps:

s3-1) determining that an input layer of the RBF neural network data generation model has M input neurons according to M attributes of each sample in the data set D, wherein each neuron corresponds to one attribute;

s3-2) determining that an output layer of the RBF neural network data generation model has C output neurons according to C categories of the data set D, wherein each neuron corresponds to one category;

s3-3) obtaining P hidden layer neurons k according to the result of the step S2)₁,k₂,…,k_P-1,k_PAnd its corresponding class and output weight { w }₁,w₂,…,w_P-1,w_PAnd the corresponding P diagonal distribution matrices { V }₁,V₂,…,V_P-1,V_P}, determination of P hidden layer neuronsParameter { (k)₁,V₁),(k₂,V₃),…,(k_P-1,V_P-1),(k_P,V_P) And the connection weight between each hidden layer neuron and the corresponding output neuron { w }₁,w₂,…,w_P-1,w_P}。

Where, it is assumed that the 1 st and 2 nd hidden layer neurons belong to class 1 and that the P-1 st and P-th hidden layer neurons belong to class C.

The principle structure of the constructed RBF neural network data generation model is shown in FIG. 2.

S4, generating data by combining the constructed RBF neural network data generation model with the mvnrnd function to obtain a generated sample set S, which specifically comprises the following steps:

s4-1) setting the number S of samples to be generated for each category_cC is 1,2, …, C; initializing, making the current hidden layer neuron center index p equal to 0, and generating a sample set

Representing an empty set;

s4-2) let p ═ p +1, assuming current hidden neuron center k_PBelongs to class c, then k_PCorresponding to the number of generated samples of

S4-3) generated sample matrix

Wherein each sample belongs to class c; will be provided with

Are combined into the generated set of samples S,

checking whether P is equal to P, and if P < P, returning to step S4-2); if P is equal to P, obtaining a complete generated sample set S, and executing the next step;

s4-4) average value vector L of all attributes stored in preprocessing_meanAnd standard deviation L_stdCarrying out inverse standardization on S; and converting the corresponding numerical value in the S back to the original values of the ordinal attribute and the nominal attribute according to the sequential coding mode of the ordinal attribute and the nominal attribute.

S5, filling the generated sample set S into the original unbalanced data set O to obtain a processed balanced data set O_s，O_s＝O∪S。

Example 2

The present embodiment provides an unbalanced data and image processing system, which performs unbalanced data processing by using the method of embodiment 1, and the system includes:

The present embodiment also provides a storage medium having stored thereon a computer program for implementing the method of embodiment 1 when executed.

The present embodiment also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of embodiment 1 when executing the computer program.

While embodiments of the invention have been disclosed above, it is not limited to the applications listed in the description and the embodiments, which are fully applicable in all kinds of fields of application of the invention, and further modifications may readily be effected by those skilled in the art, so that the invention is not limited to the specific details without departing from the general concept defined by the claims and the scope of equivalents.

Claims

1. An unbalanced data and image processing method, comprising the steps of:

1) preprocessing the unbalanced data set O;

2. The unbalanced data and image processing method according to claim 1, wherein the step 1) is specifically:

3. The unbalanced data and image processing method of claim 2, wherein the step 2) specifically comprises:

2-3) let c ═ c + 1;

2-4) let P ═ P +1, calculate D_cAnd the Hausdorff distance h between other samples_PThe corresponding sample is used as a hidden layer neuron center k newly added in the class c_P(ii) a Calculating D_cAll samples in to k_PThe recording distance is less than h_PCorresponding subset d of all samples_cAnd d is_cFrom D_cDeleting; with d_cNumber of intermediate samples as k_PConnection weight w between output neuron and corresponding class_P，k_PThe connection weight value between the neuron and other output neurons is 0; calculating d_cThe variance v of each dimension attribute in_mComposition k_PCorresponding diagonal distribution matrix

2-5) if D_cIf the number of the remaining samples is not 0, returning to the step c; otherwise, it is checked whether C is equal to C,if C < C, go back to step 2-3), if C ═ C, the algorithm terminates.

4. The unbalanced data and image processing method of claim 3, wherein the step 3) specifically comprises:

5. The unbalanced data and image processing method of claim 4, wherein the step 4) specifically comprises:

Representing an empty set;

4-3) generated sample matrix

Wherein each sample belongs to class c; will be provided with

Are combined into the generated set of samples S,

6. An unbalanced data and image processing system for processing unbalanced data using a method as claimed in any one of claims 1 to 5, the system comprising:

7. A storage medium on which a computer program is stored, characterized in that the program is adapted to carry out the method of any one of claims 1-5 when executed.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-5 when executing the computer program.