CN114937288A - Atypical class data set balancing method, device and medium - Google Patents

Atypical class data set balancing method, device and medium Download PDF

Info

Publication number
CN114937288A
CN114937288A CN202210704826.0A CN202210704826A CN114937288A CN 114937288 A CN114937288 A CN 114937288A CN 202210704826 A CN202210704826 A CN 202210704826A CN 114937288 A CN114937288 A CN 114937288A
Authority
CN
China
Prior art keywords
data set
atypical
training
network model
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210704826.0A
Other languages
Chinese (zh)
Other versions
CN114937288B (en
Inventor
林江莉
韩霖
彭建伟
林江宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haihong Zhixiang Medical Science And Technology Tianjin Co ltd
Sichuan University
Original Assignee
Haihong Zhixiang Medical Science And Technology Tianjin Co ltd
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haihong Zhixiang Medical Science And Technology Tianjin Co ltd, Sichuan University filed Critical Haihong Zhixiang Medical Science And Technology Tianjin Co ltd
Priority to CN202210704826.0A priority Critical patent/CN114937288B/en
Publication of CN114937288A publication Critical patent/CN114937288A/en
Application granted granted Critical
Publication of CN114937288B publication Critical patent/CN114937288B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/03Recognition of patterns in medical or anatomical images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Radar Systems Or Details Thereof (AREA)

Abstract

The invention relates to an atypical class data set balancing method, a device and a medium, which concretely comprise the steps of preprocessing a data set; carrying out amplification by a traditional method aiming at unbalanced data sets of different categories to obtain balanced data sets; sending the balance data set into a set network model for training to obtain an atypical data set; amplifying the atypical class data set to obtain an atypical class balanced data set; and finally, inputting the typical class data set and the atypical class equilibrium data set into a network model for training to obtain a trained network model and a trained classification result. A new idea is provided for the field of unbalanced data classification tasks by the method for amplifying the data set of the SARS type, and the specificity and accuracy of the classification model are greatly improved. Meanwhile, the transfer learning training is carried out by introducing the weight trained on the natural image, so that the problems of gradient disappearance and explosion can be solved, and the network can be helped to quickly converge and improve the performance of the model.

Description

Atypical class data set balancing method, device and medium
Technical Field
The invention relates to a deep learning technology, in particular to the field of deep learning of unbalanced data sets, and specifically relates to a processing technology for classifying disease images by using an atypical class data balance method.
Background
The application of the deep learning technology in the field of medical image classification greatly facilitates the diagnosis of diseases for doctors, and the deep learning is basically researched by a machine learning-based method before the occurrence of the deep learning. The machine learning method is to extract features by using texture features, LBP operators and other methods, and then classify images by using a K-nearest neighbor algorithm or SVM and other traditional methods. Many scholars use the method to obtain good classification results on diseases such as breast cancer, thyroid nodule and the like, and develop corresponding auxiliary diagnosis systems. Although the traditional machine learning algorithm achieves such a lot of achievements, the feature selection process of the machine learning method is complex, accurate extraction is difficult, the requirement on research personnel is high, the existing medical image data sets are usually more than ten thousand, and the classification result of the machine learning method on a large data set is unsatisfactory.
The sensitivity and specificity and accuracy of deep learning on the skin cancer classification task exceed those of human experts, and many researchers introduce a series of methods and means into the field of skin mirror image classification for searching higher accuracy and classification efficiency. Skin cancer data sets are typically unbalanced data sets, which are mostly characterized by unbalanced data distribution. For machine learning, many scholars adopt different sampling methods to solve the problem of data imbalance, and common sampling methods include oversampling such as a SMOTE sampling method and undersampling such as a NCL sampling method mixed sampling such as a SMOTE + Tomek Links method.
In the prior art, for deep learning, the traditional method for solving data imbalance is to perform preprocessing balance on data, or optimize a loss function, or improve a partial structure of a network, so that a model adapts to an unbalanced data set during training. For example, when the natural language imbalance problem is processed, a dynamic K-means clustering method is added for data preprocessing, so that the classification accuracy of the text imbalance data set is improved well; in the myocardial infarction signal processing, a CNN model and a Focal local Loss function are used for optimizing training, so that the problem of unbalanced myocardial infarction signals is solved; in the aspect of skin diseases, an auxiliary decoder is used for processing an original image, the number of samples is increased to balance a data set, the original image is sent to a decoder network after being subjected to CNN to obtain a new image, the new image and the original image are respectively sent to corresponding CNN classifiers, and the training loss of the two images and the loss of the decoder form the total loss of a classification model according to different weights. However, the skin cancer data set usually belongs to an unbalanced data set, and the Loss balance method or the model integration method solves the problem of unbalanced data set, so that the current effect is poor, and the accuracy of the network or the model is not high enough. Therefore, if a method for processing the unbalanced skin disease data set is provided, a method which can improve the classification effect of the network or the model and improve the accuracy of classification is very important.
Disclosure of Invention
In view of the disadvantages of the prior art, the present invention provides an atypical class data set balancing method, apparatus and medium, which combines the traditional unbalanced data processing method and the atypical class data set balancing method, uses the skin disease image to complete the classification and identification of different types of diseases, the method specifically comprises the steps of preprocessing a data set, carrying out traditional amplification on unbalanced data sets of skin diseases of different categories to obtain a training set, realizing the overall balanced state of the skin disease data of different categories, on the basis, the training set is sent to the selected CNN network for training, after the training weight is obtained, using the weight to classify the training set to obtain two categories of prediction error and prediction accuracy, wherein the category of prediction error is called as SARS category, then, a large amount of expansion is carried out on the SARS type to obtain a new training set, namely an atypical data balance data set; and the new training set is sent into a network model for training to obtain a final training network model, and the prediction classification of different skin diseases is carried out on the basis of the network model.
In a first aspect, the present invention provides an atypical class data set balancing method, the method comprising the steps of:
preprocessing the data set;
amplifying different types of unbalanced data sets to obtain balanced data sets;
sending the balance data set into a set network model for training to obtain an atypical data set;
performing atypical class data amplification on the atypical class data set to obtain an atypical class balanced data set;
inputting the typical class data set and the atypical class equilibrium data set into a network model for training to obtain a trained network model and a trained classification result.
Further, specifically, the preprocessing the data set specifically includes: the data set is subjected to cropping and random image enhancement.
Preferably, the specific setting of the set network model includes:
the network model is EfficientNet B0, and specifically comprises MBConvBlock, MBConv, SepConv, depth separable convolution (DWConv) and an SE module;
the initial setting of the network model is to use the weight trained on the natural image to carry out transfer learning training to obtain the initial model weight of the EfficientNetB0 network;
preferably, the network model adopts a Focal local, cross entropy Loss function of two classes in the training process:
Figure BDA0003705827410000021
wherein, the value range of y is 1 or-1, 1 represents a positive sample, -1 represents a negative sample, and the value range of the probability p is 0 to 1;
substituting pt for p yields equation 2:
Figure BDA0003705827410000022
equation 1 can be rewritten as equation 3:
CE(p,y)=CE(pt)=-log(pt) (3)
to control the weights of positive and negative samples, samples that are difficult to classify and samples that are easy to classify, the local loss adds a modulation coefficient before equation 3 to obtain equation 4:
FL(pt)=-α t (1-pt) γ log(pt) (4)
wherein alpha is t For reducing the weight of the negative examples, when the label is equal to 1, -a t Equals alpha, tags equals others, -alpha t Equal to 1-alpha, alpha ranging from 0 to 1.
In particular, the data set amplification comprises in particular: data amplification of the data set includes using one or more of histogram equalization, horizontal flipping, rotation by 30 degrees, 90 degrees, 150 degrees, 180 degrees, and random erasure.
Specifically, in an atypical class data set balancing method of the present invention, the method further comprises the steps of: and acquiring a target image, and performing prediction classification on the target image based on the trained network model.
In a second aspect, the present invention provides an atypical class data set balancing apparatus, which specifically includes:
the preprocessing module is used for preprocessing the data set;
the first amplification module is used for amplifying unbalanced data sets of different categories to obtain balanced data sets;
the first training module is used for sending the balance data set into a set network model for training to obtain an atypical data set;
the second amplification module is used for performing atypical data amplification on the atypical data set to obtain an atypical balanced data set;
and the second training module is used for inputting the typical class data set and the atypical class equilibrium data set into a network model for training to obtain a trained network model and a trained classification result.
In a third aspect, the present invention provides an electronic device, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include steps for executing the atypical-type data set balancing method.
In a fourth aspect, the present invention provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of any one of the atypical class data set balancing methods.
The invention discloses an atypical class data set balancing method, a device and a medium, which realize the following beneficial technical effects:
(1) in the embodiment of the invention, aiming at unbalanced data sets, firstly, the data sets are preprocessed; carrying out amplification by a traditional method aiming at unbalanced data sets of different categories to obtain balanced data sets; then, the balance data set is sent to a set network model for training to obtain an atypical data set; performing atypical class data set amplification on the atypical class data set to obtain an atypical class balanced data set; inputting the typical class data set and the atypical class equilibrium data set into a network model for training to obtain a trained network model and a trained classification result; a new idea is provided for the field of unbalanced data classification tasks by a method for amplifying a data set of a SARS type, and specificity and accuracy are greatly improved compared with the existing classification method.
(2) In the embodiment of the invention, in the initialization setting of a network model (EfficientNet B0), the trained weight of EfficientNet B0 on a natural image is used for carrying out transfer learning training, so that the network can be prevented from falling into a local maximum, the problems of gradient disappearance and explosion can be solved to a certain extent, and meanwhile, the transfer learning can help the network to be rapidly converged and the performance of the model is improved.
(3) In the embodiment of the invention, the improved two-classification cross entropy loss function is used for verifying the loss balance strategy, the problems of sample imbalance and difficult sample classification are solved on a target detection network, the weight of the difficult and easy samples can be well controlled, and a better training result is obtained. When the method is applied to skin cancer, an excellent intelligent recognition and classification model of the skin cancer is obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and it is also possible for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a diagram illustrating the general technical steps of an atypical class data set balancing method according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of an ISIC2019 training set tag provided in an embodiment of the present invention.
FIG. 3 is a schematic diagram of a data set before and after cropping according to an embodiment of the present invention.
Fig. 4 is a graph of a conventional post-equilibration data set distribution provided by an embodiment of the present invention.
Fig. 5 is a schematic diagram of an EfficientNet network structure according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of the distribution of the initial data set, the conventional balance data set, and the atypical class data set provided by the embodiment of the present invention.
Detailed Description
The following embodiments of the present invention are provided by way of specific examples, and other advantages and effects of the present invention will be readily apparent to those skilled in the art from the disclosure herein. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the features in the following embodiments and examples may be combined with each other without conflict.
In view of the shortcomings of the prior art, the present invention provides a method, an apparatus and a medium for balancing atypical-type data sets, which mainly aims to realize efficient and accurate classification and identification of various diseases by combining the traditional unbalanced data processing mode and the atypical-type data set balancing mode under the condition that the automatic classification data sets of diseases have imbalance. Compared with the prior art, the atypical class data set balancing method disclosed by the invention has the advantages that accurate prediction and classification of skin diseases are realized, and the classification efficiency is improved.
Specifically, the atypical class data set balancing method specifically comprises the steps of preprocessing a data set, carrying out traditional amplification on skin disease unbalanced data sets of different classes to obtain a training set, realizing the overall balanced state of the skin disease data of the different classes, sending the training set into a selected CNN network for training on the basis as shown in figure 1, after obtaining a training weight, classifying the training set by using the weight to obtain a prediction error class and a prediction correct class, calling the prediction error class as a atypical class, and then expanding the atypical class to obtain a new training set-atypical class data balanced data set; and the new training set is sent into a network model for training to obtain a final training network model and a trained classification result, and the prediction classification of different skin diseases is carried out based on the network model.
Illustratively, the data set is classified by using an ISIC2019 dermatoscope image, the data source of the data set mainly comprises HAM10000 and two major categories of training data and testing data, wherein the training data gives disease labels, 25331 dermatoscope images are stored in an Excel file mode, the label mode is similar to One-Hot coding, the label mode is shown in fig. 2, the first row of the label mode represents disease categories, the first column of the label mode is picture names, and the column category with the number of 1 represents the label of the picture. Training data set collection is operated by professional dermatologists, data set information used in experiments is shown in a table 1, data imbalance can be seen from the number and proportion of the tables, wherein the number of the melanin nevi is the largest, and the skin fibroma is the least. When the test set is used, part of pictures are reduced, then the training set of a match is divided into two, one fifth of the pictures are collected to be used as the test set to be stored independently, data pollution is avoided, and the remaining four fifths of the pictures are used as the training set. The total number of malignant skin cancer pictures and benign pictures in the whole training data set is 8473, and according to the classification of benign and malignant diseases, the malignant pictures can be seen, and the malignant pictures are only one half of the benign diseases. The most melanoma in the malignant pictures was 4522 in total, and the most melanoma in the benign pictures was 12875 in total.
Table 1 detailed example of ISIC2019 dataset
Figure BDA0003705827410000041
In the following embodiments of the present invention, specific exemplary embodiments are given by way of example, and the embodiments of the technical solutions are not limited thereto.
First embodiment
In one embodiment, an atypical class data set balancing method of the present invention comprises the steps of:
and S100, preprocessing the data set.
Specifically, when the data of the ISIC2019 is preprocessed, it is found that the marginal improvement effect of the processing is very low, not more than 1%, through the processing of the hair and the processing of dividing and classifying. In order to embody the effect of atypical class data balance provided by said invention, the above-mentioned pretreatment method can be abandoned, and only simple cutting treatment can be implemented. In the data set, a large number of pictures have irrelevant black edges, as shown in a and d of fig. 3, for example, skin cancer is taken as an example, skin cancer classification only needs to focus on a skin injury part in the middle of the picture, and a part of the picture is cut. The method comprises the following steps: firstly, binarizing all training data, wherein the binarized pictures are as shown in b and e of fig. 3, the black edge of the picture to be processed can be seen, a threshold value is set, the picture to be cut is screened out, the picture to be cut is cut, and the aspect ratio of the picture to be cut is the same as that of the initial image. The pictures before and after cutting are compared with a picture a, a picture c, a picture d and a picture f of figure 3, so that the pictures after cutting well reserve the skin lesion part, and black edges are removed. Meanwhile, before inputting into the network, the preprocessing step of the invention also comprises the random image enhancement processing to avoid the overfitting phenomenon. The data preprocessing of the invention can well improve the accuracy of the classification result.
And S200, amplifying the unbalanced data sets of different categories to obtain balanced data sets.
Specifically, taking skin diseases as an example, the number of individual categories of the skin diseases is extremely small, and the maximum category and the minimum category are different by dozens of times according to experimental data. To balance this difference, an augmentation balancing strategy is adopted, which is first a traditional augmentation balancing strategy, and illustratively, the images of the categories to be augmented are augmented by histogram equalization, horizontal flipping, 30-degree rotation, 90-degree rotation, 150-degree rotation, 180-degree rotation, and random erasure (cutoff), and the number of images of each category after the augmentation is approximately one-half or one-third of the maximum number of moles of melanin, depending on the number of initial images of each category. The data distribution after traditional balancing is as shown in fig. 4, compared with the initial distribution, the data distribution is more balanced, and the overfitting problem in the training process can be avoided due to the increase of the data volume of the small sample.
And step S300, sending the balance data set into a set network model for training to obtain an atypical data set.
Specifically, after the balanced data set is sent to a set network model for training and classification, the output result includes a typical class data set and an atypical class data set, that is, a data set with correct classification and incorrect classification is obtained through a preliminary network model.
Specifically, in one embodiment, the set network model may be EfficientNetB0, VGG, ResNet, and google lenet. Preferably, the network model adopted by the invention is EfficientNet B0, the input size of all input pictures in the balanced data set is set to be 224 multiplied by 224, and random horizontal turning and random rotation data enhancement operation is carried out, so that the overfitting phenomenon during training is avoided; and finally, normalizing the data, converting the data into vectors and inputting the vectors into the network model. During deep learning network training, the weight of the network is modified through back propagation, and at the beginning of deep learning, the weight is initialized randomly during training, which leads to the problems of gradient explosion and gradient disappearance, so a good weight initialization should be provided at the beginning of training, in one embodiment, the initialization weight of the network model can be set through priori knowledge or automatically generated at random, preferably, the network model EfficientNetB0 of the invention performs transfer learning training by using the weight trained on a natural image to obtain the initial model weight of the EfficientNetB0 network, the trained model weight can avoid the network from falling into a local maximum, the problem of gradient disappearance and explosion can be solved to a certain extent, and meanwhile, the transfer learning can help the network to converge quickly and improve the performance of the model.
Specifically, the EfficientNetB0 has the main modules of MBConvBlock, MBConv and SepConv, and uses the structures of depth separable convolution (DWConv), SE module and the like, and the specific structural diagram is shown in fig. 5. Specifically, the network model training is divided into two parts, namely migration learning and weight training after the migration learning, and the setting of the network model specifically comprises the following steps: the epoch is uniformly set to 100, the Batch Size is set to 64, the learning rate of the transfer learning is 0.01, the learning rate after the transfer learning is 0.001, the SGD optimizer with momentum is used for both training, and the momentum is set to 0.9. According to the identification requirement of the actual diseases, the number of the disease categories can be set to be different, in one embodiment, the classification categories are set to be eight categories, and the final full-connection layer of the network is modified to enable the output to meet the category number of the current experiment based on the network model used by the ImageNet transfer learning. Finally, the method monitors the precision of the verification set in the process, stops training in advance when the precision is not improved any more in 10 epoch periods, and saves the weight of the model with the highest accuracy in the verification set.
And step S400, performing atypical data amplification on the atypical data set to obtain an atypical balanced data set.
Specifically, the data expansion of the atypical class data set includes expansion by using histogram equalization, horizontal flipping, 30-degree rotation, 90-degree rotation, 150-degree rotation, 180-degree rotation, random erasure (cutoff), and the like, and the expanded data set is used as the atypical class equalization data set. In one embodiment, for example, skin diseases, although some categories have no atypical type, they do not affect the atypical type balance of the overall data, and there is often a atypical type for every malignant skin cancer. After obtaining the SARS type, using a data amplification method to amplify twenty times or more, increasing the number of the SARS type samples in the data set, so that the model can pay better attention to the SARS type samples, and the false negative rate of the model is reduced. Distribution of data sets after initial data set-traditional balanced data set-atypical class data balance as in fig. 6, the data distribution gradually tends to balance. In the invention, EfficientNet is used for testing, then the atypical data balance method provided by the invention is applied to the classical network in the classification field, and the optimization capability of the atypical data balance method on the training effect is verified.
And S500, inputting the typical class data set and the atypical class balance data set into a network model for training to obtain a trained network model and a trained classification result.
Specifically, a typical class data set and an atypical class data set are obtained through initial training based on network model training, after a SARS class equilibrium data set is obtained through amplification, data distribution gradually tends to be balanced, a new data training set obtained through the typical class data set and the SARS class equilibrium data set is input into a network model for training to obtain a trained network model, a finally trained network model is obtained, and meanwhile, a classification result of the training data set is obtained.
In step S500, it is understood that the Loss function in the network model may be a conventional Loss function, and in one embodiment, the network model of the present invention preferably uses Focal local to solve the problems of sample imbalance and sample classification difficulty during the training process, so that the network has a higher improvement in both training speed and training accuracy. Focal loss is modified above the underlying binary cross-entropy loss function, which is shown in equation 1 below,
Figure BDA0003705827410000061
wherein, the value range of y is 1 or-1, 1 represents a positive sample, -1 represents a negative sample, and the value range of the probability p is 0 to 1;
replacing p with pt, we get equation 2,
Figure BDA0003705827410000062
equation 1 can be rewritten as equation 3:
CE(p,y)=CE(pt)=-log(pt) (3)
in order to control the weights of positive and negative samples, samples which are difficult to classify and samples which are easy to classify, a modulation coefficient is added in front of formula 3 in the local loss to obtain formula 4:
FL(pt)=-α t (1-pt) γ log(pt) (4)
wherein alpha is t For reducing the weight of the negative examples, when the label is equal to 1, -a t Equal to α, the label is equal to the other times, - α t Equal to 1-alpha, alpha ranging from 0 to 1. Thus, the contribution of positive and negative samples to loss can be controlled by setting the value of α. Wherein (1-pt) γ The weight used to control the classification difficulty of samples is smaller, if pt is larger, i.e. the probability of belonging to a certain class is higher, 1-pt will be smaller, and vice versa, so that (1-pt) can be set γ To control the contribution of hard-to-classify and easy-to-classify samples to loss.Wherein (1-pt) γ Modulation factor called Focal local, and alpha t Are coefficients that are commonly used when controlling the positive and negative sample weights. When the gamma value is 0, Focal local is a common binary cross entropy Loss function, and the weight of the samples difficult to classify is gradually increased along with the increase of the value, so that the proper alpha and gamma values can be selected during use, in one embodiment, the invention uses the values of 0.25 according to experience, can well control the weight of the samples difficult to classify, and can obtain a better training result.
Further, in an atypical class data set balancing method of the present invention, the method further comprises the steps of:
and acquiring a target image, and performing prediction classification on the target image based on the trained network model.
Specifically, a target image of the test set is obtained, and the target image is input into the trained network model to perform prediction classification on the target image of the test set. The method can be applied to various data sets which need to be subjected to prediction classification, particularly to the prediction classification of images with different disease types, and is beneficial to applying a prediction network model to the classification of various target images. When the method is applied to skin cancer, the excellent intelligent skin cancer prediction model and intelligent prediction recognition classification can be obtained.
According to the invention, different classifier models are used, the atypical class data balance effect is verified on 20000 multiple skin mirror image data sets of the ISIC2019, the sensitivity, F1 score, accuracy, specificity and accuracy of the atypical class data balance model are greatly improved compared with the model without atypical class data balance, wherein the F1 score of GooglLeNet is improved by 12.7%, and the average accuracy is improved by about 5%. In the multi-classification task aiming at eight skin injuries such as melanoma, squamous cell carcinoma and the like, the accuracy of an atypical data balance method and an EfficientNet model reaches 82.4 percent, and is improved by about 20 percent compared with the accuracy of a champion model of the latest ISIC2019 competition. Therefore, the effectiveness of the intelligent identification and classification strategy for skin diseases provided by the method is fully demonstrated, and the atypical class data balance strategy also provides a new idea for the field of unbalanced data classification tasks.
Example two
The present invention provides another embodiment, and provides an atypical class data set balancing apparatus, including:
the preprocessing module is used for preprocessing the data set;
the first amplification module is used for amplifying unbalanced data sets of different categories to obtain balanced data sets;
the first training module is used for sending the balance data set into a set network model for training to obtain an atypical data set;
the second amplification module is used for performing atypical data amplification on the atypical data set to obtain an atypical balanced data set;
and the second training module is used for inputting the typical class data set and the atypical class equilibrium data set into a network model for training to obtain a trained network model and a trained classification result.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
EXAMPLE III
The present invention provides another embodiment, and the present invention provides an electronic device, including: a processor 1 and a memory 2.
The memory 2 is used for storing a computer program.
The memory 2 includes: various media that can store program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.
The processor 1 is connected to the memory 2, and is configured to execute a computer program stored in the memory 2, so as to enable the apparatus for generating a standard section of fetal craniocerebral deformity to perform the above-mentioned atypical class data set balancing method.
Preferably, the processor 1 may be a Central Processing Unit (CPU); it may also be an Application Specific Integrated Circuit (ASIC).
Example four
The present invention also provides another embodiment, which is to provide a computer-readable storage medium storing a computer program executable by at least one processor to cause the at least one processor to perform the steps of the atypical class data set balancing method as described above.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art would appreciate that the modules, elements, and/or method steps of the various embodiments described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be another division, for example, multiple modules or components may be combined or integrated into another device or system, or some features may be omitted, or not executed.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A method of atypical class data set balancing, the method comprising the steps of:
s1, preprocessing the data set;
s2, amplifying unbalanced data sets of different categories to obtain balanced data sets;
s3, sending the balanced data set into a set network model for training to obtain an atypical data set;
s4, performing atypical class data amplification on the atypical class data set;
and S5, feeding back the typical class data set and the atypical class equilibrium data set to the network model for training to obtain the trained network model and the trained classification result.
2. The atypical class data set balancing method as claimed in claim 1, wherein the preprocessing of the data set specifically includes:
the data set is subjected to a cropping process and a random image enhancement process.
3. The atypical class data set balancing method according to claim 1, wherein the set network model specifically includes:
the network model is EfficientNet B0, and specifically comprises MBConvBlock, MBConv, SepConv, depth separable convolution (DWConv) and an SE module;
the initial setting of the network model is to use the weights trained on natural images to carry out transfer learning training to obtain the initial model weight of the EfficientNetB0 network.
4. The atypical class data set balancing method of claim 3,
the network model adopts a Focal local, cross entropy Loss function of two classifications in the training process:
Figure FDA0003705827400000011
wherein, the value range of y is 1 or-1, 1 represents a positive sample, -1 represents a negative sample, and the value range of probability p is 0 to 1;
substituting pt for p yields equation 2:
Figure FDA0003705827400000012
equation 1 can be rewritten as equation 3:
CE(p,y)=CE(pt)=-log(pt) (3)
in order to control the weights of positive and negative samples, samples which are difficult to classify and samples which are easy to classify, a modulation coefficient is added in front of formula 3 in the local loss to obtain formula 4:
FL(pt)=-α t (1-pt) γ log(pt) (4)
wherein alpha is t For reducing the weight of negative examples, when the label is equal to 1, - α t Equals alpha, tags equals others, -alpha t Equal to 1-alpha, alpha ranging from 0 to 1.
5. The atypical class data set balancing method according to claim 1, wherein the data set augmentation specifically includes:
data amplification of the data set includes using one or more of histogram equalization, horizontal flipping, rotation by 30 degrees, 90 degrees, 150 degrees, 180 degrees, and random erasure.
6. The atypical class data set balancing method of claim 1 wherein the method steps further comprise:
and acquiring a target image, and performing prediction classification on the target image based on the trained network model.
7. An atypical class data set balancing apparatus, comprising:
the preprocessing module is used for preprocessing the data set;
the first amplification module is used for amplifying unbalanced data sets of different categories to obtain balanced data sets;
the first training module is used for sending the balance data set into a set network model for training to obtain an atypical data set;
the second amplification module is used for performing atypical data amplification on the atypical data set to obtain an atypical balanced data set;
and the second training module is used for feeding the typical class data set and the atypical class equilibrium data set back to the network model for training to obtain a trained network model and a trained classification result.
8. An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein the one or more programs configured to be executed by the one or more processors comprise steps for performing the atypical-type data set balancing method of any one of claims 1-6.
9. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the atypical class data set balancing method according to one of claims 1 to 6.
CN202210704826.0A 2022-06-21 2022-06-21 Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium Active CN114937288B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210704826.0A CN114937288B (en) 2022-06-21 2022-06-21 Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210704826.0A CN114937288B (en) 2022-06-21 2022-06-21 Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium

Publications (2)

Publication Number Publication Date
CN114937288A true CN114937288A (en) 2022-08-23
CN114937288B CN114937288B (en) 2023-05-26

Family

ID=82867601

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210704826.0A Active CN114937288B (en) 2022-06-21 2022-06-21 Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium

Country Status (1)

Country Link
CN (1) CN114937288B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612311A (en) * 2023-03-13 2023-08-18 浙江大学 Sample imbalance-oriented unqualified immunohistochemical image recognition system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231201A (en) * 2018-01-25 2018-06-29 华中科技大学 A kind of construction method, system and the application of disease data analyzing and processing model
CN111476266A (en) * 2020-02-27 2020-07-31 武汉大学 Non-equilibrium type leukocyte classification method based on transfer learning
US20210042580A1 (en) * 2018-10-10 2021-02-11 Tencent Technology (Shenzhen) Company Limited Model training method and apparatus for image recognition, network device, and storage medium
CN112766379A (en) * 2021-01-21 2021-05-07 中国科学技术大学 Data equalization method based on deep learning multi-weight loss function
CN112966767A (en) * 2021-03-19 2021-06-15 焦点科技股份有限公司 Data unbalanced processing method for separating feature extraction and classification tasks
US11087883B1 (en) * 2020-04-02 2021-08-10 Blue Eye Soft, Inc. Systems and methods for transfer-to-transfer learning-based training of a machine learning model for detecting medical conditions
CN113743332A (en) * 2021-09-08 2021-12-03 中国科学院自动化研究所 Image quality evaluation method and system based on universal vision pre-training model
CN114266717A (en) * 2020-09-25 2022-04-01 天津科技大学 Parallel capsule network cervical cancer cell detection method based on Inception module
CN114529767A (en) * 2022-02-18 2022-05-24 厦门大学 Small sample SAR target identification method based on two-stage comparison learning framework
WO2022109295A1 (en) * 2020-11-19 2022-05-27 Carnegie Mellon University System and method for detecting and classifying abnormal cells

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231201A (en) * 2018-01-25 2018-06-29 华中科技大学 A kind of construction method, system and the application of disease data analyzing and processing model
US20210042580A1 (en) * 2018-10-10 2021-02-11 Tencent Technology (Shenzhen) Company Limited Model training method and apparatus for image recognition, network device, and storage medium
CN111476266A (en) * 2020-02-27 2020-07-31 武汉大学 Non-equilibrium type leukocyte classification method based on transfer learning
US11087883B1 (en) * 2020-04-02 2021-08-10 Blue Eye Soft, Inc. Systems and methods for transfer-to-transfer learning-based training of a machine learning model for detecting medical conditions
CN114266717A (en) * 2020-09-25 2022-04-01 天津科技大学 Parallel capsule network cervical cancer cell detection method based on Inception module
WO2022109295A1 (en) * 2020-11-19 2022-05-27 Carnegie Mellon University System and method for detecting and classifying abnormal cells
CN112766379A (en) * 2021-01-21 2021-05-07 中国科学技术大学 Data equalization method based on deep learning multi-weight loss function
CN112966767A (en) * 2021-03-19 2021-06-15 焦点科技股份有限公司 Data unbalanced processing method for separating feature extraction and classification tasks
CN113743332A (en) * 2021-09-08 2021-12-03 中国科学院自动化研究所 Image quality evaluation method and system based on universal vision pre-training model
CN114529767A (en) * 2022-02-18 2022-05-24 厦门大学 Small sample SAR target identification method based on two-stage comparison learning framework

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
KWABENA EBO BENNIN 等: ""Mahakil: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction"" *
王乐 等: ""不平衡数据集分类方法综述"" *
赵越 等: "卷积神经网络的宫颈细胞图像分类", 计算机辅助设计与图形学学报 *
郭磊: ""图像识别中的不平衡学习和增量学习方法研究"" *
顾苏杭 等: ""基于数据点本身及其位置关系辅助信息挖掘的分类方法"" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612311A (en) * 2023-03-13 2023-08-18 浙江大学 Sample imbalance-oriented unqualified immunohistochemical image recognition system

Also Published As

Publication number Publication date
CN114937288B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
Dorj et al. The skin cancer classification using deep convolutional neural network
Sandoval et al. Two-stage deep learning approach to the classification of fine-art paintings
Kölsch et al. Real-time document image classification using deep CNN and extreme learning machines
Mane et al. A survey on supervised convolutional neural network and its major applications
CN110084318B (en) Image identification method combining convolutional neural network and gradient lifting tree
Wang et al. A new automatic identification system of insect images at the order level
Hu et al. SA-Net: A scale-attention network for medical image segmentation
Wang et al. A crop pests image classification algorithm based on deep convolutional neural network
CN112085059B (en) Breast cancer image feature selection method based on improved sine and cosine optimization algorithm
Trivizakis et al. A novel deep learning architecture outperforming ‘off‑the‑shelf’transfer learning and feature‑based methods in the automated assessment of mammographic breast density
Liu et al. Automatic classification of chinese herbal based on deep learning method
CN114937288A (en) Atypical class data set balancing method, device and medium
Goyal et al. Multiclass twin support vector machine for plant species identification
US11538158B2 (en) Convolutional neural network and associated method for identifying basal cell carcinoma
Thuon et al. Improving isolated glyph classification task for palm leaf manuscripts
He et al. Skin lesion segmentation via deep RefineNet
CN110956116B (en) Face image gender identification model and method based on convolutional neural network
Dutta et al. Detecting grades of diabetic retinopathy by extraction of retinal lesions using digital fundus images
Ghizlane et al. Spam image detection based on convolutional block attention module
Li et al. HEp-2 specimen classification with fully convolutional network
Sheikh et al. Feature preserving smoothing provides simple and effective data augmentation for medical image segmentation
Ramesh et al. Scalable scene understanding via saliency consensus
Piriyothinkul et al. Detecting text in manga using stroke width transform
Wu et al. Atrous residual convolutional neural network based on U-Net for retinal vessel segmentation
Fuhad et al. CNN Based Model for Malaria Diagnosis with Knowledge Distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant