CN114937288B - Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium - Google Patents

Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium Download PDF

Info

Publication number
CN114937288B
CN114937288B CN202210704826.0A CN202210704826A CN114937288B CN 114937288 B CN114937288 B CN 114937288B CN 202210704826 A CN202210704826 A CN 202210704826A CN 114937288 B CN114937288 B CN 114937288B
Authority
CN
China
Prior art keywords
data set
atypical
network model
training
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210704826.0A
Other languages
Chinese (zh)
Other versions
CN114937288A (en
Inventor
林江莉
韩霖
彭建伟
林江宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haihong Zhixiang Medical Science And Technology Tianjin Co ltd
Sichuan University
Original Assignee
Haihong Zhixiang Medical Science And Technology Tianjin Co ltd
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haihong Zhixiang Medical Science And Technology Tianjin Co ltd, Sichuan University filed Critical Haihong Zhixiang Medical Science And Technology Tianjin Co ltd
Priority to CN202210704826.0A priority Critical patent/CN114937288B/en
Publication of CN114937288A publication Critical patent/CN114937288A/en
Application granted granted Critical
Publication of CN114937288B publication Critical patent/CN114937288B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/03Recognition of patterns in medical or anatomical images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Radar Systems Or Details Thereof (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an atypical data set balancing method, a device and a medium, which concretely comprise the steps of preprocessing a data set; performing amplification of the traditional method aiming at unbalanced data sets of different categories to obtain balanced data sets; sending the balance data set into a set network model for training to obtain an atypical class data set; amplifying the atypical class data set to obtain an atypical class balanced data set; and finally, inputting the typical class data set and the atypical class equalization data set into a network model for training to obtain a trained network model and a trained classification result. By the method for amplifying the atypical data set, a new thought is provided for the field of unbalanced data classification tasks, and the specificity and accuracy of the classification model are greatly improved. Meanwhile, the weight trained on the natural image is introduced to carry out transfer learning training, so that the problems of gradient disappearance and explosion can be solved, and meanwhile, the rapid convergence of a network can be helped to improve the performance of the model.

Description

Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium
Technical Field
The invention relates to a deep learning technology, in particular to the field of deep learning of unbalanced data sets, and in particular relates to a processing technology for classifying disease images by using an atypical data balance method.
Background
The application of the deep learning technology in the field of medical image classification is greatly convenient for doctors to diagnose diseases, and before deep learning, the deep learning technology is basically researched based on a machine learning method. The machine learning method is to extract the features by using texture features, LBP operator and other methods, and then classify the images by using the traditional methods such as K-nearest neighbor algorithm or SVM. Many scholars obtain good classification results on diseases such as breast cancer, thyroid nodule and the like by using the method, and develop corresponding auxiliary diagnosis systems. Although the traditional machine learning algorithm achieves such a lot of achievement, the characteristic selection process of the machine learning method is complex and difficult to accurately extract, the requirement on researchers is high, the current medical image data set is usually more than ten thousand, and the classification result of the machine learning method on a large data set is unsatisfactory.
The sensitivity and specificity and accuracy of deep learning on skin cancer classification tasks exceed those of human expert, and a series of methods and means are introduced into the field of skin mirror image classification by a plurality of researchers in order to search for higher accuracy and classification efficiency. Skin cancer datasets, which mostly exhibit the characteristic of data distribution imbalance, are typical imbalance datasets. For machine learning, many students adopt different sampling methods to solve the problem of data imbalance, and common sampling methods include oversampling, such as SMOTE sampling methods, undersampling, such as NCL sampling methods, and mixed sampling, such as smote+tomek Links methods.
In the prior art, for deep learning, the traditional method for solving the data unbalance is to perform preprocessing balance on the data, optimize a loss function or improve a part of the structure of a network so that the model adapts to an unbalanced data set during training. For example, when the problem of natural language imbalance is solved, a dynamic K-means clustering method is added to perform data preprocessing, so that the classification accuracy of a text imbalance data set is improved well; in myocardial infarction signal processing, a CNN model and a Focal Loss function are used for optimizing training, so that the problem of unbalanced myocardial infarction signals is solved; in the aspect of skin diseases, an original image is processed by an auxiliary decoder, the number of samples is increased to balance a data set, the original image is sent to a decoder network after passing through CNN to obtain a new image, the new image and the original image are respectively sent to a corresponding CNN classifier, and the training loss of the two images and the loss of the decoder form the total loss of a classification model according to different weights. However, the skin cancer data set usually belongs to an unbalanced data set, and the problem of unbalance of the data set is solved by the Loss balance method or the model integration method, so that the existing effects are poor, and the accuracy of a network or a model is not high enough. Therefore, if a method for processing a skin disease imbalance data set is proposed, a method that can improve the classification effect of a network or a model, while improving the accuracy of classification is very important.
Disclosure of Invention
In view of the shortcomings of the prior art, the invention provides an atypical data set balancing method, a device and a medium, wherein the invention combines a traditional unbalanced data processing mode and an atypical data set balancing mode, uses a skin disease image to finish classification and identification of different types of diseases, and specifically comprises preprocessing a data set, carrying out traditional amplification on different types of skin disease unbalanced data sets to obtain a training set, realizing the state of overall balance of different types of skin disease data, sending the training set into a selected CNN network for training on the basis, classifying the training set by using the weight after obtaining the weight, obtaining two types of misprediction and misprediction, wherein the type of misprediction is called atypical, and then carrying out massive expansion on atypical types to obtain a new training set-atypical data balance data set; the new training set is sent into a network model for training to obtain a final training network model, and the prediction classification of different skin diseases is carried out based on the network model.
In a first aspect, the present invention proposes a method for balancing atypical class data sets, said method comprising the steps of:
preprocessing a data set;
amplifying unbalanced data sets of different categories to obtain balanced data sets;
sending the balance data set into a set network model for training to obtain an atypical class data set;
performing atypical class data amplification on the atypical class data set to obtain an atypical class balanced data set;
inputting the typical class data set and the atypical class equalization data set into a network model for training to obtain a trained network model and a trained classification result.
Further, specifically, preprocessing the data set specifically includes: the dataset is cropped and randomly image enhanced.
Preferably, the specific set of the set network model includes:
the network model is EfficientNetB0, specifically comprises MBConvBlock, MBConv, sepConv, depth separable convolution (DWConv), SE module;
the initial setting of the network model uses the weight trained on the natural image to carry out transfer learning training to obtain the initial model weight of the EfficientNetB0 network;
preferably, the network model adopts a Focal Loss function and a two-class cross entropy Loss function in the training process:
Figure BDA0003705827410000021
wherein, the value range of y is 1 or-1, 1 represents a positive sample, -1 represents a negative sample, and the value range of probability p is 0 to 1;
substituting pt for p gives equation 2:
Figure BDA0003705827410000022
equation 1 can be rewritten as equation 3:
CE(p,y)=CE(pt)=-log(pt) (3)
in order to control the weights of the positive and negative samples, difficult-to-classify and easy-to-classify samples, the Focal loss adds a modulation factor in front of equation 3 to obtain equation 4:
FL(pt)=-α t (1-pt) γ log(pt) (4)
wherein alpha is t For reducing the weight of the negative samples, -alpha when the label is equal to 1 t Equal to alpha, when the tag is equal to others, -alpha t The value of alpha is equal to 1-alpha, and the range of alpha is 0 to 1.
Specifically, the data set expansion specifically includes: data amplification of the data set includes using one or more of histogram equalization, horizontal flipping, rotation by 30 degrees, 90 degrees, 150 degrees, 180 degrees, and random erasure.
Specifically, in an atypical class data set balancing method of the present invention, the method further comprises the steps of: and obtaining a target image, and carrying out prediction classification on the target image based on the trained network model.
In a second aspect, the present invention proposes an atypical data set balancing apparatus, which specifically includes:
the preprocessing module is used for preprocessing the data set;
the first amplification module is used for amplifying unbalanced data sets of different categories to obtain balanced data sets;
the first training module is used for sending the balance data set into the set network model for training to obtain an atypical class data set;
the second amplification module is used for carrying out atypical data amplification on the atypical data set to obtain an atypical balanced data set;
and the second training module is used for inputting the typical class data set and the atypical class balance data set into the network model for training to obtain a trained network model and a trained classification result.
In a third aspect, the invention proposes an electronic device characterized in that it comprises a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, said one or more programs comprising steps for performing said atypical-type dataset balancing method.
In a fourth aspect, the present invention proposes a computer readable storage medium, on which a computer program is stored, the computer program implementing the steps of the atypical class data set balancing method according to any one of the above-mentioned methods when being executed by a processor.
The atypical data set balancing method, device and medium based on the invention realize the following beneficial technical effects:
(1) In the embodiment of the invention, aiming at unbalanced data sets, the data sets are preprocessed firstly; performing amplification of the traditional method aiming at unbalanced data sets of different categories to obtain balanced data sets; then sending the balance data set into a set network model for training to obtain an atypical class data set; performing atypical class data set amplification on the atypical class data set to obtain an atypical class balanced data set; finally, inputting the typical class data set and the atypical class equalization data set into a network model for training to obtain a trained network model and a trained classification result; by the method for amplifying the atypical data set, a new thought is provided for the field of unbalanced data classification tasks, and compared with the existing classification method, the specificity and the accuracy are greatly improved.
(2) In the embodiment of the invention, in the initialization setting of the network model (EfficientNet B0), the weight trained by EfficientNet B0 on the natural image is used for transfer learning training, so that the problem that the network falls into a local maximum value and the gradient disappears and explodes is solved to a certain extent, and meanwhile, the transfer learning can help the network to quickly converge and improve the performance of the model.
(3) In the embodiment of the invention, the improved cross entropy loss function of two classifications is used for verifying the loss balance strategy, so that the problems of sample unbalance and sample difficult classification are solved on a target detection network, the weight of a difficult sample can be well controlled, and a better training result is obtained. When the method is applied to skin cancer, a good skin cancer intelligent identification classification model is obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.
FIG. 1 is a schematic diagram of general technical steps of an atypical class data set balancing method according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of an ISIC2019 training set label according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of the data set before and after clipping according to an embodiment of the present invention.
FIG. 4 is a graph of a conventional post-equilibration dataset profile provided by an embodiment of the present invention.
Fig. 5 is a schematic diagram of an architecture of an afflicientnet network provided by an embodiment of the present invention.
FIG. 6 is a schematic diagram of an initial dataset-legacy balance-atypical class data balance dataset distribution provided by an embodiment of the present invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
In view of the shortcomings of the prior art, the invention provides an atypical data set balancing method, device and medium, and the main purpose of the invention is that under the condition that an automatic classification data set of diseases is unbalanced, the invention can realize high-efficiency and accurate classification and identification of various diseases by combining a traditional unbalanced data processing mode and an atypical data set balancing mode, and in the embodiment of the invention, skin diseases are taken as examples, and classification and identification of different types of diseases are completed by using different types of skin disease images. Compared with the prior art, the atypical data set balancing method realizes accurate prediction and classification of skin diseases and improves classification efficiency.
Specifically, the atypical data set balancing method specifically comprises preprocessing a data set, carrying out traditional amplification on different types of skin disease unbalanced data sets to obtain a training set, realizing the overall balanced state of the different types of skin disease data, sending the training set into a selected CNN network for training as shown in a figure 1 on the basis, classifying the training set by using the weight after obtaining the weight of training to obtain two types of misprediction and correct misprediction, wherein the type of misprediction is called atypical, and then expanding the atypical type to obtain a new training set, namely the atypical type data balancing data set; the new training set is sent into a network model for training to obtain a final training network model and a trained classification result, and the prediction classification of different skin diseases is carried out based on the network model.
Illustratively, the data set selects the ISIC2019 skin mirror image classification data set in the present invention, the data sources mainly include HAM10000, and the two major types include training data and test data, wherein the training data gives disease labels, and 25331 skin mirror images are stored in an Excel file manner, the label manner is similar to One-Hot encoding, the label manner is as shown in fig. 2, the first row represents disease category, the first column is picture name, and the column category with number 1 represents the label of the picture. The training data set collection is operated by a professional dermatologist, the data set information used in the experiment is shown in table 1, and the data imbalance can be seen from the number and the duty ratio of the tables, wherein the largest number of melanin nevi is the least skin fibroids, and the atypical data balancing means provided by the invention aims to solve the problem. When in use, part of pictures are reduced, then the training set of the game is divided into two parts, one fifth of pictures are collected and stored as a test set independently, data pollution is avoided, and the remaining four fifths are used as the training set. The total number of malignant skin cancer pictures in the whole training data set is 8473, the number of benign pictures is 16858, and according to the benign and malignant classification of diseases, only one half of benign diseases can be seen in the malignant pictures. The most melanoma is 4522 in total in malignant pictures, and the most moles are 12875 in total in benign pictures.
Table 1 isic2019 dataset detailed example
Figure BDA0003705827410000041
In the following embodiments of the present invention, specific exemplary technical solutions are all needed as examples, and the embodiments of the technical solutions are not limited thereto.
First embodiment
In one embodiment, an atypical class data set balancing method of the present invention comprises the steps of:
step S100, preprocessing the data set.
Specifically, when the data of ISIC2019 is subjected to pretreatment, it was found that the treatment of hair and the treatment of classification after segmentation have low marginal improvement effect of treatment, which is not more than 1%. In order to embody the atypical data balance effect, the pretreatment method is abandoned, and only simple cutting treatment is carried out. The data set has a large number of pictures with irrelevant black edges, as shown in a graph a and a graph d of fig. 3, for example, skin cancer is taken as an example, and skin cancer is classified by only paying attention to skin injury parts in the middle of the pictures, and cutting is performed on part of the pictures. The method comprises the following steps: firstly, all training data are binarized, a binarized picture is shown as b and e of fig. 3, black edges of the picture to be processed can be seen, a threshold value is set, the picture to be cut out is screened out, the picture to be cut out is cut out, and the aspect ratio of the picture to be cut out is the same as that of the initial image. The images before and after cutting are compared with the images a and c, d and f in fig. 3, so that the images after cutting can be seen to well retain the skin lesion parts and remove the black edges. Meanwhile, the preprocessing step of the invention further comprises the step of carrying out random image enhancement processing to avoid the overfitting phenomenon before inputting the network. The data preprocessing method can well improve the accuracy of the classification result.
Step 200, amplifying unbalanced data sets of different categories to obtain balanced data sets.
Specifically, taking skin diseases as an example, experimental data show that the number of individual categories of skin diseases is very small, and the most categories and the least categories differ by several tens of times. To balance this difference, an expansion balance policy is first adopted, which is a conventional expansion balance policy, and exemplary, the class pictures to be expanded are expanded by using methods of histogram equalization, horizontal flipping, rotation by 30 degrees, 90 degrees, 150 degrees, 180 degrees, random erasure (cutout) and the like, and the number of pictures of each class after expansion is approximately one half or one third of the maximum number of moles of melanin, depending on the initial number of pictures of each class. The data distribution after traditional balancing is as shown in fig. 4, compared with the initial distribution, the data distribution is more balanced, and the over-fitting problem in the training process can be avoided due to the increase of the data quantity of the small sample.
And step S300, sending the balance data set into a set network model for training to obtain an atypical data set.
Specifically, after the balanced data set is sent to the set network model for training and classifying, the output result comprises the typical class data set and the atypical class data set, namely, the data sets with correct classification and incorrect classification are obtained through the preliminary network model.
Specifically, in one embodiment, the set network model may be an efficiency net b0, VGG, res net, and google net. Preferably, the network model adopted by the invention is EfficientNet B0, the input size of all input pictures in the balanced data set is 224 multiplied by 224, and the random horizontal overturn and random rotation data enhancement operation is carried out, so that the phenomenon of fitting is avoided during training; and finally, normalizing the data and converting the data into vectors to be input into a network model. When the deep learning network is trained, the weight of the network is modified through counter propagation, the weight is randomly initialized during the training at the beginning of the deep learning, which leads to the problems of gradient explosion and gradient disappearance, so that the initial weight of the network model is initialized well at the beginning of the training, in one embodiment, the initial weight of the network model can be set through priori knowledge or is automatically randomly generated, preferably, the network model EfficientNetB0 of the invention uses the weight trained on a natural image to carry out migration learning training to obtain the initial model weight of the EfficientNetB0 network, the trained model weight can avoid the network from sinking into a local maximum value, the problems of gradient disappearance and gradient explosion can be solved to a certain extent, and meanwhile, the migration learning can help the network to quickly converge and improve the performance of the model.
Specifically, the main modules of the EfficientNetB0 include MBConvBlock, MBConv, sepConv, and the structures of a depth separable convolution (DWConv), an SE module and the like are used, and the specific structure is shown in fig. 5. Specifically, the network model training is divided into two parts of transfer learning and weight training after the transfer learning, and the setting of the network model specifically includes: the epoch is uniformly set to 100, the batch Size is set to 64, the learning rate of transfer learning is 0.01, the learning rate after transfer learning is 0.001, the SGD optimizer with the driving quantity is used in two training, and the momentum is set to 0.9. According to the identification requirement of the actual diseases, the types of the diseases can be set to be different in number, in one embodiment, the classification types are set to be eight types, and the network model is based on the fact that the network model is used for transfer learning from the ImageNet, so that the final full connection layer of the network is modified to enable the output of the network to meet the number of the types of the current experiment. Finally, the invention monitors the accuracy of the verification set in the process, and when the accuracy of the verification set is not improved in the period of 10 epochs, training is stopped in advance, and the weight with the highest accuracy of the model in the verification set is saved.
And step 400, carrying out atypical class data amplification on the atypical class data set to obtain an atypical class balanced data set.
Specifically, the data amplification of the atypical class data set includes expanding by using methods such as histogram equalization, horizontal inversion, rotation by 30 degrees, 90 degrees, 150 degrees, 180 degrees, random erasure (cutout) and the like, and the expanded data set is used as the atypical class equalization data set. In one embodiment, taking skin diseases as an example, while some categories have no atypical categories, the atypical category balance of the overall data is not affected and there is often an atypical category for each malignant skin cancer. After the atypical class is obtained, the data amplification method is used for amplifying the atypical class by twenty times and fifty times or more, the quantity of the atypical class in a data set is increased, so that the model can pay attention to the atypical class sample better, and the false negative rate of the model is reduced. The distribution of the data set after the initial data set-traditional balancing data set-atypical class data balancing is as in fig. 6, and the data distribution gradually tends to be balanced. The invention uses EfficientNet to test, then applies the atypical data balancing method provided by the invention to the classical network in the classification field, and verifies the optimization capability of the atypical data balancing method on the training effect.
And S500, inputting the typical class data set and the atypical class balance data set into a network model for training to obtain a trained network model and a trained classification result.
Specifically, a typical class data set and an atypical class data set are obtained based on network model training initial training, after the atypical class balanced data set is obtained through amplification, data distribution gradually tends to be balanced, a new data training set obtained by the typical class data set and the atypical class balanced data set is input into a network model for training to obtain a trained network model, and a final trained network model is obtained, and meanwhile, a classification result of the training data set is also obtained.
In step S500, it can be appreciated that the use of the Loss function in the network model may be a conventional Loss function, and in one embodiment, preferably, the network model of the present invention adopts Focal Loss to solve the problems of sample imbalance and sample difficult classification in the training process, and the network has a higher improvement in both training speed and training accuracy. Focal loss is modified above the underlying binary cross entropy loss function, the cross entropy loss function of the two classes is shown in equation 1 below,
Figure BDA0003705827410000061
wherein, the value range of y is 1 or-1, 1 represents a positive sample, -1 represents a negative sample, and the value range of probability p is 0 to 1;
p is replaced by pt, resulting in equation 2,
Figure BDA0003705827410000062
equation 1 can be rewritten as equation 3:
CE(p,y)=CE(pt)=-log(pt) (3)
in order to control the weights of the positive and negative samples, difficult-to-classify and easy-to-classify samples, the Focal loss adds a modulation factor in front of equation 3 to obtain equation 4:
FL(pt)=-α t (1-pt) γ log(pt) (4)
wherein alpha is t For reducing the weight of the negative samples, -alpha when the label is equal to 1 t Equal to alpha, when the tag is equal to others, -alpha t The value of alpha is equal to 1-alpha, and the range of alpha is 0 to 1. Thus, the contribution of positive and negative samples to loss can be controlled by setting the value of α. Of which (1-pt) γ For controlling the weight of a sample which is difficult to classify, if pt has a larger value, that is to say the probability of belonging to a certain class is larger, 1-pt is smaller, and vice versa, so that (1-pt) can be set γ To control the contribution of difficult and easy-to-classify samples to loss. Of which (1-pt) γ Modulation factor called Focal Loss, alpha t Is a coefficient commonly used in controlling the weight of positive and negative samples. When the gamma value is 0, the Focal Loss is a common binary cross entropy Loss function, and the weight of the samples difficult to classify is gradually increased along with the increase of the value, so that proper alpha and gamma values can be selected when the sample is used, and in one embodiment, the weight of the samples difficult to classify can be well controlled according to experience using 0.25, and a better training result is obtained.
Further, in an atypical class data set balancing method of the present invention, the method further comprises:
and obtaining a target image, and carrying out prediction classification on the target image based on the trained network model.
Specifically, a target image of the test set is obtained, the target image is input into a trained network model, and prediction classification is carried out on the target image of the test set. The method can be applied to various data sets needing prediction classification, particularly the prediction classification of images with different disease types, and is beneficial to applying a prediction network model to the classification of various target images. Can help to obtain an excellent intelligent skin cancer prediction model and intelligent prediction recognition classification when applied to skin cancer.
According to the invention, by using different classifier models, the atypical class data balance effect is verified on 20000 skin mirror image datasets of the ISIC2019, the sensitivity, F1 fraction, accuracy, specificity and accuracy of the atypical class data balance model are improved greatly compared with the model without atypical class data balance, wherein the F1 fraction of the GoogLeNet is improved by 12.7%, and the average accuracy is improved by about 5%. In the multi-classification task aiming at eight skin injuries such as melanoma, squamous cell carcinoma and the like, the accuracy of using an atypical data balance method and an EfficientNet model reaches 82.4%, and the accuracy of the champion model of the latest ISIC2019 competition is improved by about 20%. Therefore, the effectiveness of the intelligent identification classification strategy for the skin diseases provided by the method is fully described, and the atypical data balance strategy also provides a new thought for the field of unbalanced data classification tasks.
Example two
The invention also provides another implementation mode, and provides an atypical data set balancing device, which comprises the following components:
the preprocessing module is used for preprocessing the data set;
the first amplification module is used for amplifying unbalanced data sets of different categories to obtain balanced data sets;
the first training module is used for sending the balance data set into the set network model for training to obtain an atypical class data set;
the second amplification module is used for carrying out atypical data amplification on the atypical data set to obtain an atypical balanced data set;
and the second training module is used for inputting the typical class data set and the atypical class balance data set into the network model for training to obtain a trained network model and a trained classification result.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.
Example III
The invention also provides another implementation mode, and the invention provides electronic equipment, which comprises: a processor 1 and a memory 2.
The memory 2 is used for storing a computer program.
The memory 2 includes: various media capable of storing program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.
The processor 1 is connected to the memory 2 and is configured to execute a computer program stored in the memory 2, so that the apparatus for generating a standard section of a malformation of a fetal craniocerebral structure performs the above-mentioned atypical data set balancing method.
Preferably, the processor 1 may be a central processing unit (Central Processing Unit, CPU for short); an application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short) is also possible.
Example IV
The present invention also provides another embodiment, namely, a computer-readable storage medium storing a computer program executable by at least one processor to cause the at least one processor to perform the steps of the atypical class data set balancing method as described above.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the modules, units, and/or method steps of the various embodiments described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or components may be combined or integrated into another device or system, or some features may be omitted or not performed.
In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated units may be implemented in hardware or in software functional units.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A method of atypical class data set balancing, said method comprising the steps of:
s1, preprocessing a data set; wherein the dataset comprises an acquired medical image set;
the pretreatment comprises the following steps: cutting out images of the image set according to the initial aspect ratio, and reserving lesion parts in the images;
s2, amplifying unbalanced data sets of different categories to obtain balanced data sets;
s3, sending the balance data set into a set network model for training to obtain an atypical data set; the set network model specifically comprises the following steps: MBConvBlock, MBConv, sepConv, depth separable convolution (DWConv), SE module;
s4, carrying out atypical data amplification on the atypical data set;
s5, feeding the typical class data set and the atypical class equalization data set obtained in the step S4 back into a network model for training, and obtaining a trained network model and a trained classification result.
2. The atypical class data set balancing method of claim 1, wherein preprocessing the data set further comprises: random image enhancement processing.
3. The atypical class data set balancing method of claim 1, wherein the set network model specifically comprises:
the network model is EfficientNetB0;
and the initial model weight of the EfficientNetB0 network is obtained by performing migration learning training by using the weight trained on the natural image in the initial setting of the network model.
4. The method of atypical class data set balancing of claim 3,
the network model adopts Focal Loss and two-classification cross entropy Loss functions in the training process:
Figure FDA0004173045260000011
wherein, the value range of y is 1 or-1, 1 represents a positive sample, -1 represents a negative sample, and the value range of probability p is 0 to 1;
substituting pt for p gives equation 2:
Figure FDA0004173045260000012
equation 1 can be rewritten as equation 3:
CE(p,y)=CE(pt)=-log(pt) (3)
in order to control the weights of the positive and negative samples, difficult-to-classify and easy-to-classify samples, the Focal loss adds a modulation factor in front of equation 3 to obtain equation 4:
FL(pt)=-α t (1-pt) γ log(pt) (4)
wherein alpha is t For reducing the weight of the negative samples, -alpha when the label is equal to 1 t Equal to alpha, when the tag is equal to others, -alpha t The value of alpha is equal to 1-alpha, and the range of alpha is 0 to 1.
5. The atypical class data set balancing method of claim 1, wherein the data set amplification specifically comprises:
data amplification of the data set includes using one or more of histogram equalization, horizontal flipping, rotation by 30 degrees, 90 degrees, 150 degrees, 180 degrees, and random erasure.
6. The atypical class data set balancing method of claim 1, further comprising the steps of: and obtaining a target image, and carrying out prediction classification on the target image based on the trained network model.
7. An atypical class data set balancing apparatus, comprising in particular:
the preprocessing module is used for preprocessing the data set; wherein the dataset comprises an acquired medical image set;
the pretreatment comprises the following steps: cutting out images of the image set according to the initial aspect ratio, and reserving lesion parts in the images;
the first amplification module is used for amplifying unbalanced data sets of different categories to obtain balanced data sets;
the first training module is used for sending the balance data set into the set network model for training to obtain an atypical class data set; the set network model specifically comprises the following steps: MBConvBlock, MBConv, sepConv, depth separable convolution (DWConv), SE module;
the second amplification module is used for carrying out atypical data amplification on the atypical data set to obtain an atypical balanced data set;
and the second training module is used for feeding the typical class data set and the atypical class balance data set back to the network model for training, and obtaining a trained network model and a trained classification result.
8. An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising steps for performing the atypical class data set balancing method of any one of claims 1-6.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the atypical class data set balancing method according to any one of claims 1-6.
CN202210704826.0A 2022-06-21 2022-06-21 Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium Active CN114937288B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210704826.0A CN114937288B (en) 2022-06-21 2022-06-21 Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210704826.0A CN114937288B (en) 2022-06-21 2022-06-21 Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium

Publications (2)

Publication Number Publication Date
CN114937288A CN114937288A (en) 2022-08-23
CN114937288B true CN114937288B (en) 2023-05-26

Family

ID=82867601

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210704826.0A Active CN114937288B (en) 2022-06-21 2022-06-21 Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium

Country Status (1)

Country Link
CN (1) CN114937288B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612311A (en) * 2023-03-13 2023-08-18 浙江大学 Sample imbalance-oriented unqualified immunohistochemical image recognition system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231201A (en) * 2018-01-25 2018-06-29 华中科技大学 A kind of construction method, system and the application of disease data analyzing and processing model
CN111476266A (en) * 2020-02-27 2020-07-31 武汉大学 Non-equilibrium type leukocyte classification method based on transfer learning
CN112766379A (en) * 2021-01-21 2021-05-07 中国科学技术大学 Data equalization method based on deep learning multi-weight loss function
CN112966767A (en) * 2021-03-19 2021-06-15 焦点科技股份有限公司 Data unbalanced processing method for separating feature extraction and classification tasks
US11087883B1 (en) * 2020-04-02 2021-08-10 Blue Eye Soft, Inc. Systems and methods for transfer-to-transfer learning-based training of a machine learning model for detecting medical conditions
CN113743332A (en) * 2021-09-08 2021-12-03 中国科学院自动化研究所 Image quality evaluation method and system based on universal vision pre-training model
CN114266717A (en) * 2020-09-25 2022-04-01 天津科技大学 Parallel capsule network cervical cancer cell detection method based on Inception module
WO2022109295A1 (en) * 2020-11-19 2022-05-27 Carnegie Mellon University System and method for detecting and classifying abnormal cells

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163234B (en) * 2018-10-10 2023-04-18 腾讯科技(深圳)有限公司 Model training method and device and storage medium
CN114529767A (en) * 2022-02-18 2022-05-24 厦门大学 Small sample SAR target identification method based on two-stage comparison learning framework

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231201A (en) * 2018-01-25 2018-06-29 华中科技大学 A kind of construction method, system and the application of disease data analyzing and processing model
CN111476266A (en) * 2020-02-27 2020-07-31 武汉大学 Non-equilibrium type leukocyte classification method based on transfer learning
US11087883B1 (en) * 2020-04-02 2021-08-10 Blue Eye Soft, Inc. Systems and methods for transfer-to-transfer learning-based training of a machine learning model for detecting medical conditions
CN114266717A (en) * 2020-09-25 2022-04-01 天津科技大学 Parallel capsule network cervical cancer cell detection method based on Inception module
WO2022109295A1 (en) * 2020-11-19 2022-05-27 Carnegie Mellon University System and method for detecting and classifying abnormal cells
CN112766379A (en) * 2021-01-21 2021-05-07 中国科学技术大学 Data equalization method based on deep learning multi-weight loss function
CN112966767A (en) * 2021-03-19 2021-06-15 焦点科技股份有限公司 Data unbalanced processing method for separating feature extraction and classification tasks
CN113743332A (en) * 2021-09-08 2021-12-03 中国科学院自动化研究所 Image quality evaluation method and system based on universal vision pre-training model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
卷积神经网络的宫颈细胞图像分类;赵越 等;计算机辅助设计与图形学学报(第11期);2049-2054 *

Also Published As

Publication number Publication date
CN114937288A (en) 2022-08-23

Similar Documents

Publication Publication Date Title
Jain et al. Convolutional neural network based Alzheimer’s disease classification from magnetic resonance brain images
Pitchai et al. RETRACTED ARTICLE: Brain Tumor Segmentation Using Deep Learning and Fuzzy K-Means Clustering for Magnetic Resonance Images
Öztürk Stacked auto-encoder based tagging with deep features for content-based medical image retrieval
US10846566B2 (en) Method and system for multi-scale cell image segmentation using multiple parallel convolutional neural networks
de Lima et al. Detection and classification of masses in mammographic images in a multi-kernel approach
Öksüz et al. Brain tumor classification using the fused features extracted from expanded tumor region
Trivizakis et al. A novel deep learning architecture outperforming ‘off‑the‑shelf’transfer learning and feature‑based methods in the automated assessment of mammographic breast density
Zhang et al. A novel algorithm for breast mass classification in digital mammography based on feature fusion
CN111767952A (en) Interpretable classification method for benign and malignant pulmonary nodules
Shirazi et al. Detection of cancer tumors in mammography images using support vector machine and mixed gravitational search algorithm
CN114937288B (en) Atypical data set balancing method, atypical data set balancing device and atypical data set balancing medium
CN115601751B (en) Fundus image semantic segmentation method based on domain generalization
Siar et al. A combination of feature extraction methods and deep learning for brain tumour classification
Rampun et al. Breast density classification using local ternary patterns in mammograms
Kavinkumar et al. Classification of tumor of MRI brain image using hybrid feature extraction method and support vector machine classifier
Shan et al. Brain Tumor Segmentation using Automatic 3D Multi-channel Feature Selection Convolutional Neural Network.
Mohan et al. MRI Brain Tumor Detection and Classification Using U-NET CNN
Dutta et al. Detecting grades of diabetic retinopathy by extraction of retinal lesions using digital fundus images
Oladimeji et al. Brain tumor classification using ResNet50-convolutional block attention module
Chen et al. Chromosome segmentation via data simulation and shape learning
Kishanrao et al. An improved grade based MRI brain tumor classification using hybrid DCNN-DH framework
Kontos et al. Breast cancer detection in mammogram medical images with data mining techniques
Alnaggar et al. MRI brain tumor detection using boosted crossbred random forests and chimp optimization algorithm based convolutional neural networks
Radhika et al. MSCDNet-based multi-class classification of skin cancer using dermoscopy images
Indraswari et al. Brain tumor detection on magnetic resonance imaging (MRI) images using convolutional neural network (CNN)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant