CN113673575A - Data synthesis method, training method of image processing model and related device - Google Patents

Data synthesis method, training method of image processing model and related device Download PDF

Info

Publication number
CN113673575A
CN113673575A CN202110845215.3A CN202110845215A CN113673575A CN 113673575 A CN113673575 A CN 113673575A CN 202110845215 A CN202110845215 A CN 202110845215A CN 113673575 A CN113673575 A CN 113673575A
Authority
CN
China
Prior art keywords
feature data
subset
feature
data
subsets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110845215.3A
Other languages
Chinese (zh)
Inventor
杨永涛
唐邦杰
苏慧
潘华东
殷俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202110845215.3A priority Critical patent/CN113673575A/en
Publication of CN113673575A publication Critical patent/CN113673575A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a data synthesis method, a training method of an image processing model and a related device, wherein the data synthesis method comprises the following steps: obtaining a first feature data set comprising a first amount of feature data, and obtaining a second feature data set comprising a second amount of feature data, the first amount being less than the second amount; dividing each feature data of the first feature data set based on the feature similarity to obtain a plurality of feature data subsets; wherein the similarity of different subsets of feature data and reference feature data is different, the reference feature data being determined based on the first feature data set; and synthesizing the feature data of each feature data subset in the plurality of feature data subsets based on the second quantity to obtain the extended feature data of each feature data subset. So that the distribution of minority class data and majority class data can be balanced.

Description

Data synthesis method, training method of image processing model and related device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data synthesis method, a training method for an image processing model, and a related apparatus.
Background
In image recognition, the imbalance of the actual distribution of data can cause the imbalance of each dimension of the acquired data. Data imbalance is a problem which is frequently encountered, data imbalance among different attributes can cause attribute identification results to be more inclined to most classes, and deviate from few classes, data imbalance in the classes can also cause data dimension loss, the effect of the algorithm in certain dimensions is poor, and therefore the generalization capability of the algorithm is affected, and the data imbalance affects the effect of the algorithm. However, the recognition effect of the algorithm is not completely determined by the data volume, and some samples which are easy to be classified can achieve a better effect only by a small amount of data. In general, samples near the center of a feature are easier to distinguish, while samples near the boundary of a feature are easier to misclassify.
The inventors of the present application have found in long-term research and development that one common method to alleviate the data imbalance problem is resampling. However, when the majority of data in resampling is sampled at random, the distribution of original data is easy to change, and effective information in a sample is easy to lose, and when the minority of data is sampled at random, the sample information is easy to repeat, which causes overfitting, thereby causing the generalization ability of the algorithm to be reduced. Some methods are available for alleviating the data imbalance problem, such as a Synthetic least priority Oversampling Technique (SMOTE) algorithm, which expands the number of samples of a Minority class of data by synthesizing a new sample of the Minority class of data, and is an Oversampling algorithm improved on the basis of random sampling; borderline SMOTE is an improved oversampling algorithm based on SMOTE, which uses only a few classes of samples on the boundary to synthesize new samples, thereby improving the class distribution of the samples; adaptive synthesis sampling (ADASYN) gives different weights to different minority classes, thereby generating different numbers of samples.
However, the Borderline SMOTE method and the ADASYN method both synthesize new samples using only boundary samples, and ignore the effect of the boundary internal samples. In general, the closer to the center of the feature, the greater the density of samples, while the more sparse the samples are for locations away from the center of the sample. Therefore, neither the Borderline SMOTE method nor the ADASYN method can balance the minority class data and the majority class data in each dimension, and the prior art needs to be improved.
Disclosure of Invention
The invention provides a data synthesis method, a training method of an image processing model and a related device.
In order to solve the above technical problems, a first technical solution provided by the present invention is: provided is a data synthesis method including: obtaining a first feature data set comprising a first amount of feature data and obtaining a second feature data set comprising a second amount of feature data, the first amount being less than the second amount; dividing each feature data of the first feature data set based on the feature similarity to obtain a plurality of feature data subsets; wherein the similarity of the different feature data subsets and the reference feature data is different, the reference feature data being determined based on the first feature data set; and synthesizing the feature data of each feature data subset in the plurality of feature data subsets based on the second quantity to obtain the expanded feature data of each feature data subset.
The synthesizing, based on the second quantity, the feature data of each of the plurality of feature data subsets to obtain the extended feature data of each of the feature data subsets includes: determining the number of basic extensions corresponding to each feature data subset based on the second number; the base extension number characterizes a number reference value of the extended feature data for the corresponding feature data subset; and synthesizing the feature data in each feature data subset based on the basic expansion number corresponding to each feature data subset to obtain the expanded feature data of each feature data subset.
Determining the number of basic extensions corresponding to each feature data subset based on the second number, wherein the determining the number of basic extensions corresponding to each feature data subset comprises: determining a feature density corresponding to each feature data subset, and determining an expansion number corresponding to the first feature data set based on the second number; the expansion quantity characterizes a quantity reference value of the expansion characteristic data of the first characteristic data set; and determining the basic expansion quantity corresponding to each characteristic data subset based on the characteristic density and the expansion quantity.
The step of determining the feature density corresponding to each feature data subset comprises the following steps: determining a third amount of feature data in each subset of feature data, and determining a density of feature data in each subset of feature data; the density is determined based on neighborhood characteristics of the feature data in each subset of feature data; and determining the feature density corresponding to each feature data subset based on the third quantity and the density.
The step of synthesizing the feature data of each of the plurality of feature data subsets based on the second number to obtain the extended feature data of each feature data subset includes: weighting the basic expansion quantity corresponding to each characteristic data subset, and determining the final expansion quantity corresponding to each characteristic data subset, wherein the final expansion quantity is a quantity determination value aiming at the expansion characteristic data of the corresponding characteristic data subset; and synthesizing the feature data of each feature data subset based on the final expansion quantity and the neighborhood features of the feature data of each feature data subset to obtain the expansion feature data of each feature data subset.
The method for obtaining the feature data subsets comprises the following steps of dividing each feature data of a first feature data set based on feature similarity, wherein the step of obtaining a plurality of feature data subsets comprises the following steps: traversing all feature data in the first feature data set to obtain feature data with the similarity smaller than a first threshold value with the reference feature data, and obtaining a first feature data subset; traversing the remaining feature data in the first feature data set except the first feature data subset to obtain feature data with similarity greater than a first threshold and smaller than a second threshold with reference feature data, and obtaining a second feature data subset; traversing the remaining feature data in the first feature data set except the first feature data subset and the second feature data subset to obtain feature data with similarity greater than a second threshold with the reference feature data, and obtaining a third feature data subset; wherein the similarity of the third characteristic data subset to the reference characteristic data is greater than the similarity of the second characteristic data subset to the reference characteristic data; the similarity of the second subset of feature data to the reference feature data is greater than the similarity of the first subset of feature data to the reference feature data.
Wherein the steps of obtaining a first feature data set comprising a first amount of feature data and obtaining a second feature data set comprising a second amount of feature data comprise: acquiring a characteristic data set; and classifying the feature data sets according to the dimension of each feature data in the feature data sets to further obtain a first feature data set and a second feature data set.
In order to solve the above technical problems, a second technical solution provided by the present invention is: provided is a data synthesis apparatus including: an acquisition module for acquiring a first feature data set comprising a first amount of feature data and a second feature data set comprising a second amount of feature data, the first amount being smaller than the second amount; the sampling module is used for dividing each feature data of the first feature data set based on the feature similarity to obtain a plurality of feature data subsets; wherein the similarity of the different feature data subsets and the reference feature data is different, the reference feature data being determined based on the first feature data set; and the synthesis module is used for synthesizing the feature data of each feature data subset in the plurality of feature data subsets based on the second quantity to obtain the extended feature data of each feature data subset.
In order to solve the above technical problems, a third technical solution provided by the present invention is: a training method of an image processing model is provided, which comprises the following steps: acquiring a training sample set, wherein the training sample set comprises extended characteristic data obtained by any one of the methods; and training the image processing model to be trained by utilizing the training sample set to obtain the trained image processing model.
In order to solve the above technical problems, a fourth technical solution provided by the present invention is: there is provided a training apparatus for an image processing model, comprising: a sample set acquisition module, configured to acquire a training sample set, where the training sample set includes extended feature data obtained by any one of the above methods; and the training module is used for training the image processing model to be trained by utilizing the training sample set to obtain the trained image processing model.
In order to solve the above technical problems, a fifth technical solution provided by the present invention is: provided is an electronic device including: a memory storing program instructions and a processor retrieving the program instructions from the memory to perform any of the above methods.
In order to solve the above technical problems, a sixth technical solution provided by the present invention is: there is provided a computer readable storage medium storing a program file executable to implement the method of any of the above.
The data synthesis method has the beneficial effects that the data synthesis method is different from the prior art, and is used for dividing each feature data of the first feature data set based on the feature similarity to obtain a plurality of feature data subsets; wherein the similarity of the different feature data subsets and the reference feature data is different, the reference feature data being determined based on the first feature data set; and synthesizing the feature data of each feature data subset in the plurality of feature data subsets based on the second quantity to obtain the expanded feature data of each feature data subset. Wherein the first number of feature data of the first set of feature data is smaller than the second number. The method can balance the distribution of minority class data and majority class data.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:
FIG. 1 is a schematic flow chart of a first embodiment of a data synthesis method according to the present invention;
FIG. 2 is a schematic flow chart of hierarchical sampling according to the present invention;
FIG. 3 is a flowchart illustrating an embodiment of step S13 in FIG. 1;
FIG. 4 is a schematic structural diagram of a data synthesis apparatus according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating an embodiment of a method for training an image processing model according to the present invention;
FIG. 6 is a schematic structural diagram of an embodiment of an apparatus for training an image processing model according to the present invention;
FIG. 7 is a schematic structural diagram of an electronic device according to an embodiment of the invention;
FIG. 8 is a structural diagram of an embodiment of a computer-readable storage medium according to the invention.
Detailed Description
The prior art provides a research of an unbalanced data resampling method based on data screening, and the method provides a mixed sampling method based on safe sample screening aiming at the problem that unbalanced data is easy to lose information in the under-sampling method processing. The method combines an under-sampling method and an over-sampling method based on safety sample screening, carries out safety sample screening on unbalanced data, deletes part of data in a plurality of types of samples and a few types of samples, retains important samples which are valuable for determining classification boundaries, and then carries out over-sampling treatment to generate new data. The main disadvantages of this method are: the safety sample screening method is complex in calculation, so that the data processing efficiency is low; the down sampling and up sampling are carried out on most types, so that more noise is easily introduced, the information distortion of an original sample set is caused, and the authenticity and the accuracy of data are influenced.
The prior art also proposes a density-based SMOTE method study, which proposes a new oversampling method: the DS-SMOTE method. The DS-SMOTE algorithm identifies sparse samples based on the density of the samples and takes the sparse samples as seed samples in the sampling process; the idea of the SMOTE algorithm is then employed in the sampling process to generate a composite sample between the seed sample and its neighboring K samples. The main disadvantages of this method are: the samples in the sparse set generate the same number of new samples, and the sparse degree difference of each sparse sample is not considered, so that the relative sparse degree among the sparse samples is not changed in the processing mode, more new samples can be generated for the more sparse samples, and the data volume of each characteristic region sample can be balanced.
The prior art also provides a non-equilibrium data set processing method and a system based on the improved SMOTE algorithm, and the design key points of the method are that the non-equilibrium data set processing method based on the improved SMOTE algorithm is provided, firstly, the gravity center point of a minority sample is calculated, secondly, the gravity center point of a minority small area is constructed, then the minority sample and each sample of the set M are respectively subjected to random linear interpolation, and a new minority sample is synthesized and added into a data set; and finally, judging the unbalanced rate of the new data set, if the unbalanced rate is too small, repeating the steps, and if not, stopping. The main disadvantages of this method are: and respectively carrying out random linear interpolation on a few types of samples and each sample in the set, wherein all samples participate in synthesizing a new sample and no operation is carried out on specific data characteristics. The subsequent training of the model by using the synthesized new sample can lead to weak generalization capability of the trained model.
The method not only adds different class weights to samples of different classes, but also applies the sample weights to the samples on a sample level by considering the difference among the data samples in the same class, so that the weights of the samples in the same class are different, the effect of the class with small sample number can be increased on one hand, the difference between the class with small sample number and the class with large sample number can be reduced on the other hand, and the unbalance of the model caused by unbalanced data distribution is integrally and effectively relieved. In addition, the design embodiment dynamically sets the sample weight of each sample according to the output result of the sample in the network training process, and is not limited to setting the same weight for each sample, so that the contribution of the sample to the network is better adjusted, and the influence of unbalanced sample distribution on the model performance is effectively relieved. The main disadvantages of this method are: the weight of the sample is adjusted too frequently in the training process, which easily causes the algorithm to be not converged.
In addition, in the prior art, the Borderline SMOTE method and the ADASYN method both synthesize new samples by using boundary samples only, and neglect the effect of boundary internal samples. In general, the closer to the center of the feature, the greater the density of samples, while the more sparse the samples are for locations away from the center of the sample. Therefore, neither the Borderline SMOTE method nor the ADASYN method can balance the minority class data and the majority class data in each dimension.
The invention provides a data synthesis method, which performs hierarchical upsampling when few types of data are upsampled, so that the number of new samples generated at positions far away from a feature center is large, and the number of new samples generated at positions close to the feature center is small, thus the information of the few types of samples can be fully utilized to generate new samples, and the purpose of balancing all dimensional data in a class is achieved.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, a schematic flow chart of a first embodiment of the data synthesis method of the present invention specifically includes:
step S11: a first feature data set comprising a first amount of feature data is acquired, and a second feature data set comprising a second amount of feature data is acquired.
Specifically, a feature data set is obtained, and the feature data set is classified according to the dimension of each feature data in the feature data set to obtain a first feature data set and a second feature data set. Wherein the first feature data set comprises a first amount of feature data and the second feature data set comprises a second amount of feature data, the first amount being less than the second amount. I.e. the first characteristic data set is a minority class data set with respect to the second characteristic data set.
Specifically, the multi-label identification task of the attributes of the person has a plurality of attributes, that is, each sample has labels corresponding to a plurality of attributes. Besides the data imbalance of the majority class and the minority class, the data imbalance of each dimension in the class also exists, and the data distribution and the characteristic boundary of each dimension are different. It is assumed that there are far more male samples than female samples for gender attributes in the training set of the model, but the sample data distributions for males and females at different ages are different. Assume that the number of samples in different age groups with gender attributes is distributed as follows:
TABLE 1 data distribution of gender Attribute at different age groups
Figure BDA0003180591690000081
In table 1, the gender attribute is generally the majority and the minority of males, but the data distribution of gender attribute is different at different ages. If the data of the age group of the child is the data of the child, the male is the minority class, the female is the majority class, and the data distribution is opposite to the overall data distribution, if the minority class is up-sampled according to the overall data distribution rule, the data volume of the girl is increased, and the data imbalance of the male and the female in the child data is increased; the data of the old age group is balanced between male and female data, and up-sampling operation is not needed. Therefore, there is a need to separate attribute data by dimension and then to combine new samples for a few classes of dimensions.
Specifically, if the attribute of the data to be processed is age, the data to be processed may be classified into children, adults, and elderly people according to age, and after classification, the majority class data (the second feature data set) and the minority class data (the first feature data set) of each age group may be obtained. For another example, if the attribute (attribute is dimension) of the data to be processed is gender, the data to be processed may be classified into boys and girls according to gender, and after classification, the majority class data (second feature data set) and the minority class data (first feature data set) of each gender may be obtained. By the method, data of all dimensions can be balanced.
Step S12: and dividing each feature data of the first feature data set based on the feature similarity to obtain a plurality of feature data subsets.
Specifically, each feature data of the first feature data set is divided based on the feature similarity of the feature data in the first feature data set, so as to obtain a plurality of feature data subsets. Wherein the similarity of the different feature data subsets and the reference feature data is different, the reference feature data being determined based on the first feature data set. In a specific embodiment, the reference feature data is a central feature of the first feature data set. Specifically, the calculation determines a center feature of the first feature data based on a distance between feature data of the first feature data set.
The data synthesis method of this embodiment samples each feature data of the first feature data set by using a hierarchical sampling method to obtain a plurality of feature data subsets. Wherein each subset of feature data comprises at least one feature data. In one embodiment, the first feature data set is upsampled by using a hierarchical sampling method, so that a small number of types of feature information can be fully utilized to generate new feature data.
In general, the closer to the center of the feature of the first feature data set, the greater the density of the feature, and the further away from the center of the feature of the first feature data set, the sparser the feature. In this embodiment, the first feature data set is hierarchically and sequentially sampled from a sampling order gradually approaching the center of the first feature data set, so as to obtain a plurality of feature data subsets.
Specifically, in an embodiment, all the feature data in the first feature data set are traversed to obtain feature data with similarity smaller than a first threshold value with the reference feature data, and a first feature data subset is obtained. And traversing the residual feature data in the first feature data set except the first feature data subset to obtain the feature data with the similarity greater than a first threshold and smaller than a second threshold with the reference feature data, and obtaining a second feature data subset. Traversing the remaining feature data in the first feature data set except the first feature data subset and the second feature data subset to obtain feature data with similarity greater than a second threshold with the reference feature data, and obtaining a third feature data subset; wherein the similarity of the third characteristic data subset to the reference characteristic data is greater than the similarity of the second characteristic data subset to the reference characteristic data; the similarity of the second subset of feature data to the reference feature data is greater than the similarity of the first subset of feature data to the reference feature data.
After traversing the first feature data set to obtain a first feature data subset, traversing the remaining feature data in the first feature data set except the feature data in the first feature data subset to obtain feature data adjacent to the feature data in the first feature data subset, so as to obtain a second feature data subset. Specifically, the feature data adjacent to the feature data is feature data having a high similarity to the feature data. For example, feature distances between two feature data in the first feature data set may be calculated, and feature data of K neighbors of one feature data may be obtained based on the feature distances. That is, the remaining feature data in the first feature data set except the feature data in the first feature data subset are traversed to obtain K neighboring feature data of each feature data in the first feature data subset, and the K neighboring feature data of each feature data form a second feature data subset.
Furthermore, the remaining feature data in the first feature data set except the feature data in the first feature data subset and the feature data in the second feature data subset can be traversed to obtain feature data adjacent to the feature data in the second feature data subset, and then a third feature data subset can be obtained. For example, the remaining feature data in the first feature data set except for the feature data in the first feature data subset and the feature data in the second feature data subset are traversed to obtain K neighboring feature data of each feature data in the second feature data subset, and the K neighboring feature data of each feature data form a third feature data subset. And repeating the steps until the traversal to the center position of the first feature data set is finished, namely the reference feature data is finished, and finally obtaining a first feature data subset, a second feature data subset, a third feature data subset, a.
In a practical embodiment, a neighborhood with the feature center M as the center and P as the radius is set in the first feature data set, specifically, as shown in fig. 2, a circle with M as the center and P as the radius is set as the boundary of the neighborhood. If the distance from the feature data to the feature center M is smaller than P, the feature data is in the boundary; if the distance from the feature data to the feature center M is greater than P, the feature data is outside the boundary. It is to be noted that P is taken as small as possible.
During first sampling, finding out feature data different from the preset category of the first feature data set from the first feature data set, and further obtaining a first feature data subset I; during second sampling, finding out feature data adjacent to the feature data in the first feature data subset I from the residual feature data except the first feature data subset I in the first feature data set, and further obtaining a second feature data subset II; during sampling for the third time, finding out feature data adjacent to the feature data in the second feature data subset from the remaining feature data except the first feature data subset I and the second feature data subset II in the first feature data set, and further obtaining a third feature data subset III; in the fourth sampling, finding out feature data adjacent to the feature data in the third feature data subset from the remaining feature data except the first feature data subset I, the second feature data subset II and the third feature data subset III in the first feature data set, and further obtaining a fourth feature data subset IV; and in the fifth sampling, finding out the feature data adjacent to the feature data in the fourth feature data subset from the rest feature data except the first feature data subset I, the second feature data subset II, the third feature data subset III and the fourth feature data subset IV in the first feature data set, and further obtaining a fifth feature data subset (fifth) until the boundary of a circle with M as the center and P as the radius is sampled. Assuming that the finally sampled feature data is "nth feature data subset", the first feature data set may be divided into N layers through the above process. As shown in fig. 2, the first feature data set is divided into 5 layers, and feature data subsets corresponding to the respective layers are obtained.
Step S13: and synthesizing the feature data of each feature data subset in the plurality of feature data subsets based on the second quantity to obtain the expanded feature data of each feature data subset.
Specifically, after obtaining the feature data subsets corresponding to each layer by sampling, the feature data of each feature data subset in the plurality of feature data subsets needs to be synthesized based on the second quantity, so as to obtain the extended feature data of each feature data subset.
In one embodiment, as shown in fig. 3, step S13 includes:
step S31: and determining the number of basic extensions corresponding to each characteristic data subset based on the second number.
Wherein the base extension number characterizes a number reference value of the extended feature data for the corresponding feature data subset.
The final objective of the present application is to balance the number of feature data in the first feature data set with the number of feature data in the second feature data set, so that the number of base extensions corresponding to each feature data subset is determined based on the second number of feature data in the second feature data set.
Specifically, in one embodiment, the number of expansions corresponding to the first feature data set is determined based on the second number. The spread quantity characterizes a quantity reference value of the spread characteristic data of the first characteristic data set. Specifically, the expansion number corresponding to the first feature data set is determined by the following formula (1):
G=(ml-ms)×β (1);
wherein m islIs the second number, msIs a first quantity, beta e [0,1 ]]And G is the expansion number corresponding to the first characteristic data set.
And determining the feature density corresponding to each feature data subset. In one embodiment, a third amount of feature data in each subset of feature data is determined, and a density of feature data in each subset of feature data is determined; the density is determined based on the neighborhood characteristics of the feature data in each subset of feature data. And determining the feature density corresponding to each feature data subset based on the third quantity and the density.
Specifically, the feature density corresponding to the feature data subset is determined by using the following formula (2):
Figure BDA0003180591690000121
wherein the content of the first and second substances,
Figure BDA0003180591690000122
for the characteristic data xkThe number of feature data of the same class in the neighborhood feature of (a),
Figure BDA0003180591690000123
expressed as characteristic data xkThe density of (a) of (b),
Figure BDA0003180591690000124
m is a third number for the feature density corresponding to the subset of feature data.
Based on feature density
Figure BDA0003180591690000125
And the expansion quantity G is used for determining the basic expansion quantity corresponding to each characteristic data subset. Specifically, the following formula (3) is used to determine the number of basic expansions corresponding to each feature data subset:
Figure BDA0003180591690000126
wherein the content of the first and second substances,
Figure BDA0003180591690000127
feature density for the ith feature data subset, GiAnd N is the number of the characteristic data subsets.
Step S32: and synthesizing the feature data in each feature data subset based on the basic expansion number corresponding to each feature data subset to obtain the expanded feature data of each feature data subset.
In one embodiment, after obtaining the number of base extensions corresponding to each subset of feature data, the number of base extensions is based onThe basic expansion quantity G corresponding to each characteristic data subsetiAnd synthesizing each feature data in the plurality of feature data subsets to obtain the extended feature data of each feature data subset.
In a specific embodiment of the present application, a weighting process is further performed on the basic expansion number corresponding to each feature data subset, and a final expansion number corresponding to each feature data subset is determined, where the final expansion number is a value determined for the number of expanded feature data of the corresponding feature data subset; and synthesizing the feature data of each feature data subset based on the final expansion quantity and the neighborhood features of the feature data of each feature data subset to obtain the expansion feature data of each feature data subset.
Specifically, the proportion of the second feature data set in the neighborhood feature of the ith feature data subset in the first feature data set is calculated in the following specific calculation mode:
Figure BDA0003180591690000131
wherein, DeltaiThe number of feature data of the second feature data set in the neighborhood feature of the ith feature data in the first feature data set is K, and the number of feature data of the neighborhood feature of the ith feature data in the first feature data set is K.
The proportion r of the second feature data set in the neighborhood feature of the ith feature data in the first feature data setiNormalization is performed. Specifically, a ratio of the second feature data set in the neighborhood feature of the ith feature data to the sum of the ratios of the second feature data sets in the neighborhood features of all the feature data is calculated in the following specific calculation manner:
Figure BDA0003180591690000132
weighting the basic expansion quantity Gi corresponding to the ith characteristic data subset by using the ratio result to obtain the final expansion quantity Gi corresponding to the ith characteristic data subset:
Figure BDA0003180591690000133
after weighting, synthesizing the feature data of each feature data subset based on the final expansion number and the neighborhood features of the feature data of each feature data subset to obtain the expansion feature data of each feature data subset. Specifically, the extended feature data of each feature data subset is calculated by using the following formula (4):
Figure BDA0003180591690000134
wherein x isziIs the characteristic data xiOne of gi feature data randomly selected from neighborhood features, alpha ∈ [0,1 ]]。
Because the feature data close to the boundary are sparse, new data are synthesized more by the method of the embodiment, so that the feature data close to the boundary are increased greatly, the data distribution of the original data set is broken, and the data distribution is more balanced.
The method comprises the steps of classifying feature data sets according to dimensions, then dividing minority class data, namely a first feature data set, to obtain a plurality of feature data subsets, and determining the number of expansion feature data needing to be synthesized corresponding to each feature data subset based on the density of each feature data subset. In the obtained extended feature data, the feature data subsets far away from the feature center generate more extended feature data, and the feature data subsets close to the feature center generate less extended feature data, so that feature data of various dimensions can be balanced. The synthesized expanded feature data is more, the feature data close to the boundary is greatly increased, the data distribution of the original data set is broken, and the data distribution is more balanced.
Fig. 4 is a schematic structural diagram of a data synthesis apparatus according to an embodiment of the present invention, which specifically includes: an acquisition module 41, a sampling module 42 and a synthesis module 43.
The obtaining module 41 is configured to obtain a first feature data set including a first amount of feature data, and obtain a second feature data set including a second amount of feature data.
Specifically, the obtaining module 41 obtains the feature data sets, and classifies the feature data sets according to the dimension of each feature data in the feature data sets to obtain a first feature data set and a second feature data set. Wherein the first feature data set comprises a first amount of feature data and the second feature data set comprises a second amount of feature data, the first amount being less than the second amount. I.e. the first characteristic data set is a minority class data set with respect to the second characteristic data set.
The sampling module 42 is configured to divide each feature data of the first feature data set based on the feature similarity, so as to obtain a plurality of feature data subsets.
Specifically, the sampling module 42 divides each feature data of the first feature data set based on the feature similarity of the feature data in the first feature data set to obtain a plurality of feature data subsets. Wherein the similarity of the different feature data subsets and the reference feature data is different, the reference feature data being determined based on the first feature data set. In a specific embodiment, the reference feature data is a central feature of the first feature data set. Specifically, the calculation determines a center feature of the first feature data based on a distance between feature data of the first feature data set.
The data synthesis method of this embodiment samples each feature data of the first feature data set by using a hierarchical sampling method to obtain a plurality of feature data subsets. Wherein each subset of feature data comprises at least one feature data. In one embodiment, the first feature data set is upsampled by using a hierarchical sampling method, so that a small number of types of feature information can be fully utilized to generate new feature data.
In general, the closer to the center of the feature of the first feature data set, the greater the density of the feature, and the further away from the center of the feature of the first feature data set, the sparser the feature. In this embodiment, the first feature data set is hierarchically and sequentially sampled from a sampling order gradually approaching the center of the first feature data set, so as to obtain a plurality of feature data subsets.
Specifically, in an embodiment, all the feature data in the first feature data set are traversed to obtain feature data with similarity smaller than a first threshold value with the reference feature data, and a first feature data subset is obtained. And traversing the residual feature data in the first feature data set except the first feature data subset to obtain the feature data with the similarity greater than a first threshold and smaller than a second threshold with the reference feature data, and obtaining a second feature data subset. Traversing the remaining feature data in the first feature data set except the first feature data subset and the second feature data subset to obtain feature data with similarity greater than a second threshold with the reference feature data, and obtaining a third feature data subset; wherein the similarity of the third characteristic data subset to the reference characteristic data is greater than the similarity of the second characteristic data subset to the reference characteristic data; the similarity of the second subset of feature data to the reference feature data is greater than the similarity of the first subset of feature data to the reference feature data.
The synthesizing module 43 is configured to perform synthesizing processing on the feature data of each of the feature data subsets based on the second quantity to obtain extended feature data of each feature data subset.
In one embodiment, the synthesis module 43 determines the number of base extensions corresponding to each of the feature data subsets based on the second number. The base extension number characterizes a number reference value of the extended feature data for the corresponding feature data subset. And synthesizing the feature data in each feature data subset based on the basic expansion number corresponding to each feature data subset to obtain the expanded feature data of each feature data subset.
In one embodiment, the synthesis module 43 is configured to determine a feature density corresponding to each feature data subset, and determine an expansion number corresponding to the first feature data set based on the second number; the spread quantity characterizes a quantity reference value of the spread characteristic data of the first characteristic data set. And determining the basic expansion quantity corresponding to each characteristic data subset based on the characteristic density and the expansion quantity.
In one embodiment, the synthesis module 43 is configured to determine a third amount of feature data in each subset of feature data, and determine a density of feature data in each subset of feature data; the density is determined based on the neighborhood characteristics of the feature data in each subset of feature data. And determining the feature density corresponding to each feature data subset based on the third quantity and the density.
In an embodiment, the synthesizing module 43 is configured to perform weighting processing on the basic expansion quantity corresponding to each feature data subset, and determine a final expansion quantity corresponding to each feature data subset, where the final expansion quantity is a quantity determination value for the expanded feature data of the corresponding feature data subset; and synthesizing the feature data of each feature data subset based on the final expansion quantity and the neighborhood features of the feature data of each feature data subset to obtain the expansion feature data of each feature data subset.
Referring to fig. 5, a schematic flow chart of an embodiment of the training method of the image processing model of the present invention specifically includes:
step S51: a training sample set is obtained.
Step S52: and training the image processing model to be trained by utilizing the training sample set to obtain the trained image processing model.
Specifically, in this embodiment, for example, the synthesized extended feature data obtained by the data synthesis method shown in fig. 1 to 3 is used to train the image processing model to be trained. The synthesized sample set obtained by the data synthesis method shown in fig. 1-3 can be used to fine-tune the image processing model to be trained, so that the feature distribution of the new feature data shows the characteristics of dense center and sparse boundary again, the boundary features move to the feature center, the feature data close to the feature center has no tendency of diffusing to the boundary, the purpose of shrinking the feature boundary is achieved, and the physical sign classification effect of the image processing model obtained by training is better.
Referring to fig. 6, a schematic structural diagram of an embodiment of the training apparatus for an image processing model of the present invention specifically includes: a sample set acquisition module 61 and a training module 62.
Wherein the sample set acquiring module 61 is used for acquiring a training sample set. The training sample set is the extended feature data obtained by the data synthesis method shown in fig. 1 to 3.
The training module 62 is configured to train the image processing model to be trained by using the training sample set, so as to obtain the trained image processing model. Specifically, the training module 62 is configured to perform fine adjustment on the image processing model to be trained by using the synthesized sample set obtained by the data synthesis method shown in fig. 1 to 3, so that the feature distribution of the new feature data presents characteristics of dense center and sparse boundary again, and thus the boundary features move to the feature center, and the feature data close to the feature center does not have a tendency of diffusing to the boundary, so that the purpose of shrinking the feature boundary is achieved, and the image processing model obtained by training has a better physical sign classification effect.
Referring to fig. 7, a schematic structural diagram of an electronic device according to an embodiment of the present invention is shown, where the electronic device includes a memory 202 and a processor 201 that are connected to each other.
The memory 202 is used to store program instructions implementing the method of any of the above.
The processor 201 is used to execute program instructions stored by the memory 202.
The processor 201 may also be referred to as a Central Processing Unit (CPU). The processor 201 may be an integrated circuit chip having signal processing capabilities. The processor 201 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 202 may be a memory bank, a TF card, etc., and may store all information in the electronic device of the device, including the input raw data, the computer program, the intermediate operation results, and the final operation results. It stores and retrieves information based on the location specified by the controller. With the memory, the electronic device can only have the memory function to ensure the normal operation. The memories of electronic devices are classified into a main memory (internal memory) and an auxiliary memory (external memory) according to their purposes, and also into an external memory and an internal memory. The external memory is usually a magnetic medium, an optical disk, or the like, and can store information for a long period of time. The memory refers to a storage component on the main board, which is used for storing data and programs currently being executed, but is only used for temporarily storing the programs and the data, and the data is lost when the power is turned off or the power is cut off.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a system server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method of the embodiments of the present application.
Please refer to fig. 8, which is a schematic structural diagram of a computer-readable storage medium according to the present invention. The storage medium of the present application stores a program file 203 capable of implementing all the methods described above, wherein the program file 203 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. The aforementioned storage device includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (12)

1. A method of synthesizing data, comprising:
obtaining a first feature data set comprising a first amount of feature data, and obtaining a second feature data set comprising a second amount of feature data, the first amount being less than the second amount;
dividing each feature data of the first feature data set based on the feature similarity to obtain a plurality of feature data subsets; wherein the similarity of different subsets of feature data and reference feature data is different, the reference feature data being determined based on the first feature data set;
and synthesizing the feature data of each feature data subset in the plurality of feature data subsets based on the second quantity to obtain the extended feature data of each feature data subset.
2. The method according to claim 1, wherein the synthesizing the feature data of each of the plurality of feature data subsets based on the second number to obtain the extended feature data of each feature data subset comprises:
determining a base expansion number corresponding to each characteristic data subset based on the second number; the base extension number characterizes a number reference value of extended feature data for a corresponding feature data subset;
and synthesizing the feature data in each feature data subset based on the basic expansion number corresponding to each feature data subset to obtain the expanded feature data of each feature data subset.
3. The method of claim 2, wherein determining the number of base extensions for each of the subsets of feature data based on the second number comprises:
determining a feature density corresponding to each feature data subset, and determining an expansion number corresponding to the first feature data set based on the second number; the expansion quantity characterizes a quantity reference value of the expanded feature data of the first feature data set;
and determining the basic expansion quantity corresponding to each characteristic data subset based on the characteristic density and the expansion quantity.
4. The method of claim 3, wherein the step of determining the feature density corresponding to each subset of feature data comprises:
determining a third amount of feature data in the respective subset of feature data, and determining a density of feature data in the respective subset of feature data; the density is determined based on neighborhood characteristics of feature data in the respective feature data subset;
and determining the feature density corresponding to each feature data subset based on the third quantity and the density.
5. The data synthesis method according to claim 2, wherein the step of synthesizing feature data of each of the plurality of feature data subsets based on the second number to obtain extended feature data of each feature data subset includes:
weighting the basic expansion quantity corresponding to each feature data subset, and determining the final expansion quantity corresponding to each feature data subset, wherein the final expansion quantity is a determined value of the quantity of the expansion feature data of the corresponding feature data subset;
and synthesizing the feature data of each feature data subset based on the final expansion number and the neighborhood features of the feature data of each feature data subset to obtain the expansion feature data of each feature data subset.
6. The data synthesis method according to claim 1, wherein the step of dividing each feature data of the first feature data set based on the feature similarity to obtain a plurality of feature data subsets comprises:
traversing all feature data in the first feature data set to obtain the feature data with the similarity of the reference feature data being smaller than a first threshold value, and obtaining a first feature data subset;
traversing the remaining feature data in the first feature data set except the first feature data subset to obtain feature data with similarity greater than a first threshold and smaller than a second threshold with the reference feature data, and obtaining a second feature data subset;
traversing the remaining feature data in the first feature data set except the first feature data subset and the second feature data subset to obtain feature data with similarity greater than a second threshold with the reference feature data, and obtaining a third feature data subset;
wherein the third subset of feature data has a greater similarity to the reference feature data than the second subset of feature data; the second subset of feature data is more similar to the reference feature data than the first subset of feature data.
7. The data synthesis method of claim 1, wherein the steps of obtaining a first feature data set containing a first amount of feature data and obtaining a second feature data set containing a second amount of feature data comprise:
acquiring a characteristic data set;
and classifying the feature data sets according to the dimension of each feature data in the feature data sets, so as to obtain the first feature data set and the second feature data set.
8. A data synthesis apparatus, comprising:
an obtaining module configured to obtain a first feature data set including a first number of feature data and obtain a second feature data set including a second number of feature data, the first number being smaller than the second number;
the sampling module is used for dividing each feature data of the first feature data set based on the feature similarity to obtain a plurality of feature data subsets; wherein the similarity of different subsets of feature data and reference feature data is different, the reference feature data being determined based on the first feature data set;
and the synthesis module is used for synthesizing the feature data of each feature data subset in the plurality of feature data subsets based on the second quantity to obtain the extended feature data of each feature data subset.
9. A method for training an image processing model, comprising:
obtaining a training sample set comprising extended feature data obtained by the method of any one of claims 1 to 7;
and training the image processing model to be trained by utilizing the training sample set to obtain the trained image processing model.
10. An apparatus for training an image processing model, comprising:
a sample set acquisition module, configured to acquire a training sample set, where the training sample set includes extended feature data obtained by the method according to any one of claims 1 to 7;
and the training module is used for training the image processing model to be trained by utilizing the training sample set to obtain the trained image processing model.
11. An electronic device, comprising: a memory storing program instructions and a processor retrieving the program instructions from the memory to perform the method of any of claims 1-7 or 9.
12. A computer-readable storage medium, characterized in that a program file is stored, which program file can be executed to implement the method according to any one of claims 1-7 or 9.
CN202110845215.3A 2021-07-26 2021-07-26 Data synthesis method, training method of image processing model and related device Pending CN113673575A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110845215.3A CN113673575A (en) 2021-07-26 2021-07-26 Data synthesis method, training method of image processing model and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110845215.3A CN113673575A (en) 2021-07-26 2021-07-26 Data synthesis method, training method of image processing model and related device

Publications (1)

Publication Number Publication Date
CN113673575A true CN113673575A (en) 2021-11-19

Family

ID=78540150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110845215.3A Pending CN113673575A (en) 2021-07-26 2021-07-26 Data synthesis method, training method of image processing model and related device

Country Status (1)

Country Link
CN (1) CN113673575A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108470187A (en) * 2018-02-26 2018-08-31 华南理工大学 A kind of class imbalance question classification method based on expansion training dataset
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
CN109635839A (en) * 2018-11-12 2019-04-16 国家电网有限公司 A kind for the treatment of method and apparatus of the non-equilibrium data collection based on machine learning
WO2019169704A1 (en) * 2018-03-08 2019-09-12 平安科技(深圳)有限公司 Data classification method, apparatus, device and computer readable storage medium
CN110443281A (en) * 2019-07-05 2019-11-12 重庆信科设计有限公司 Adaptive oversampler method based on HDBSCAN cluster
CN111382897A (en) * 2019-10-25 2020-07-07 广州供电局有限公司 Transformer area low-voltage trip prediction method and device, computer equipment and storage medium
CN111488520A (en) * 2020-03-19 2020-08-04 武汉工程大学 Crop planting species recommendation information processing device and method and storage medium
CN112990318A (en) * 2021-03-18 2021-06-18 中国科学院深圳先进技术研究院 Continuous learning method, device, terminal and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108470187A (en) * 2018-02-26 2018-08-31 华南理工大学 A kind of class imbalance question classification method based on expansion training dataset
WO2019169704A1 (en) * 2018-03-08 2019-09-12 平安科技(深圳)有限公司 Data classification method, apparatus, device and computer readable storage medium
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
CN109635839A (en) * 2018-11-12 2019-04-16 国家电网有限公司 A kind for the treatment of method and apparatus of the non-equilibrium data collection based on machine learning
CN110443281A (en) * 2019-07-05 2019-11-12 重庆信科设计有限公司 Adaptive oversampler method based on HDBSCAN cluster
CN111382897A (en) * 2019-10-25 2020-07-07 广州供电局有限公司 Transformer area low-voltage trip prediction method and device, computer equipment and storage medium
CN111488520A (en) * 2020-03-19 2020-08-04 武汉工程大学 Crop planting species recommendation information processing device and method and storage medium
CN112990318A (en) * 2021-03-18 2021-06-18 中国科学院深圳先进技术研究院 Continuous learning method, device, terminal and storage medium

Similar Documents

Publication Publication Date Title
CN107301225B (en) Short text classification method and device
CN107305637B (en) Data clustering method and device based on K-Means algorithm
CN104246765B (en) Image retrieving apparatus, image search method, program and computer-readable storage medium
CN103020067B (en) A kind of method and apparatus determining type of webpage
CN110618082B (en) Reservoir micro-pore structure evaluation method and device based on neural network
WO2015004434A1 (en) Compact and robust signature for large scale visual search, retrieval and classification
CN106156163B (en) Text classification method and device
EP2833275A1 (en) Image search device, image search method, program, and computer-readable storage medium
WO2008073820A1 (en) Identifying relationships among database records
CN110399483A (en) A kind of subject classification method, apparatus, electronic equipment and readable storage medium storing program for executing
CN111753870A (en) Training method and device of target detection model and storage medium
Wang et al. AGNES‐SMOTE: An Oversampling Algorithm Based on Hierarchical Clustering and Improved SMOTE
CN113673575A (en) Data synthesis method, training method of image processing model and related device
CN107133643A (en) Note signal sorting technique based on multiple features fusion and feature selecting
CN116628600A (en) Unbalanced data sampling method and device based on random forest
CN109635839B (en) Unbalanced data set processing method and device based on machine learning
JP2008052707A (en) Method and device to determine descriptor for signal representing multimedia item, device for retrieving item in database, and device for classification of multimedia item in database
US7356797B2 (en) Logic transformation and gate placement to avoid routing congestion
Groves et al. Craniometry of slow lorises (genus Nycticebus) of insular Southeast Asia
Knight et al. Hypergen-a distributed genetic algorithm on a hypercube
CN108319682A (en) Method, apparatus, equipment and the medium of grader amendment and taxonomy library structure
CN108921207A (en) A kind of hyper parameter determines method, device and equipment
Viswanath et al. A fast and efficient ensemble clustering method
CN116244426A (en) Geographic function area identification method, device, equipment and storage medium
CN114333840A (en) Voice identification method and related device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination