CN116304932B

CN116304932B - Sample generation method, device, terminal equipment and medium

Info

Publication number: CN116304932B
Application number: CN202310566164.XA
Authority: CN
Inventors: 刘星宝; 李鑫; 刘庆东; 刘利枚; 杨俊丰; 李沁; 张震
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-09-05
Anticipated expiration: 2043-05-19
Also published as: CN116304932A

Abstract

The application is applicable to the technical field of data processing, and provides a sample generation method, a device, terminal equipment and a medium, wherein a disease sample set and a normal sample set are obtained by dividing a diagnosis sample set; adding covariance and entropy into a feature matrix of a disease sample set, and constructing a gradient lifting tree model according to the feature matrix; according to the gradient lifting tree model, the entropy of the disease sample set is combined, and the splitting contribution degree and entropy difference of each feature are calculated respectively to obtain feature weights of the features; constructing an initial population corresponding to the disease sample set; obtaining a first evolution probability corresponding to the initial population according to the covariance, the entropy and the characteristic weight; according to the first evolution probability, evolving the initial population to obtain an intermediate population, and calculating a second evolution probability corresponding to the intermediate population; and taking the intermediate population meeting the evolution termination condition as a new disease sample set based on the first evolution probability and the second evolution probability. The application can improve the quality of the generated sample.

Description

Sample generation method, device, terminal equipment and medium

Technical Field

The present application belongs to the technical field of data processing, and in particular, relates to a sample generation method, a device, a terminal device, and a medium.

Background

An unbalanced data set is one in which the number of samples in each category in the data set varies greatly. Taking the classification problem as an example, if the number of negative class samples is far greater than that of positive class samples, the classification result is biased towards the negative class, and the misclassification rate of the positive class is high. Indeed, if the imbalance ratio of the data set exceeds 4:1, the classifier will be biased towards a large number of classes, whereas in super-imbalance data sets, the proportion of positive classes in the data set will typically be less than one percent, and such problems are particularly pronounced in the medical field.

In the medical field, the ratio of the disease sample (positive sample) is often much lower than that of the normal sample (negative sample), which greatly influences the subsequent identification of the disease and thus endangers the health of the patient.

To address such problems, those skilled in the art have employed a synthetic minority class oversampling approach (SMOTE, synthetic Minority Oversampling Technique) to increase the number of negative class samples to balance the data set, thereby avoiding the above-described problems. However, the quality of the sample generated by the current sample generation method is low, which causes uncertainty and randomness in the subsequent classification result and affects the effect of the classifier.

Disclosure of Invention

The embodiment of the application provides a sample generation method, a sample generation device, terminal equipment and a sample generation medium, which can solve the problem of low quality of samples generated by the current sample generation method.

In a first aspect, an embodiment of the present application provides a sample generating method, including:

step 1, dividing an unbalanced diagnosis sample set to obtain a disease sample set and a normal sample set;

step 2, respectively calculating covariance and entropy of the disease sample set, adding the covariance and entropy into a feature matrix of the disease sample set, and constructing a gradient lifting tree model according to the feature matrix; the gradient lifting tree model comprises a plurality of decision trees, and leaf nodes of the decision trees are in one-to-one correspondence with features in the feature matrix;

step 3, according to the gradient lifting tree model, combining entropy of the disease sample set, calculating split contribution degree and entropy difference of each feature in the plurality of features respectively, and obtaining feature weights of the features; the split contribution degree and the entropy difference are used for representing the importance of the feature;

step 4, constructing an initial population corresponding to the disease sample set; wherein, population individuals of the initial population are in one-to-one correspondence with disease samples in the disease sample set;

Step 5, obtaining a first evolution probability corresponding to the initial population according to the covariance, the entropy and the characteristic weight; the first evolution probability is used for representing the importance of population individuals in the initial population;

step 6, evolving the initial population according to the first evolution probability to obtain an intermediate population, and calculating a second evolution probability corresponding to the intermediate population;

step 7, judging whether the middle population meets a preset evolution termination condition according to the first evolution probability and the second evolution probability;

step 8, if the intermediate population meets the preset evolution termination condition, taking the intermediate population as a new disease sample set; otherwise, the middle population is used as the initial population in the step 6, and the step 6 is executed again.

Optionally, step 3 includes:

by calculation formula

Obtaining the contribution degree of division； wherein ,/>Indicate->The individual features are in->The degree of split contribution on the individual decision trees,，/>representing the total number of features in the disease sample set, +.>Indicate->Leaf nodes on the individual decision tree, +.>Indicate->No. H of individual disease samples>Personal characteristics (I)>，/>Represents the total number of disease samples in the disease sample set, +.>Indicate->The individual features are in->Optimal split point on the individual decision tree, < - >Indicate->Individual disease samples, ->Indicate>Other disease samples than those;

by calculation formula

Obtaining entropy difference； wherein ,/>Indicate->Entropy difference of individual features,/>Represents the entropy of the disease sample set,indicating that the affected sample set is removed +.>Entropy after individual features,/->Indicating removal of +.>Probability of individual disease samples;

for each feature, the following steps are performed:

obtaining the variance contribution degree of the features according to the division contribution degree and the covariance; the variance contribution is used to characterize the importance of the feature;

and obtaining the feature weight of the feature according to the entropy difference and the variance contribution degree.

Optionally, obtaining the variance contribution of the feature according to the split contribution and the covariance includes:

by calculation formula

Obtaining variance contribution degree； wherein ,/>Indicate->Variance contribution of individual features, ++>Representing covariance.

Optionally, obtaining the feature weight of the feature according to the entropy difference and the variance contribution degree includes:

by calculation formulaObtain characteristic weight +.>； wherein ,/>Indicate->Feature weight of individual feature, +.>Representing a hyper-parameter for controlling the weight of said entropy difference and the weight of said variance contribution, ++ >。

Optionally, step 5 includes:

by calculation formula

Obtaining a first evolution probability； wherein ,/>Representing normalization coefficients, +.>，/>Representing a set of feature weights->，/>Representing probability density->Representing an exponential term for adjusting the shape of the exponential function such that the trend of the probability density function across different dimensions is smoother,。

optionally, step 6 includes:

by calculation formulaObtaining new population individuals->；

By calculation formulaObtaining the adaptability of the new population individuals>；

Ordering all new population individuals according to the order of the fitness from large to small, and sorting the previous population individualsIndividual new populations are added to the disease sample set +.>In (2) obtaining a new disease sample set +.>；

According to a genetic algorithm, according to the new disease sample setObtaining a middle population;

calculating a second evolution probability corresponding to the middle population。

Optionally, step 7 includes:

by calculation formulaObtaining evolution loss of intermediate population>； wherein ,representing the middle population->Evolution loss of the secondary evolution, which is used for representing the superiority of the intermediate population,，/>representing a preset maximum population evolution frequency;

counting the execution times of the step 6, and taking the times as evolution times of the intermediate population;

If the evolution loss is smaller than a preset threshold value and the evolution frequency is larger than or equal to the maximum population evolution frequency, determining that the intermediate population meets a preset evolution termination condition; otherwise, determining that the intermediate population does not meet the preset evolution termination condition.

In a second aspect, an embodiment of the present application provides a sample generating device, including:

the sample dividing module is used for dividing an unbalanced diagnosis sample set to obtain a disease sample set and a normal sample set;

the gradient lifting tree module is used for respectively calculating covariance and entropy of the disease sample set, adding the covariance and entropy into a feature matrix of the disease sample set, and constructing a gradient lifting tree model according to the feature matrix; the gradient lifting tree model comprises a plurality of decision trees, and leaf nodes of the decision trees are in one-to-one correspondence with features in the feature matrix;

the feature weight module is used for respectively calculating the splitting contribution degree and entropy difference of each feature in the plurality of features according to the gradient lifting tree model and the entropy of the disease sample set to obtain feature weights of the features; the split contribution degree and the entropy difference are used for representing the importance of the feature;

the genetic algorithm module is used for constructing an initial population corresponding to the disease sample set; wherein, population individuals of the initial population are in one-to-one correspondence with disease samples in the disease sample set;

The first evolution probability module is used for obtaining a first evolution probability corresponding to the initial population according to the covariance, the entropy and the feature weight; the first evolution probability is used for representing the importance of population individuals in the initial population;

the second evolution probability module is used for evolving the initial population according to the first evolution probability to obtain an intermediate population, and calculating a second evolution probability corresponding to the intermediate population;

the judging module is used for judging whether the middle population meets the preset evolution termination condition according to the first evolution probability and the second evolution probability;

the sample generation module is used for taking the intermediate population as a new disease sample set if the intermediate population meets the preset evolution termination condition; otherwise, the middle population is used as the initial population in the second evolution probability module, and the second evolution probability module is executed in a returning mode.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the sample generation method described above when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium storing a computer program, which when executed by a processor, implements the above-described sample generation method.

The scheme of the application has the following beneficial effects:

in some embodiments of the application, covariance and entropy of a disease sample set are calculated, the covariance and entropy are added into a feature matrix of the disease sample set, a gradient lifting tree model is constructed according to the feature matrix, then feature weights are obtained according to the gradient lifting tree model, features of the disease sample set are expanded, and important features are given larger weights, so that diversity and robustness of samples are improved, and the quality of the samples is improved; the evolution probability of the population is obtained according to the feature weight, so that excellent individuals in the population can be prevented from being evolved, excellent samples are reserved, and the sample quality is improved.

Other advantageous effects of the present application will be described in detail in the detailed description section which follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a sample generation method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a sample generating device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

Aiming at the problem of low quality of samples generated by the current sample generation method, the application provides a sample generation method, a device, a terminal device and a medium, the covariance and entropy are added into a feature matrix of a disease sample set by calculating the covariance and entropy of the disease sample set, a gradient lifting tree model is constructed according to the feature matrix, and then feature weights are obtained according to the gradient lifting tree model, so that the features of the disease sample set are expanded, and more weights are given to important features, thereby ensuring the subsequent generation of excellent samples and facilitating the lifting of the generated sample quality; the evolution probability of the population is obtained according to the feature weight, so that excellent individuals in the population can be prevented from being evolved, excellent samples are reserved, and the sample quality is improved.

As shown in fig. 1, the sample generating method provided by the application comprises the following steps:

step 1, dividing an unbalanced diagnosis sample set to obtain a disease sample set and a normal sample set.

In some embodiments of the present application, the diagnostic sample set is further pre-processed prior to performing step 1 to ensure the validity of the diagnostic sample set. Illustratively, the preprocessing includes: data desiccation, missing value padding, etc.

The above-mentioned division of the unbalanced diagnostic sample set is based on the class labels of the diagnostic samples in the diagnostic sample set. Exemplary, diagnostic sample sets, wherein ,/>Representing the total number of diagnostic samples in the diagnostic sample set. Dividing a diagnostic sample labeled as a disease into a disease sample set +.>, wherein ,represents +.>Individual disease samples, ->Representing the total number of disease samples. Based on the above division principle, a normal sample set can be obtained accordingly>, wherein ,/>Represents +.>Normal samples->Indicating the total number of normal samples.

And 2, respectively calculating covariance and entropy of the disease sample set, adding the covariance and entropy into a feature matrix of the disease sample set, and constructing a gradient lifting tree model according to the feature matrix.

The gradient lifting tree model comprises a plurality of decision trees, and leaf nodes of the decision trees are in one-to-one correspondence with features in the feature matrix.

It should be noted that the feature matrix of the disease sample set includes a plurality of features, each feature representing one dimension of the disease sample set, and illustratively, in one embodiment of the present application, any of the disease samples The corresponding characteristics are that, wherein ,/>Representing the total number of features of the disease sample.

The gradient lifting tree model is common knowledge to those skilled in the art, and a common gradient lifting tree model construction method can be adopted to construct the gradient lifting tree model, and the construction mode is not limited herein.

It is worth mentioning that the covariance and entropy are added into the feature matrix of the disease sample set, so that the features of the disease sample set are expanded, and the robustness of the disease sample is enhanced.

And 3, respectively calculating the split contribution degree and entropy difference of each feature in the plurality of features according to the gradient lifting tree model and by combining the entropy of the disease sample set, so as to obtain the feature weight of the feature.

Wherein both the split contribution and entropy difference are used to characterize the importance of the feature.

And 4, constructing an initial population corresponding to the disease sample set.

Wherein, the population individuals of the initial population are in one-to-one correspondence with the disease samples in the disease sample set.

In some embodiments of the application, genetic algorithms may be used to construct an initial population for a disease sample set.

And step 5, obtaining a first evolution probability corresponding to the initial population according to the covariance, the entropy and the characteristic weight.

The first evolution probability is used for representing the importance of population individuals in the initial population.

Specifically, by a calculation formula

Obtaining a first evolution probability； wherein ,/>Representing normalization coefficients, +.>，/>Representing a set of feature weights->，/>Representing probability density->Sample set of diseases>Personal characteristic weight, ++>Representing an exponential term for adjusting the shape of the exponential function such that the probability density function is in different dimensionsIs smoother and is->。

And 6, evolving the initial population according to the first evolution probability to obtain an intermediate population, and calculating a second evolution probability corresponding to the intermediate population.

The second evolution probability corresponding to the middle population is calculated so as to facilitate the subsequent calculation of the similarity between the middle population and the initial population, so that the situation that the newly generated sample is larger than the original sample is avoided, and the effect of the classifier is affected.

The second evolution probability is used for representing the importance of population individuals in the middle population.

And 7, judging whether the middle population meets a preset evolution termination condition according to the first evolution probability and the second evolution probability.

In some embodiments of the present application, after obtaining the intermediate population meeting the evolution termination condition, a new corresponding disease sample set is generated according to the number of population individuals in the intermediate population and the disease sample types corresponding to the individuals in each population, and the new disease sample set and the normal sample set form a balance sample set at this time, so that the classifier is trained, and accurate identification and classification of the disease can be realized.

In embodiments of the present application, the formula may be calculated byObtaining the number of new disease samples to be generated +.>； wherein ,/>Representing the total number of normal samples in the normal sample set obtained in step 1, < >>Representing the total number of disease samples in the disease sample set obtained in step 1, < >>Representing the control parameters set by the person,/-, for example>。

The following describes an exemplary procedure of step 3 (calculating the split contribution degree and entropy difference of each of the plurality of features according to the gradient lifting tree model, in combination with the entropy of the disease sample set, to obtain the feature weight of the feature).

Step 3.1, through a calculation formula

Obtaining the contribution degree of division。

wherein ,indicate->The individual features are in->Split contribution on individual decision tree, +.>，/>Representing the total number of features in the disease sample set, +. >Indicate->Leaf nodes on the individual decision tree, +.>Indicate->No. H of individual disease samples>Personal characteristics (I)>，/>Represents the total number of disease samples in the disease sample set, +.>Indicate->The individual features are in->Optimal split point on the individual decision tree, < ->Indicate->Individual disease samples, ->Indicate>Other disease samples than those;

step 3.2, through a calculation formula

Obtaining entropy difference。

wherein ,indicate->Entropy difference of individual features,/>Entropy representing the disease sample set, +.>Indicating that the disease sample set is removed +.>Entropy after individual features,/->Indicating removal of +.>Probability of individual disease samples.

For each feature, the following steps are performed:

step 3.3, obtaining the variance contribution degree of the features according to the division contribution degree and the covariance; the variance contribution is used to characterize the importance of the feature.

Specifically, by a calculation formula

And 3.4, obtaining the characteristic weight of the characteristic according to the entropy difference and the variance contribution degree.

Specifically, by a calculation formulaObtain characteristic weight +.>； wherein ,/>Indicate->Feature weight of individual feature, +. >Representing hyper-parameters, weights for controlling the entropy difference and the variance contribution, weight ++>。

The following describes an exemplary procedure of step 6 (evolve the initial population according to the first evolution probability, obtain the intermediate population, and calculate the second evolution probability corresponding to the intermediate population).

Step 6.1, through a calculation formulaObtaining new population individuals->。

It should be noted that, the greater the evolution probability, the more important the population of individuals, and the more likely it should be kept to avoid the evolution of the excellent population of individuals.

Step 6.2, through the calculation formulaObtaining the adaptability of the new population individuals>。

Step 6.3, sorting all new population individuals according to the order of the fitness from the big to the small, and sorting the previous population individualsThe new population of individuals is added to the disease sample set +.>In (2) obtaining a new disease sample set +.>。

Step 6.5, according to the genetic algorithm, according to the new disease sample setObtaining an intermediate population.

Step 6.6, calculating a second evolution probability corresponding to the intermediate population。

The following describes an exemplary procedure of step 7 (determining whether the intermediate population satisfies the preset evolution termination condition according to the first evolution probability and the second evolution probability).

Step 7.1, through the calculation formulaObtaining evolution loss of intermediate population>。

wherein ,representing the middle population->Evolution loss of the secondary evolution, evolution loss being used to characterize the superiority of the intermediate population, ++>，/>Representing the preset maximum population evolution times.

And 7.2, counting the execution times of the step 6, and taking the times as evolution times of the intermediate population.

Step 7.3, if the evolution loss is smaller than a preset threshold value and the evolution frequency is larger than or equal to the maximum population evolution frequency, determining that the intermediate population meets a preset evolution termination condition; otherwise, determining that the intermediate population does not meet the preset evolution termination condition.

It should be noted that, when the evolution loss is greater than the preset threshold, it is indicated that the difference between the middle population and the initial population is too large to be suitable for the subsequent training of the degree classifier. Meanwhile, the number of evolution times is required to be larger than or equal to the maximum population evolution times so as to improve the precision of the middle population and avoid the influence of accidental factors on the result.

As can be seen from the above steps, the sample generation method provided by the application adds covariance and entropy into the feature matrix of the disease sample set by calculating the covariance and entropy of the disease sample set, builds a gradient lifting tree model according to the feature matrix, obtains feature weights according to the gradient lifting tree model, expands the features of the disease sample set, gives larger weights to important features, improves the diversity and robustness of the samples, and is beneficial to improving the quality of the samples; the evolution probability of the population is obtained according to the feature weight, so that excellent individuals in the population can be prevented from being evolved, excellent samples are reserved, and the sample quality is improved.

The sample generating device provided by the present application is exemplified below.

As shown in fig. 2, the sample generation apparatus 200 includes:

the sample dividing module 201 is configured to divide an unbalanced diagnostic sample set to obtain a disease sample set and a normal sample set;

the gradient lifting tree module 202 is configured to calculate covariance and entropy of the disease sample set, add the covariance and entropy to a feature matrix of the disease sample set, and construct a gradient lifting tree model according to the feature matrix; the gradient lifting tree model comprises a plurality of decision trees, and leaf nodes of the decision trees are in one-to-one correspondence with features in the feature matrix;

the feature weight module 203 is configured to calculate a split contribution degree and an entropy difference of each feature in the plurality of features according to the gradient lifting tree model and in combination with entropy of the disease sample set, so as to obtain feature weights of the features; the split contribution degree and the entropy difference are used for representing the importance of the feature;

the genetic algorithm module 204 is configured to construct an initial population corresponding to the disease sample set; wherein, population individuals of the initial population are in one-to-one correspondence with disease samples in the disease sample set;

the first evolution probability module 205 is configured to obtain a first evolution probability corresponding to the initial population according to the covariance, the entropy and the feature weight; the first evolution probability is used for representing the importance of population individuals in the initial population;

The second evolution probability module 206 is configured to evolve the initial population according to the first evolution probability, obtain an intermediate population, and calculate a second evolution probability corresponding to the intermediate population;

a judging module 207, configured to judge whether the intermediate population meets a preset evolution termination condition according to the first evolution probability and the second evolution probability;

the sample generation module 208 is configured to take the intermediate population as a new disease sample set if the intermediate population meets a preset evolution termination condition; otherwise, the middle population is used as the initial population in the second evolution probability module, and the second evolution probability module is executed in a returning mode.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

As shown in fig. 3, an embodiment of the present application provides a terminal device, and as shown in fig. 3, a terminal device 300 of the embodiment includes: at least one processor 301 (only one processor is shown in fig. 3), a memory 302 and a computer program 303 stored in the memory 302 and executable on the at least one processor 301, the processor 301 implementing the steps in any of the various method embodiments described above when executing the computer program 303.

Specifically, when the processor 301 executes the computer program 303, the diagnostic sample set is divided into a disease sample set and a normal sample set, covariance and entropy of the disease sample set are added into a feature matrix of the disease sample set, a gradient lifting tree model is built, entropy of the disease sample set is combined, split contribution degree and entropy difference of each feature are calculated respectively to obtain feature weights of the features, an initial population corresponding to the disease sample set is built, a first evolution probability of the initial population is obtained according to the covariance, the entropy and the feature weights, the initial population is evolved according to the evolution probability to obtain a middle population, a second evolution probability of the middle population is calculated, and the middle population meeting evolution termination conditions is used as a new disease sample set based on the first evolution probability and the second evolution probability. The covariance and entropy of the disease sample set are calculated, the covariance and entropy are added into a feature matrix of the disease sample set, a gradient lifting tree model is built according to the feature matrix, and then feature weights are obtained according to the gradient lifting tree model, so that the features of the disease sample set are expanded, larger weights are given to important features, the diversity and the robustness of the samples are improved, and the quality of the samples is improved; the evolution probability of the population is obtained according to the feature weight, so that excellent individuals in the population can be prevented from being evolved, excellent samples are reserved, and the sample quality is improved.

The processor 301 may be a central processing unit (CPU, central Processing Unit), the processor 301 may also be other general purpose processors, digital signal processors (DSP, digital Signal Processor), application specific integrated circuits (ASIC, application Specific Integrated Circuit), off-the-shelf programmable gate arrays (FPGA, field-Programmable Gate Array) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 302 may in some embodiments be an internal storage unit of the terminal device 300, such as a hard disk or a memory of the terminal device 300. The memory 302 may also be an external storage device of the terminal device 300 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 300. Further, the memory 302 may also include both an internal storage unit and an external storage device of the terminal device 300. The memory 302 is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs, such as program code for the computer program. The memory 302 may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps for implementing the various method embodiments described above.

Embodiments of the present application provide a computer program product enabling a terminal device to carry out the steps of the method embodiments described above when the computer program product is run on the terminal device.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to sample generating device/terminal equipment, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The sample generation method provided by the application has the following advantages:

the quality of data can be improved, and the defects of the traditional SMOTE method can be overcome. When the method is used for generating the synthesized sample, the feature weight is required to be calculated to ensure the similarity between the new sample and the original sample so as to reduce noise generation, and therefore, a gradient lifting tree method is provided for solving the feature weight of the disease sample set, and the gradient lifting tree is a feature weight calculation method combining the entropy and covariance of the data set, has good interpretation and robustness, and is very suitable for a large-scale data set and a high-dimensional feature space. Sample information is added through a probability generation model, entropy and covariance of an original data set are considered, and characteristic weight evolution probability of a population is calculated according to variation conditions of the generated synthesized sample and the covariance and entropy of the original data set to judge whether the generated synthesized sample has representativeness and reliability. If the evolution probability of the characteristic weight of the generated sample is too large, the generated sample can be considered to have adverse effect on the quality of the data set, and the sample needs to be regenerated. The method can reduce noise interference and improve the diversity and robustness of the generated data. Has important application value in important fields such as disease diagnosis, credit card fraud and the like.

While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims

1. A method of generating a sample, comprising:

step 1, dividing an unbalanced diagnosis sample set to obtain a disease sample set and a normal sample set; the diagnosis sample set comprises a plurality of diagnosis samples aiming at a disease, and a label corresponding to each diagnosis sample in the plurality of diagnosis samples, wherein the label comprises the disease and the normal;

step 2, respectively calculating covariance and entropy of the disease sample set, adding the covariance and the entropy into a feature matrix of the disease sample set, and constructing a gradient lifting tree model according to the feature matrix; the gradient lifting tree model comprises a plurality of decision trees, and leaf nodes of the decision trees are in one-to-one correspondence with features in the feature matrix; the separately calculating covariance and entropy of the disease sample set comprises:

by calculation formula

；

Obtaining the covariance ，/>Representing the disease sample set,/->Represents +.f in the disease sample set>Individual disease samples, ->，/>Representing a total number of disease samples in the disease sample set;

by calculation formulaObtaining said entropy->, wherein ,/>Indicating the removal of +.>Probability of individual disease samples;

the feature matrix of the disease sample set includes a plurality of features, each feature representing a dimension of the disease sample set, any of the disease samplesThe corresponding characteristic is->, wherein ,/>Representing the total number of features of the disease sample;

step 3, according to the gradient lifting tree model, combining the entropy of the disease sample set, respectively calculating the split contribution degree and entropy difference of each feature in the plurality of features to obtain the feature weight of the feature; wherein the split contribution and the entropy difference are both used to characterize the importance of the feature; the step 3 comprises the following steps:

by calculation formula

Obtaining the split contribution degree； wherein ,/>Indicate->The individual features are in->The degree of split contribution on the individual decision trees,，/>representing the total number of features in the disease sample set, < >>Indicate->Leaf nodes on the individual decision tree, +.>Indicate->No. H of individual disease samples >Personal characteristics (I)>，/>Representing the total number of disease samples in said disease sample set, < > x->Indicate->Personal specialSymptoms at->Optimal split point on the individual decision tree, < ->Indicate->Individual disease samples, ->Indicate>Other disease samples than those;

by calculation formula

Obtaining the entropy difference； wherein ,/>Indicate->Entropy difference of individual features,/>Entropy representing the disease sample set, +.>Representing the removal of the +.>Entropy after individual features,/->Indicating the removal of +.>Probability of individual disease samples;

for each feature, the following steps are performed:

obtaining the variance contribution of the feature according to the division contribution and the covariance; the variance contribution is used to characterize the importance of the feature;

obtaining feature weights of the features according to the entropy difference and the variance contribution degree; and obtaining the feature weight of the feature according to the entropy difference and the variance contribution degree, wherein the feature weight comprises the following components: by calculation formulaObtaining the characteristic weight +.>； wherein ,/>Indicate->Feature weight of individual feature, +.>Representing a hyper-parameter for controlling the weight of said entropy difference and the weight of said variance contribution, ++ >；

Step 4, constructing an initial population corresponding to the disease sample set; wherein, the population individuals of the initial population are in one-to-one correspondence with the disease samples in the disease sample set;

step 5, obtaining a first evolution probability corresponding to the initial population according to the covariance, the entropy and the feature weight; the first evolution probability is used for representing the importance of population individuals in the initial population; the step 5 comprises the following steps:

by calculation formula

Obtaining the first evolution probability； wherein ,/>Representing normalization coefficients, +.>，/>Representing a set of feature weights->，/>Representing probability density->Representing an exponential term for adjusting the shape of the exponential function such that the trend of the probability density function across different dimensions is smoother,；

step 6, evolving the initial population according to the first evolution probability to obtain an intermediate population, and calculating a second evolution probability corresponding to the intermediate population; the step 6 comprises the following steps:

by calculation formulaObtaining new population individuals->；

By calculation formulaObtaining the fitness of the new population of individuals>；

Ordering all new population individuals according to the order of the fitness from large to small, and sorting the previous population individuals Individual new populations are added to the disease sample set +.>In (2) obtaining a new disease sample set +.>；

According to a genetic algorithm, according to the new disease sample setObtaining the intermediate population;

calculating a second evolution probability corresponding to the intermediate population；

Step 7, judging whether the intermediate population meets a preset evolution termination condition according to the first evolution probability and the second evolution probability; the step 7 comprises the following steps:

by calculation formulaObtaining the evolution loss of the intermediate population +.>； wherein ,representing the intermediate population +.>Evolution loss of secondary evolution, said evolution loss being used to characterize the superiority of said intermediate population,/->，/>Representing a preset maximum population evolution frequency;

counting the times of execution of the step 6, and taking the times as the evolution times of the intermediate population;

if the evolution loss is smaller than a preset threshold and the evolution frequency is larger than or equal to the maximum population evolution frequency, determining that the intermediate population meets a preset evolution termination condition; otherwise, determining that the intermediate population does not meet a preset evolution termination condition;

step 8, if the intermediate population meets the preset evolution termination condition, taking the intermediate population as a new disease sample set; otherwise, the intermediate population is used as the initial population in the step 6, and the step 6 is executed again.

2. A sample generation apparatus, comprising:

the sample dividing module is used for dividing an unbalanced diagnosis sample set to obtain a disease sample set and a normal sample set; the diagnosis sample set comprises a plurality of diagnosis samples aiming at a disease, and a label corresponding to each diagnosis sample in the plurality of diagnosis samples, wherein the label comprises the disease and the normal;

the gradient lifting tree module is used for respectively calculating covariance and entropy of the disease sample set, adding the covariance and the entropy into a feature matrix of the disease sample set, and constructing a gradient lifting tree model according to the feature matrix; the gradient lifting tree model comprises a plurality of decision trees, and leaf nodes of the decision trees are in one-to-one correspondence with features in the feature matrix; the separately calculating covariance and entropy of the disease sample set comprises:

by calculation formula

；

Obtaining the covariance，/>Representing the disease sample set,/->Represents +.f in the disease sample set>Individual disease samples, ->，/>Representing a total number of disease samples in the disease sample set;

by calculation formulaObtaining said entropy->, wherein ,/>Indicating the removal of +. >Probability of individual disease samples;

the feature weight module is used for respectively calculating the split contribution degree and entropy difference of each feature in the plurality of features according to the gradient lifting tree model and the entropy of the disease sample set to obtain the feature weight of the feature; wherein the split contribution and the entropy difference are both used to characterize the importance of the feature; the feature weight module comprises: by calculation formula

；

Obtaining the split contribution degree； wherein ,/>Indicate->The individual features are in->The degree of split contribution on the individual decision trees,，/>representing the total number of features in the disease sample set, < >>Indicate->The leaf nodes on the individual decision trees are,indicate->No. H of individual disease samples>Personal characteristics (I)>，/>Representing the total number of disease samples in said disease sample set, < > x->Indicate->Personal specialSymptoms at->Optimal split point on the individual decision tree, < ->Indicate->Individual disease samples, ->Indicate>Other disease samples than those;

By calculation formula

；

for each feature, the following steps are performed:

obtaining feature weights of the features according to the entropy difference and the variance contribution degree; and obtaining the feature weight of the feature according to the entropy difference and the variance contribution degree, wherein the feature weight comprises the following components: by calculation formulaObtaining the characteristic weight +.>； wherein ,/>Indicate->Feature weight of individual feature, +.>Representing a hyper-parameter for controlling the weight of said entropy difference and the weight of said variance contribution, ++>；

The genetic algorithm module is used for constructing an initial population corresponding to the disease sample set; wherein, the population individuals of the initial population are in one-to-one correspondence with the disease samples in the disease sample set;

the first evolution probability module is used for obtaining a first evolution probability corresponding to the initial population according to the covariance, the entropy and the characteristic weight; the first evolution probability is used for representing the importance of population individuals in the initial population; the first evolution probability module comprises: by calculation formula

；

the second evolution probability module is used for evolving the initial population according to the first evolution probability to obtain an intermediate population, and calculating a second evolution probability corresponding to the intermediate population; the second actorThe probability module comprises: by calculation formulaObtaining new population individuals->The method comprises the steps of carrying out a first treatment on the surface of the By calculation formula->Obtaining the fitness of the new population of individuals>The method comprises the steps of carrying out a first treatment on the surface of the Ordering all new population individuals in order of fitness from big to small, and sorting the former +.>Individual new populations are added to the disease sample set +.>In (2) obtaining a new disease sample set +.>The method comprises the steps of carrying out a first treatment on the surface of the According to a genetic algorithm, according to said new set of disease samples +.>Obtaining the intermediate population; calculating a second evolution probability corresponding to the intermediate population>；

The judging module is used for judging whether the middle population meets a preset evolution termination condition according to the first evolution probability and the second evolution probability; the judging module comprises: by calculation formula Obtaining the intermediate speciesEvolution loss of group->； wherein ,/>Representing the intermediate population +.>Evolution loss of secondary evolution, said evolution loss being used to characterize the superiority of said intermediate population,/->，/>Representing a preset maximum population evolution frequency; counting the times of execution of the step 6, and taking the times as the evolution times of the intermediate population; if the evolution loss is smaller than a preset threshold and the evolution frequency is larger than or equal to the maximum population evolution frequency, determining that the intermediate population meets a preset evolution termination condition; otherwise, determining that the intermediate population does not meet a preset evolution termination condition;

the sample generation module is used for taking the intermediate population as a new disease sample set if the intermediate population meets a preset evolution termination condition; otherwise, the intermediate population is used as an initial population in the second evolution probability module, and the second evolution probability module is executed in a returning mode.

3. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the sample generation method of claim 1 when executing the computer program.

4. A computer readable storage medium storing a computer program, which when executed by a processor implements the sample generation method of claim 1.