CN114360653A

CN114360653A - Sample generation and survival evaluation method and device based on data genetic variation

Info

Publication number: CN114360653A
Application number: CN202111551408.4A
Authority: CN
Inventors: 郑乐
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-04-15

Abstract

The invention discloses a sample generation and survival evaluation method and device based on data genetic variation, belonging to the technical field of computers, and the technical scheme comprises the following steps: preparing a parental sample, crossing the parental sample, performing characteristic heredity, performing characteristic variation, performing primary static survival evaluation, performing secondary static survival evaluation, performing dynamic survival evaluation and performing business modeling; and (4) circulating the steps to generate a child sample set S2, and performing static survival evaluation and dynamic survival evaluation on child samples in the child sample set S2 to determine elimination or retention. The method aims to obtain enough samples for modeling in the early stage of business, so that the model building effect is better, the business risk is reduced more favorably, and the profitability of a financial institution is improved.

Description

Sample generation and survival evaluation method and device based on data genetic variation

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a sample generation and survival evaluation method and device based on data genetic variation.

Background

Machine learning is widely applied at present, and finance, communication, medical treatment, transportation, e-commerce and the like have a cold start stage in a period in a starting stage of many new businesses, and the model is difficult to construct by using a machine learning method due to the lack of samples or only a small number of samples in the cold start stage.

Most typically, how a financial institution makes a scoring model for distinguishing whether a client is good or bad at the beginning of a target scene business according to a small amount of client information, for this situation, a general method adopted at present is as follows:

1. and expanding the similar service samples. For example, the goal is to build a scoring model for car credit business, but the sample is not enough, so other similar business, such as a sample of consumption credit, is selected as a supplement to expand the number of original car credit samples. The method can relieve the problem of insufficient samples, but how to determine the similar services and select which samples are often determined by artificial rules, certain uncertainty exists, the actual characteristic space distribution of the 'similar customer group' is very different from that of the target service customer group, and the effect cannot be expected.

2. The sample sampling, such as random over-sampling, random down-sampling and the like, deals with the problem of sample imbalance, and does not increase new samples per se, but copies or reduces the samples in a sampling mode to balance the positive proportion and the negative proportion of training samples.

The SMOTE method comprises the following processing flows: a few classes of feature space are selected to find its k nearest neighbors, one of these neighbors is selected, and a composite point is placed anywhere on the line connecting the point under consideration and its selected neighbor. The artificially synthesized points do not substantially fall out of the feature space range of the few class samples, and if the feature space of the few class samples changes greatly over time, the SMOTE effect is usually reduced.

4. Rejection inference, which is a scenario of an extended sample that is common in the field of financial credits. The client is refused after applying for loan, the financial institution obtains the characteristic information of the client, but cannot know the future label performance of the client, and the label of the client needs to be reasonably presumed through refusal inference. The rejection inference method is various, different in effect according to different scenes, and no effective general scheme is known.

Disclosure of Invention

The invention provides a method and a device for generating a sample and evaluating survival based on data genetic variation, aiming at solving the problem that the prior art lacks samples or only has a small number of samples and is difficult to construct a model by a machine learning method.

The technical scheme adopted by the invention is as follows:

a sample generation and survival assessment method based on data genetic variation specifically comprises the following steps:

parent sample preparation: acquiring an initial parent sample according to a service scene;

parent sample crossover: combining various initial parent samples, and crossing to obtain labels of the child samples;

characteristic inheritance: determining the characteristics of the filial generation samples, analogizing the characteristic vectors Xi to a DNA chain, setting a genetic coefficient H, randomly selecting H x i characteristic genes from the characteristic space Xi of the two parental generation samples for value exchange, generating two new filial generation characteristic vectors Zi, and keeping the length of the characteristic vectors of the filial generation samples consistent with that of the parental generation samples;

characteristic variation: acquiring the feature vectors Zi of the child samples, respectively setting variation coefficients V according to different features, and performing value transformation on the child feature vectors Zi exchanged in the step of feature inheritance according to the variation coefficients V to obtain final feature vectors Zi' of the child samples and a child sample set S1;

static survival assessment: obtaining a large number of offspring sample sets S1 through different cross variation combinations, performing static survival evaluation on the offspring samples, performing static production evaluation by adopting a multi-classifier evaluation voting mode, calculating classifier passing rates by combining voting results of a plurality of classifiers, eliminating the offspring samples with low passing rates, and changing the retained passed offspring sample sets into S1';

and (3) dynamic survival assessment: for the offspring samples reserved after the static survival evaluation, performing dynamic production evaluation of a plurality of time windows in a multi-classifier evaluation voting mode, calculating the passing rate of each time window by combining the voting results of a plurality of classifiers, sequentially evaluating the passing rate of each time window of the offspring samples, eliminating the offspring samples which do not meet the survival round, and changing the retained offspring sample set into S1';

and (3) business modeling: in the business modeling stage, aiming at the offspring samples remained after two rounds of survival evaluation, a survival turn and a passing rate threshold value are set according to a specific scene, and regular samples are supplemented into the modeling samples for business modeling;

and circulating the parent samples to prepare for business modeling, generating a child sample set S2, and performing the static survival evaluation and the dynamic survival evaluation on the child samples in the child sample set S2 to determine elimination or retention.

By adopting the technical scheme, the randomness of the filial generation samples on the characteristic cross can be greatly expanded through the setting of the genetic coefficient, so that the generation mode of the filial generation samples is enriched; through the setting of characteristic variation and variation coefficient, the randomness and diversity of the characteristics of the child samples can be further expanded, and the generation modes of the child samples are further enriched; through the thought of genetic variation, some similar real samples are simulated, a business model can be established by combining the obtained real samples, and the model learns more comprehensive potential distribution, thereby being greatly helpful for future business expansion; the data of the service is used to generate the sample, and the sample does not depend on other similar services, so that the problem that the similar service does not exist or cannot be accurately defined is solved; the generated offspring samples are higher in distribution consistency with the original parent passenger groups, and the effect of establishing a subsequent service model is better; the method not only processes from a data algorithm, but also can integrate the prior experience of the service, the generated samples have diversity and rationality, and the randomness of the generation of the samples is limited through static and dynamic survival evaluation, so that the method is more reasonable and effective compared with a pure algorithm construction mode.

Optionally, the coefficient of variation setting is to satisfy the following two basic principles:

1) principle of rationality of the range: for a certain proportion of characteristics, if the original value is 0 to 1, the varied characteristic value also accords with 0 to 1;

2) distribution retention principle: for the features conforming to normal distribution, the standard deviation of the features of the parent sample is calculated to be sigma, the average value is mu, and the probability that the known feature value falls in a mu +/-3 sigma interval is 99.73%, so that the value range of the variant progeny features is within mu +/-3 sigma, and the feature value of the whole sample population is ensured not to deviate from normal distribution.

Based on the above, through setting the characteristic variation and the variation coefficient, the randomness and the diversity of the child samples on the characteristics can be further expanded, and the generation modes of the child samples are further enriched.

Optionally, the static survival assessment is divided into two stages, specifically:

one-stage static survival assessment: namely, a combination rule is set according to the prior experience of the feature vector to eliminate part of samples with obvious defects;

two-stage static survival assessment: after one stage of survival evaluation, the remaining child samples need to be evaluated through a model, the classifier passing rate is calculated by combining voting results of a plurality of classifiers, a removal threshold value is set, child sample groups with low passing rates are removed, the passing sample groups and the corresponding classifier passing rates are reserved, the classifier passing rates can be used as T1 stage survival weights of the sample groups, and the remaining child sample sets are S1' after T1 stage survival evaluation.

Optionally, the model evaluation method is: dividing a D1 parent sample into a modeling data set train1 and a test data set test1, training a plurality of classifiers by using train1, evaluating the effect of the trained classifiers on the test set test1, selecting common two-classification evaluation indexes AUC as evaluation indexes, taking the AUC results of the obtained plurality of classifiers as a reference, equally dividing S1 child samples into M groups, combining each group with a train1 sample, training the plurality of classifiers, respectively calculating AUC on the test1, finally comparing the AUC results with the AUC results of the plurality of classifiers, if the AUC is improved after children are added, the classifier is thrown by 1, otherwise, calculating the classifier passing rate by combining the voting results of the plurality of classifiers, setting a elimination threshold, eliminating the child sample group with low passing rate, keeping the passing sample group and the corresponding classifier passing rate, and collecting the remaining child samples after the survival evaluation in a T1 stage as S1'.

Optionally, the classifier pass rate is used as a survival weight for the T2 stage through child samples, and the survival weight is used as a modeled sample weight.

In addition, to achieve the above object, the present invention provides a sample generation and survival evaluation device based on genetic variation of data, the device comprising:

a service data storage module: the system comprises a data storage module, a data processing module and a data processing module, wherein the data storage module is used for receiving and storing business data and providing an initial parent sample;

a mass sample generation module: the system comprises a rule configuration module, a parameter setting module and a rule setting module, wherein the rule configuration module is used for setting parameters and rules of a parent sample;

a rule configuration module: the method is used for visually configuring the genetic coefficient, the variation coefficient and the static survival rule;

survival model operation module: the device is used for training and deploying a plurality of survival model classifiers, performing static and dynamic generation evaluation on the subinvolution samples, and outputting sample survival results and corresponding survival weights;

a sample retention module: the survival weight storage module is used for storing samples which pass the survival assessment and corresponding survival weights;

a business model operation module: and the module is used for fusing the child samples and the parent samples, modeling and deploying the application, and applying the model result to the business.

Based on the device, the service channel acquires service-related data and stores the data into the service data storage module, the service data storage module is responsible for providing parental sample data, the rule configuration module configures genetic coefficients, variation coefficients and static survival rule configurations, and the survival model operation module deploys a plurality of classification models for static and dynamic survival model evaluation. The business data storage module, the rule configuration module and the survival model operation module are matched, and the retained sample enters the retained sample data storage module. The sample retention module and the service data storage module jointly provide a sample establishment service model to be deployed in the service model operation module, and the service model is finally applied to a service channel to influence the service.

In addition, in order to achieve the above object, the present invention further provides a computer device, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

the processor is used for realizing the sample generation and survival evaluation method based on the data genetic variation when executing the program stored in the memory.

In addition, to achieve the above object, the present invention further provides a storage medium, which stores instructions that, when executed on a computer, cause the computer to execute the above method for generating a sample based on genetic variation of data and evaluating survival.

In addition, to achieve the above object, the present invention also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the above method for generating a sample based on genetic variation of data and evaluating survival.

Due to the adoption of the technical scheme, the invention has the beneficial effects that:

1) different from 'similar service sample expansion' in the prior art, the technical scheme completely uses the data of the service to generate the sample, does not depend on other similar services, and solves the problem that the similar service does not exist or cannot be accurately defined. The generated offspring samples are higher in distribution consistency with the original parent customer groups, and the effect of establishing a subsequent business model is better.

2) Unlike the 'sample sampling' of the prior art, the original parent sample is not simply copied and reduced, but a completely new sample is formed through the idea of genetic variation, and a progeny sample which is not identical to the original parent sample is generated. Due to the limitation of business expansion, all real samples on the market cannot be obtained, and the scheme simulates some similar real samples through the idea of genetic variation, can establish a business model by combining the obtained real samples, learns more comprehensive potential distribution and is greatly helpful for future business expansion.

3) Different from the 'SMOTE method' in the prior art, the sample generation method in the technical scheme not only processes from a data algorithm, but also can integrate the prior experience of business, and the generated sample has diversity and rationality. And through static and dynamic survival assessment, the randomness of sample generation is limited. Through the combination of the algorithm, the expert experience and the survival assessment, the whole scheme forms a closed loop, and the constructed offspring sample is more reasonable and more effective than a pure algorithm construction mode.

4) Compared with the 'rejection inference' in the prior art, the method is different from the prior art in that the rejection sample is taken as a parent sample, the parent sample and a positive sample and a negative sample of a determined label are crossed to generate a new filial sample with the determined label, the rejection inference refers to the label inference of the parent sample of an unknown label, and the method is completely different in thinking method and can be used as a novel rejection inference method.

Drawings

The invention will now be described, by way of example, with reference to the accompanying drawings, in which:

FIG. 1 is a flowchart illustrating steps of a method for generating a sample based on genetic variation of data and evaluating survival according to the present invention;

FIG. 2 is a flowchart of a method for generating a sample based on genetic variation of data and evaluating survival according to example 1;

FIG. 3 is a block diagram of an apparatus for generating a sample based on genetic variation and evaluating survival according to example 2;

fig. 4 is a block diagram of a computer device provided in embodiment 3.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Example 1:

as shown in fig. 1, in one aspect, the present invention provides a method for generating a sample based on genetic variation of data and evaluating survival, which may specifically include the following steps:

s101, parent sample preparation: according to the business scenario, an initial parent sample is obtained, where the time window of the initial parent sample is T1, and the feature space is Xi (i is a dimension of the feature vector, and is usually a high-dimensional vector). The labels of the original parental samples are divided into three categories, positive, negative, and indeterminate. The total amount of initial parental samples was D1, with positive samples being Dp, negative samples Dn, indeterminate samples Du, and apparently Dp + Dn + Du equals D1.

S102, parent crossing: combining various types of initial parent samples, and crossing to obtain labels of the child samples. The specific crossing scheme can be various, such as: positive-positive crosses result in positive tags, positive-indeterminate crosses result in positive tags, negative-indeterminate crosses result in negative tags, and negative-negative crosses result in negative tags.

S103, characteristic inheritance: after the combination and the label generation in the step S102 are performed, the next step is to determine the features of the child samples, which is similar to the idea of DNA gene cross exchange, and set the genetic coefficient H (value 0-1) by using the feature vector Xi as a DNA chain, and randomly select H × i feature genes for value exchange in the feature space Xi of the two parent samples, so as to generate two new child sample feature vectors Zi, where the feature vector length of the child samples is consistent with the parent, and the feature vectors are analogized to the DNA chain, so that the randomness of the child samples on the feature cross can be greatly expanded through the setting of the genetic coefficient, and further the generation mode of the child samples is enriched.

S104, characteristic variation: through the step S103, the feature vectors Zi of the child samples are obtained, the variation coefficients V are respectively set according to different features, and the value of the child sample feature vectors Zi exchanged in the step S103 is transformed according to the variation coefficients V to obtain final feature vectors Zi' of the child samples, and the child sample set is S1. The method for setting the variation coefficient needs prior experience support, and in order to ensure the rationality of the value after variation, the following two basic principles are required to be met: 1. the principle of range rationality, for example, for a certain proportion of features, the original value is 0 to 1, and then the varied feature value also conforms to 0 to 1; 2. the distribution retention principle is that for some features with definite meanings, such as the height of a person, the prior knowledge is known to be normal distribution, then according to the theory of normal distribution, the standard deviation of the height of the parent sample is calculated to be sigma, the average value is mu, then the probability that the height value falls in the interval of mu +/-3 sigma is known to be 99.73%, and the value range of the height of the offspring after variation should be within mu +/-3 sigma, so that the height feature of the whole sample group is ensured not to deviate from the normal distribution. Through the setting of the characteristic variation and the variation coefficient, the randomness and the diversity of the offspring samples on the characteristics can be further expanded, and the generation modes of the offspring samples are further enriched.

S105, one-stage static survival assessment: according to steps S102-S104, a large number of child sample sets S1 can be obtained through different cross variation combinations, survival evaluation is performed on child samples to determine elimination or retention, static survival evaluation is divided into two stages, one stage is rule evaluation, that is, some combination rules are set according to prior experience of feature vectors to eliminate some samples with obvious defects, for example, there are two features of age + academic rank in the feature, and if there are samples of 3 years + bosch in the child samples, direct elimination is needed.

S106, two-stage static survival assessment: after a stage of static survival evaluation, the remaining progeny samples need to be evaluated by a model, and the model evaluation method comprises the following steps: the D1 parental sample is divided into a modeling data set train1 and a testing data set test1, a plurality of classifiers are trained by utilizing the train1, and the types of the classifiers can be linear models, logistic regression, complex models such as lightgbm and the like. The trained classifier evaluates the effect on the test set test1, the evaluation index can select the commonly used two-classification evaluation index AUC, and the obtained AUC results of the plurality of classifiers are used as the reference. Then, equally dividing the offspring samples in the offspring sample set S1 into M groups, combining each group with the train1 samples, training a plurality of classifiers, respectively calculating AUC on test1, finally carrying out front-back AUC comparison on the plurality of classifiers, if the AUC is improved after adding the offspring, the classifier throws 1, otherwise, the classifier throws 0, and calculating the classifier passing rate by combining the voting results of the plurality of classifiers, namely the average value of the classifier voting results. And setting a elimination threshold, eliminating the offspring sample groups with low passing rate, and keeping the passing sample groups and the corresponding classifier passing rate which can be used as the survival weight of the T1 stage of the sample groups. The remaining set of progeny samples after the survival evaluation at stage T1 was S1'. The two-stage static survival evaluation uses a multi-classifier evaluation voting mode, ensures the adaptability of the filial generation sample to each type of model algorithm in the subsequent business modeling process, sets survival weight, and describes the similarity degree of the filial generation sample and the parent generation sample and the weight of the training of the filial generation sample model in the subsequent business modeling.

S107, dynamic survival assessment: after two-stage static survival evaluation, the adaptability of the progeny samples in the remaining progeny sample set S1 'to T1 time is effectively evaluated, as the business continues to develop, the time rolls to T2 window, a new parental sample D2 appears, referring to step 6, the D2 samples are also divided into a training set train2 and a test set test2, a plurality of classifiers are trained by using a train2 in the same way, the AUC of each classifier is obtained at the test2, the progeny samples in the progeny sample set S1' are combined with the train2 samples according to the previous grouping and then trained by the same plurality of classifiers, the AUC is calculated on the test2 respectively, finally, the front-back AUC comparison is performed on the plurality of classifiers, if the AUC is improved after adding the progeny, the classifier is thrown by 1, otherwise, 0 is thrown, the classifier passing rate is calculated by combining the voting results of the plurality of classifiers, the samples with low passing rate are eliminated, and the passing rate of the passing samples and the corresponding classifiers are retained, the classifier pass rate may be used as a survival weight for the T2 stage of the pass sample. The set of progeny samples of the remaining T1 after the T2 stage survival assessment became S1 ". The business changes with the change of time, and sets up the dynamic survival assessment, has solved the problem that the filial generation sample can only adapt to the business environment of a certain period, add these filial generation samples that pass the test, the business model will have better generalization performance.

S108, business modeling: and repeating the step S107, performing dynamic evaluation on the child samples in the child sample set S1 in a plurality of time windows, wherein the more passes, the stronger the survival ability of the child samples is, flexibly setting the survival turns and the pass rate threshold value in the business modeling stage, and supplementing the samples subjected to the survival evaluation into the modeling samples for business modeling. In addition, the survival weight can also be used as the sample weight of the model, the real sample survival weight is naturally 1, and different sample weights can also influence the training effect of the model.

In addition to the dynamic survival evaluation of the child samples in the child sample set S1, the parent sample D2 at S109 and T2 stages may also generate a child sample set S2 according to the steps from S101 to S107, perform a static evaluation at T2, and perform a dynamic evaluation at subsequent T3 and T4 … to determine elimination or retention.

The sample generation and survival evaluation method provided by the embodiment can be applied to the following two scenes:

scene one: a service cold start scenario. Taking a credit scene in the financial field as an example, when a credit business is just started, generally, the wind control can only obtain some rules by means of expert experience or analysis and induction according to a small amount of data. The model can be built by using a machine learning method only when the sample size is accumulated enough, and the borrowing client repayment also needs a presentation period to determine whether the sample is overdue or not, and the presentation period also prolongs the sample accumulation time, and the presentation period usually needs 1-2 years to accumulate samples for modeling. If we can generate some approximate samples through the idea of genetic variation, enough samples can be obtained in early business, which is more beneficial to reducing business risks and improving the profitability of the financial institution.

Scene two: and (4) abnormal detection scenes. The abnormal cases in most enterprises are few, such as malicious bill swiping, cattle order, credit card fraud, electricity stealing, equipment failure and the like, the proportion of the data samples is usually a small part of the whole sample, and in the case of credit card fraud, the fraud proportion of entity credit card swiping is generally within 0.1%. And if we can generate low-proportion samples through the concept of genetic variation, the problem of sample imbalance is relieved, and the effect of the model is more excellent. Solves the problem of difficult modeling under the two technical backgrounds

Example 2:

in another aspect, the present invention further provides an apparatus for implementing the method for generating a sample and evaluating survival based on genetic variation of data, referring to fig. 2, the apparatus includes:

a service data storage module: the system is used for receiving and storing business data and providing an initial parent sample;

a mass sample generation module: the system comprises a rule configuration module, a parameter setting module and a rule setting module, wherein the rule configuration module is used for receiving an initial parent sample and generating a mass of child samples by matching with the parameters and the rules provided by the rule configuration module;

survival model operation module: the device is used for training and deploying a plurality of survival model classifiers, and is used for performing static and dynamic generation evaluation on the offspring samples and outputting the sample survival results and corresponding survival weights;

Example 3:

in another aspect, the present invention further provides a computer device, as shown in fig. 3, the computer device includes a memory, a processor, a communication interface, and a communication bus, a computer program that can be executed on the processor is stored in the memory, and the processor executes the computer program to implement the steps in the method of the above-mentioned embodiment.

The processor may be a Central Processing Unit (CPU).

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and units, such as the corresponding program units in the above-described method embodiments of the present invention. The processor executes various functional applications of the processor and the processing of the work data by executing the non-transitory software programs, instructions and modules stored in the memory, that is, the method in the above method embodiment is realized.

The memory may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be coupled to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more units are stored in the memory and when executed by the processor perform the method of the above embodiments.

The specific details of the computer device may be understood by referring to the corresponding related descriptions and effects in the above embodiments, and are not described herein again.

Example 4:

in another aspect, the present invention further provides a computer storage medium, which stores instructions that, when executed on a computer, cause the computer to perform the sample generation and survival assessment method according to any of the above embodiments. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.

Example 5:

in yet another embodiment, a computer program product comprising instructions is provided, which when run on a computer, causes the computer to perform the sample generation and survival assessment method of any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a storage medium or transmitted from one storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A sample generation and survival assessment method based on data genetic variation is characterized by comprising the following steps:

and (3) dynamic survival assessment: for the offspring samples reserved after the static survival evaluation, performing dynamic production evaluation of a plurality of time windows by adopting a multi-classifier evaluation voting mode, calculating the passing rate of each time window by combining the voting results of a plurality of classifiers, sequentially evaluating the passing rate of each time window of the offspring samples, eliminating the offspring samples which do not meet the survival round, and changing the reserved passed offspring sample set into S1 '';

2. The method as claimed in claim 1, wherein the coefficient of variation V is set to satisfy the following two basic principles:

2) distribution retention principle: for the features conforming to the normal distribution, the standard deviation of the features of the parent sample is calculated to be sigma, the average value is mu, and the probability that the known feature value falls in a mu +/-3 sigma interval is known to be 99.73%, so that the value range of the variant progeny features is required to be within mu +/-3 sigma, and the feature value of the whole sample population is ensured not to deviate from the normal distribution.

3. The method as claimed in claim 1, wherein the static survival assessment is divided into two stages, specifically:

4. The method of claim 3, wherein the model evaluation method comprises: dividing a D1 parent sample into a modeling data set train1 and a test data set test1, training a plurality of classifiers by using train1, evaluating the effect of the trained classifiers on the test set test1, selecting common two-classification evaluation indexes AUC as evaluation indexes, taking the AUC results of the obtained plurality of classifiers as a reference, equally dividing S1 child samples into M groups, combining each group with a train1 sample, training the plurality of classifiers, respectively calculating AUC on the test1, finally comparing the AUC results with the AUC results of the plurality of classifiers, if the AUC is improved after children are added, the classifier is thrown by 1, otherwise, calculating the classifier passing rate by combining the voting results of the plurality of classifiers, setting a elimination threshold, eliminating the child sample group with low passing rate, keeping the passing sample group and the corresponding classifier passing rate, and collecting the remaining child samples after the survival evaluation in a T1 stage as S1'.

5. The method as claimed in claim 4, wherein the classifier pass rate is used as a survival weight for T2 stage passing through the offspring sample, and the survival weight is used as a modeled sample weight.

6. An apparatus for implementing the method for generating a sample based on genetic variation of data and assessing survival as claimed in any one of claims 1 to 5, the apparatus comprising:

7. The computer equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication between the processor and the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 8 when executing the computer program stored on the memory.

8. A storage medium on which a computer program is stored, which computer program, when being executed by a processor, carries out the method according to any one of claims 1 to 5.