CN116306543B - Form data generation method and system based on generation type countermeasure network - Google Patents

Form data generation method and system based on generation type countermeasure network Download PDF

Info

Publication number
CN116306543B
CN116306543B CN202310595962.5A CN202310595962A CN116306543B CN 116306543 B CN116306543 B CN 116306543B CN 202310595962 A CN202310595962 A CN 202310595962A CN 116306543 B CN116306543 B CN 116306543B
Authority
CN
China
Prior art keywords
data
generator
countermeasure network
regressor
generated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310595962.5A
Other languages
Chinese (zh)
Other versions
CN116306543A (en
Inventor
李长林
陈燎
未伟
贾宁
崔润邦
孙洪贵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Fantike Technology Co ltd
Tianjin University
Original Assignee
Beijing Fantike Technology Co ltd
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Fantike Technology Co ltd, Tianjin University filed Critical Beijing Fantike Technology Co ltd
Priority to CN202310595962.5A priority Critical patent/CN116306543B/en
Publication of CN116306543A publication Critical patent/CN116306543A/en
Application granted granted Critical
Publication of CN116306543B publication Critical patent/CN116306543B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a form data generation method and system based on a generation type countermeasure network. According to the technical scheme of the application, the method comprises the following steps: step S1), cleaning data of a table to be generated; step S2), carrying out standardization treatment on the cleaned data; step S3) inputting the standardized data into a pre-established and trained form data generation model to obtain form generation data; the tabular data generation model is based on an improved generation type countermeasure network implementation, and comprises a generator and a regressor. The invention improves the generation type countermeasure network, introduces a new structure, namely a regressor, and is used for converting the output of the generator into final generation data; in model training, gradient punishment and random linear interpolation terms are introduced, so that the learning speed and stability of the model are improved, and problems such as gradient explosion are avoided.

Description

Form data generation method and system based on generation type countermeasure network
Technical Field
The present invention relates to the field of table data generation technology, and in particular, to a table data generation method and system based on a generation type countermeasure network.
Background
Form data is the most basic and common data form, and a large amount of data exists in form of form in engineering and production life (such as a user information form of a bank, a product holding condition form of a user and the like). With the development of informatization, more and more enterprises, researchers and managers choose to conduct management activities such as planning, organization, coordination, decision-making and control based on data analysis. However, this motivates the need to generate data as administrators encounter data quantity, quality, imbalance, and privacy issues in the process of processing data using machine learning to make decisions.
Data quantity problem: in some fields, the data size is not abundant, and the premise of the machine learning, especially the deep learning, can have good effect is that a large amount of marked data is needed, so if the original data can be expanded and enhanced, the better application effect can be realized by using less original data.
Data quality problem: data quality problems are now common, such as outliers due to erroneous recordings in some data that rely on manual collection. This data quality problem can be better addressed if the data distribution can be learned and then data enhancement can be performed according to the original distribution.
Data imbalance problem: imbalance of positive and negative samples, too little sample data of a certain type, can lead to many problems in subsequent applications. In this case, many people choose to divide the data set smaller. We consider data generation and enhancement to be the fundamental approach to solving the problem of data imbalance.
Data privacy problem: many data are sensitive information and often are difficult (or only a small portion of them) to access by researchers due to privacy concerns. This sensitive information problem can be avoided if "false" data of the same correlation characteristics can be generated.
Currently, table data generation mainly uses a statistical model or a deep learning model to learn the distribution of real data to generate table data. Statistical model methods use a series of predefined probability distributions to fit a new tabular dataset. For example, a gaussian mixture model (Gaussian mixture model) may model the joint distribution of several consecutive columns, while a bayesian network (Bayesian networks) may be used to model the joint distribution of discrete columns. However, this approach is severely limited by the distribution of data, and cannot be universally used if the data set is hashed both continuously and discretely. In this case, the method adopted is often to discretize the continuous column data and then model the continuous column data through a bayesian network or decision tree. Furthermore, statistical model-based methods are computationally expensive, and such models are difficult to apply to large data sets with thousands of columns and millions of rows. Another category is deep learning methods. Since deep learning is a good effect in terms of computer vision and natural language processing, many researchers have been motivated to try to use deep learning for table data generation. Deep learning methods such as Variational Auto-Encoders (VAEs) and generative antagonism networks (Generative Adversarial Networks, GANs) have the ability to learn complex high-dimensional distributions and generate high quality samples, which have been widely used in images and text. Also, it is highly possible to learn the implicit distribution of the table data based on this proposed model and then sample it therefrom to generate rows to obtain high-quality table data. However, while GANs have the potential to model arbitrary distributions, the current GANs-based models tend to be less excellent than simple statistical models for some special properties, such as non-gaussian distributed continuous columns, non-uniformly distributed discrete columns, etc.
Therefore, how to simply and universally implement the generation of table data is a technical problem that needs to be solved in the art.
Disclosure of Invention
In view of this, the present application proposes a table data generating method based on a generating type countermeasure network, so as to achieve a better generating effect for different types of columns. By adding a regression device for the standardization of different types of data in a targeted manner, the model has excellent generation effect on different types of columns, and can simply and universally realize the generation of table data.
According to one aspect of the present application, there is provided a form data generation method based on a generation type countermeasure network, the method including: step S1), cleaning data of a table to be generated; step S2), carrying out standardization treatment on the cleaned data; step S3) inputting the standardized data into a pre-established and trained form data generation model to obtain form generation data; the tabular data generation model is based on an improved generation type countermeasure network implementation, and comprises a generator and a regressor.
Preferably, the step S1) specifically includes: and checking the complete attribute of the data, and deleting the record with the null value.
Preferably, the data after washing in step S2) includes discrete columns and continuous columns, and the normalization process specifically includes:
for discrete columns, carrying out standardization processing by adopting one-hot coding; for the continuous columns, adopting variational Gaussian mixture for standardization treatment; and then splicing.
Preferably, for discrete columns, one-hot coding is adopted for standardization treatment; the method specifically comprises the following steps:
for the cj-th column of the discrete columnsI element->Using one-hot coding to obtain vector +.>Where d represents the total number of categories for the discrete column.
Preferably, the normalization is performed by using variational Gaussian mixture for the continuous columns; the method specifically comprises the following steps:
according to the cleaned data, the number K of Gaussian distributions and the average priori variance of the Gaussian distributions
Obtaining the first step by adopting CAVI algorithmkWeights with selected gaussian distributionMean->Sum of variances->And a variation probability density +.>
According toFor column cj of the consecutive columns +.>Element->Establishing a Gaussian mixture model:
wherein, the liquid crystal display device comprises a liquid crystal display device,mean value of +.>Variance is->Gaussian distribution of->
Sampling from the K variational probability densities to obtain a Gaussian distribution for normalization, expressed as
Preferably, the splicing is specifically:
data after i-th row splicingThe method comprises the following steps:
wherein the symbols areRepresenting a vector concatenation operation.
Preferably, the generator G includes a convolution layer, a LeakyReLu activation function, a full connection layer, and a Tanh activation function connected in sequence;
the regressor R comprises a full connection layer, a Tanh activation function, a batch standardization layer, a full connection layer and a sigmoid function which are connected in sequence.
Preferably, the table data generating model further includes a discriminator D in training, including: a generator G and a regressor R; wherein, the liquid crystal display device comprises a liquid crystal display device,
comprising the following steps: the method comprises the following steps of sequentially connecting a convolution layer, a LeakyReLu activation function, a pooling layer, a Flatten layer, a full connection layer and a Sigmoid activation function.
Preferably, the method further comprises a training step of generating a model from the tabular data; the method specifically comprises the following steps:
step T1), a training set is established;
step T2) setting gradient penalty coefficients respectivelyNumber of arbiter iterations in each generator iteration/>Batch size b, adam super parameter +.>Training algebra +.>
Step T3) number of iterations traversedUpdating the D parameter of the discriminator in each iteration;
step T4) when the number of iterations is reachedUpdate generator G when training algebra +.>Turning to the step T3), and continuing training; otherwise, obtaining a trained generator, and turning to a step T5);
step T5) traversing training algebraUpdating parameters of the regressor R to obtain a trained regressor R, and further obtaining a trained form data generation model by the trained generator G and the regressor R.
Preferably, the step T1) specifically includes:
selecting data comprising user basic information, product holding information, asset information and/or stream information; after cleaning and standardization treatment, a training set is established.
Preferably, the step T3) specifically includes:
number of iterations of traversalThe following processes are respectively carried out in each iteration:
establishing a mathematical expectation of 0, standardStandard normal distribution N (0, 1) with difference 1, from which b hidden vector samples are taken
Taking b data samples from the training set;
Get at [0,1]Random numbers within a range
Traversing b hidden vector samples, taking one sample at a time,/>Obtaining corresponding embedded vector via generator G>
Embedded vectorObtaining interpolation data by random linear interpolation>
Sample training setEmbedded vector->Interpolation data +.>Inputting a discriminator D;
current arbiter parameters according toUpdating to obtain new discriminator parameter +.>
Wherein, the liquid crystal display device comprises a liquid crystal display device,representing Adam optimization algorithm,/->Is a discriminator (I/O)>About the discriminator parameters->Is used for the gradient of (a),representing two norms>Which is a gradient penalty coefficient.
Preferably, said step T4) when the number of iterations is reachedAn update generator G; the method specifically comprises the following steps:
when the number of iterations is reachedB hidden vector samples +.>
Taking a sample from the hidden vector sample,/>
Current generator parameters are calculated according toUpdating to obtain new generator parameters +.>
Wherein, the liquid crystal display device comprises a liquid crystal display device,is about generator parameters->Is a gradient of (a).
Preferably, the step T5) specifically includes:
step T5-1), traversing the training algebra E, and repeating the step T5-2); until reaching training algebra E, go to step T5-3);
step T5-2) taking b hidden vector samples from the normal distribution N (0, 1)Obtain->
Traversing b hidden vector samples, taking one sample at a time,/>Obtaining the generated data by a regressor R>
Updating regressor parameters
Step T5-3) to obtain a trained regressor R, and further obtaining a trained form data generation model by the trained generator G and the regressor R.
According to still another aspect of the present application, there is provided a form data generation system based on a generation type countermeasure network, implemented according to the above form data generation method, the system including: the system comprises a cleaning module, a standardized processing module, a generating module and a form data generating model; wherein, the liquid crystal display device comprises a liquid crystal display device,
the cleaning module is used for cleaning the data of the form to be generated;
the standardized processing module is used for carrying out standardized processing on the cleaned data;
the generation module is used for inputting the standardized data into a pre-established and trained form data generation model to obtain form generation data;
the tabular data generation model is based on an improved generation type countermeasure network implementation, and comprises a generator and a regressor.
The application also provides a computer device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, and is characterized in that the method for generating the table data is realized when the processor executes the computer program.
The present application also provides a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the above-described table data generation method of the present application.
According to the technical scheme, in the data processing part, continuous data are processed by using Bayesian Gaussian mixture, discrete data are processed by using one-hot coding, and the characteristics of the discrete data are better depicted, so that the processing of a subsequent model is facilitated; the method uses and improves the leading-edge unsupervised learning technology in recent years, namely a generation type countermeasure network, and introduces a new structure, namely a regressor, on the basis of a generator and a discriminator in the generation type countermeasure network, wherein the regressor is used for converting the output of the generator into final generated data; in model training, gradient punishment and random linear interpolation terms are introduced, so that the learning speed and stability of the model are improved, and problems such as gradient explosion are avoided.
Additional features and advantages of the present application will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:
FIG. 1 is a flow chart of a method for generating table data according to the present invention;
FIG. 2 is a diagram of a tabular data generation model;
FIG. 3 is a document type column distribution comparison, where FIG. 3 (a) is the generated data and FIG. 3 (b) is the actual data;
FIG. 4 is a comparison of the distribution of the number of held products, wherein FIG. 4 (a) is the generated data and FIG. 4 (b) is the actual data;
FIG. 5 is a graph showing the comparison of the gender and whether the demand deposit line is held or not in a combined profile, wherein FIG. 5 (a) is the generated data and FIG. 5 (b) is the actual data;
FIG. 6 is a graph of contour contrast of the joint distribution of the belonging industry and whether there is a regular deposit line, wherein FIG. 6 (a) is the generated data and FIG. 6 (b) is the actual data;
FIG. 7 is a graph of the highest school, number of held products, and marital status versus the co-distribution of the columns, where FIG. 7 (a) is the generated data and FIG. 7 (b) is the actual data.
Detailed Description
As shown in fig. 1, embodiment 1 of the present invention provides a table data generating method based on a generated countermeasure network, including the steps of:
step one, obtaining a sample of form data to be generated.
Step two, data cleaning, cleaning out records that do not meet the conditions (the form data for the study must contain all the complete attributes).
And step three, cleaning the data according to the step two, adopting one-hot coding for discrete columns, and adopting variational Gaussian mixture (Variational Gaussian Mixture) for standardization processing for continuous columns.
Step four, taking the data sample processed in the step three, wherein hidden variable z (obeys distribution) Random number->Input into the proposed generative countermeasure network model (ReTGAN) for training.
And fifthly, after training is finished, N vectors are taken from the standard normal distribution and are input into a network, and form generation data meeting requirements can be obtained through output of a regressive in a ReTGAN.
The technical solutions of the present application will be described in detail below with reference to the accompanying drawings in combination with embodiments.
Example 1
Embodiment 1 of the present invention provides a table data generation method based on a generation type countermeasure network.
The data set is selected for research in the embodiment, and the data distribution is complex and has discrete columns and continuous columns. The system comprises 1980 attribute characteristics such as user basic information, product holding information, asset information, stream information and the like. Wherein, the liquid crystal display device comprises a liquid crystal display device,
the basic information includes: customer number, birth date, gender, ethnicity, marital status, date of customer opening, highest school, work unit, industry, address information, certificate type, etc.
The product holding information includes: customer number (for association with other tables), data time, number of savings cards held, number of products held, period of activity held, period of time held, financial reason held, national debt held, fund held, insurance held, loan held, effective mobile banking held, effective online banking held, effective WeChat banking held, subscription payment accepted, foreign currency savings held, short message subscription held, effective savings card held, effective credit card held, effective social security card customer held, effective medical security customer held, mobile banking held, online banking held, weChat banking held, credit card held, social security card customer held, medical security customer held, etc.
The asset information includes: customer numbering, data date, management asset balance, running period balance, periodic balance, financial balance, insurance balance, national debt balance, foundation balance, management asset average daily balance, running period average daily balance, periodic average daily balance, financial month average daily balance, insurance average daily balance, national debt average daily balance, foundation average daily balance, and the like.
The running water includes running water in a living period and running water in a regular period. Wherein the running water in the living period comprises: serial number, customer number (for association with other tables), deposit account number, transaction time, transaction amount, balance.
The regular running water includes: serial number, customer number (for association with other tables), deposit account number, purchase date, expiration date, transaction amount, product number, interest rate, deadline, and the like.
It should be noted that the above data is merely an example, and is not limited thereto.
The data statistics time is from 2019, 10 month, 1 day to 2019, 10 month, 31 days, and is 9090 records in total. The specific implementation mode is as follows:
step one, acquiring a data set, which is expressed as:i,jrespectively row and column numbers. For example, the ith row of the table is denoted +.>
And step two, cleaning the data, and deleting the record with the null value so that the data contains all the complete attributes.
Step three, according to the data washed in step two, for any columnEvery element->In case of discrete columns, using one-hot coding, the element one-hot coding in one discrete column can be expressed as vector +.>D represents the total number of categories for the discrete column, i.e. the dimension of the one-hot code is d-dimension. If a continuous column, a variational Gaussian mixture (Variational Gaussian Mixture) is used for normalization. The Gaussian mixture model isWherein->Is a weight representing the probability that the kth gaussian distribution is selected;and->The mean and variance of the kth gaussian distribution, respectively. Variable probability Density->. The model was trained using the Coordinate Ascent Variational Inference (CAVI) algorithm and the parameters were learned as shown in table 1.
TABLE 1 CAVI algorithm of variational Gaussian mixture model
Obtaining a variation probability densityWeight->Mean->Variance->Then, a Gaussian distribution is sampled from the K variational probability densities for normalization, denoted +.>. Splicing the discrete columns and the continuous columns, and setting the sign +.>Representing a vector concatenation operation, then a row may be represented as:
step four, designing a generated countermeasure network model (regsan) for generating the table data, as shown in fig. 2. The generator G comprises a convolution layer, a LeakyReLu activation function, a full connection layer and a Tanh activation function which are connected in sequence; the regressor R comprises a full connection layer, a Tanh activation function, a batch standardization layer, a full connection layer and a sigmoid function which are connected in sequence. The discriminator D comprises a convolution layer, a LeakyReLu activation function, a pooling layer, a Flatten layer, a full connection layer and a Sigmoid activation function which are connected in sequence. In the training, the generator G, the discriminator D and the regressor R are trained, and when the training is actually used, the generator G and the regressor R are adopted to form a table data generation model. The specific training process is as follows:
TABLE 2 ReTGAN training procedure
In the table:
the standard normal distribution N (0, 1) is mathematically expected to be 0, with a standard deviation of 1.
Is about generator parameters->Gradient of->Is about the discriminator parameter->Is a gradient of (a).
Step five, after training, N vectors are taken from standard normal distribution and input into a network, and generated data can be obtained through a regressor R in the RETGAN:
the structural settings and super parameters of the generator G, the arbiter D, and the regressor R network are shown in table 3.
TABLE 3 network parameters of the RETGAN model
Generating a model according to the trained form data, wherein the form data generating method comprises the following steps:
step one, cleaning data of a form to be generated;
step two, carrying out standardization treatment on the cleaned data;
inputting the standardized data into a pre-established and trained form data generation model to obtain form generation data.
Other common data normalization methods include, but are not limited to, min-Max normalization, Z-score normalization, and the like. Furthermore, different parameter settings based on this framework include, but are not limited to, different training algebra, different numbers of neurons, different numbers of network layers, etc.
Example 2
Embodiment 2 of the present invention provides a form data generation system based on a generation type countermeasure network, implemented according to the method of embodiment 1, the system including: the system comprises a cleaning module, a standardized processing module, a generating module and a form data generating model; wherein, the liquid crystal display device comprises a liquid crystal display device,
the cleaning module is used for cleaning the data of the form to be generated;
the standardized processing module is used for carrying out standardized processing on the cleaned data;
the generation module is used for inputting the standardized data into a pre-established and trained form data generation model to obtain form generation data;
the tabular data generation model is based on an improved generation type countermeasure network implementation, and comprises a generator and a regressor.
Example 3
Embodiment 3 of the present invention may also provide a computer apparatus, including: at least one processor, memory, at least one network interface, and a user interface. The various components in the device are coupled together by a bus system. It will be appreciated that a bus system is used to enable connected communications between these components. The bus system includes a power bus, a control bus, and a status signal bus in addition to the data bus.
The user interface may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, track ball, touch pad, or touch screen, etc.).
It is to be understood that the memory in the embodiments disclosed herein may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and Direct RAM (DRRAM). The memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some implementations, the memory stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof: an operating system and application programs.
The operating system includes various system programs, such as a framework layer, a core library layer, a driving layer, and the like, and is used for realizing various basic services and processing hardware-based tasks. Applications, including various applications such as Media Player (Media Player), browser (Browser), etc., are used to implement various application services. The program implementing the method of the embodiment of the present disclosure may be contained in an application program.
In the above embodiment, the processor may be further configured to call a program or an instruction stored in the memory, specifically, may be a program or an instruction stored in an application program:
the steps of the method of example 1 are performed.
The method of embodiment 1 may be applied to, or implemented by, a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps and logic blocks disclosed in embodiment 1 may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with embodiment 1 may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (Application Specific Integrated Circuits, ASIC), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (DSP devices, DSPD), programmable logic devices (Programmable Logic Device, PLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), general purpose processors, controllers, microcontrollers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the inventive techniques may be implemented with functional modules (e.g., procedures, functions, and so on) that perform the inventive functions. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Example 4
Embodiment 4 of the present invention may also provide a nonvolatile storage medium for storing a computer program. The steps of the above-described method embodiments may be implemented when the computer program is executed by a processor.
Verification effect
Two ideas are used for evaluating the technical effect, one is that the distribution of the generated data and the real data can be compared to evaluate the capability of learning the data distribution because the real data has a known probability distribution; the second is used in real machine learning tasks to evaluate the performance of the generated data in real scenes.
(1) Single column distribution: generating 300 samples, and drawing generated data and a real distribution diagram by taking certificate TYPE CERTI_TYPE and the PROD_CNT columns of the number of the held products as examples, wherein the generated data and the real distribution diagram are shown in fig. 3 and 4, fig. 3 is a comparison of certificate TYPE column distribution, wherein fig. 3 (a) is generated data, and fig. 3 (b) is real data; fig. 4 is a comparison of the distribution of the holding product number prod_cnt columns, in which fig. 4 (a) is the generated data and fig. 4 (b) is the true data. Taking fig. 4 as an example, in real data, the distribution of the number columns of the held products approximately follows the normal distribution, and the number columns of the held products in the generated data are very close to the normal distribution, so that the method has better capability of learning the data distribution of a single column.
(2) Multi-column joint distribution: generating 300 samples, drawing a joint distribution contour map by taking two columns of gender SEX and whether a demand deposit is held or not, belonging industry CORP_INDUX and whether a demand deposit is held or not as an example, wherein the joint distribution contour map is shown in fig. 5 and 6, fig. 5 is a contour contrast map of joint distribution of gender and whether the demand deposit is held or not, wherein fig. 5 (a) is generated data, and fig. 5 (b) is real data; FIG. 6 is a graph of contour contrast of the joint distribution of the belonging industry CORP_INDUX and whether to hold the regular deposit TERM_FLAG column, wherein FIG. 6 (a) is the generated data and FIG. 6 (b) is the actual data; the violin diagram is shown in fig. 7 by taking three columns of the highest-school edu_lev, the held product quantity prod_cnt, and marital status MARRIAGE as an example, wherein fig. 7 (a) is the generated data and fig. 7 (b) is the real data.
(3) Real machine learning tasks: taking a financial wind control scene as an example, the label is a risk level and is characterized by 1980 columns of related marketing solution data collected. Different random seeds were taken according to 7:3 dividing the training set and the test set, and calculating the AUC as shown in table 4, wherein the data generated by the method can be seen to obtain good technical effect and accuracy in the actual machine learning scene.
Table 4 AUC results
The preferred embodiments of the present application have been described in detail above, but the present application is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application.
In addition, the specific features described in the foregoing embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described in detail.
Moreover, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be considered as the disclosure of the present invention.

Claims (15)

1. A method of generating tabular data based on a generated countermeasure network, the method comprising:
step S1), cleaning data of a table to be generated;
step S2), carrying out standardization treatment on the cleaned data;
step S3) inputting the standardized data into a pre-established and trained form data generation model to obtain form generation data;
the table data generation model is based on improved generation type antagonism network realization, and comprises a generator and a regressor;
the generator G comprises a convolution layer, a LeakyReLu activation function, a full connection layer and a Tanh activation function which are connected in sequence;
the regressor R comprises a full connection layer, a Tanh activation function, a batch standardization layer, a full connection layer and a sigmoid function which are connected in sequence.
2. The method for generating table data based on a generated type countermeasure network according to claim 1, wherein the step S1) specifically includes: and checking the complete attribute of the data, and deleting the record with the null value.
3. The method for generating table data based on a generated type countermeasure network according to claim 1, wherein the data washed in step S2) includes a discrete column and a continuous column, and the normalization process specifically includes:
for discrete columns, carrying out standardization processing by adopting one-hot coding; for the continuous columns, adopting variational Gaussian mixture for standardization treatment; and then splicing.
4. A method for generating tabular data based on a generated type countermeasure network according to claim 3, wherein said normalization processing is performed by using one-hot coding for discrete columns; the method specifically comprises the following steps:
for the first of the discrete columnscjColumn ofIs the first of (2)iIndividual element->Using one-hot coding to obtain vector +.>Wherein, the method comprises the steps of, wherein,drepresenting the total number of categories for that discrete column.
5. The method for generating table data based on a generated type countermeasure network according to claim 4, wherein the normalization processing is performed by using a variational gaussian mixture for successive columns; the method specifically comprises the following steps:
according to the cleaned data, the number of Gaussian distributionKAverage prior variance of gaussian distribution
Obtaining the weight of the kth Gaussian distribution selected by adopting a CAVI algorithmMean->Sum of variances->And a variation probability density +.>
According toFor the first in the continuous columnscjColumn->Element->Establishing a Gaussian mixture model:
wherein (1)>Mean value of +.>Variance isGaussian distribution of->
From the slaveKSampling the variation probability density to obtain a Gaussian distribution for normalization, expressed as
6. The method for generating table data based on a generated type countermeasure network according to claim 5, wherein the splicing specifically includes:
first, theiData after row splicingThe method comprises the following steps:
wherein the symbol->Representing a vector concatenation operation.
7. The method for generating tabular data based on generated countermeasure network according to claim 1, wherein the tabular data generation model further comprises a discriminator D in training: the discriminator D comprises a convolution layer, a LeakyReLu activation function, a pooling layer, a Flatten layer, a full connection layer and a Sigmoid activation function which are connected in sequence.
8. The method of generating tabular data based on a generative countermeasure network of claim 7, wherein said method further comprises a training step of a tabular data generation model; the method specifically comprises the following steps:
step T1), a training set is established;
step T2) setting gradient penalty coefficients respectivelyThe number of arbiter iterations in each generator iteration +.>Batch size b, adam super parameter +.>Training algebra +.>
Step T3) number of iterations traversedUpdating the D parameter of the discriminator in each iteration;
step T4) when the number of iterations is reachedUpdate generator G when training algebra +.>Turning to the step T3), and continuing training; otherwise, obtaining a trained generator, and turning to a step T5);
step T5) traversing training algebraUpdating parameters of the regressor R to obtain a trained regressor R, and further obtaining a trained form data generation model by the trained generator G and the regressor R.
9. The method for generating table data based on a generated type countermeasure network according to claim 8, wherein the step T1) specifically includes:
selecting data comprising user basic information, product holding information, asset information and/or stream information; after cleaning and standardization treatment, a training set is established.
10. The method for generating table data based on a generated type countermeasure network according to claim 8, wherein the step T3) specifically includes:
number of iterations of traversalThe following processes are respectively carried out in each iteration:
establishing a standard normal distribution N (0, 1) with a mathematical expectation of 0 and a standard deviation of 1, and taking the standard normal distribution N (0, 1) from the standard normal distributionbIndividual hidden vector samples
Fetching from training setbIndividual data samples
Take [0,1 ]]Random numbers within a range
TraversingbEach time, taking one sample from each hidden vector sample,/>Obtaining corresponding embedded vector via generator G>
Embedded vectorObtaining interpolation data by random linear interpolation>
Sample training setEmbedded vector->Interpolation data +.>Inputting a discriminator D;
current arbiter parameters according toUpdating to obtain new discriminator parameter +.>
Wherein the method comprises the steps of,/>Representing Adam optimization algorithm,/->Is a discriminator (I/O)>Is about the discriminator parameter->Is used for the gradient of (a),representing two norms>Which is a gradient penalty coefficient.
11. The method for generating table data based on a generated type countermeasure network according to claim 10, wherein said step T4) is performed when the number of iterations is reachedAn update generator G; the method specifically comprises the following steps:
when the number of iterations is reachedTaken from a standard normal distribution N (0, 1)bPersonal hidden vector sample->
Taking a sample from the hidden vector sample,/>
Current generator parameters are calculated according toUpdating to obtain new generator parameters +.>
Wherein (1)>Is about generator parameters->Is a gradient of (a).
12. The method for generating table data based on a generated type countermeasure network according to claim 11, wherein the step T5) specifically includes:
step T5-1) traversing the training algebraERepeating the step T5-2); until reaching training algebraETurning to the step T5-3);
step T5-2) from the normal distribution N (0, 1)bIndividual hidden vector samplesObtaining
TraversingbEach time, taking one sample from each hidden vector sample,/>Obtaining the generated data by a regressor R>
Updating regressor parameters
Step T5-3) to obtain a trained regressor R, and further obtaining a trained form data generation model by the trained generator G and the regressor R.
13. A tabular data generation system based on a generated countermeasure network, the system being implemented according to the method of any of claims 1-12, the system comprising: the system comprises a cleaning module, a standardized processing module, a generating module and a form data generating model; wherein, the liquid crystal display device comprises a liquid crystal display device,
the cleaning module is used for cleaning the data of the form to be generated;
the standardized processing module is used for carrying out standardized processing on the cleaned data;
the generation module is used for inputting the standardized data into a pre-established and trained form data generation model to obtain form generation data;
the table data generation model is based on improved generation type antagonism network realization, and comprises a generator and a regressor;
the generator G comprises a convolution layer, a LeakyReLu activation function, a full connection layer and a Tanh activation function which are connected in sequence;
the regressor R comprises a full connection layer, a Tanh activation function, a batch standardization layer, a full connection layer and a sigmoid function which are connected in sequence.
14. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 12 when executing the computer program.
15. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the method of any one of claims 1 to 12.
CN202310595962.5A 2023-05-25 2023-05-25 Form data generation method and system based on generation type countermeasure network Active CN116306543B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310595962.5A CN116306543B (en) 2023-05-25 2023-05-25 Form data generation method and system based on generation type countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310595962.5A CN116306543B (en) 2023-05-25 2023-05-25 Form data generation method and system based on generation type countermeasure network

Publications (2)

Publication Number Publication Date
CN116306543A CN116306543A (en) 2023-06-23
CN116306543B true CN116306543B (en) 2023-07-28

Family

ID=86820759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310595962.5A Active CN116306543B (en) 2023-05-25 2023-05-25 Form data generation method and system based on generation type countermeasure network

Country Status (1)

Country Link
CN (1) CN116306543B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116775622B (en) * 2023-08-24 2023-11-07 中建五局第三建设有限公司 Method, device, equipment and storage medium for generating structural data

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019056975A (en) * 2017-09-19 2019-04-11 株式会社Preferred Networks Improved generative adversarial network achievement program, improved generative adversarial network achievement device, and learned model generation method
CN110197514A (en) * 2019-06-13 2019-09-03 南京农业大学 A kind of mushroom phenotype image generating method based on production confrontation network
CN115357941B (en) * 2022-10-20 2023-01-13 北京宽客进化科技有限公司 Privacy removing method and system based on generating artificial intelligence

Also Published As

Publication number Publication date
CN116306543A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
TWI788529B (en) Credit risk prediction method and device based on LSTM model
Li et al. Heterogeneous ensemble for default prediction of peer-to-peer lending in China
Cont et al. Recovering volatility from option prices by evolutionary optimization
US20210303970A1 (en) Processing data using multiple neural networks
CN116306543B (en) Form data generation method and system based on generation type countermeasure network
CN113344700B (en) Multi-objective optimization-based wind control model construction method and device and electronic equipment
Siao et al. Predicting recovery rates using logistic quantile regression with bounded outcomes
Severa et al. Whetstone: A method for training deep artificial neural networks for binary communication
CN112365075A (en) Stock price trend prediction method, system, terminal and storage medium
Odegua Predicting bank loan default with extreme gradient boosting
CN111221881B (en) User characteristic data synthesis method and device and electronic equipment
Kang et al. A CWGAN-GP-based multi-task learning model for consumer credit scoring
Gong et al. Partial tail-correlation coefficient applied to extremal-network learning
CN112241920A (en) Investment and financing organization evaluation method, system and equipment based on graph neural network
Chen et al. Financial distress prediction using data mining techniques
CN116071150A (en) Data processing method, bank product popularization, wind control system, server and medium
Zhang et al. Improving Stock Price Forecasting Using a Large Volume of News Headline Text.
CN114219184A (en) Product transaction data prediction method, device, equipment, medium and program product
CN114897607A (en) Data processing method and device for product resources, electronic equipment and storage medium
CN113256404A (en) Data processing method and device
Guo Research on the factors affecting the successful borrowing rate of P2P network lending in China—Taking the case of renrendai online lending as an example
Supriyanto Comparison of Grid Search and Evolutionary Parameter Optimization with Neural Networks on JCI Stock Price Movements during the Covid 19
Feng et al. Application of Business Intelligence Based on the Deep Neural Network in Credit Scoring
Shen et al. Investment time series prediction using a hybrid model based on RBMs and pattern clustering
Desai Machine Learning for Economics Research: When What and How?

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant