WO2021068563A1 - 样本数据处理方法、装置、计算机设备及存储介质 - Google Patents

样本数据处理方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021068563A1
WO2021068563A1 PCT/CN2020/098820 CN2020098820W WO2021068563A1 WO 2021068563 A1 WO2021068563 A1 WO 2021068563A1 CN 2020098820 W CN2020098820 W CN 2020098820W WO 2021068563 A1 WO2021068563 A1 WO 2021068563A1
Authority
WO
WIPO (PCT)
Prior art keywords
network model
feature data
data
sample
minority
Prior art date
Application number
PCT/CN2020/098820
Other languages
English (en)
French (fr)
Inventor
秦文力
张密
韩丙卫
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021068563A1 publication Critical patent/WO2021068563A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a sample data processing method, device, computer equipment, and storage medium.
  • This application provides a sample data processing method, device, computer equipment, and storage medium to solve the problem of sample data imbalance.
  • a sample data processing method including:
  • sample feature data where the sample feature data includes annotation data
  • the target generation confrontation network model is constructed by using the Deep&CrossNet network model
  • the structure feature data is added to the sample feature data to obtain standard feature data.
  • a sample data processing device includes:
  • a sample feature data acquisition module configured to acquire sample feature data, where the sample feature data includes annotation data
  • the classification module is used to classify the sample feature data based on the annotation data to obtain different types of basic feature data
  • a statistics module configured to count the data volume of each type of basic feature data, and calculate the proportion of each type of basic feature data in the sample feature data according to the data volume;
  • a minority feature data set determining module which is used to determine the basic feature data whose proportion value is less than the proportion threshold value when there is basic characteristic data whose proportion value is less than a preset proportion threshold value in the sample characteristic data , Identified as a minority feature data set;
  • the data construction module is used to construct the minority feature data set based on the target generation confrontation network model to generate structure feature data, wherein the target generation confrontation network model is constructed by using the Deep&CrossNet network model;
  • the adding module is used to add the structure feature data to the sample feature data to obtain standard feature data.
  • a computer device includes a memory, a processor, and a computer program that is stored in the memory and can run on the processor, and when the processor executes the computer program, a sample data processing method is implemented:
  • sample feature data where the sample feature data includes annotation data
  • the target generation confrontation network model is constructed by using the Deep&CrossNet network model
  • the structure feature data is added to the sample feature data to obtain standard feature data.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, a sample data processing method is implemented:
  • sample feature data where the sample feature data includes annotation data
  • the target generation confrontation network model is constructed by using the Deep&CrossNet network model
  • the structure feature data is added to the sample feature data to obtain standard feature data.
  • the above-mentioned sample data processing method, device, computer equipment and storage medium first classify the acquired sample feature data, and then extract the minority feature data whose number proportion is less than the proportion threshold, and then adopt the target constructed by the Deep&CrossNet network model
  • the generative adversarial network model constructs data for minority types of feature data and generates a set of structured feature data, thereby effectively solving the problem of sample data imbalance.
  • FIG. 1 is a schematic diagram of an application environment of a sample data processing method in an embodiment of the present application
  • Fig. 2 is an example diagram of a sample data processing method in an embodiment of the present application
  • Fig. 3 is another example diagram of a sample data processing method in an embodiment of the present application.
  • Fig. 4 is another example diagram of a sample data processing method in an embodiment of the present application.
  • Fig. 5 is another example diagram of a sample data processing method in an embodiment of the present application.
  • Fig. 6 is another example diagram of a sample data processing method in an embodiment of the present application.
  • Fig. 7 is a functional block diagram of a sample data processing device in an embodiment of the present application.
  • FIG. 8 is another functional block diagram of the sample data processing device in an embodiment of the present application.
  • FIG. 9 is another functional block diagram of the sample data processing device in an embodiment of the present application.
  • Fig. 10 is a schematic diagram of a computer device in an embodiment of the present application.
  • the sample data processing method can be applied to the application environment as shown in FIG. 1.
  • the sample data processing method is applied in a sample data processing system.
  • the sample data processing system includes a client and a server as shown in FIG. 1.
  • the client and the server communicate through a network to solve the sample data category.
  • the client is also called the client, which refers to the program that corresponds to the server and provides local services to the client.
  • the client can be installed on, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented with an independent server or a server cluster composed of multiple servers.
  • a sample data processing method is provided.
  • the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • sample characteristic data refers to the data to be processed.
  • Sample feature data can be, but not limited to, user information (such as gender, age, occupation, etc.), website or web page click behavior (such as click time, times, frequency, etc.), user transaction data and behavior (such as payment product information, payment amount, payment Way etc.) etc.
  • the sample feature data includes annotation data.
  • the label data refers to a type of identification information used to distinguish different types of sample feature data.
  • each sample feature data needs to be sample-labeled in advance to obtain labeled data.
  • the acquired sample characteristic data is website or web page click behavior (such as the number of clicks)
  • the sample characteristic data includes characteristic data that the number of website or web page clicks does not exceed 100, and the number of website or web page clicks equals or exceeds 100
  • the annotation data included in the sample feature data are 1 and 0 respectively.
  • S20 Classify the sample feature data based on the labeled data to obtain different types of basic feature data.
  • each sample feature data includes corresponding annotation data
  • the classification can be directly based on the annotation data corresponding to each sample feature data, that is, sample feature data with the same annotation data are classified as the same type of basic feature data.
  • the characteristic data of samples with different labeling data are classified into different types of characteristic data.
  • the sample feature data includes at least two types of sample feature data.
  • S30 Calculate the data volume of each type of basic feature data, and calculate the proportion of each type of basic feature data in the sample feature data according to the data volume.
  • calculating the proportion of each type of basic feature data in the sample feature data includes: firstly, the data volume of each type of basic feature data is proportional to the total data volume of the sample feature data, and then each generated data volume is proportional. A proportional result is reduced to obtain the proportion of each type of basic feature data in the sample feature data.
  • the acquired sample feature data includes three types of basic feature data, which are basic feature data A, basic feature data B, and basic feature data C; the total data volume of the sample feature data is 20000, and the statistics are The data amount of basic characteristic data A is 1000, the data amount of basic characteristic data B is 9000, and the data amount of basic characteristic data C is 10000; then the data amount of basic characteristic data A is 1000 and the total data amount of sample characteristic data is 20000.
  • the ratio obtained after reduction is 1/20, and the ratio of the data volume of basic characteristic data B of 9000 to the total data volume of sample characteristic data of 20000 is proportional to 9/20; the basic characteristic data
  • the ratio of the data volume of C of 10000 and the total data volume of sample feature data of 20000 is proportional to 1/2.
  • the percentage threshold refers to a preset threshold used to evaluate whether the data volume of the basic feature data meets the requirements.
  • the percentage threshold can be 1/10, 1/12, 1/20, etc., and the user can customize the setting according to the actual data volume of the sample characteristic data.
  • Minority feature data set refers to a data set composed of basic feature data whose data volume does not meet the set requirements. Understandably, the minority feature data set contains several minority feature data.
  • the sample feature data After determining the percentage value of each type of basic feature data in the sample feature data; compare the percentage value of each type of basic feature data with a preset percentage threshold one by one; determine the sample feature data Whether there is basic feature data whose proportion value is less than the proportion threshold value in the sample feature data, if there is basic feature data whose proportion value is less than the proportion threshold value in the sample characteristic data, then the basic characteristic data whose proportion value is less than the proportion threshold value is determined as a minority category Feature data set. Understandably, if there is no basic feature data whose proportion value is less than the proportion threshold value in the sample feature data, it means that the sample feature data does not have the problem of unbalanced sample data categories.
  • the preset threshold of proportion is 1/10
  • the proportion of basic characteristic data A in the sample characteristic data is 1/20
  • the proportion of basic characteristic data B is 9/20
  • the proportion of basic characteristic data B is 9/20.
  • the ratio value of the characteristic data C is 1/2.
  • S50 Based on the target generative confrontation network model, perform data structure on a minority feature data set to generate structured feature data. Among them, the target generation confrontation network model is constructed by using the Deep&CrossNet network model.
  • the target generation confrontation network model is a network model obtained by pre-training.
  • the target generation confrontation network model is used to construct the minority feature data set, and output the same structure feature data as the corresponding minority feature data set.
  • the number of generated construction feature data can be customized according to actual conditions.
  • the structural feature data and the minority feature data belong to the same type of feature data, that is, the generated structural feature data is the same as the feature contained in the minority feature data. For example: if the minority feature data is user transaction data and behavior (such as payment product information, payment amount, payment method, etc.), the generated structural feature data is also user transaction data and behavior (such as payment product information, payment amount, payment Way etc.).
  • the minority feature data set is constructed based on the target generation confrontation network model.
  • Generating and constructing the feature data includes: inputting a set of random data and minority feature data to the preset generation confrontation network model for training, and generating target generation confrontation Network model, the preset generation confrontation network model is built by the Deep&CrossNet network. Then, input the random data into the target generation confrontation network model generated by the training to generate corresponding structural feature data.
  • the acquired minority feature data is continuous data, it is necessary to discretize the minority feature data before constructing the minority feature data based on the target generation confrontation network model.
  • the transformation process generates a set of discrete data composed of vectors. If the obtained minority feature data is a discrete data, the minority feature data can be directly constructed based on the target generation confrontation network model to generate structure feature data.
  • the generative confrontation network model is a kind of generative confrontation network model, it is mainly composed of a generative network model and a discriminative network model. Therefore, the use of the Deep&CrossNet network model to build a generative confrontation network model mainly refers to the use of the Deep&CrossNet network model to build the generative network model and the discriminant network model in the generative confrontation network model.
  • the Deep&CrossNet (DCN) network model is a cross network model.
  • the DCN network model network is a network composed of the first layer of embedding and stacking layers, the second layer of a cross network and a parallel deep network, and the third layer of combined layers.
  • the DCN network combines the output of a crossover network and a deep network.
  • the DCN network model network can further abstract information on the basis of retaining the original feature information, and can efficiently extract the interaction and interaction information in the limited important features, without manual feature engineering or traversal search, and is easier to train than ordinary neural networks .
  • DCN can further abstract information on the basis of retaining original feature information, and is more adaptable in terms of structured data.
  • the standard feature data refers to the feature data that meets the requirements. Understandably, the standard feature data is a set of data with balanced data categories.
  • the generated structural feature data is added to the sample feature data to obtain the standard feature data.
  • the generated structural feature data is discrete data composed of a set of feature vectors with a value of 0 or 1
  • the structural feature data is added to the sample feature
  • the encoding method may be One-Hot encoding or integer encoding.
  • the sample feature data includes annotation data; the sample feature data is classified based on the annotation data to obtain different types of basic feature data; the data volume of each type of basic feature data is counted according to The data volume calculates the proportion of each type of basic characteristic data in the sample characteristic data; if there is basic characteristic data with a proportion smaller than the preset proportion threshold in the sample characteristic data, then the proportion value is smaller than the proportion threshold.
  • the data is determined as a minority feature data set; the minority feature data set is constructed based on the target generation confrontation network model, and the structure feature data is generated.
  • the target generation confrontation network model is constructed using the Deep&CrossNet network model; the feature data will be constructed Add to the sample feature data to obtain the standard feature data; first classify the acquired sample feature data, then extract the minority feature data set whose proportion is less than the proportion threshold, and then use the target generated by the Deep&CrossNet network model
  • the adversarial network model constructs data on minority feature data sets and generates a set of structure feature data, thus effectively solving the problem of sample data imbalance.
  • data structure is performed on a minority feature data set based on the target generation confrontation network model to generate structure feature data, which specifically includes the following steps:
  • S501 Obtain minority feature data, use the minority feature data to train a preset initial generation confrontation network model, and generate a target generation confrontation network model, where the initial generation confrontation network model is constructed using the Deep&CrossNet network model.
  • the minority feature data is part of the data obtained from the minority feature data set. Because the minority feature data contained in the minority feature data set all belong to the same type of data. Therefore, when constructing the minority feature data set based on the target generative confrontation network model, only part of the data obtained from the minority feature data set is used as the minority feature data, and then the minority feature data is used to train the preset initial generation
  • the confrontation network network model mainly includes: using the Deep&CrossNet network to build the initial generation confrontation network model, that is, setting the generation network model and the judgment network model in the initial generation confrontation network model to the Deep&CrossNet network model; input a set of random
  • the noise data is sent to the generative network model of the initial generative confrontation network model for training, and the generative network model outputs a set of random feature data; then, the obtained minority feature data and the random feature data are respectively used as the initial generative confrontation network model
  • the input vector of the discriminant network model, and the discriminant network model is trained, and looped in
  • S502 Obtain random noise data, input the random noise data into the generative network model of the target generative countermeasure network model, and generate structural feature data.
  • random noise data refers to randomly generated data that conforms to a normal distribution. Specifically, after the target generation confrontation network model is obtained according to step S501, a set of random noise data is randomly generated, and the random noise data is input into the generation network model of the target generation confrontation network model to generate the corresponding structure Characteristic data.
  • using minority feature data to train a preset initial generation confrontation network model to generate a target generation confrontation network model specifically includes the following steps:
  • S5011 Build an initial generation confrontation network model based on the Deep&CrossNet network.
  • the Deep&CrossNet network is a cross network.
  • the DCN network is a network composed of the first layer of embedding and stacking layer, the second layer of a cross network and a parallel deep network, and the third layer of combination layer.
  • the DCN network combines the output of a crossover network and a deep network.
  • the DCN network can further abstract information on the basis of retaining the original feature information, can efficiently extract the interaction and interaction information in the limited important features, does not require manual feature engineering or traversal search, and is easier to train than general neural networks.
  • DCN can further abstract information on the basis of retaining original feature information, and is more adaptable in terms of structured data.
  • the generative confrontation network model is mainly composed of a generative network model and a discriminative network model. Therefore, building an initial generative confrontation network model based on the Deep&CrossNet network mainly refers to constructing the generative network model and the discriminant network model in the initial generative confrontation network model by using the Deep&CrossNet network. Understandably, both the generative network model and the discriminant network model in the initial generative confrontation network model are composed of the Deep&CrossNet network.
  • S5012 Input a set of random noise data to the generative network model of the initial generative confrontation network model for training, and generate random feature data.
  • random noise data refers to randomly generated data that conforms to a normal distribution. Specifically, by inputting a set of random noise data to the generative network model of the initial generative confrontation network model for training, a set of random feature data can be generated.
  • S5013 Discretize the random feature data to obtain discrete feature data.
  • the random feature data generated in step 5012 may be a set of continuous data
  • the generated random feature data needs to be discretized to generate discrete feature data.
  • discrete feature data refers to data composed of a set of feature vectors whose values are 0 or 1.
  • the random feature data can be discretized by using a preset encoding method to obtain discrete feature data.
  • the encoding method may be One-Hot encoding or integer encoding.
  • S5014 Use discrete feature data and minority feature data as the input vector of the discriminant network model of the initial generation confrontation network model, and iteratively train the initial generation confrontation network model to generate the target generation confrontation network model.
  • Discrete feature data and minority feature data are respectively used as the input vectors of the discriminant network model of the initial generation confrontation network model, and the initial generation confrontation network model is iteratively trained until convergence, and the target generation confrontation network model is obtained. Understandably, the iterative training process of the initial generation confrontation network model mainly refers to the process of alternately training the generation network model and the discriminant network model in the initial generation confrontation network model. It should be noted that before the discrete feature data and minority feature data are used as the input vector for the initial generation of the discriminant network model against the network model, the feature conditions (feature values) of the minority feature data and the feature conditions of the discrete feature data need to be guaranteed (Characteristic values) correspond to each other.
  • the generative network model and the discriminative network model in the initial generative confrontation network model independently perform iterative training by maximizing the discriminative ability of the discriminant network model and minimizing the distribution loss function of the generative network model, until the initial generation of the confrontation network model
  • the discriminant output probability value of the random feature data generated by the generative network model in the discriminant network model is close to 0.5, and the target generative confrontation network model is obtained.
  • an initial generation confrontation network model is built based on the Deep&CrossNet network; a set of random noise data is input to the generation network model of the initial generation confrontation network model for training to generate random feature data; random feature data is discretized, Obtain discrete feature data; use discrete feature data and minority feature data as the input vector of the discriminant network model of the initial generation confrontation network model, and iteratively train the initial generation confrontation network model to generate the target generation confrontation network model; construct by using the Deep&CrossNet network
  • the initial generation of the confrontation network model makes the generative network model and the discriminant network model in the generated target generation confrontation network model more stable and efficient, thereby ensuring the accuracy of the subsequent use of the target generation confrontation network model for the data construction of minority feature data.
  • discrete feature data and minority feature data are used as the input vectors of the discriminant network model of the initial generation of the confrontation network model, and the iterative training of the initial generation of the confrontation network model specifically includes the following steps:
  • S50141 Set the discrete feature data as a false sample set, and set the minority feature data as a true sample set.
  • the discrete feature data is set as a false sample set
  • the minority feature data is set as a true sample set.
  • the fake sample set and the true sample set may also be labeled.
  • all class labels of the fake sample set are set to 0, and all class labels of the true sample set are set to 1.
  • S50142 Input the false sample set and the true sample set into the initial generation confrontation network model respectively, and obtain the output value of the discriminant network model in the initial generation confrontation network model.
  • the output value of the discriminant network model in the initial generation confrontation network model can be directly obtained.
  • the input samples are a true sample set with a label of 1 and a false sample set with a label of 0, the output value generated based on the true sample set and the false sample set is a value between 0-1.
  • the parameter values of the network model are generated in order to avoid the introduction of unnecessary uncertain unknown conditions due to the change of the model parameters when generating the discrete feature data of the network model, which leads to the appearance of errors, which in turn leads to the tilt of the iterative training results.
  • S50143 According to the output value of the discrimination network model, adjust the parameter values of the discrimination network model to make the output value of the discrimination network model close to the preset output value.
  • the parameter value of the discrimination network model is adjusted by comparing the difference between the output value of the discrimination network model and the preset output value, so that the output value of the discrimination network model is close to the preset output value. Since the input samples are the true sample set with the label set to 1 and the fake sample set with the label set to 0, the preset output value is preferably 1, even if the output value of the discriminant network model is close to the true sample set.
  • the parameter values of the discriminant network model are adjusted by calculating the difference between the output value of the discriminant network model and the preset output value. When the discriminant network model's output value and the preset output value are close to 0, that Complete the training of the discriminant network model.
  • the discrete feature data is set as a false sample set, and the minority feature data is set as a true sample set; the false sample set and the true sample set are respectively input into the initial generation confrontation network model to obtain the initial generation confrontation network
  • the output value of the discriminating network model in the model according to the output value of the discriminating network model, adjust the parameter value of the discriminating network model to make the output value of the discriminating network model close to the preset output value; thereby further improving the stability and accuracy of the discriminating network model Sex.
  • discrete feature data and minority feature data are used as the input vectors of the discriminant network model of the initial generation of the confrontation network model, and the iterative training of the initial generation of the confrontation network model further specifically includes the following steps :
  • S50144 Preset the parameter value of the judgment network model.
  • the parameter values of the discriminant network model are set in advance to set the discriminant network model as a quantitative condition. It should be noted that the parameter value generally includes, but is not limited to, the discriminant weight of the discriminant network model.
  • S50145 Input random noise data into the generative network model of the initial generative confrontation network model, and perform initial calculation to obtain random characteristic data.
  • the label of the random feature data needs to be set to 1 again. Setting the label of the random feature data to 1, which means that the random feature data is regarded as the minority feature data in the current situation during the discrimination, so as to determine the authenticity through the discriminating network model.
  • the difference between the random feature data output by the generating network model and the minority feature data is calculated, and the generated network is adjusted according to the difference between the random feature data output by the generating network model and the minority feature data.
  • the parameter values in the model are used to make the random feature data output by the generating network model close to the minority feature data, so as to improve the generating network model.
  • the parameter values of the network model are determined by preset; the random noise data is input to the initial generation of the confrontation network model, and the initial calculation is performed to obtain the random characteristic data; the random characteristic data output by the generated network model is compared with a few The difference between the class feature data is adjusted to the parameter value of the generated network model, so that the random feature data output by the generated network model is close to the minority feature data; thereby further improving the stability and accuracy of the generated network model.
  • a sample data processing device is provided, and the sample data processing device corresponds to the sample data processing method in the above-mentioned embodiment one-to-one.
  • the sample data processing device includes a sample characteristic data acquisition module 10, a classification module 20, a statistics module 30, a minority characteristic data determination module 40, a data construction module 50 and an adding module 60.
  • the detailed description of each functional module is as follows:
  • the sample characteristic data acquisition module 10 is used to acquire sample characteristic data, and the sample characteristic data includes annotation data;
  • the classification module 20 is used to classify the sample feature data based on the annotation data to obtain different types of basic feature data;
  • the statistics module 30 is used to count the data volume of each type of basic feature data, and calculate the proportion of each type of basic feature data in the sample feature data according to the data volume;
  • the minority characteristic data set determining module 40 is used to determine the basic characteristic data whose proportion value is less than the proportion threshold value as the minority characteristic data when there are basic characteristic data whose proportion value is less than the preset proportion threshold value in the sample characteristic data set;
  • the data construction module 50 is used for data construction on minority feature data based on the target generation confrontation network model to generate structure feature data, wherein the target generation confrontation network model is constructed by using the Deep&CrossNet network model;
  • the adding module 60 is used to add the structural feature data to the sample feature data to obtain standard feature data.
  • the data construction module 50 includes:
  • the training sub-module 501 is used to obtain minority feature data, use the minority feature data to train the preset initial generation confrontation network model, and generate the target generation confrontation network model, where the initial generation confrontation network model is constructed by using the Deep&CrossNet network model;
  • the structure feature data generation sub-module 502 is used to obtain random noise data, input the random noise data into the generation network model of the target generation countermeasure network model, and generate structure feature data.
  • the training sub-module 501 includes:
  • the building unit 5011 is used to build an initial generation confrontation network model based on the Deep&CrossNet network;
  • the training unit 5012 is used for inputting a set of random noise data to the generating network model of the initial generating confrontation network model for training, and generating random feature data;
  • the discretization processing unit 5013 is used to discretize the random feature data to obtain discrete feature data
  • the iterative training unit 5014 is configured to use discrete feature data and minority feature data as input vectors for the discriminant network model of the initially generated confrontation network model, and iteratively train the initial generation confrontation network model to generate the target generation confrontation network model.
  • the iterative training unit 5014 includes:
  • Set up subunits to set discrete feature data as a false sample set, and set minority feature data as a true sample set;
  • the input subunit is used to input the fake sample set and the true sample set into the initial generation confrontation network model, and obtain the output value of the discriminant network model in the initial generation confrontation network model;
  • the first adjustment subunit is used for adjusting the parameter value of the discriminating network model according to the output value of the discriminating network model, so that the output value of the discriminating network model is close to the preset output value.
  • the iterative training unit 5014 further includes:
  • the preset subunit is used to preset the parameter value of the judgment network model
  • the calculation subunit is used to input random noise data into the generative network model of the initial generative confrontation network model, and perform initial calculation to obtain random characteristic data;
  • the second adjustment subunit is used to adjust the parameter values of the generated network model by comparing the difference between the random feature data output by the generating network model and the minority feature data, so that the random feature data output by the generating network model is close to the minority feature data data.
  • each module in the above-mentioned sample data processing device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer device is used to store the data used in the sample data processing method in the foregoing embodiment.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a sample data processing method.
  • a computer device including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and the processor implements the above-mentioned sample data processing method when the computer program is executed:
  • sample feature data where the sample feature data includes annotation data
  • the target generation confrontation network model is constructed by using the Deep&CrossNet network model
  • the structure feature data is added to the sample feature data to obtain standard feature data.
  • a computer-readable storage medium is provided.
  • the above-mentioned storage medium may be a non-volatile storage medium or a volatile storage medium.
  • a computer program is stored thereon, and when the computer program is executed by a processor, the above-mentioned sample data processing method is realized:
  • sample feature data where the sample feature data includes annotation data
  • the target generation confrontation network model is constructed by using the Deep&CrossNet network model
  • the structure feature data is added to the sample feature data to obtain standard feature data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及人工智能领域,公开一种样本数据处理方法、装置、计算机设备及存储介质;通过获取样本特征数据;基于样本特征数据的标注数据对样本特征数据进行分类,得到不同类型的基本特征数据;统计每一类型的基本特征数据的数据量,根据数据量计算每一类型的基本特征数据在样本特征数据中的占比值;若存在占比值小于预设的占比阈值的基本特征数据,则将占比值小于占比阈值的基本特征数据,确定为少数类特征数据集;基于目标生成对抗网络模型对少数类特征数据集进行数据构造,生成构造特征数据,目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;将构造特征数据加入到样本特征数据中,得到标准特征数据;从而有效解决了样本数据不平衡的问题。

Description

样本数据处理方法、装置、计算机设备及存储介质
本申请要求于2019年10月11日提交中国专利局、申请号为201910965007.X,发明名称为“样本数据处理方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种样本数据处理方法、装置、计算机设备及存储介质。
背景技术
随着科技的进步和大数据时代的到来,人们可以访问获取的数据和信息资源呈现出爆炸式的增长。利用数据进行预测、评估反馈等应用日趋普遍,例如:采用机器学习或者聚类方法等进行预测或评估反馈。然而,发明人意识到在采用机器学习或者聚类方法等进行预测或评估反馈时,经常会出现样本数据不平衡的问题。目前,解决样本数据不平衡的问题的方法大部分都是直接通过人工合成技术增加少数类的样本量实现。但是,通过人工合成技术增加少数类的样本量的方法生成的样本相对单一,并且容易造成样本交叉。因此,有效的解决样本数据不平衡的问题是目前数据处理领域中亟待解决的重要问题。
技术问题
本申请提供一种样本数据处理方法、装置、计算机设备及存储介质,以解决样本数据不平衡的问题。
技术解决方案
一种样本数据处理方法,包括:
获取样本特征数据,所述样本特征数据包括标注数据;
基于所述标注数据对所述样本特征数据进行分类,得到不同类型的基本特征数据;
统计所述每一类型的基本特征数据的数据量,根据所述数据量计算所述每一类型的基本特征数据在所述样本特征数据中的占比值;
若所述样本特征数据中存在所述占比值小于预设的占比阈值的基本特征数据,则将所述占比值小于所述占比阈值的基本特征数据,确定为少数类特征数据集;
基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,其中,所述目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
将所述构造特征数据加入到所述样本特征数据中,得到标准特征数据。
一种样本数据处理装置,包括:
样本特征数据获取模块,用于获取样本特征数据,所述样本特征数据包括标注数据;
分类模块,用于基于所述标注数据对所述样本特征数据进行分类,得到不同类型的基本特征数据;
统计模块,用于统计所述每一类型的基本特征数据的数据量,根据所述数据量计算所述每一类型的基本特征数据在所述样本特征数据中的占比值;
少数类特征数据集确定模块,用于在所述样本特征数据中存在所述占比值小于预设的占比阈值的基本特征数据时,将所述占比值小于所述占比阈值的基本特征数据,确定为少数类特征数据集;
数据构造模块,用于基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,其中,所述目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
加入模块,用于将所述构造特征数据加入到所述样本特征数据中,得到标准特征数据。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现一种样本数据处理方法:
获取样本特征数据,所述样本特征数据包括标注数据;
基于所述标注数据对所述样本特征数据进行分类,得到不同类型的基本特征数据;
统计所述每一类型的基本特征数据的数据量,根据所述数据量计算所述每一类型的基本特征数据在所述样本特征数据中的占比值;
若所述样本特征数据中存在所述占比值小于预设的占比阈值的基本特征数据,则将所述占比值小于所述占比阈值的基本特征数据,确定为少数类特征数据集;
基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,其中,所述目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
将所述构造特征数据加入到所述样本特征数据中,得到标准特征数据。
一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现一种样本数据处理方法:
获取样本特征数据,所述样本特征数据包括标注数据;
基于所述标注数据对所述样本特征数据进行分类,得到不同类型的基本特征数据;
统计所述每一类型的基本特征数据的数据量,根据所述数据量计算所述每一类型的基本特征数据在所述样本特征数据中的占比值;
若所述样本特征数据中存在所述占比值小于预设的占比阈值的基本特征数据,则将所述占比值小于所述占比阈值的基本特征数据,确定为少数类特征数据集;
基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,其中,所述目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
将所述构造特征数据加入到所述样本特征数据中,得到标准特征数据。
有益效果
上述样本数据处理方法、装置、计算机设备及存储介质,先通过对获取的样本特征数据进行分类,然后提取数量占比值少于占比阈值的少数类特征数据,再采用由Deep&CrossNet网络模型构建的目标生成对抗网络模型对少数类特征数据进行数据构建,生成一组构造特征数据,从而有效解决了样本数据不平衡的问题。
附图说明
图1是本申请一实施例中样本数据处理方法的一应用环境示意图;
图2是本申请一实施例中样本数据处理方法的一示例图;
图3是本申请一实施例中样本数据处理方法的另一示例图;
图4是本申请一实施例中样本数据处理方法的另一示例图;
图5是本申请一实施例中样本数据处理方法的另一示例图;
图6是本申请一实施例中样本数据处理方法的另一示例图;
图7是本申请一实施例中样本数据处理装置的一原理框图;
图8是本申请一实施例中样本数据处理装置的另一原理框图;
图9是本申请一实施例中样本数据处理装置的另一原理框图;
图10是本申请一实施例中计算机设备的一示意图。
本发明的最佳实施方式
本申请实施例提供的样本数据处理方法,该样本数据处理方法可应用如图1所示的应用环境中。具体地,该样本数据处理方法应用在样本数据处理系统中,该样本数据处理系统包括如图1所示的客户端和服务端,客户端与服务端通过网络进行通信,用于解决样本数据类别不平衡的问题。其中,客户端又称为用户端,是指与服务端相对应,为客户提供本地服务的程序。客户端可安装在但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备上。服务端可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种样本数据处理方法,以该方法应用在图1中的服务端为例进行说明,包括如下步骤:
S10: 获取样本特征数据,样本特征数据包括标注数据。
其中,样本特征数据指待进行处理的数据。样本特征数据可以但不限于用户信息(如性别、年龄、职业等)、网站或网页点击行为(如点击时间、次数、频率等)、用户交易数据及行为(如支付产品信息、支付金额、支付方式等)等。样本特征数据包括标注数据。其中,标注数据指用于区别不同类型的样本特征数据的一种标识信息。
在一具体实施例中,为了便于区分不同类型的样本特征数据,需预先对每一样本特征数据进行样本标注,得到标注数据。示例性地,若获取的样本特征数据为网站或网页点击行为(如点击次数),该样本特征数据包括网站或网页点击次数不超过100次的特征数据,和网站或网页点击次数等于或超过100次的特征数据;则可以预先对网站或网页点击次数不超过100次的特征数据进行样本标注为1,和对网站或网页点击次数等于或超过100次的特征数据进行样本标注为0。可以理解地,该样本特征数据包括的标注数据分别为1和0。
S20: 基于标注数据对样本特征数据进行分类,得到不同类型的基本特征数据。
具体地,由于每一样本特征数据都包括对应的标注数据,因此,可直接根据每一样本特征数据对应的标注数据进行分类,即将标注数据相同的样本特征数据归为相同类型的基本特征数据,将标注数据不同的样本特征数据归为不同类型的特征数据基本。在本实施例中,样本特征数据包括至少两种类型的样本特征数据。
S30: 统计每一类型的基本特征数据的数据量,根据数据量计算每一类型的基本特征数据在样本特征数据中的占比值。
在对样本特征数据进行分类,得到不同类型的基本特征数据之后,可采用统计函数统计每一类型的基本特征数据的数据量。然后,根据每一类型的基本特征数据的数据量,计算每一类型的基本特征数据在样本特征数据中的占比值。具体地,计算每一类型的基本特征数据在样本特征数据中的占比值包括:先将每一类型的基本特征数据的数据量与样本特征数据的总数据量进行比例化,然后对生成的每一比例化结果进行约分,即可得到每一类型的基本特征数据在样本特征数据中的占比值。
示例性地,若获取的样本特征数据中包括三种类型的基本特征数据,分别为基本特征数据A、基本特征数据B和基本特征数据C;该样本特征数据的总数据量为20000,统计得到基本特征数据A的数据量为1000,基本特征数据B的数据量为9000,基本特征数据C数据量为10000;则将基本特征数据A的数据量1000与样本特征数据的总数据量20000进行比例化约分后得到的占比值为1/20,将基本特征数据B的数据量9000与样本特征数据的总数据量20000进行比例化约分后得到的占比值为9/20;将基本特征数据C的数据量10000与样本特征数据的总数据量20000进行比例化约分后得到的占比值为1/2。
S40: 若样本特征数据中存在占比值小于预设的占比阈值的基本特征数据,则将占比值小于占比阈值的基本特征数据,确定为少数类特征数据集。
其中,占比阈值指预先设定的用于评估基本特征数据的数据量是否满足要求的阈值。例如,占比阈值可以为1/10, 1/12或者1/20等,用户可根据样本特征数据的实际数据量自定义设定。少数类特征数据集指数据量不满足设定要求的基本特征数据所组成的数据集。可以理解地,少数类特征数据集中包含若干少数类特征数据。具体地,在确定了样本特征数据中每一类型的基本特征数据的占比值之后;将每一类型的基本特征数据的占比值与预设的占比阈值进行一一比较;判断该样本特征数据中是否存在占比值小于占比阈值的基本特征数据,若样本特征数据中存在占比值小于占比阈值的基本类特征数据,则将该占比值小于占比阈值的基本特征数据,确定为少数类特征数据集。可以理解地,若该样本特征数据中不存在占比值小于占比阈值的基本类特征数据,则说明该样本特征数据不存在样本数据类别不平衡的问题。
示例性地,若预先设定的占比阈值为1/10,经步骤S30得到样本特征数据中基本特征数据A的占比值为1/20,基本特征数据B的占比值为9/20,基本特征数据C的占比值为1/2。将基本特征数据A、基本特征数据B和基本特征数据C的占比值分别与占比阈值进行一一比较之后;得到基本特征数据A的占比值1/20小于占比阈值1/10,基本特征数据B的占比值9/20大于占比阈值1/10,基本特征数据C的占比值1/2大于占比阈值1/10;则将基本特征数据A确定为少数类特征数据集。
S50: 基于目标生成对抗网络模型对少数类特征数据集进行数据构造,生成构造特征数据,其中,目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的。
其中,目标生成对抗网络模型是预先训练得到的一个网络模型。目标生成对抗网络模型用于对少数类特征数据集进行数据构造,并输出与对应的少数类特征数据集相同的构造特征数据。在本实施例中,基于目标生成对抗网络模型对少数类特征数据集进行数据构造后,生成的构造特征数据的数量可根据实际情况自定义设定。需要说明的是,构造特征数据与少数类特征数据属于相同类型的特征数据,即生成的构造特征数据与少数类特征数据所包含的特征相同。例如:若少数类特征数据为用户交易数据及行为(如支付产品信息、支付金额、支付方式等),则生成的构造特征数据也为用户交易数据及行为(如支付产品信息、支付金额、支付方式等)。
具体地,基于目标生成对抗网络模型对少数类特征数据集进行数据构造,生成构造特征数据包括:输入一组随机数据和少数类特征数据至预设生成对抗网络模型中进行训练,生成目标生成对抗网络模型,该预设生成对抗网络模型是由Deep&CrossNet网络搭建的。然后,再将该随机数据输入该训练生成的目标生成对抗网络模型中,即可生成对应的构造特征数据。
需要说明的是,在本实施例中,若获取的少数类特征数据为一连续性数据,则在基于目标生成对抗网络模型对少数类特征数据进行数据构造之前,需先少数类特征数据进行离散化处理生成一组由向量组成的离散型数据。若获取的少数类特征数据为一离散型数据,则可直接基于目标生成对抗网络模型对少数类特征数据进行数据构造,生成构造特征数据。
由于生成对抗网络模型是一种生成对抗网络模型,主要由生成网络模型和判别网络模型组成。因此采用Deep&CrossNet网络模型搭建生成对抗网络模型主要是指通过采用Deep&CrossNet网络模型来搭建生成对抗网络模型中的生成网络模型和判别网络模型。其中,Deep&CrossNet(DCN)网络模型是一种交叉网络模型。DCN网模型络是由第一层嵌入和堆积层,第二层一个交叉网络和一个与之平行的深度网络,以及第三层组合层组成的网络。DCN网络结合了交叉网络和深度网络的输出。DCN网模型络可以在保留原始特征信息的基础上进一步抽象信息,能够高效地提取有限的重要特征中的相互作用和交互信息,不需要人工特征工程或者遍历搜索,而且比一般的神经网络易于训练。另外地,DCN可以在保留原始特征信息的基础上进一步抽象信息,在结构化数据方面适应性更好。
S60: 将构造特征数据加入到样本特征数据中,得到标准特征数据。
其中,标准特征数据指满足要求的特征数据。可以理解地,标准特征数据是一组数据类别平衡的数据。
具体地,在根据步骤S50生成构造特征数据之后,再将生成的构造特征数据加入到样本特征数据中,即可得到标准特征数据。优选地,由于生成的构造特征数据是由一组取值为0或1的特征向量组成的离散型数据,若获取的样本特征数据为一连续型数据,则在将构造特征数据加入到样本特征数据中之前,需预先采用预先设置的编码方式将样本特征数据编码转化为离散型数据。其中,编码方式可以为One-Hot编码或者整数编码等。然后,再将构造特征数据加入到样本特征数据中,得到标准特征数据。
在本实施例中,通过获取样本特征数据,样本特征数据包括标注数据;基于标注数据对样本特征数据进行分类,得到不同类型的基本特征数据;统计每一类型的基本特征数据的数据量,根据数据量计算每一类型的基本特征数据在样本特征数据中的占比值;若样本特征数据中存在占比值小于预设的占比阈值的基本特征数据,则将占比值小于占比阈值的基本特征数据,确定为少数类特征数据集;基于目标生成对抗网络模型对少数类特征数据集进行数据构造,生成构造特征数据,其中,目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;将构造特征数据加入到样本特征数据中,得到标准特征数据;先通过对获取的样本特征数据进行分类,然后提取数量占比值少于占比阈值的少数类特征数据集,再采用由Deep&CrossNet网络模型构建的目标生成对抗网络模型对少数类特征数据集进行数据构建,生成一组构造特征数据,从而有效解决了样本数据不平衡的问题。
在一实施例中,如图3所示,基于目标生成对抗网络模型对少数类特征数据集进行数据构造,生成构造特征数据,具体包括如下步骤:
S501: 获取少数类特征数据,采用少数类特征数据训练预设的初始生成对抗网络模型,生成目标生成对抗网络模型,其中,初始生成对抗网络模型是采用Deep&CrossNet网络模型构建的。
其中,少数类特征数据是从少数类特征数据集中获取的部分数据。由于少数类特征数据集中所包含的少数类特征数据都属于相同类型的数据。因此,在基于目标生成对抗网络模型对少数类特征数据集进行数据构造,只需从少数类特征数据集中获取的部分数据,作为少数类特征数据,然后采用少数类特征数据训练预设的初始生成对抗网络网络模型,生成目标生成对抗网络模型主要包括:采用Deep&CrossNet网络搭建初始生成对抗网络模型,即设置初始生成对抗网络模型中的生成网络模型和判断网络模型都为Deep&CrossNet网络模型;输入一组随机噪声数据至该初始生成对抗网络模型的生成网络模型中进行训练,生成网络模型输出一组随机特征数据;然后,再将获取的少数类特征数据和该随机特征数据分别作为该初始生成对抗网络模型的判别网络模型的输入向量,并对该判别网络模型进行训练,依次循环,以对该初始生成对抗网络模型中的生成网络模型和判断网络模型进行迭代训练,直至收敛,得到目标生成对抗网络模型。
S502: 获取随机噪声数据,将随机噪声数据输入目标生成对抗网络模型的生成网络模型中,生成构造特征数据。
其中,随机噪声数据是指随机生成的符合正态分布的数据。具体地,在根据步骤S501得到目标生成对抗网络模型之后,再随机生成一组随机噪声数据,并将该随机噪声数据输入到该目标生成对抗网络模型的生成网络模型中,即可生成对应的构造特征数据。
在本实施例中,通过获取少数类特征数据,采用少数类特征数据训练预设的生成对抗网络网络模型,生成目标生成对抗网络模型,其中,生成对抗网络网络模型是采用Deep&CrossNet网络模型构建的;获取随机噪声数据,将随机噪声数据输入目标生成对抗网络模型的生成网络模型中,生成构造特征数据;通过采用Deep&CrossNet网络模型构建的生成对抗网络网络模型对少数类特征数据进行数据构造,从而提高了数据构造的效率。
在一实施例中,如图4所示,采用少数类特征数据训练预设的初始生成对抗网络模型,生成目标生成对抗网络模型,具体包括如下步骤:
S5011: 基于Deep&CrossNet网络搭建初始生成对抗网络模型。
其中,Deep&CrossNet网络是一种交叉网络。DCN网络是由第一层嵌入和堆积层,第二层一个交叉网络和一个与之平行的深度网络,以及第三层组合层组成的网络。DCN网络结合了交叉网络和深度网络的输出。DCN网络可以在保留原始特征信息的基础上进一步抽象信息,能够高效地提取有限的重要特征中的相互作用和交互信息,不需要人工特征工程或者遍历搜索,而且比一般的神经网络易于训练。另外地,DCN可以在保留原始特征信息的基础上进一步抽象信息,在结构化数据方面适应性更好。
具体地,由于生成对抗网络模型主要由生成网络模型和判别网络模型组成。因此,基于Deep&CrossNet网络搭建初始生成对抗网络模型主要是指通过采用Deep&CrossNet网络来构建初始生成对抗网络模型中的生成网络模型和判别网络模型。可以理解地,初始生成对抗网络模型中的生成网络模型和判别网络模型都是由Deep&CrossNet网络构成的。
S5012: 输入一组随机噪声数据至初始生成对抗网络模型的生成网络模型中进行训练,生成随机特征数据。
其中,随机噪声数据是指随机生成的符合正态分布的数据。具体地,输入一组随机噪声数据至初始生成对抗网络模型的生成网络模型中进行训练,即可生成一组随机特征数据。
S5013: 对随机特征数据进行离散化处理,得到离散特征数据。
具体地,由于经步骤5012生成的随机特征数据可能是一组连续型数据,因此,为了提高后续模型训练的精准度,需对生成的随机特征数据进行离散化处理,生成离散特征数据。其中,离散特征数据是指由一组取值为0或1的特征向量组成的数据。具体地,可采用预先设置的编码方式对该随机特征数据进行离散化处理,得到离散特征数据。其中,编码方式可以为One-Hot编码或者整数编码等。
S5014: 将离散特征数据和少数类特征数据作为初始生成对抗网络模型的判别网络模型的输入向量,对初始生成对抗网络模型进行迭代训练,生成目标生成对抗网络模型。
将离散特征数据和少数类特征数据分别作为初始生成对抗网络模型的判别网络模型的输入向量,对初始生成对抗网络模型进行迭代训练,直至收敛,得到目标生成对抗网络模型。可以理解地,对初始生成对抗网络模型进行迭代训练过程主要指对初始生成对抗网络模型中的生成网络模型和判别网络模型进行交替训练的过程。需要说明的是,在将离散特征数据和少数类特征数据作为初始生成对抗网络模型的判别网络模型的输入向量之前,需保证少数类特征数据的特征条件(特征值)与离散特征数据的特征条件(特征值)相互对应。
具体地,初始生成对抗网络模型中的生成网络模型和判别网络模型通过最大化判别网络模型的差别能力和最小化生成网络模型的分布损失函数来独立进行迭代训练,直至初始生成对抗网络模型中的生成网络模型生成的随机特征数据在判别网络模型中的判别输出概率值接近0.5,得到目标生成对抗网络模型。
在本实施例中,基于Deep&CrossNet网络搭建初始生成对抗网络模型;输入一组随机噪声数据至初始生成对抗网络模型的生成网络模型中进行训练,生成随机特征数据;对随机特征数据进行离散化处理,得到离散特征数据;将离散特征数据和少数类特征数据作为初始生成对抗网络模型的判别网络模型的输入向量,对初始生成对抗网络模型进行迭代训练,生成目标生成对抗网络模型;通过采用Deep&CrossNet网络构建初始生成对抗网络模型,使生成的目标生成对抗网络模型中的生成网络模型和判别网络模型更加稳定和高效,从而保证了后续采用目标生成对抗网络模型对少数类特征数据进行数据构造的准确性。
在一实施例中,如图5所示,将离散特征数据和少数类特征数据作为初始生成对抗网络模型的判别网络模型的输入向量,对初始生成对抗网络模型进行迭代训练,具体包括如下步骤:
S50141: 将离散特征数据设为假样本集,将少数类特征数据设为真样本集。
具体地,将离散特征数据设为假样本集,将少数类特征数据设为真样本集。在一具体实施例中,为了便于区分假样本集和真样本集,还可对假样本集和真样本集进行标签设置。优选地,将假样本集的所有类标签设为0,将真样本集的所有类标签设为1。
S50142:分别输入假样本集和真样本集至初始生成对抗网络模型中,获取初始生成对抗网络模型中判别网络模型的输出值。
具体地,分别输入假样本集和真样本集至初始生成对抗网络模型中,即可直接获取初始生成对抗网络模型中判别网络模型的输出值。具体地,由于输入的样本为标签设为1的真样本集和标签设为0的假样本集,因此基于真样本集和假样本集所生成的输出值为在0-1之间的数值。
需要说明的是,由于是对初始生成对抗网络模型中判别网络模型的迭代训练,因此在输入假样本集和真样本集至初始生成对抗网络模型中之前,需要先固定初始生成对抗网络模型中生成网络模型的参数值,以避免生成网络模型在生成离散特征数据时由于模型参数的变化导致引入非必要性的不确定未知条件,从而导致误差的出现,进而导致迭代训练的结果产生倾斜。
 S50143: 根据判别网络模型的输出值,调整判别网络模型的参数值,使判别网络模型的输出值接近预设输出值。
具体地,通过比较判别网络模型的输出值与预设输出值的差值调整判别网络模型的参数值,以使判别网络模型的输出值接近预设输出值。由于输入的样本为标签设为1的真样本集和标签设为0的假样本集,因此,预设输出值优选为1,即使判别网络模型的输出值接近真样本集。在本步骤中,通过计算判别网络模型的输出值与预设输出值的差值调整判别网络模型的参数值,当判别网络模型的输出值与预设输出值的差值接近于0时,即完成判别网络模型的训练。
在本实施例中,通过将离散特征数据设为假样本集,将少数类特征数据设为真样本集;分别输入假样本集和真样本集至初始生成对抗网络模型中,获取初始生成对抗网络模型中判别网络模型的输出值;根据判别网络模型的输出值,调整判别网络模型的参数值,使判别网络模型的输出值接近预设输出值;从而进一步提高了判别网络模型的稳定性和准确性。
在一实施例中,如图6所示,将离散特征数据和少数类特征数据作为初始生成对抗网络模型的判别网络模型的输入向量,对初始生成对抗网络模型进行迭代训练,还具体包括如下步骤:
S50144: 预设判别网络模型的参数值。
具体地,通过预先设定判别网络模型的参数值,以将判别网络模型设定为定量条件。需要说明的是,参数值一般包括但不限于判别网络模型的判别权重。
S50145: 输入随机噪声数据至初始生成对抗网络模型的生成网络模型中,并进行初始计算得到随机特征数据。
具体地,输入随机噪声数据至初始生成对抗网络模型的生成网络模型中,并进行初始计算得到随机特征数据。优选地,在一具体实施例中,在得到随机特征数据之后,需再将随机特征数据的标签设为1。将随机特征数据的标签设置为1,即表示在进行判别时将随机特征数据看作为当前情况下的少数类特征数据,以通过判别网络模型进行真伪判别。
S50146: 通过比较生成网络模型输出的随机特征数据与少数类特征数据之间的差值,调整生成网络模型的参数值,使生成网络模型输出的随机特征数据接近少数类特征数据。
具体地,计算生成网络模型输出的随机特征数据与少数类特征数据之间的差值,通过比较生成网络模型输出的随机特征数据与少数类特征数据之间的差值,根据差值调整生成网络模型中的参数值,以使生成网络模型输出的随机特征数据接近少数类特征数据,完善生成网络模型。
在本实施例中,通过预设判别网络模型的参数值;输入随机噪声数据至以初始生成对抗网络模型中,并进行初始计算得到随机特征数据;通过比较生成网络模型输出的随机特征数据与少数类特征数据之间的差值,调整生成网络模型的参数值,使生成网络模型输出的随机特征数据接近少数类特征数据;从而进一步提高了生成网络模型的稳定性和准确性。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种样本数据处理装置,该样本数据处理装置与上述实施例中样本数据处理方法一一对应。如图7所示,该样本数据处理装置包括样本特征数据获取模块10、分类模块20、统计模块30、少数类特征数据确定模块40、数据构造模块50和加入模块60。各功能模块详细说明如下:
样本特征数据获取模块10,用于获取样本特征数据,样本特征数据包括标注数据;
分类模块20,用于基于标注数据对样本特征数据进行分类,得到不同类型的基本特征数据;
统计模块30,用于统计每一类型的基本特征数据的数据量,根据数据量计算每一类型的基本特征数据在样本特征数据中的占比值;
少数类特征数据集确定模块40,用于在样本特征数据中存在占比值小于预设的占比阈值的基本特征数据时,将占比值小于占比阈值的基本特征数据,确定为少数类特征数据集;
数据构造模块50,用于基于目标生成对抗网络模型对少数类特征数据进行数据构造,生成构造特征数据,其中,目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
加入模块60,用于将构造特征数据加入到样本特征数据中,得到标准特征数据。
优选地,如图8所示,数据构造模块50,包括:
训练子模块501,用于获取少数类特征数据,采用少数类特征数据训练预设的初始生成对抗网络模型,生成目标生成对抗网络模型,其中,初始生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
构造特征数据生成子模块502,用于获取随机噪声数据,将随机噪声数据输入目标生成对抗网络模型的生成网络模型中,生成构造特征数据。
优选地,如图9所示,训练子模块501,包括:
搭建单元5011,用于基于Deep&CrossNet网络搭建初始生成对抗网络模型;
训练单元5012,用于输入一组随机噪声数据至初始生成对抗网络模型的生成网络模型中进行训练,生成随机特征数据;
离散化处理单元5013,用于对随机特征数据进行离散化处理,得到离散特征数据;
迭代训练单元5014,用于将离散特征数据和少数类特征数据作为初始生成对抗网络模型的判别网络模型的输入向量,对初始生成对抗网络模型进行迭代训练,生成目标生成对抗网络模型。
优选地,迭代训练单元5014,包括:
设置子单元,用于将离散特征数据设为假样本集,将少数类特征数据设为真样本集;
输入子单元,用于分别输入假样本集和真样本集至初始生成对抗网络模型中,获取初始生成对抗网络模型中判别网络模型的输出值;
第一调整子单元,用于根据判别网络模型的输出值,调整判别网络模型的参数值,使判别网络模型的输出值接近预设输出值。
优选地,迭代训练单元5014,还包括:
预设子单元,用于预设判别网络模型的参数值;
计算子单元,用于输入随机噪声数据至初始生成对抗网络模型的生成网络模型中,并进行初始计算得到随机特征数据;
第二调整子单元,用于通过比较生成网络模型输出的随机特征数据与少数类特征数据之间的差值,调整生成网络模型的参数值,使生成网络模型输出的随机特征数据接近少数类特征数据。
关于样本数据处理装置的具体限定可以参见上文中对于样本数据处理方法的限定,在此不再赘述。上述样本数据处理装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储上述实施例中的样本数据处理方法中使用到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种样本数据处理方法。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现上述样本数据处理方法:
获取样本特征数据,所述样本特征数据包括标注数据;
基于所述标注数据对所述样本特征数据进行分类,得到不同类型的基本特征数据;
统计所述每一类型的基本特征数据的数据量,根据所述数据量计算所述每一类型的基本特征数据在所述样本特征数据中的占比值;
若所述样本特征数据中存在所述占比值小于预设的占比阈值的基本特征数据,则将所述占比值小于所述占比阈值的基本特征数据,确定为少数类特征数据集;
基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,其中,所述目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
将所述构造特征数据加入到所述样本特征数据中,得到标准特征数据。
在一个实施例中,提供了一种计算机可读存储介质,上述存储介质可以是非易失性存储介质,也可以是易失性存储介质。其上存储有计算机程序,计算机程序被处理器执行时实现上述样本数据处理方法:
获取样本特征数据,所述样本特征数据包括标注数据;
基于所述标注数据对所述样本特征数据进行分类,得到不同类型的基本特征数据;
统计所述每一类型的基本特征数据的数据量,根据所述数据量计算所述每一类型的基本特征数据在所述样本特征数据中的占比值;
若所述样本特征数据中存在所述占比值小于预设的占比阈值的基本特征数据,则将所述占比值小于所述占比阈值的基本特征数据,确定为少数类特征数据集;
基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,其中,所述目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
将所述构造特征数据加入到所述样本特征数据中,得到标准特征数据。

Claims (20)

  1. 一种样本数据处理方法,其中,包括:
    获取样本特征数据,所述样本特征数据包括标注数据;
    基于所述标注数据对所述样本特征数据进行分类,得到不同类型的基本特征数据;
    统计所述每一类型的基本特征数据的数据量,根据所述数据量计算所述每一类型的基本特征数据在所述样本特征数据中的占比值;
    若所述样本特征数据中存在所述占比值小于预设的占比阈值的基本特征数据,则将所述占比值小于所述占比阈值的基本特征数据,确定为少数类特征数据集;
    基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,其中,所述目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
    将所述构造特征数据加入到所述样本特征数据中,得到标准特征数据。
  2. 如权利要求1所述的样本数据处理方法,其中,所述基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,包括:
    获取少数类特征数据,采用所述少数类特征数据训练预设的初始生成对抗网络模型,生成目标生成对抗网络模型,其中,所述初始生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
    获取随机噪声数据,将所述随机噪声数据输入所述目标生成对抗网络模型的生成网络模型中,生成构造特征数据。
  3. 如权利要求2所述的样本数据处理方法,其中,所述采用所述少数类特征数据训练预设的初始生成对抗网络模型,生成目标生成对抗网络模型,包括:
    基于Deep&CrossNet网络搭建初始生成对抗网络模型;
    输入一组随机噪声数据至所述初始生成对抗网络模型的生成网络模型中进行训练,生成随机特征数据;
    对所述随机特征数据进行离散化处理,得到离散特征数据;
    将所述离散特征数据和所述少数类特征数据作为所述初始生成对抗网络模型的判别网络模型的输入向量,对所述初始生成对抗网络模型进行迭代训练,生成目标生成对抗网络模型。
  4. 如权利要求3所述的样本数据处理方法,其中,所述将所述离散特征数据和所述少数类特征数据作为所述初始生成对抗网络模型的判别网络模型的输入向量,对所述初始生成对抗网络模型进行迭代训练,包括:
    将所述离散特征数据设为假样本集,将所述少数类特征数据设为真样本集;
    分别输入所述假样本集和所述真样本集至所述初始生成对抗网络模型中,获取所述初始生成对抗网络模型中判别网络模型的输出值;
    根据所述判别网络模型的所述输出值,调整所述判别网络模型的参数值,使所述判别网络模型的所述输出值接近预设输出值。
  5. 如权利要求3所述的样本数据处理方法,其中,所述将所述离散特征数据和所述少数类特征数据作为所述初始生成对抗网络模型的判别网络模型的输入向量,对所述初始生成对抗网络模型进行迭代训练,还包括:
    预设所述判别网络模型的参数值;
    输入随机噪声数据至所述初始生成对抗网络模型的生成网络模型中,并进行初始计算得到随机特征数据;
    通过比较所述生成网络模型输出的所述随机特征数据与所述少数类特征数据之间的差值,调整所述生成网络模型的参数值,使所述生成网络模型输出的随机特征数据接近所述少数类特征数据。
  6. 一种样本数据处理装置,其中,包括:
    样本特征数据获取模块,用于获取样本特征数据,所述样本特征数据包括标注数据;
    分类模块,用于基于所述标注数据对所述样本特征数据进行分类,得到不同类型的基本特征数据;
    统计模块,用于统计所述每一类型的基本特征数据的数据量,根据所述数据量计算所述每一类型的基本特征数据在所述样本特征数据中的占比值;
    少数类特征数据集确定模块,用于在所述样本特征数据中存在所述占比值小于预设的占比阈值的基本特征数据时,将所述占比值小于所述占比阈值的基本特征数据,确定为少数类特征数据集;
    数据构造模块,用于基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,其中,所述目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
    加入模块,用于将所述构造特征数据加入到所述样本特征数据中,得到标准特征数据。
  7. 如权利要求6所述的样本数据处理装置,其中,所述数据构造模块,包括:
    训练子模块,用于获取少数类特征数据,采用所述少数类特征数据训练预设的初始生成对抗网络模型,生成目标生成对抗网络模型,其中,所述初始生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
    构造特征数据生成子模块,用于获取随机噪声数据,将所述随机噪声数据输入所述目标生成对抗网络模型的生成网络模型中,生成构造特征数据。
  8. 如权利要求7所述的样本数据处理装置,其中,所述训练子模块,包括:
    搭建单元,用于基于Deep&CrossNet网络搭建初始生成对抗网络模型;
    训练单元,用于输入一组随机噪声数据至所述初始生成对抗网络模型的生成网络模型中进行训练,生成随机特征数据;
    离散化处理单元,用于对所述随机特征数据进行离散化处理,得到离散特征数据;
    迭代训练单元,用于将所述离散特征数据和所述少数类特征数据作为所述初始生成对抗网络模型的判别网络模型的输入向量,对所述初始生成对抗网络模型进行迭代训练,生成目标生成对抗网络模型。
  9. 如权利要求8所述的样本数据处理装置,其中,所述迭代训练单元,包括:
    设置子单元,用于将离散特征数据设为假样本集,将少数类特征数据设为真样本集;
    输入子单元,用于分别输入假样本集和真样本集至初始生成对抗网络模型中,获取初始生成对抗网络模型中判别网络模型的输出值;
    第一调整子单元,用于根据判别网络模型的输出值,调整判别网络模型的参数值,使判别网络模型的输出值接近预设输出值。
  10. 如权利要求8所述的样本数据处理装置,其中,所述迭代训练单元,还包括:
    预设子单元,用于预设判别网络模型的参数值;
    计算子单元,用于输入随机噪声数据至初始生成对抗网络模型的生成网络模型中,并进行初始计算得到随机特征数据;
    第二调整子单元,用于通过比较生成网络模型输出的随机特征数据与少数类特征数据之间的差值,调整生成网络模型的参数值,使生成网络模型输出的随机特征数据接近少数类特征数据。
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现一种样本数据处理方法:
    获取样本特征数据,所述样本特征数据包括标注数据;
    基于所述标注数据对所述样本特征数据进行分类,得到不同类型的基本特征数据;
    统计所述每一类型的基本特征数据的数据量,根据所述数据量计算所述每一类型的基本特征数据在所述样本特征数据中的占比值;
    若所述样本特征数据中存在所述占比值小于预设的占比阈值的基本特征数据,则将所述占比值小于所述占比阈值的基本特征数据,确定为少数类特征数据集;
    基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,其中,所述目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
    将所述构造特征数据加入到所述样本特征数据中,得到标准特征数据。
  12. 如权利要求11所述的计算机设备,其中,所述基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,包括:
    获取少数类特征数据,采用所述少数类特征数据训练预设的初始生成对抗网络模型,生成目标生成对抗网络模型,其中,所述初始生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
    获取随机噪声数据,将所述随机噪声数据输入所述目标生成对抗网络模型的生成网络模型中,生成构造特征数据。
  13. 如权利要求12所述的计算机设备,其中,所述采用所述少数类特征数据训练预设的初始生成对抗网络模型,生成目标生成对抗网络模型,包括:
    基于Deep&CrossNet网络搭建初始生成对抗网络模型;
    输入一组随机噪声数据至所述初始生成对抗网络模型的生成网络模型中进行训练,生成随机特征数据;
    对所述随机特征数据进行离散化处理,得到离散特征数据;
    将所述离散特征数据和所述少数类特征数据作为所述初始生成对抗网络模型的判别网络模型的输入向量,对所述初始生成对抗网络模型进行迭代训练,生成目标生成对抗网络模型。
  14. 如权利要求13所述的计算机设备,其中,所述将所述离散特征数据和所述少数类特征数据作为所述初始生成对抗网络模型的判别网络模型的输入向量,对所述初始生成对抗网络模型进行迭代训练,包括:
    将所述离散特征数据设为假样本集,将所述少数类特征数据设为真样本集;
    分别输入所述假样本集和所述真样本集至所述初始生成对抗网络模型中,获取所述初始生成对抗网络模型中判别网络模型的输出值;
    根据所述判别网络模型的所述输出值,调整所述判别网络模型的参数值,使所述判别网络模型的所述输出值接近预设输出值。
  15. 如权利要求13所述的计算机设备,其中,所述将所述离散特征数据和所述少数类特征数据作为所述初始生成对抗网络模型的判别网络模型的输入向量,对所述初始生成对抗网络模型进行迭代训练,还包括:
    预设所述判别网络模型的参数值;
    输入随机噪声数据至所述初始生成对抗网络模型的生成网络模型中,并进行初始计算得到随机特征数据;
    通过比较所述生成网络模型输出的所述随机特征数据与所述少数类特征数据之间的差值,调整所述生成网络模型的参数值,使所述生成网络模型输出的随机特征数据接近所述少数类特征数据。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种样本数据处理方法:
    获取样本特征数据,所述样本特征数据包括标注数据;
    基于所述标注数据对所述样本特征数据进行分类,得到不同类型的基本特征数据;
    统计所述每一类型的基本特征数据的数据量,根据所述数据量计算所述每一类型的基本特征数据在所述样本特征数据中的占比值;
    若所述样本特征数据中存在所述占比值小于预设的占比阈值的基本特征数据,则将所述占比值小于所述占比阈值的基本特征数据,确定为少数类特征数据集;
    基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,其中,所述目标生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
    将所述构造特征数据加入到所述样本特征数据中,得到标准特征数据。
  17. 如权利要求16所述的计算机可读存储介质,其中,所述基于目标生成对抗网络模型对所述少数类特征数据集进行数据构造,生成构造特征数据,包括:
    获取少数类特征数据,采用所述少数类特征数据训练预设的初始生成对抗网络模型,生成目标生成对抗网络模型,其中,所述初始生成对抗网络模型是采用Deep&CrossNet网络模型构建的;
    获取随机噪声数据,将所述随机噪声数据输入所述目标生成对抗网络模型的生成网络模型中,生成构造特征数据。
  18. 如权利要求17所述的计算机可读存储介质,其中,所述采用所述少数类特征数据训练预设的初始生成对抗网络模型,生成目标生成对抗网络模型,包括:
    基于Deep&CrossNet网络搭建初始生成对抗网络模型;
    输入一组随机噪声数据至所述初始生成对抗网络模型的生成网络模型中进行训练,生成随机特征数据;
    对所述随机特征数据进行离散化处理,得到离散特征数据;
    将所述离散特征数据和所述少数类特征数据作为所述初始生成对抗网络模型的判别网络模型的输入向量,对所述初始生成对抗网络模型进行迭代训练,生成目标生成对抗网络模型。
  19. 如权利要求18所述的计算机可读存储介质,其中,所述将所述离散特征数据和所述少数类特征数据作为所述初始生成对抗网络模型的判别网络模型的输入向量,对所述初始生成对抗网络模型进行迭代训练,包括:
    将所述离散特征数据设为假样本集,将所述少数类特征数据设为真样本集;
    分别输入所述假样本集和所述真样本集至所述初始生成对抗网络模型中,获取所述初始生成对抗网络模型中判别网络模型的输出值;
    根据所述判别网络模型的所述输出值,调整所述判别网络模型的参数值,使所述判别网络模型的所述输出值接近预设输出值。
  20. 如权利要求18所述的计算机可读存储介质,其中,所述将所述离散特征数据和所述少数类特征数据作为所述初始生成对抗网络模型的判别网络模型的输入向量,对所述初始生成对抗网络模型进行迭代训练,还包括:
    预设所述判别网络模型的参数值;
    输入随机噪声数据至所述初始生成对抗网络模型的生成网络模型中,并进行初始计算得到随机特征数据;
    通过比较所述生成网络模型输出的所述随机特征数据与所述少数类特征数据之间的差值,调整所述生成网络模型的参数值,使所述生成网络模型输出的随机特征数据接近所述少数类特征数据。
PCT/CN2020/098820 2019-10-11 2020-06-29 样本数据处理方法、装置、计算机设备及存储介质 WO2021068563A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910965007.XA CN110888911A (zh) 2019-10-11 2019-10-11 样本数据处理方法、装置、计算机设备及存储介质
CN201910965007.X 2019-10-11

Publications (1)

Publication Number Publication Date
WO2021068563A1 true WO2021068563A1 (zh) 2021-04-15

Family

ID=69746107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/098820 WO2021068563A1 (zh) 2019-10-11 2020-06-29 样本数据处理方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN110888911A (zh)
WO (1) WO2021068563A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177597A (zh) * 2021-04-30 2021-07-27 平安国际融资租赁有限公司 模型训练数据确定方法、检测模型训练方法、装置及设备
CN113988908A (zh) * 2021-10-14 2022-01-28 同盾科技有限公司 营销人群的投放方法、装置、电子设备和存储介质
CN114596277A (zh) * 2022-03-03 2022-06-07 北京百度网讯科技有限公司 检测对抗样本的方法、装置、设备以及存储介质
CN117235624A (zh) * 2023-09-22 2023-12-15 中节能天融科技有限公司 排放数据造假检测方法、装置及系统和存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110888911A (zh) * 2019-10-11 2020-03-17 平安科技(深圳)有限公司 样本数据处理方法、装置、计算机设备及存储介质
CN111553759A (zh) * 2020-03-25 2020-08-18 平安科技(深圳)有限公司 一种产品信息推送方法、装置、设备及存储介质
CN111970584A (zh) * 2020-07-08 2020-11-20 国网宁夏电力有限公司电力科学研究院 一种用于处理数据的方法、装置、设备以及存储介质
CN111839495B (zh) * 2020-07-30 2023-04-07 深圳前海微众银行股份有限公司 检测方法、设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9224104B2 (en) * 2013-09-24 2015-12-29 International Business Machines Corporation Generating data from imbalanced training data sets
CN107392259A (zh) * 2017-08-16 2017-11-24 北京京东尚科信息技术有限公司 构建不均衡样本分类模型的方法和装置
CN110097130A (zh) * 2019-05-07 2019-08-06 深圳市腾讯计算机系统有限公司 分类任务模型的训练方法、装置、设备及存储介质
CN110888911A (zh) * 2019-10-11 2020-03-17 平安科技(深圳)有限公司 样本数据处理方法、装置、计算机设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10319076B2 (en) * 2016-06-16 2019-06-11 Facebook, Inc. Producing higher-quality samples of natural images
CN108470187A (zh) * 2018-02-26 2018-08-31 华南理工大学 一种基于扩充训练数据集的类别不平衡问题分类方法
CN109190750B (zh) * 2018-07-06 2021-06-08 国家计算机网络与信息安全管理中心 基于对抗生成网络的小样本生成方法及装置
CN109711452A (zh) * 2018-12-20 2019-05-03 四川新网银行股份有限公司 一种基于wgan-gp模型对用户行为的不平衡分类方法
CN110012019A (zh) * 2019-04-11 2019-07-12 鸿秦(北京)科技有限公司 一种基于对抗模型的网络入侵检测方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9224104B2 (en) * 2013-09-24 2015-12-29 International Business Machines Corporation Generating data from imbalanced training data sets
CN107392259A (zh) * 2017-08-16 2017-11-24 北京京东尚科信息技术有限公司 构建不均衡样本分类模型的方法和装置
CN110097130A (zh) * 2019-05-07 2019-08-06 深圳市腾讯计算机系统有限公司 分类任务模型的训练方法、装置、设备及存储介质
CN110888911A (zh) * 2019-10-11 2020-03-17 平安科技(深圳)有限公司 样本数据处理方法、装置、计算机设备及存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177597A (zh) * 2021-04-30 2021-07-27 平安国际融资租赁有限公司 模型训练数据确定方法、检测模型训练方法、装置及设备
CN113988908A (zh) * 2021-10-14 2022-01-28 同盾科技有限公司 营销人群的投放方法、装置、电子设备和存储介质
CN114596277A (zh) * 2022-03-03 2022-06-07 北京百度网讯科技有限公司 检测对抗样本的方法、装置、设备以及存储介质
CN117235624A (zh) * 2023-09-22 2023-12-15 中节能天融科技有限公司 排放数据造假检测方法、装置及系统和存储介质
CN117235624B (zh) * 2023-09-22 2024-05-07 中节能数字科技有限公司 排放数据造假检测方法、装置及系统和存储介质

Also Published As

Publication number Publication date
CN110888911A (zh) 2020-03-17

Similar Documents

Publication Publication Date Title
WO2021068563A1 (zh) 样本数据处理方法、装置、计算机设备及存储介质
US11190562B2 (en) Generic event stream processing for machine learning
CN106874253A (zh) 识别敏感信息的方法及装置
CN110750658B (zh) 一种媒体资源的推荐方法、服务器及计算机可读存储介质
US11494559B2 (en) Hybrid in-domain and out-of-domain document processing for non-vocabulary tokens of electronic documents
CN116935169A (zh) 文生图模型训练方法以及文生图方法
US20200257679A1 (en) Natural language to structured query generation via paraphrasing
CN112231592A (zh) 基于图的网络社团发现方法、装置、设备以及存储介质
WO2022105119A1 (zh) 意图识别模型的训练语料生成方法及其相关设备
US11586838B2 (en) End-to-end fuzzy entity matching
WO2021196935A1 (zh) 数据校验方法、装置、电子设备和存储介质
CN110855648A (zh) 一种网络攻击的预警控制方法及装置
WO2022105121A1 (zh) 一种应用于bert模型的蒸馏方法、装置、设备及存储介质
US20230368028A1 (en) Automated machine learning pre-trained model selector
CN115130711A (zh) 一种数据处理方法、装置、计算机及可读存储介质
CN114416998A (zh) 文本标签的识别方法、装置、电子设备及存储介质
WO2016122575A1 (en) Product, operating system and topic based recommendations
WO2023024408A1 (zh) 用户特征向量确定方法、相关设备及介质
WO2021196474A1 (zh) 用户兴趣画像方法及相关设备
WO2021174814A1 (zh) 众包任务的答案验证方法、装置、计算机设备及存储介质
CN113515593A (zh) 基于聚类模型的话题检测方法、装置和计算机设备
CN114547257B (zh) 类案匹配方法、装置、计算机设备及存储介质
CN116319033A (zh) 网络入侵攻击检测方法、装置、设备及存储介质
CN117009621A (zh) 信息搜索方法、装置、电子设备、存储介质及程序产品
CN113591881A (zh) 基于模型融合的意图识别方法、装置、电子设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20874790

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20874790

Country of ref document: EP

Kind code of ref document: A1