WO2019242627A1 - 一种数据处理方法及其装置 - Google Patents
一种数据处理方法及其装置 Download PDFInfo
- Publication number
- WO2019242627A1 WO2019242627A1 PCT/CN2019/091819 CN2019091819W WO2019242627A1 WO 2019242627 A1 WO2019242627 A1 WO 2019242627A1 CN 2019091819 W CN2019091819 W CN 2019091819W WO 2019242627 A1 WO2019242627 A1 WO 2019242627A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- form data
- data
- feature
- output
- object description
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 13
- 238000000034 method Methods 0.000 claims abstract description 55
- 238000012545 processing Methods 0.000 claims description 94
- 238000012360 testing method Methods 0.000 claims description 71
- 230000015654 memory Effects 0.000 claims description 18
- 238000010606 normalization Methods 0.000 claims description 17
- 238000004458 analytical method Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 10
- 238000004590 computer program Methods 0.000 description 6
- 210000002569 neuron Anatomy 0.000 description 6
- 230000002159 abnormal effect Effects 0.000 description 5
- 238000007689 inspection Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 238000013256 Gubra-Amylin NASH model Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000011664 signaling Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
- G06Q30/0203—Market surveys; Market polls
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/40—Business processes related to the transportation industry
Definitions
- the embodiments of the present application relate to the technical field of data processing, and in particular, to a data processing method and a device thereof.
- the technical problem to be solved in the embodiments of the present application is to provide a data processing method and a device thereof, which can construct output data that is very close to the input data, and can realize the analysis of the data even in the case of being separated from the data site.
- a first aspect of the embodiments of the present application provides a data processing method, including:
- a second form data is generated using a generative adversarial network model, and the similarity between the second form data and the first form data reaches a first threshold;
- the second form data is subjected to inverse normalization coding to obtain output form data.
- the output form data has the same object description characteristics as the input form data.
- the input form data is standardized to be encoded, so that the first form data obtained by the standardized encoding can use a generative adversarial network model to generate the second form data, and the second form data is subjected to inverse normalization encoding.
- the simulation of the input form data can be realized, and the output form data can be directly analyzed, so as to indirectly realize the analysis of the input form data, so as to be separated from the data location , Can quickly realize the analysis of input form data.
- the input form data is original form data of the input data processing device, that is, data provided for a data office, for example, telecommunication domain data provided by a telecommunication office.
- the above input form data has one or more object description features.
- the object description features are used to describe the corresponding features of the object.
- One field in the form data is an object description feature.
- One column in the form data corresponds to one field, and one row corresponds to one.
- Object an object can also be called a sample.
- the object description feature has semantic meaning, that is, a field is given a meaning.
- Tabular data is different from multimedia data.
- the features also have different features with different importance, and the features do not have time or space distribution requirements.
- the embodiments of the present application need to standardize the input form data.
- the input form data has at least one of a category-type object description feature and a numeric-type object description feature.
- the feature value corresponding to the category-type object description feature is a non-numeric value.
- Eigenvalues are numeric.
- any one of the object description features included in the input form data may be a category-type object description feature or a numeric-type object description feature, and different types of object description features may be coded differently.
- the process of performing standardized encoding on the input form data is: obtaining the feature value corresponding to the category-type object description feature from the input form data, and Feature values corresponding to category-type object description features are one-hot encoded.
- one-hot encoding is performed on the feature values corresponding to the category-type object description features, that is, the feature values corresponding to the category-type object description features are encoded from non-numerical values to numerical values, so that they can be applied to a generative adversarial network model.
- the process of standardized encoding of the input form data is: obtaining the feature value corresponding to the numerical object description feature from the input form data;
- the feature values corresponding to the numerical object description features are normalized and coded.
- the feature values corresponding to the numerical object description features are normalized and encoded, that is, the feature value codes corresponding to the numerical object description features are mapped to the same numerical interval, so as to be better applied to the generative adversarial network model.
- the purpose of performing one-hot encoding is to express the characteristics of category object description with specific values
- the purpose of normalizing encoding is to map the numerical range corresponding to the characteristics of numerical object description to the same value interval.
- the first form data obtained by standardized encoding can be applied to a generative adversarial network model, so that the second form data can be generated based on the first form data.
- the output form data when the output form data is obtained, the output form data needs to be checked to check the similarity between the input form data and the output form data. If the similarity between the input form data and the output form data reaches a second threshold, then output the output form data and the generative adversarial network model; if the similarity between the input form data and the output form data does not reach the second threshold, the generative formula The initialization parameters of the adversarial network model are adjusted so that the similarity between the adjusted output form data and the input form data reaches a second threshold.
- the initialization parameters of the generative adversarial network model may include the types of encoders and decoders, the number of neurons in each layer of the generative network and discriminative network, the depth of the generative network and discriminant network, the learning rate of gradient descent, and the like.
- the similarity check conditions can be obtained, and the similarity between the input form data and the output form data can be tested according to the similarity check conditions to determine whether the output form data meets the similarity check conditions, thereby Check the similarity between the input form data and the output form data.
- the above-mentioned similarity test condition may include a positive and negative object data ratio test condition.
- the similarity test condition the similarity between the input form data and the output form data is checked, and may specifically include: statistical input The ratio of the positive and negative object data of the table data, statistics the positive and negative object data ratio of the output table data; judging whether the difference between the positive and negative object data ratio of the input table data and the positive and negative object data ratio of the output table data is in the first range Within; if the difference between the ratio of the positive and negative object data of the input form data and the ratio of the positive and negative object data of the output form data is within the first range, determine the ratio of the positive and negative object data of the input form data and the positive of the output form data The proportion of negative object data is the same, and the proportion of positive and negative object data of the output table data meets the conditions of the inspection of the proportion of positive and negative object data.
- the positive and negative object data ratio test is a statistical index test that counts the positive and negative object data ratios of the input form data and the output form data, and then determines whether the positive and negative object data proportions of the output form data meet the positive and negative object data. Proportion inspection conditions are simple and convenient to implement.
- the above-mentioned similarity test condition may include a feature distribution test condition.
- the similarity test condition the similarity between the input form data and the output form data is checked, and may specifically include: calculating the output form data
- the relative entropy of the object description feature i relative to the object description feature i in the input form data is any one of one or more object description features. It is determined whether the relative entropy is within the second range.
- the entropy is in the second range, it is determined that the feature distribution of the object description feature i in the output form data follows the feature distribution of the object description feature i in the input form data, and the feature distribution of the object description feature i in the output form data satisfies the feature distribution test condition. .
- the feature distribution test is an information index test. By calculating the relative entropy of an object's described features, it is determined whether the feature distribution of the object's described features in the output table data meets the conditions of the feature distribution test.
- the implementation is simple and convenient. .
- the above-mentioned similarity test condition includes a feature-tag correlation test condition.
- the similarity test condition the similarity between the input form data and the output form data is checked, which may specifically include: calculating the input form.
- the first mutual information between the object description feature j and the object tag in the data, and the second mutual information between the object description feature j and the object tag in the output table data is calculated.
- the object description feature j is any one of one or more object description features.
- the feature-label correlation test is an information index test, which determines whether the correlation between the feature and the object label in the output table data is satisfied by calculating the mutual information of an object description characteristic and the object label.
- Feature-tag correlation test conditions are simple and convenient to implement.
- the similarity between the input form data and the output form data reaches a second threshold
- the similarity between the output form data and the first output form data is checked, and the first output form data is an output
- the output form data obtained by the first generative adversarial network model may be based on the input form data, using the output form data obtained by the first generative adversarial network model, and using the current output form data to verify the previous first output form. Data to test the previous first generative adversarial network model.
- the initialization parameters of the first generative adversarial network model are adjusted to improve the accuracy of the first generative adversarial network model.
- the similarity between the output form data and the first output form data reaches a second threshold, it can be determined that the first output form data is available and the first generative adversarial network model is available, that is, the first output form data can be analyzed to achieve Purpose of analysis of input form data.
- a second aspect of the embodiments of the present application provides a data processing apparatus, and the data processing apparatus has a function of implementing the method provided in the first aspect.
- the functions can be realized by hardware, and the corresponding software can also be implemented by hardware.
- the hardware or software includes one or more units corresponding to the above functions.
- the data processing apparatus includes a coding unit and a generating unit, and the coding unit is configured to perform standardized coding on the input form data to obtain the first form data, and an object description characteristic of the first form data is a numerical value.
- Type object description characteristics a generating unit for generating a second form data based on the first form data and using a generative adversarial network model, and the similarity between the second form data and the first form data reaches a first threshold; the encoding unit also uses The second form data is subjected to inverse normalization encoding to obtain output form data, and the output form data has the same object description characteristics as the input form data.
- the data processing apparatus includes a processor, a transceiver, and a memory, where the transceiver is used to receive and send information, the computer stores instructions for execution in the memory, and the processor is connected to the memory and the transceiver through a bus.
- the processor executes computer execution instructions stored in the memory, so that the data processing device performs the following operations: standardized encoding of the input form data to obtain the first form data, and the object description feature of the first form data is a numerical object description feature Based on the first form data, the second form data is generated using a generative adversarial network model, and the similarity between the second form data and the first form data reaches a first threshold value; the second form data is subjected to inverse normalization coding to obtain the output form data
- the output form data has the same object description characteristics as the input form data.
- the implementation of the data processing device can refer to the implementation of the method. To repeat.
- a third aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores instructions that, when run on a computer, cause the computer to execute the method described in the first aspect above.
- a fourth aspect of the embodiments of the present application provides a computer program product containing instructions, which when executed on a computer, causes the computer to execute the method described in the first aspect above.
- FIG. 1 is a schematic diagram of a network architecture to which an embodiment of the present application is applied;
- FIG. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application.
- Figure 3 is an example diagram of one-hot coding
- FIG. 4 is an example diagram of normalized coding
- 5 is an example diagram of inverse one-hot encoding
- FIG. 6 is a schematic flowchart of another data processing method according to an embodiment of the present application.
- FIG. 7 is a schematic diagram of a logical structure of a data processing device according to an embodiment of the present application.
- FIG. 8 is a simplified schematic diagram of a physical structure of a data processing apparatus according to an embodiment of the present application.
- Tabular data can be data presented in a wide form, or data presented in a narrow form.
- the wide table is literally a database table with many fields.
- a wide table is usually a database table that associates indicators, dimensions, and attributes related to a business topic. For example, Table 1 below is tabular data in the telecommunications domain.
- Table data is all the sample data displayed in the table.
- One row is one sample and one column is one feature.
- the data of the row where the user name is Zhang San in Table 1 is a sample.
- the sample includes Zhang San ’s mobile phone number, the home location of the mobile phone number, and the package type of the mobile phone number.
- Equal-column elements are features, and A, B, Package 1, Package 2, etc. can be called feature values corresponding to the features.
- a sample may be referred to as an object, a user name or a mobile phone number, etc. as an object identifier, and a home place, a package type, etc. as an object description feature.
- the table data can have one or more object description features, and each object description feature has semantics, meaning that semantics is given meaning, in other words, each object description feature has a specific meaning, such as the attribute of the object description where it belongs Used to indicate the country or city to which the mobile number belongs.
- Tabular data is a type of data that is different from multimedia data such as images and voice.
- the characteristics of multimedia data are that they do not have semantics, the image features have the same importance, the image features have a time or space distribution requirement, and the image features are all numerical values.
- the pixels in the image data have no semantic meaning, that is, they cannot convey specific information through the pixels, and the value of this pixel can only be represented by numerical values.
- the importance of each pixel is the same, and the distribution of different pixels in space The location is different.
- speech data any two frames of speech are different in time distribution.
- the object description features of table data have semantics, the object description features are of different importance, and there is no time or spatial distribution requirement for the object description features.
- the importance of the belonging place and the package type shown in Table 1 may be different. There is no time or space distribution requirement, and the values of the belonging place and the package type are not expressed by numerical values, but by words.
- the input form data is the original form data of the input data processing device, and is the form data that has not been processed by standardized encoding, generative adversarial network model, and inverse normalized encoding.
- Data provided by the data office Taking telecommunications domain data as an example, the input form data is the form data collected or sorted or stored by the server in the telecommunications domain office.
- the output form data is the form data processed by the data processing device. It is the form data processed through standardized encoding, generative adversarial network model, and inverse standardized encoding. It is constructed form data, and is not real form data. It can be understood that the output form data is used to simulate the input form data, and is not the actual form data.
- the object description features in the embodiments of the present application can be classified into category-type object description features and numerical-type object description features.
- the attribution or package type in Table 1 is a category-type object description feature, that is, the attribution or package type is described by text.
- the user's monthly call fee or annual call fee is a numerical object to describe the characteristics, that is, the value is used to describe the monthly call fee or the annual call fee.
- the feature value corresponding to the categorical object description feature is text, that is, non-numeric value
- the feature value corresponding to the numerical object description feature value is numeric, that is, a specific value.
- the table data has at least one of a category object description feature and a numeric object description feature. That is, all the object description features are category object description features, or all the object description features are numeric object descriptions.
- Features can also be some object description features are categorical object description features, and other object description features are numerical object description features.
- Generative adversarial networks (GAN) model which is mainly used in multimedia data scenarios, mainly includes two parts, a generator and a discriminator.
- the generator is mainly used to learn the real image distribution, generate images, and make the generated images more real, so as to fool the discriminator.
- the discriminator is to determine the authenticity of the received image. Throughout the whole process, the generator "takes great care" to make the image it generates closer to the real image, while the discriminator strives to identify the authenticity of the image. This is similar to a two-player game.
- the generator and the discriminator continue to confront each other, and finally reaches a dynamic balance, that is, the image generated by the generator is close to the real image distribution, and the discriminator cannot recognize it.
- the authenticity of the image generated by the generator is mainly used in multimedia data scenarios, mainly includes two parts, a generator and a discriminator.
- the generator is mainly used to learn the real image distribution, generate images, and make the generated images more real, so as to fool the discriminator
- the discriminator labels it as 1; for a generated image, the discriminator labels it as 0.
- the generator hopes that the label of the discriminator will be 1, but the discriminator will not label the generated image 1 at the beginning, and will discriminate based on the real image. If the discriminator gives the generated image If the label is 0, the generator adjusts the generated image and passes it to the discriminator until the label generated by the discriminator is 1. Then the discriminator cannot distinguish the true of the generated image at this time. False, it is understandable that the generated image at this time is very close to the real image, and the generated image can be used instead of the real image.
- the mathematical language description of the GAN model can be: Assume that the generation model is G (z), where z is a random noise, and G converts this random noise into an image x. D is a discriminative model. For the input image x, D (x) outputs a real number in the range of 0 to 1, which is used to determine the probability that the image is a real image.
- the GAN model is applied in a tabular data scenario, such as a telecommunication domain data scenario, and can be used to simulate input of tabular data, such as simulating the generation of telecommunication domain data, to solve the disadvantage that the telecommunication domain data cannot be taken away from the office.
- Standardized coding can make the data corresponding to different object description features exert influence on the mode in the same range, such as exerting influence on the generative adversarial network model.
- the standardized coding can be divided into one-hot coding and normalized coding.
- One-hot coding can encode the data corresponding to the description features of the category object, so that the distances between the different values corresponding to the encoded category description features of the object can be equal. Understandably, the one-hot encoding can convert the information described by text into description by numerical values.
- Normalized coding can encode data corresponding to numerical object description features, can standardize input and output, and map different values corresponding to different object description features to the same interval, which is conducive to accelerating deep learning to obtain the optimal solution. It can be understood that the normalized encoding makes the encoded values in the same interval.
- the embodiments of the present application provide a data processing method and device that can construct output form data that is very close to the input form data. It can also realize the analysis of data even when it is separated from the data point.
- FIG. 1 is a schematic diagram of a network architecture to which an embodiment of the present application is applied.
- the schematic diagram of the network architecture shown in FIG. 1 includes a server 101 and a data processing device 102.
- the server 101 is a server of a data office, and is responsible for collecting, generalizing, organizing, and storing data at the data office to form tabular data.
- the form data stored in the server 101 may have privacy, such as telecommunications domain data.
- the leakage of telecommunications domain data not only affects the daily lives of users, but also results in financial losses for users, and may even violate laws and regulations. In view of the privacy of these form data Therefore, these table data cannot be taken away from the data office, for example, from the telecommunications domain office.
- the inability to take the form data away from the data office will have a great impact on the R & D personnel.
- the R & D personnel can only analyze the form data at the data office, so the R & D personnel will spend time between the data office and the company. And capital consumption, it is impossible to analyze the table data conveniently and quickly.
- the server 101 may communicate with the data processing device 102, for example, receive a request for obtaining form data sent by the data processing device 102, and send the form data to the data processing device 102.
- the data processing device 102 is a device provided in an embodiment of the present application, and is configured to execute the data processing method provided in the embodiment of the present application.
- the data processing device 102 may be set at a data office and may communicate with the server 101. For example, the data processing device 102 sends a request for obtaining form data to the server 101, and receives the form data sent to it by the server 101.
- the data transmission of the data processing device 102 has authority, for example, prohibiting the form data obtained from the server 101 from being sent to other networks or prohibiting the data interface from copying the form data obtained from the server 101.
- the data processing device 102 may set at least one generative adversarial network model, and these generative adversarial network models may be implemented by software.
- the server 101 and the data processing apparatus 102 may be independent devices, or the data processing apparatus 102 may be integrated in the server 101, and a specific implementation form is not limited.
- FIG. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application.
- the method may include, but is not limited to:
- Step S201 Perform standardized encoding on the input form data to obtain the first form data.
- the object description feature of the first form data is a numerical object description feature.
- the data processing apparatus 102 obtains the input form data from the server 101 before executing step S201.
- the data processing device 102 may send a request to the server to obtain input form data.
- the server 101 may authenticate the data processing device 102, that is, identify the login account or device ID of the data processing device 102. Verify, determine whether the login account or user of the data processing device 102 has permission to obtain the input form data.
- the server 101 may send the input form data to the data processing device 102 if the data processing device 102 is authenticated.
- the data processing device 102 may carry the quantity of object data to be obtained in the request. For example, if 1000 pieces of object data are requested, the input form data includes 1,000 pieces of object data, and one piece of object data may include an object identifier and the corresponding object identifier.
- One or more object description characteristics for example, in the telecommunication domain data, an object data may include a mobile phone number and the object description characteristics such as the home location, package type, monthly call charge, and annual call charge corresponding to the mobile phone number.
- the data processing device 102 may specify which form data to obtain in the request, for example, specify to obtain the form data of the last four digits of 0000-5000 in the request; for example, specify in the request to obtain the form with an online age of more than 10 years data.
- the input form data is data collected, sorted, and stored by the server 101.
- the input form data may be a data set, including multiple pieces of data, and may be all or part of the data stored by the server 101.
- the number of specific data is not limited in the embodiments of the present application, and it depends on specific situations.
- Each piece of data in the input form data has one or more object description features, and each object description feature has semantics.
- the input table data also has different characteristics with different importance, and each characteristic does not have the characteristics of the space-time distribution requirement.
- the features of the input form data can be categorical features or numerical features. The specific feature depends on the specific situation.
- the data processing device 102 may traverse the input form data and perform a filtering process, and perform standardized encoding on the filtered input form data.
- the data processing device 102 checks whether there is a specific value for the category-type object description feature corresponding to each object, that is, whether there is a text description, such as whether the corresponding hometown of an object is a specific city, If there is no specific value, for example, the data corresponding to the object is empty, then the data corresponding to the object is deleted from the input form data.
- the data processing device 102 also needs to check whether there is a specific value for the numerical object description feature corresponding to each object, that is, whether there is a specific value. If there is no specific value, the object is corresponding to The data is deleted from the input form data.
- the data processing device 102 may also check whether the specific numerical value corresponding to the numerical characteristic corresponding to each object is an abnormal value. For example, an object corresponding to the online age is 150 years, and 150 years exceeds the online age. It can be regarded as an abnormal value, and the data processing device 102 can delete the data corresponding to the object from the input form data.
- the data processing device 102 When the data processing device 102 obtains the input form data, it performs a filtering process on the input form data, so as to avoid the influence of missing values and abnormal values on the output form data.
- the data processing device 102 obtains the feature value corresponding to the category object description feature from the input form data, obtains the feature value corresponding to the numeric object description feature, that is, the feature value corresponding to the category object description feature Distinguish feature values corresponding to numerical object description features. If the input form data has only category-type object description features or numeric-type object description features, no distinction is required. It can be understood that in the case that the input form data has both the category object description feature and the numerical object description feature, one input form data is split into two form data for standardized encoding, and one form data includes the category type object Describe the feature value corresponding to the feature, and another table data includes the feature value corresponding to the numerical object description feature.
- the data processing device 102 describes features for different types of objects, and adopts standardized coding in different ways.
- the data processing device 102 uses one-hot encoding for standardized encoding.
- the one-hot code is intuitively a code system in which there are as many bits as there are states, and only one bit is 1, and the others are all 0.
- each object identifier can be a user name or a mobile phone number, and the brand can also be called a package type. Global Connect, China Travel, and Dynamic Zone. Equal is the data corresponding to the object description feature of the brand. It should be noted that each brand shown in FIG. 3 is for example only and does not constitute a limitation on the embodiment of the present application.
- each object After the one-hot encoding, each object only takes a value in a certain field, and the value is 1. For example, if the brand corresponding to an object is global, after the one-hot encoding, it takes a value of one in the global one, and other fields The value is 0.
- the data processing device 102 uses the normalized coding to perform the standardized coding.
- the purpose of normalization coding is to map different feature values of different object description features to the same numerical interval, such as [0,1] or [0,99]. If it is [0,1], the value after encoding is used. It is a decimal number in the range of 0 to 1. If it is [0,99], the encoded value is an integer in the range of 0 to 99.
- Normalized coding may include, but is not limited to, min-max standardized coding, standard score (z-score or standard score) standardized coding, inverse tangent function (atan) standardized coding, and the like.
- the age can represent the actual age of the user or the age of the user on the Internet.
- the range is [0,100]; the average revenue per user (ARPU) value can represent the profit that the operator receives from each user in a period of time, and the value range is [0,5].
- ARPU average revenue per user
- the z-score standardized encoding can be realized by the following formula:
- x ij represents the value of object i on field j
- ⁇ j represents the standard deviation of all objects on field j.
- Object i is any object in the input form data
- field j is any numerical object description feature corresponding to the object.
- the feature value corresponding to the numerical object description feature is mapped to a numerical interval after being normalized and encoded, and the interval may be [0,1] or [0,99] and the like. No matter how many numerical objects describe features, the feature values corresponding to these numerical object description features are mapped to the same numerical interval, which is conducive to accelerating deep learning to obtain the optimal solution.
- the object description feature of the first form data is a numerical object description feature, that is, the first form data does not have a type object description feature. It can be understood that in the case where the input form data includes only category-type object description features, the feature values corresponding to each category-type object description feature in the input form data are encoded as numeric values, or each category-type object description feature is converted into numeric values Type object description feature, so that the feature value corresponding to each object description feature in the first form data is 0 or 1; in the case where the input form data has only numerical type object description features, each value form object description feature in the first form data The corresponding range of the feature value is a specific interval, the interval is [0,1] or [0,99], etc .; in the case that the input form data has category-type object description features and numerical-type object description features, The value of the feature value corresponding to the categorical object description feature in the first table data is 0 or 1.
- the value range of the feature value is a specific interval, the interval is [0,1
- step S202 based on the first form data, a second form data is generated using a generative adversarial network model, and the similarity between the second form data and the first form data reaches a first threshold.
- the data processing device 102 uses the generative adversarial network model to generate the second table data on the basis of obtaining the first table data.
- the similarity between the second table data and the first table data reaches a first threshold, and the specific value of the first threshold is not limited in the embodiment of the present application.
- the similarity between the second form data and the first form data reaches a first threshold, that is, the decision tag of the second form data is a true tag, in other words, the second form data is very close to the first form data, and it cannot be determined that the second form data is Constructed data.
- the generative adversarial network model finally generates the second table data, which is the result of the confrontation between the generator and the discriminator.
- the generator included in the generative adversarial network model can be implemented by a generation network, and the discriminator can be implemented by a discriminative network.
- a group of random noise is usually Gaussian noise.
- the random noise is (0.2, 0.7, 0.6, -0.5, 0.1).
- the group of random noise is transformed into a new one.
- the vector has the same dimensions as the number of features.
- the new vector transformed by the random noise after generating the network is (0.32, 0.63, 0.89, 0.65, 0.21,0.69, 0.85, 0.01, 0.36).
- the vector generated by the generation network enters the discrimination network, and the discrimination network determines whether the vector generated by the generation network is true based on real samples, that is, input form data.
- the number of neurons in the input layer is the random noise dimension plus one
- the number of neurons in the output layer is the number of object description features.
- the number of neurons in the input layer is the number of object description features plus 1
- the number of neurons in the output layer is 1.
- the type of network and the number of network layers can be set by the user according to the actual situation, for example, it can be set according to the number of input form data.
- Ada-Grad is an improved stochastic gradient descent (SGD) algorithm for solving a generative adversarial network model.
- Dropout is a regularization technique mainly used to prevent overfitting of generative adversarial network models.
- the first form data is a real sample
- the second form data is a simulated sample.
- the simulated sample is very close to the real sample, so that the discriminator cannot determine the authenticity of the simulated sample.
- the verdict label printed is the true label.
- Step S203 Perform inverse normalization encoding on the second form data to obtain output form data.
- the output form data has the same object description characteristics as the input form data.
- the data processing device 102 After the second form data is obtained, the data processing device 102 performs inverse normalization encoding on the second form data to obtain output form data.
- the output form data has the same object description characteristics as the input form data. Since the second form data is very close to the first form data, the output form data obtained by the denormalization encoding is very close to the input form data.
- inverse normalized encoding may include inverse one-hot encoding and inverse normalized encoding.
- FIG. 5 For inverse one-hot encoding, see Figure 5 for an example of inverse one-hot encoding.
- the table shown on the left in FIG. 5 may be part or all of the data in the second table, and the table shown on the right may be the output table data. Partial or full. It can be known from FIG. 5 that the data processing device 102 uses the highest probability in each field in the second table data as the feature value of the final category-type object description of the object. For example, the probability of global communication in the data corresponding to the object is 0.7. The probability is 0.1, the probability of the dynamic zone is 0.2, and the data processing device 102 uses Global Link as the final brand of the object.
- the number of objects in the output form data is the same as the number of objects in the input form data when the feature value is empty or abnormal; the form data is output when the feature value is empty or abnormal is considered The number of objects is less than or equal to the number of objects in the input table data.
- the output form data is very close to the input form data, the output form data is not real, and the output form data and the generative adversarial network model can be taken away from the data point without leaking the input form data.
- R & D personnel can indirectly analyze the input form data through the analysis of the output form data, so that the data can be analyzed even when it is separated from the data site, which can reduce the analysis time.
- the embodiment of the present application performs standardized encoding on the input form data, so that the first form data obtained by the standardized encoding can be generated using the generative adversarial network model.
- FIG. 6 is a schematic flowchart of another data processing method according to an embodiment of the present application.
- the method may include, but is not limited to, the following steps:
- step S601 the input form data is standardized and encoded to obtain the first form data.
- the object description feature of the first form data is a numerical object description feature.
- Step S602 Based on the first form data, a second form data is generated by using a generative adversarial network model, and the similarity between the second form data and the first form data reaches a first threshold.
- Step S603 Perform inverse normalization encoding on the second form data to obtain output form data.
- the output form data has the same object description characteristics as the input form data.
- steps S601 to S603 refer to the detailed description of steps S201 to S203 in the embodiment shown in FIG. 2, and details are not described herein again.
- step S604 the similarity between the input form data and the output form data is checked.
- the data processing device 102 may obtain similarity check conditions and check the similarity between the input form data and the output form data according to the similarity check conditions to determine whether the output form data meets the similarity check conditions.
- step S605 if the similarity between the input form data and the output form data reaches a second threshold, a generative adversarial network model is output and the form data is output.
- step S606 if the similarity between the input form data and the output form data does not reach the second threshold, the initialization parameters of the generative adversarial network model are adjusted so that the similarity between the adjusted output form data and the input form data reaches the third level. Two thresholds.
- the best classifier is usually used to verify the correctness of the simulation samples, but different classifiers are suitable for different scenarios. In order to select the best classifier in different scenarios, it takes a long time to choose. In view of this, the embodiments of the present application directly check the correctness of the output form data according to the statistical index and information measurement index of the input form data and the output form data without the participation of a classifier.
- the statistical index may be the ratio of the positive and negative object data
- the information measurement index may be a feature distribution or a feature correlation.
- the data processing apparatus 102 may set a similarity test condition according to a statistical index and an information measurement index, and the similarity test condition may be preset in the data processing apparatus 102.
- the similarity test condition may include at least one of a positive and negative object data ratio condition, a feature distribution test condition, and a feature-tag correlation test condition.
- the above-mentioned positive and negative object data ratio checking condition may be that a difference between the positive and negative object data ratio of the input form data and the positive and negative object data ratio of the output form data is within a first range.
- the specific range value of the first range is not limited in the embodiments of the present application, and may depend on specific situations.
- the data processing device 102 counts the ratio of the positive and negative target data of the input form data and the ratio of the positive and negative target data of the output form data.
- the data corresponding to an object can indicate that the object identifier A is an online user
- the data corresponding to the object can be used as the positive object data
- the data corresponding to an object can indicate the object identifier B
- the data corresponding to this object can be used as negative object data.
- the data processing device 102 may determine the positive and negative object data ratio of the output form data and the output form.
- the ratio of the positive and negative object data of the data is the same, and it can also be determined that the positive and negative object data ratio of the output table data meets the condition of the positive and negative object data ratio test.
- the ratio of the positive and negative object data of the input form data is 4: 1
- the ratio of the positive and negative object data of the output form data is 16: 5
- the difference between the two is within the first range.
- the data processing device 102 adjusts the initialization parameters of the generative adversarial network model so that the adjusted positive and negative target data of the output form data is adjusted.
- the ratio satisfies the condition for verifying the ratio of the positive and negative object data, that is, the adjusted second form data is generated by the adjusted generative adversarial network model, and the adjusted second form data is subjected to inverse normalization coding to obtain the adjusted output form data, which is adjusted.
- the ratio of the positive and negative object data of the subsequent output table data satisfies the verification condition of the ratio of the positive and negative object data.
- the initialization parameters of the generative adversarial network model may include the types of encoders and decoders, the number of neurons in each layer of the generative network and discriminative network, the depth of the generative network and discriminant network, the learning rate of gradient descent, and the like.
- batch normalization is an adaptive re-parameterization method, which can speed up the training convergence speed.
- the above feature distribution test condition may be that the feature distribution of the object description feature i in the output form data obeys the feature distribution of the object description feature i in the input form data, and the object description feature i is one or more object description features of the input form data Any one of the objects describes the characteristics.
- the data processing device 102 calculates the relative entropy of the object description feature i in the output form data relative to the object description feature i in the input form data. If the relative entropy is within the second range, the feature distribution of the object description feature i in the output form data is determined. Obey the feature distribution of the object description feature i in the input form data, and the feature distribution of the object description feature i in the output form data meets the feature distribution test conditions.
- the specific range value of the second range is not limited in the embodiments of the present application, and may depend on specific situations.
- the initialization parameters of the generative adversarial network model are adjusted so that the adjusted feature distribution of the object description feature i in the output form data meets the features Distribution test conditions.
- the above-mentioned feature-label correlation test condition may be a strong correlation between the object description feature j in the output form data and the object tag, and the object description feature j is any of one or more object description features possessed by the input form data.
- An object describes characteristics.
- the object label is used to indicate the status of the object data. Taking telecommunication domain data as an example, the object label can indicate two states: online or offline.
- the data processing device 102 calculates the first mutual information of the object description feature j and the object tag in the input form data, and calculates the second mutual information of the object description feature j and the object tag in the output form data, if the first mutual information and the second mutual information If the difference is within the third range, it is determined that the correlation between the object description feature j and the object label in the output table data satisfies the feature-label correlation test condition.
- the data processing device 102 may calculate mutual information according to the following formula:
- the initialization parameters of the generative adversarial network model are adjusted so that the adjusted output table data The correlation between the object description feature j and the object label satisfies the feature-label correlation test condition.
- the data processing device 102 may be configured with multiple generative adversarial network models, for example, two generative adversarial network models.
- the output table data 1 may be obtained through the generative adversarial network model 1.
- the network model 2 can obtain output table data 2.
- the data processing device 102 may verify the output form data 1 and the output form data 2 and select the output form data closest to the input form data.
- the positive and negative object data ratio of the input form data is 4: 1
- the positive and negative object data ratio of the output form data 1 is 16: 5
- the positive and negative object data ratio of the output form data 1 is 16 : 7
- the difference between 4: 1 and 16: 5 is less than 4: 1 and 16: 7
- the ratio of the positive and negative object data of the output form data 1 is closer to the ratio of the positive and negative object data of the input form data, which can be selected
- the output table data 1 is analyzed.
- the object tag can be online or off. These two values of the net, then the number of times the object description feature j in the input form data appears on (0,0), (0,1), (1,0), (1,1) is (100,200,50,100) ). The number of times the object description feature j in the output form data 1 and output form data 2 appears in the above four combinations is (90, 180, 60, 120) and (80, 170, 70, 130).
- the mutual information corresponding to the input form data and the two output form data is -2749.16, -2749.16, and -2748.94. It can be seen that the correlation between (90,180,60,120) and the label is closer to the correlation between the real object (100,200,50,100) and the label, that is (
- the similarity between the input form data and the output form data reaches a second threshold
- the specific value of the second threshold is not limited in the embodiments of the present application.
- the second threshold value and the first threshold value may be different values or the same value.
- the data processing device 102 can output the output form data and the generative adversarial network model, that is, the output form data and the generative adversarial network model can be taken away from the data point. Take the output form data away from the data site, so that in the case of leaving the data site, you can analyze the output form data and indirectly realize the analysis of the input form data. Taking the generative adversarial network model away from the data point, the generative adversarial network model can be studied.
- the data processing device 102 may test the first generative adversarial network model according to the output form data to determine Whether the first output form data meets a test condition, that is, the similarity between the first output form data and the output form data is checked.
- the first output table data is output table data obtained by using the first generative adversarial network model. Specific inspection methods can refer to the inspection of the output form data. If the first output table data does not satisfy the similarity check condition, the initialization parameters of the first generative adversarial network model are adjusted according to the output table data.
- the first output form data may be output form data obtained by the data processing device 102 based on the input form data before using the first generative adversarial network model.
- the data processing device 102 adjusts the initialization parameters of the previous generative adversarial network model according to the output table data currently obtained.
- the data processing apparatus 70 may include an encoding unit 701 and a generating unit 702.
- An encoding unit 701 configured to perform standardized encoding on the input form data to obtain the first form data, and the object description feature of the first form data is a numerical description feature;
- a generating unit 702 configured to generate a second form data using a generative adversarial network model based on the first form data, and the similarity between the second form data and the first form data reaches a first threshold;
- the encoding unit 701 is further configured to perform inverse normalization encoding on the second form data, and the output form data has the same object description characteristics as the input form data.
- the above-mentioned encoding unit 701 is configured to perform steps S201 and S203 in the embodiment shown in FIG. 2, and the above-mentioned generating unit 702 is configured to execute step S202 in the embodiment shown in FIG. 2. For details, see FIG. 2. The specific description of the illustrated embodiment is not repeated here.
- the input form data has one or more object description characteristics.
- Object description features have semantics.
- the input form data has at least one of a category-type object description feature and a numeric-type object description feature.
- the feature value corresponding to the category-type object description feature is non-numeric, and the feature value corresponding to the numeric-type object description feature is numeric.
- the input form data has category-type object description features; when the encoding unit 701 is configured to perform standard encoding on the input form data, it is specifically configured to obtain features corresponding to the category-type object description features from the input form data. Value; one-hot encoding of the feature value corresponding to the categorical object description feature.
- the encoding unit 701 is configured to uniquely encode the feature value corresponding to the category-type object description feature, and is specifically configured to encode the feature value corresponding to the category-type object description feature from a non-numeric value to a numeric value.
- the input form data has a numerical object description feature; when the encoding unit 701 is used to perform standard encoding on the input form data, it is specifically configured to obtain a characteristic corresponding to the numerical object description feature from the input form data. Value; normalize the feature value corresponding to the description feature of the numerical object.
- the encoding unit 701 is configured to perform normalization encoding on the feature values corresponding to the numerical object description features, and is specifically configured to map the feature value codes corresponding to the numerical object description features to the same numerical interval.
- the data processing apparatus 70 further includes a checking unit 703, an output unit 704, and an adjusting unit 705.
- a checking unit 703, configured to check the similarity between the input form data and the output form data
- An output unit 704 is configured to output a generative adversarial network model and output form data if the similarity between the input form data and the output form data reaches a second threshold;
- An adjusting unit 705 is configured to adjust the initialization parameters of the generative adversarial network model if the similarity between the input form data and the output form data does not reach the second threshold, so that the adjusted output form data is similar to the input form data Degree reaches a second threshold.
- the checking unit 703 is configured to check the similarity between the input form data and the output form data, and is specifically used to obtain similarity check conditions; according to the similarity check condition, the input form data and the output form data The similarity of the data is tested to determine whether the output table data meets the similarity test conditions.
- the similarity test condition includes a positive and negative object data ratio test condition
- the test unit 703 is configured to test the similarity between the input form data and the output form data according to the similarity check condition, and specifically uses Based on the statistics of the positive and negative object data proportions of the input form data and the positive and negative object data proportions of the output form data; judging whether the difference between the positive and negative object data proportions of the input form data and the positive and negative object data proportions of the output form data is within Within the first range; if the difference between the ratio of the positive and negative object data of the input form data and the ratio of the positive and negative object data of the output form data is within the first range, determine the ratio of the positive and negative object data of the input form data and the output form The ratio of the positive and negative object data of the data is the same, and the ratio of the positive and negative object data of the output table data satisfies the test condition of the ratio of the positive and negative object data.
- the similarity test condition includes a feature distribution test condition; the test unit 703 is configured to test the similarity between the input table data and the output table data according to the similarity test condition, and is specifically used to calculate the output Relative entropy of the object description feature i in the table data relative to the object description feature i in the input table data.
- the object description feature i is any one of one or more object description features; it is determined whether the relative entropy is within the second range.
- the similarity test condition includes a feature-tag correlation test condition;
- the test unit 703 is configured to test the similarity between the input form data and the output form data according to the similarity check condition, and specifically uses The first mutual information of the object description feature j and the object tag in the input form data is calculated, and the second mutual information of the object description feature j and the object tag in the output form data is calculated.
- the object description feature j is one or more object description features.
- any one of the object description characteristics determine whether the difference between the first mutual information and the second mutual information is within the third range; if the difference between the first mutual information and the second mutual information is within the third range, Then it is determined that the correlation between the object description feature j and the object label in the output table data satisfies the feature-label correlation test condition.
- the checking unit 703 is further configured to check the similarity between the output form data and the first output form data when the similarity between the input form data and the output form data reaches a second threshold.
- An output table data is output table data obtained by using the output first generation adversarial network model;
- the adjusting unit 705 is further configured to adjust the initialization parameters of the first generative adversarial network model if the similarity between the output form data and the first output form data does not reach the second threshold.
- the data processing apparatus 70 may implement the functions of the data processing apparatus in the foregoing method embodiments. For detailed processes performed by each unit in the data processing apparatus 70, refer to the execution steps of the data processing apparatus in the foregoing method embodiments, which are not described herein again.
- FIG. 8 is a simplified schematic diagram of a physical structure of a data processing apparatus according to an embodiment of the present application.
- the data processing apparatus 80 includes a transceiver 801, a processor 802, and a memory 803.
- the transceiver 801, the processor 802, and the memory 803 may be connected to each other through a bus 804, or may be connected in other ways.
- Relevant functions implemented by the encoding unit 701, the generation unit 702, the inspection unit 703, and the adjustment unit 704 shown in FIG. 7 may be implemented by the processor 802.
- the transceiver 801 is configured to send data and / or signaling and receive data and / or signaling. Applied in the embodiment of the present application, the transceiver 801 is configured to communicate with a server, obtain input form data from the server, and the like.
- the processor 802 may include one or more processors, for example, one or more central processing units (CPUs).
- the processor 802 may be one CPU, the CPU may be a single-core CPU, or Can be a multi-core CPU.
- the processor 802 is configured to execute steps S201 to S203 in the embodiment shown in FIG. 2, and is further configured to execute steps S601 to S606 in the embodiment shown in FIG. 6.
- the memory 803 includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or A portable read-only memory (CD-ROM).
- RAM random access memory
- ROM read-only memory
- EPROM erasable programmable read-only memory
- CD-ROM portable read-only memory
- the memory 803 is used for related instructions and data.
- the memory 803 is configured to store program codes and data of the data processing apparatus 80.
- FIG. 8 only shows a simplified design of the data processing device.
- the data processing device may also include other necessary components, including but not limited to any number of transceivers, processors, controllers, memories, communication units, etc., and all devices that can implement this application are in this Within the scope of the application.
- a person of ordinary skill in the art may understand that all or part of the processes in the method of the foregoing embodiments are implemented.
- the processes may be completed by a computer program instructing related hardware.
- the program may be stored in a computer-readable storage medium.
- the foregoing storage media include: ROM or random storage memory RAM, magnetic disks, or optical discs, which can store various program code media. Therefore, another embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the computer-readable storage medium runs on the computer, the computer executes the methods described in the above aspects.
- Yet another embodiment of the present application further provides a computer program product containing instructions, which when executed on a computer, causes the computer to execute the methods described in the above aspects.
- the disclosed systems, devices, and methods may be implemented in other ways.
- the device embodiments described above are only schematic.
- the division of the unit is only a logical function division.
- multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, which may be electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing module, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
- the computer instructions may be stored in a computer-readable storage medium, or transmitted through the computer-readable storage medium.
- the computer instructions may be transmitted from a website site, computer, server, or data center through wired (for example, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (for example, infrared, wireless, microwave, etc.) Another website site, computer, server, or data center for transmission.
- the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes one or more available medium integration.
- the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Marketing (AREA)
- Theoretical Computer Science (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Game Theory and Decision Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Primary Health Care (AREA)
- Tourism & Hospitality (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请实施例提供一种数据处理方法及其装置,其中方法可以包括如下步骤:对输入表格数据进行标准化编码,得到第一表格数据,第一表格数据的对象描述特征为数值型对象描述特征;基于第一表格数据,采用生成式对抗网络模型生成第二表格数据,第二表格数据与第一表格数据的相似度达到第一阈值;对第二表格数据进行逆标准化编码,得到输出表格数据,输出表格数据与输入表格数据具有相同的对象描述特征。采用本申请实施例,可以构建非常接近于输入数据的输出数据,即使在脱离数据局点的情况下,也能实现对数据的分析。
Description
本申请要求于2018年6月19日提交中国专利局、申请号为201810630422.5、申请名称为“一种数据处理方法及其装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请实施例涉及数据处理技术领域,具体涉及一种数据处理方法及其装置。
随着大数据技术的快速发展,电信运营商也开始越来越多的关注如何将杂乱无章、海量的电信域数据转化为有价值的信息,实现诸如套餐推荐、客户挽留以及基站流量预测等应用。但是由于电信域数据具有如下的特殊性,会给电信域数据分析带来困难。
特殊性一,电信域数据无法带离电信局点,导致在离开电信局点的情况下,无法针对电信域数据构建模型,无法对电信域数据进行分析。特殊性二,特定类型样本数据缺失。特定类型样本数据的缺失会显著影响模型的构建,进而影响数据分析。例如,在离网用户预测应用中,离网用户数量是极少的,这样便导致正负样本数量高度不平衡,进而影响对离网用户的分析。
鉴于上述电信域数据的特殊性,在离开电信局点的情况下,如何对电信域数据进行数据分析是亟待解决的技术问题。
发明内容
本申请实施例所要解决的技术问题在于,提供一种数据处理方法及其装置,可以构建非常接近于输入数据的输出数据,即使在脱离数据局点的情况下,也能实现对数据的分析。
本申请实施例第一方面提供一种数据处理方法,包括:
对输入表格数据进行标准化编码,得到第一表格数据,第一表格数据的对象描述特征为数值型对象描述特征;
基于第一表格数据,采用生成式对抗网络模型生成第二表格数据,第二表格数据与第一表格数据的相似度达到第一阈值;
对第二表格数据进行逆标准化编码,得到输出表格数据,输出表格数据与输入表格数据具有相同的对象描述特征。
本申请实施例第一方面,通过对输入表格数据进行标准化编码,以使标准化编码得到的第一表格数据可以应用生成式对抗网络模型生成第二表格数据,对第二表格数据进行逆标准化编码,以得到与输入表格数据非常接近的输出表格数据,从而实现对输入表格数据的模拟,可直接对输出表格数据进行分析,从而间接实现对输入表格数据的分析,以便在脱离数据局点的情况下,能够快速实现对输入表格数据的分析。
在一种可能的实现方式中,上述输入表格数据为输入数据处理装置的原始表格数据,即为数据局点提供的数据,例如可以为电信局点提供的电信域数据。上述输入表格数据具有一个或多个对象描述特征,对象描述特征用于描述对象所对应的特征,在表格数据中的 一个字段即为一个对象描述特征,表格数据中一列对应一个字段,一行对应一个对象,一个对象也可以称为一个样本。
在一种可能的实现方式中,上述对象描述特征具有语义即字段被赋予含义。表格数据不同于多媒体数据,特征除了具有语义的特点之外,还具有不同特征重要性不同,特征不存在时间或空间分布要求的特点。鉴于表格数据具有这些特点,使得表格数据无法直接应用于生成式对抗网络模型,因此本申请实施例需对输入表格数据进行标准化编码。
在一种可能的实现方式中,输入表格数据具有类别型对象描述特征和数值型对象描述特征中的至少一种,类别型对象描述特征对应的特征值为非数值,数值型对象描述特征对应的特征值为数值。换言之,输入表格数据所包括的任意一个对象描述特征可为类别型对象描述特征或数值型对象描述特征,对于不同类型的对象描述特征进行不同的标准化编码。
在一种可能的实现方式中,对于输入表格数据具有类别型对象描述特征的情况,对输入表格数据进行标准化编码的过程为:从输入表格数据中获取类别型对象描述特征对应的特征值,对类别型对象描述特征对应的特征值进行独热编码。
其中,对类别型对象描述特征对应的特征值进行独热编码,即将类别型对象描述特征对应的特征值由非数值编码为数值,以便可以应用于生成式对抗网络模型。
在一种可能的实现方式中,对于输入表格数据具有数值型对象描述特征的情况,对输入表格数据进行标准化编码的过程为:从输入表格数据中获取数值型对象描述特征对应的特征值;对数值型对象描述特征对应的特征值进行归一化编码。
其中,对数值型对象描述特征对应的特征值进行归一化编码,即将所述数值型对象描述特征对应的特征值编码映射至同一数值区间,以便更好地应用于生成式对抗网络模型。
可以理解的是,进行独热编码的目的是将类别型对象描述特征用特定的数值进行表示,进行归一化编码的目的是将数值型对象描述特征对应的数值范围映射到同一个数值区间,使得标准化编码得到的第一表格数据可以应用于生成式对抗网络模型,以便基于第一表格数据可以生成第二表格数据。
在一种可能的实现方式中,在得到输出表格数据的情况下,需对输出表格数据进行检验,检验输入表格数据与输出表格数据之间的相似度。若输入表格数据与输出表格数据的相似度达到第二阈值,则输出该输出表格数据和生成式对抗网络模型;若输入表格数据与输出表格数据的相似度未达到第二阈值,则对生成式对抗网络模型的初始化参数进行调整,以使调整后的输出表格数据与输入表格数据的相似度达到第二阈值。
其中,生成式对抗网络模型的初始化参数可以包括编码器和译码器类别、生成网络和判别网络每一层神经元的个数、生成网络和判别网络的深度、梯度下降的学习速率等。
在一种可能的实现方式中,可通过获取相似度检验条件,根据相似度检验条件,对输入表格数据与输出表格数据的相似度进行检验,以确定输出表格数据是否满足相似度检验条件,从而检验输入表格数据与输出表格数据之间的相似度。
在一种可能的实现方式中,上述相似度检验条件可以包括正负对象数据比例检验条件,根据相似度检验条件,对输入表格数据与输出表格数据的相似度进行检验,具体可包括:统计输入表格数据的正负对象数据比例,统计输出表格数据的正负对象数据比例;判断输入表格数据的正负对象数据比例与输出表格数据的正负对象数据比例之间的差值是否在第 一范围内;若输入表格数据的正负对象数据比例与输出表格数据的正负对象数据比例之间的差值在第一范围内,则确定输入表格数据的正负对象数据比例与输出表格数据的正负对象数据比例一致,输出表格数据的正负对象数据比例满足正负对象数据比例检验条件。
可以理解的是,正负对象数据比例检验是一种统计指标检验,统计输入表格数据和输出表格数据的正负对象数据比例,进而确定输出表格数据的正负对象数据比例是否满足正负对象数据比例检验条件,实现简单、方便。
在一种可能的实现方式中,上述相似度检验条件可以包括特征分布检验条件,根据相似度检验条件,对输入表格数据与输出表格数据的相似度进行检验,具体可包括:计算输出表格数据中对象描述特征i相对于输入表格数据中对象描述特征i的相对熵,对象描述特征i为一个或多个对象描述特征中的任意一个对象描述特征;判断相对熵是否在第二范围内;若相对熵在第二范围内,则确定输出表格数据中对象描述特征i的特征分布服从输入表格数据中的对象描述特征i的特征分布,输出表格数据中对象描述特征i的特征分布满足特征分布检验条件。
可以理解的是,特征分布检验是一种信息度指标检验,通过计算某个对象描述特征的相对熵来确定输出表格数据中该对象描述特征的特征分布是否满足特征分布检验条件,实现简单、方便。
在一种可能的实现方式中,上述相似度检验条件包括特征-标签相关性检验条件,根据相似度检验条件,对输入表格数据与输出表格数据的相似度进行检验,具体可包括:计算输入表格数据中对象描述特征j与对象标签的第一互信息,计算输出表格数据中对象描述特征j与对象标签的第二互信息,对象描述特征j为一个或多个对象描述特征中的任意一个对象描述特征;判断第一互信息与第二互信息之间的差值是否在第三范围内;若第一互信息与第二互信息之间的差值在第三范围内,则确定输出表格数据中的对象描述特征j与对象标签之间的相关性满足特征-标签相关性检验条件。
可以理解的是,特征-标签相关性检验是一种信息度指标检验,通过计算某个对象描述特征与对象标签的互信息来确定输出表格数据中该特征与对象标签之间的相关性是否满足特征-标签相关性检验条件,实现简单、方便。
在一种可能的实现方式中,在输入表格数据与输出表格数据的相似度达到第二阈值的情况下,检验输出表格数据与第一输出表格数据的相似度,第一输出表格数据为采用输出的第一生成式对抗网络模型得到的输出表格数据,可以是之前基于输入表格数据,采用第一生成式对抗网络模型得到的输出表格数据,采用目前的输出表格数据来检验之前的第一输出表格数据,从而检验之前的第一生成式对抗网络模型。
若输出表格数据与第一输出表格数据的相似度未达到第二阈值,则对第一生成式对抗网络模型的初始化参数进行调整,以提高第一生成式对抗网络模式的准确性。
若输出表格数据与第一输出表格数据的相似度达到第二阈值,则可以确定第一输出表格数据可用,第一生成式对抗网络模型可用,即可以对第一输出表格数据进行分析,以达到对输入表格数据进行分析的目的。
本申请实施例第二方面提供一种数据处理装置,该数据处理装置具有实现第一方面提供方法的功能。功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。硬件或软 件包括一个或多个与上述功能相对应的单元。
在一种可能的实现方式中,该数据处理装置包括:编码单元和生成单元,编码单元,用于对输入表格数据进行标准化编码,得到第一表格数据,第一表格数据的对象描述特征为数值型对象描述特征;生成单元,用于基于第一表格数据,采用生成式对抗网络模型生成第二表格数据,第二表格数据与第一表格数据的相似度达到第一阈值;编码单元,还用于对第二表格数据进行逆标准化编码,得到输出表格数据,输出表格数据与输入表格数据具有相同的对象描述特征。
在一种可能的实现方式中,该数据处理装置包括:处理器、收发器和存储器,其中,收发器用于接收和发送信息,存储器中存储计算机执行指令,处理器通过总线与存储器和收发器连接,处理器执行存储器中存储的计算机执行指令,以使该数据处理装置执行以下操作:对输入表格数据进行标准化编码,得到第一表格数据,第一表格数据的对象描述特征为数值型对象描述特征;基于第一表格数据,采用生成式对抗网络模型生成第二表格数据,第二表格数据与第一表格数据的相似度达到第一阈值;对第二表格数据进行逆标准化编码,得到输出表格数据,输出表格数据与输入表格数据具有相同的对象描述特征。
基于同一发明构思,由于该数据处理装置解决问题的原理以及有益效果可以参见第一方面的方法以及所带来的有益效果,因此该数据处理装置的实施可以参见方法的实施,重复之处不再赘述。
本申请实施例第三方面提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
本申请实施例第四方面提供一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1是应用本申请实施例的网络架构示意图;
图2是本申请实施例提供的一种数据处理方法的流程示意图;
图3是独热编码的示例图;
图4是归一化编码的示例图;
图5是逆独热编码的示例图;
图6是本申请实施例提供的另一种数据处理方法的流程示意图;
图7是本申请实施例提供的数据处理装置的逻辑结构示意图;
图8是本申请实施例提供的数据处理装置的实体结构简化示意图。
下面将对本申请实施例涉及的技术用语或名词进行介绍。
(1)表格数据,以表格形式展现的数据,可以是以宽表形式展现的数据,也可以是以窄表形式展现的数据。其中,宽表从字面意义上讲就是字段比较多的数据库表。宽表,通常是指业务主题相关的指标、维度、属性关联在一起的一张数据库表。例如,下表1为电 信域中的表格数据。
表1
用户名 | 手机号码 | 归属地 | 套餐类型 | … |
张三 | XXXXXXXXXXX | A地 | 套餐1 | … |
李四 | XXXXXXXXXXX | B地 | 套餐2 | … |
王五 | XXXXXXXXXXX | C地 | 套餐3 | … |
… | … | … | … | … |
表格数据为表格所展现的所有样本数据,一行即为一个样本,一列即为一个特征。例如,表1中用户名为张三所在一行数据即为一个样本,该样本包括张三的手机号码、该手机号的归属地、该手机号的套餐类型等;表1中归属地、套餐类型等列元素即为特征,A地、B地、套餐1、套餐2等可以称为特征对应的特征值。应用在本申请实施例中,可以将一个样本称为一个对象,将用户名或手机号码等称为对象标识,将归属地、套餐类型等称为对象描述特征。
由表1可知,表格数据可以具有一个或多个对象描述特征,每个对象描述特征具有语义,语义即被赋予含义,换言之,每个对象描述特征具有特定的含义,例如归属地这个对象描述特征用于指示手机号码所属的国家或城市。
表格数据是区别于图像、语音等多媒体数据的一种数据。多媒体数据的特点是不具有语义,图像特征具有相同的重要性,图像特征具有时间或空间的分布要求,并且图像特征都是数值。例如,图像数据中的像素点不具有语义,即不能通过像素点传达特定的信息,只能通过数值来表示这个像素点的值,各个像素点的重要性相同,不同像素点在空间上的分布位置不同。再例如,语音数据,任意两帧语音在时间上分布不同。而表格数据的对象描述特征具有语义,对象描述特征重要性不同,对象描述特征不存在时间或空间分布要求。例如,表1所示的归属地与套餐类型的重要性可以不相同,不存在时间或空间的分布要求,并且归属地、套餐类型等取值并不是用数值表示,而是用文字表示。
应用在本申请实施例中,输入表格数据为输入数据处理装置的原始表格数据,是未通过标准化编码、生成式对抗网络模型以及逆标准化编码处理的表格数据,为真实存在的表格数据,即为数据局点提供的数据。以电信域数据为例,输入表格数据是电信域局点中服务器采集或整理或存储的表格数据。输出表格数据为数据处理装置处理得到的表格数据,是通过标准化编码、生成式对抗网络模型以及逆标准化编码处理的表格数据,为构造的表格数据,并非真实存在的表格数据。可以理解的是,输出表格数据用于模拟输入表格数据,并非真实存在的表格数据。
本申请实施例中的对象描述特征可分为类别型对象描述特征和数值型对象描述特征,例如表1中归属地或套餐类型为类别型对象描述特征,即用文字来描述归属地或套餐类型等信息;用户每个月的话费或每年的话费为数值型对象描述特征,即用数值来描述每月话费或每年话费等信息。换言之,类别型对象描述特征对应的特征值为文字,即非数值,数值型对象描述特征对应的特征值为数值,即具体的数值。
表格数据具有类别型对象描述特征和数值型对象描述特征中的至少一种,即可以是所有的对象描述特征均为类别型对象描述特征,也可以是所以的对象描述特征均为数值型对 象描述特征,还可以是一些对象描述特征是类别型对象描述特征,另一些对象描述特征是数值型对象描述特征。
(2)生成式对抗网络(generative adversarial networks,GAN)模型,主要用于在多媒体数据场景,主要包括两个部分,生成器(generator)和判别器(discriminator)。生成器主要用来学习真实的图像分布,生成图像,并使得其生成的图像更加真实,以骗过判别器。判别器则是对接收到的图像进行真假判别。整个流程中,生成器“费尽心机”地让其生成的图像更加接近真实图像,而判别器则努力地识别图像的真假。这就类似于一个二人博弈,随着训练过程的不断进行,生成器和判别器持续对抗,最终达到了一个动态平衡,即生成器生成的图像接近于真实图像分布,而判别器识别不出生成器所生成的图像的真假。
对应真实图像(real image),判别器给其打的标签为1;对于生成图像(fake image),判别器给其打的标签为0。对于生成器传递给判别器的生成图像,生成器希望判别器打的标签为1,但是判别器不会一开始就给生成图像打标签1,会基于真实图像进行判别,若判别器给生成图像打的标签为0,则生成器对生成图像进行调整,再传递至判别器,直到判别器给其传递的生成图像打的标签为1,那么,此时判别器已分不清生成图像的真假,可以理解的是,此时生成图像非常接近于真实图像,可用生成图像替代真实图像。
GAN模型的数学语言描述可为:假设生成模型是G(z),其中z是一个随机噪声,而G将这个随机噪声转化为图像x。D是一个判别模型,对于输入图像x,D(x)输出一个0到1范围内的实数,用于判断该图像是一张真实的概率有多大。
本申请实施例将GAN模型应用在表格数据场景中,例如电信域数据场景中,可用于模拟输入表格数据,例如模拟生成电信域数据,以解决无法将电信域数据带离局点的弊端。
(3)标准化编码,可以使得不同对象描述特征对应的数据在同一个范围内对模式施加影响,例如对生成式对抗网络模型施加影响。应用在本申请实施例中,标准化编码可分为独热编码和归一化编码。
独热编码,可以对类别型对象描述特征对应的数据进行编码,可以使得编码后的类别型对象描述特征对应的不同取值之间的距离相等。可以理解的,独热编码可以使得用文字描述的信息转换为用数值进行描述。
归一化编码,可以对数值型对象描述特征对应的数据进行编码,可以规范输入输出,将不同对象描述特征对应的不同取值映射到同一个区间,有利于加快深度学习获得最优的解。可以理解的是,归一化编码使得编码后的数值在同一个区间内。
鉴于多媒体数据与表格数据的不同,以及电信域数据等表格数据的特殊性,本申请实施例提供一种数据处理方法及其装置,可以构建出非常接近于输入表格数据的输出表格数据,即使在脱离数据局点的情况下,也能实现对数据的分析。
下面结合本申请实施例中的附图对本申请实施例进行描述。
请参见图1,为应用本申请实施例的网络架构示意图。图1所示的网络架构示意图包括服务器101和数据处理装置102。
其中,服务器101为数据局点的服务器,负责对数据局点的数据进行采集、归纳、整理、存储,形成表格数据。服务器101所存储的表格数据可以具有隐私性,例如电信域数据,电信域数据的泄露不仅影响用户的日常生活,还会导致用户的经济损失,甚至可能触 犯法律法规,鉴于这些表格数据具有隐私性,因此无法将这些表格数据带离数据局点,例如带离电信域局点。
无法将表格数据带离数据局点会给研发人员带来很大的影响,例如研发人员只能在数据局点对表格数据进行分析,那么研发人员往来数据局点与公司之间会存在时间消耗和资金消耗,无法方便、快速地对表格数据进行分析。
服务器101可与数据处理装置102进行通信,例如接收数据处理装置102发送的获取表格数据的请求,向数据处理装置102发送表格数据等。
其中,数据处理装置102为本申请实施例提供的装置,用于执行本申请实施例提供的数据处理方法。
数据处理装置102可以设置在数据局点,可以与服务器101进行通信,例如,数据处理装置102向服务器101发送获取表格数据的请求,接收服务器101向其发送的表格数据等。
需要说明的是,鉴于表格数据的隐私性,数据处理装置102的数据传输有权限,例如禁止将从服务器101获取的表格数据发送至其他网络或禁止数据接口拷贝从服务器101获取的表格数据。
数据处理装置102可设置至少一个生成式对抗网络模型,这些生成式对抗网络模型可通过软件实现。
服务器101与数据处理装置102可以是独立的设备,也可以将数据处理装置102集成在服务器101中,不限定具体实现形式。
请参见图2,为本申请实施例提供的一种数据处理方法的流程示意图,该方法可以包括但不限于:
步骤S201,对输入表格数据进行标准化编码,得到第一表格数据,第一表格数据的对象描述特征为数值型对象描述特征。
在一种可能的实现方式中,数据处理装置102在执行步骤S201之前,从服务器101获取输入表格数据。数据处理装置102可向服务器发送获取输入表格数据的请求,服务器101在接收到该请求的情况下,可对数据处理装置102进行身份验证,即对数据处理装置102的登录账号或装置标识进行身份验证,判断数据处理装置102的登录账号或用户是否有权限获取输入表格数据。服务器101可在数据处理装置102通过身份验证的情况下,向数据处理装置102发送输入表格数据。
数据处理装置102可在请求中携带所需获取的对象数据数量,例如,请求获取1000条对象数据,那么输入表格数据包括1000条对象数据,一条对象数据可以包括一个对象标识以及该对象标识对应的一个或多个对象描述特征,例如电信域数据中,一条对象数据可以包括一个手机号码以及该手机号码对应的归属地、套餐类型、每月话费、每年话费等对象描述特征。
数据处理装置102可在请求中指定获取哪些表格数据,例如,在请求中指定获取手机号码后四位为0000-5000的表格数据;再例如,在请求中指定获取在网年龄10年以上的表格数据。
可以理解的是,输入表格数据为服务器101采集、整理、存储的数据,输入表格数据 可以是一个数据集合,包括多条数据,可以是服务器101所存储的全部数据或部分数据,输入表格数据的具体数据条数在本申请实施例中不做限定,视具体情况而定。
其中,输入表格数据中的每条数据具有一个或多个对象描述特征,每个对象描述特征都具有语义。输入表格数据还具有不同特征的重要性不同,每个特征不存在时空分布要求的特点。输入表格数据的特征可以是类别型特征,也可以是数值型特征,具体为哪种特征视具体情况而定。
在一种可能的实现方式中,数据处理装置102在获取到输入表格数据的情况下,可对输入表格数据进行遍历、筛选处理,对筛选后的输入表格数据进行标准化编码。
对于类别型对象描述特征,数据处理装置102查看每个对象对应的类别型对象描述特征是否存在具体的值,即是否存在文字描述,例如某个对象对应的归属地是否为某个具体的城市,若不存在具体的值,例如该对象对应的数据对应的归属地为空,则将该对象对应的数据从输入表格数据中删除。
对于数值型对象描述特征,数据处理装置102同样需查看每个对象对应的数值型对象描述特征是否存在具体的值,即是否存在具体的数值,若不存在具体的数值,则将该对象对应的数据从输入表格数据中删除。对于数值型对象描述特征,数据处理装置102还可以查看每个对象对应的数值型特征对应的具体数值是否为异常值,例如某个对象对应的在网年龄为150年,150年超过在网年龄的上限,可将其视为异常值,数据处理装置102可将该对象对应的数据从输入表格数据中删除。
数据处理装置102在获取到输入表格数据的情况下,对输入表格数据进行筛选处理,可避免缺失值以及异常值等对输出表格数据的影响。
在一种可能的实现方式中,数据处理装置102从输入表格数据中获取类别型对象描述特征对应的特征值,获取数值型对象描述特征对应的特征值,即将类别型对象描述特征对应的特征值与数值型对象描述特征对应的特征值进行区分。若输入表格数据只具有类别型对象描述特征或数值型对象描述特征,则无需进行区分。可以理解的是,在输入表格数据既具有类别型对象描述特征又具有数值型对象描述特征的情况下,将一个输入表格数据拆分为两个表格数据进行标准化编码,一个表格数据包括类别型对象描述特征对应的特征值,另一个表格数据包括数值型对象描述特征对应的特征值。
数据处理装置102针对不同类型的对象描述特征,采用不同方式的标准化编码。
对于类别型对象描述特征对应的特征值,数据处理装置102采用独热(one-hot)编码进行标准化编码。独热码,直观来说就是有多少个状态就有多少比特,而且只有一个比特为1,其他全为0的一种码制。
可参见图3所示的独热编码的示例图,该示例图以电信域数据为例,其中对象标识可以是用户名或手机号码等标识,品牌也可称为套餐类型,全球通、神州行、动感地带等即为品牌这个对象描述特征对应的数据,需要说明的是,图3所示的各个品牌仅用于举例,不构成对本申请实施例的限定。经过独热编码之后,每个对象只在某个字段上取值,且值为1,例如若某个对象对应的品牌为全球通,独热编码之后,在全球通这个字段上取值1,其它字段取值为0。
可以理解的是,类别型对象描述特征对应的特征值经过独热编码之后,在某个字段上 取值为1,其余字段取值为0,即用1或0来描述类别型对象描述特征对应的特征值。
对于数值型对象描述特征对应的特征值,数据处理装置102采用归一化编码进行标准化编码。归一化编码的目的将不同对象描述特征的不同特征值映射到同一个数值区间,例如[0,1]或[0,99]等,若为[0,1],则编码后的取值为0至1范围内的小数;若为[0,99],则编码后的取值为0至99范围内的整数。
归一化编码可以包括但不限于最小-最大(min-max)标准化编码、标准分数(z-score或standard score)标准化编码、反正切函数(atan)标准化编码等。
可参见图4所示的归一化编码的示例图,该示例图以电信域数据为例,以z-score标准化编码为例,其中年龄可表示用户的实际年龄或用户在网年龄,取值范围为[0,100];每用户平均收入(average revenue per user,ARPU)值可以表示一个时间段内运营商从每个用户所得到的利润,取值范围为[0,5]。需要说明的是,图4所示的年龄、ARPU值仅用于举例,不构成对本申请实施例的限定。
其中,z-score标准化编码可通过如下公式实现:
由图4可知,数值型对象描述特征对应的特征值经过归一化编码之后,映射到一个数值区间,该区间可以是[0,1]或[0,99]等。不管有多少个数值型对象描述特征,这些数值型对象描述特征对应的特征值都被映射到同一个数值区间,有利于加快深度学习获得最优的解。
输入表格数据在经过标准化编码之后,可以得到第一表格数据,第一表格数据的对象描述特征为数值型对象描述特征,即第一表格数据不具有类型型对象描述特征。可以理解的是,在输入表格数据只包括类别型对象描述特征的情况下,输入表格数据中各个类别型对象描述特征对应的特征值被编码为数值,或各个类别型对象描述特征被转换为数值型对象描述特征,使得第一表格数据中各个对象描述特征对应的特征值为0或1;在输入表格数据只具有数值型对象描述特征的情况下,第一表格数据中各个数值型对象描述特征对应的特征值的取值范围为一个特定的区间,该区间为[0,1]或[0,99]等;在输入表格数据具有类别型对象描述特征和数值型对象描述特征的情况下,第一表格数据中类别型对象描述特征对应的特征值的取值为0或1,数值型对象描述特征对应的特征值的取值范围为一个特定的区间,为了便于生成式对抗网络模型生成第二表格数据,可将数值型对象描述特征的取值范围设为[0,1]。
步骤S202,基于第一表格数据,采用生成式对抗网络模型生成第二表格数据,第二表格数据与第一表格数据的相似度达到第一阈值。
数据处理装置102在得到第一表格数据的基础上,采用生成式对抗网络模型生成第二表格数据。其中,第二表格数据与第一表格数据的相似度达到第一阈值,第一阈值的具体数值在本申请实施例中不作限定。第二表格数据与第一表格数据的相似度达到第一阈值, 即第二表格数据的判决标签为真实标签,换言之,第二表格数据与第一表格数据非常接近,无法判断第二表格数据为构造的数据。生成式对抗网络模型最终生成第二表格数据,即为生成器与判别器相互对抗的结果。
生成式对抗网络模型所包括的生成器可由生成网络实现,判别器可由判别网络实现。
在一种可能的实现方式中,一组随机噪声,通常为高斯噪声,例如该随机噪声为(0.2,0.7,0.6,-0.5,0.1),该组随机噪声经过生成网络后会变换成一个新的向量,该向量的维度与特征个数相同,例如,上述随机噪声经过生成网络后变换的新的向量为(0.32,0.63,0.89,0.65,0.21,0.69,0.85,0.01,0.36)。生成网络所生成的向量进入判别网络,判别网络根据真实样本,即输入表格数据判断生成网络所生成的向量是否为真。
生成网络中,输入层神经元个数为随机噪声维数加1,输出层神经元个数为对象描述特征个数。判断网络中,输入层神经元个数为对象描述特征个数加1,输出层神经元个数为1。生成网络和判别网络中,网络的种类以及网络层数可由用户根据实际情况设置,例如可根据输入表格数据的条数进行设置等。
在生成式对抗网络模型生成第二表格数据的过程中,可以使用Ada-Grad算法、Dropout以及正则化(regularization)等技术。其中,Ada-Grad是一种改进的随机梯度下降(stochastic gradient descent,SGD)算法,用于求解生成式对抗网络模型。Dropout是一种正则化技术,主要用于防止生成式对抗网络模型过拟合。
可以理解的是,第一表格数据为真实样本,第二表格数据为模拟样本,模拟样本非常接近于真实样本,以致于判别器无法判定模拟样本的真假,那么判别器对第二表格数据所打的判决标签为真实标签。
步骤S203,对第二表格数据进行逆标准化编码,得到输出表格数据,输出表格数据与输入表格数据具有相同的对象描述特征。
在得到第二表格数据之后,数据处理装置102对第二表格数据进行逆标准化编码,得到输出表格数据,输出表格数据与输入表格数据具有相同的对象描述特征。由于第二表格数据非常接近于第一表格数据,那么逆标准化编码得到的输出表格数据非常接近于输入表格数据。
对应地,逆标准化编码可包括逆独热编码和逆归一化编码。
对于逆独热编码,可参见图5,为逆独热编码的示例图,图5中左边所示的表可以为第二表格数据的部分或全部,右边所示的表可以为输出表格数据的部分或全部。由图5可知,数据处理装置102将第二表格数据中每个字段中概率最大的作为该对象最终的类别型对象描述的特征值,例如该对象对应的数据中全球通的概率为0.7,神州行的概率为0.1,动感地带的概率为0.2,数据处理装置102将全球通作为该对象最终的品牌。
对于逆归一化编码,与图4所示的过程相反。
可以理解的是,在不考虑特征值为空或异常的情况下,输出表格数据的对象个数与输入表格数据的对象个数相同;在考虑特征值为空或异常的情况下,输出表格数据的对象个数小于或等于输入表格数据的对象个数。
可以理解的是,虽然输出表格数据非常接近于输入表格数据,但是输出表格数据并不是真实存在的,可以将输出表格数据和生成式对抗网络模型带离数据局点,不会存在泄漏 输入表格数据的风险,研发人员可通过对输出表格数据的分析间接实现对输入表格数据的分析,实现即使在脱离数据局点的情况下,也能对数据进行分析,从而可以减少分析耗时。
由于表格数据的特点及特殊性,无法直接应用生成式对抗网络模型,因此本申请实施例对输入表格数据进行标准化编码,以使标准化编码得到的第一表格数据可以应用生成式对抗网络模型生成第二表格数据,对第二表格数据进行逆标准化编码,以得到与输入表格数据非常接近的输出表格数据,从而实现对输入表格数据的模拟,可直接对输出表格数据进行分析,从而间接实现对输入表格数据的分析,以便在脱离数据局点的情况下,能够快速实现对输入表格数据的分析。
请参见图6,为本申请实施例提供的另一种数据处理方法的流程示意图,该方法可以包括但不限于如下步骤:
步骤S601,对输入表格数据进行标准化编码,得到第一表格数据,第一表格数据的对象描述特征为数值型对象描述特征。
步骤S602,基于第一表格数据,采用生成式对抗网络模型生成第二表格数据,第二表格数据与第一表格数据的相似度达到第一阈值。
步骤S603,对第二表格数据进行逆标准化编码,得到输出表格数据,输出表格数据与输入表格数据具有相同的对象描述特征。
步骤S601-步骤S603的具体实现过程可参见图2所示实施例中步骤S201-步骤S203的具体描述,在此不再赘述。
步骤S604,检验输入表格数据与输出表格数据的相似度。
数据处理装置102可通过获取相似度检验条件,根据相似度检验条件,来对输入表格数据与输出表格数据的相似度进行检验,以确定输出表格数据是否满足相似度检验条件。
步骤S605,若输入表格数据与输出表格数据的相似度达到第二阈值,则输出生成式对抗网络模型和输出表格数据。
步骤S606,若输入表格数据与输出表格数据的相似度未达到第二阈值,则对生成式对抗网络模型的初始化参数进行调整,以使调整后的输出表格数据与输入表格数据的相似度达到第二阈值。
目前,通常采用最佳分类器来检验模拟样本的正确性,但是不同分类器适用于不同的场景,为了在不同场景下选择最佳的分类器,需要耗费较长的时间进行选择。鉴于此,本申请实施例直接根据输入表格数据和输出表格数据的统计指标与信息度量指标来检验输出表格数据的正确性,而无需分类器的参与。
其中,统计指标可以是正负对象数据比例,信息度量指标可以是特征分布,也可以是特征相关性。
在一种可能的实现方式中,数据处理装置102可以根据统计指标和信息度量指标来设置相似度检验条件,该相似度检验条件可以预置在数据处理装置102。该相似度检验条件可以包括正负对象数据比例条件、特征分布检验条件和特征-标签相关性检验条件中的至少一种。
上述正负对象数据比例检验条件可以为输入表格数据的正负对象数据比例与输出表格数据的正负对象数据比例之间的差值在第一范围内。其中,第一范围的具体范围数值在本 申请实施例中不做限定,可视具体情况而定。
数据处理装置102统计输入表格数据的正负对象数据比例,以及统计输出表格数据的正负对象数据比例。以电信域数据为例,若某个对象对应的数据可以指示该对象标识A为在线用户,则可以将该对象对应的数据作为正对象数据;若某个对象对应的数据可以指示该对象标识B为离线用户,则可以将该对象对应的数据作为负对象数据。
若输入表格数据的正负对象数据比例与输出表格数据的正负对象数据比例之间的差值在第一范围内,则数据处理装置102可以确定输出表格数据的正负对象数据比例与输出表格数据的正负对象数据比例一致,还可以确定输出表格数据的正负对象数据比例满足正负对象数据比例检验条件。例如,输入表格数据的正负对象数据比例为4:1,输出表格数据的正负对象数据比例为16:5,两者之间的差在第一范围内。
若输出表格数据的正负对象数据比例不满足正负对象数据比例检验条件,则数据处理装置102对生成式对抗网络模型的初始化参数进行调整,以使调整后的输出表格数据的正负对象数据比例满足正负对象数据比例验证条件,即通过调整后的生成式对抗网络模型生成调整后的第二表格数据,对调整后的第二表格数据进行逆标准化编码得到调整后的输出表格数据,调整后的输出表格数据的正负对象数据比例满足正负对象数据比例验证条件。
其中,生成式对抗网络模型的初始化参数可以包括编码器和译码器类别、生成网络和判别网络每一层神经元的个数、生成网络和判别网络的深度、梯度下降的学习速率等。除了对生成式对抗网络模块的初始化参数进行调整外,还可以为生成式对抗网络模型添加批标准化(batch normalization)以及残差网络等,以使调整后的输出表格数据的正负对象数据比例满足正负对象数据比例验证条件。其中,batch normalization是一种自适应的重参数化方法,可以加速训练收敛速度。
上述特征分布检验条件可以为输出表格数据中对象描述特征i的特征分布服从输入表格数据中的对象描述特征i的特征分布,对象描述特征i为输入表格数据所具有的一个或多个对象描述特征中的任意一个对象描述特征。
数据处理装置102计算输出表格数据中对象描述特征i相对于输入表格数据中对象描述特征i的相对熵,若该相对熵在第二范围内,则确定输出表格数据中对象描述特征i的特征分布服从输入表格数据中的对象描述特征i的特征分布,输出表格数据中对象描述特征i的特征分布满足特征分布检验条件。其中,第二范围的具体范围数值在本申请实施例中不做限定,可视具体情况而定。
若输出表格数据中对象描述特征i的特征分布不满足特征分布检验条件,则对生成式对抗网络模型的初始化参数进行调整,以使调整后的输出表格数据中对象描述特征i的特征分布满足特征分布检验条件。
上述特征-标签相关性检验条件可以为输出表格数据中的对象描述特征j与对象标签之 间具有强相关性,对象描述特征j为输入表格数据所具有的一个或多个对象描述特征中的任意一个对象描述特征,对象标签用于指示对象数据状态,以电信域数据为例,对象标签可以指示在线或离网这两种状态。
数据处理装置102计算输入表格数据中对象描述特征j与对象标签的第一互信息,计算输出表格数据中对象描述特征j与对象标签的第二互信息,若第一互信息与第二互信息之间的差值在第三范围内,则确定输出表格数据中的对象描述特征j与对象标签之间的相关性满足特征-标签相关性检验条件。
数据处理装置102可按照如下公式计算互信息:
若输出表格数据中的对象描述特征j与对象标签之间的相关性不满足特征-标签相关性检验条件,则对生成式对抗网络模型的初始化参数进行调整,以使调整后的输出表格数据中的对象描述特征j与对象标签之间的相关性满足特征-标签相关性检验条件。
在一种可能的实现方式中,数据处理装置102可配置多个生成式对抗网络模型,例如两个生成式对抗网络模型,通过生成式对抗网络模型1可得到输出表格数据1,通过生成式对抗网络模型2可得到输出表格数据2。数据处理装置102可对输出表格数据1和输出表格数据2进行验证,从中选择与输入表格数据最接近的输出表格数据。
针对正负对象数据比例检验,假设输入表格数据的正负对象数据比例为4:1,输出表格数据1的正负对象数据比例为16:5,输出表格数据1的正负对象数据比例为16:7,可见4:1与16:5之间的差值小于4:1与16:7,那么输出表格数据1的正负对象数据比例更接近输入表格数据的正负对象数据比例,可选择对输出表格数据1进行分析。
针对特征分布检验,假设输入表格数据中对象描述特征i存在5种不同的取值(0,1,2,3,4),对应的对象数据比例分别为
输出表格数据1中对象描述特征i对应的对象数据比例分别为
输出表格数据2中对象描述特征i对应的对象数据比例分别为
按照上述相对熵公式可得,
相对于
的相对熵为0.139,
相对于
的相对熵为0.246,由此可见,
相对于
的相对熵更小,输出表格数据1中对象描述特征i的特征分布更加服从输入表格数据中的对象描述特征i的特征分布,可选择对输出表格数据1进行分析。
针对特征-标签相关性检验,假设输入表格数据中对象描述特征j存在两种不同的取值,对象标签也存在两种不同的取值,以电信域数据为例,对象标签可以是在线或离网这两种取值,那么输入表格数据中对象描述特征j在(0,0),(0,1),(1,0),(1,1)上分别出现的次数为(100,200,50,100)。输出表格数据1和输出表格数据2中的对象描述特征j在上述四种组合上分别出现的次数为(90,180,60,120)和(80,170,70,130)。输入表格数据以及两种输出表格数据分别对应的互信息为-2749.16,-2749.16和-2748.94。由此可见,(90,180,60,120)与标签之间的相关性与真实对象(100,200,50,100)与标签之间的相关度更加接近,即(|-2749.16-(-2749.16)|<|-2749.16-(-2748.94)|),可选择对输出表格数据1进行分析。
在输出表格数据满足相似度检验条件的情况下,可认为输入表格数据与输出表格数据的相似度达到第二阈值,第二阈值的具体数值在本申请实施例中不作限定。第二阈值与第 一阈值可以为不同的数值,也可以为相同的数值。
在输出表格数据满足相似度检验条件的情况下,数据处理装置102可输出该输出表格数据和该生成式对抗网络模型,即可以将输出表格数据和生成式对抗网络模型带离数据局点。将输出表格数据带离数据局点,以便在脱离数据局点的情况下,可以对输出表格数据进行分析,间接实现对输入表格数据的分析。将生成式对抗网络模型带离数据局点,可以对该生成式对抗网络模型进行研究。
在一种可能的实现方式中,数据处理装置102在输入表格数据与输出表格数据的相似度达到第二阈值的情况下,可根据输出表格数据对第一生成式对抗网络模型进行检验,以确定第一输出表格数据是否满足检验条件,即检验第一输出表格数据与输出表格数据之间的相似度。其中,第一输出表格数据为采用第一生成式对抗网络模型得到的输出表格数据。具体检验方法可参考对输出表格数据的检验。若第一输出表格数据不满足相似度检验条件,则根据输出表格数据对第一生成式对抗网络模型的初始化参数进行调整。
可以理解的是,第一输出表格数据可以是数据处理装置102之前基于输入表格数据,采用第一生成式对抗网络模型得到的输出表格数据。数据处理装置102根据当前得到的输出表格数据对之前的生成式对抗网络模型的初始化参数进行调整。
上述详细阐述了本申请实施例提供的方法,下面将对本申请实施例提供装置进行介绍。
请参见图7,是本申请实施例提供的数据处理装置的逻辑结构示意图,该数据处理装置70可以包括编码单元701和生成单元702。
编码单元701,用于对输入表格数据进行标准化编码,得到第一表格数据,第一表格数据的对象描述特征为数值型描述特征;
生成单元702,用于基于第一表格数据,采用生成式对抗网络模型生成第二表格数据,第二表格数据与第一表格数据的相似度达到第一阈值;
编码单元701,还用于对第二表格数据进行逆标准化编码,输出表格数据与输入表格数据具有相同的对象描述特征。
需要说明的是,上述编码单元701用于执行图2所示实施例中的步骤S201和步骤S203,上述生成单元702用于执行图2所示实施例中的步骤S202,具体可参见图2所示实施例的具体描述,在此不再赘述。
其中,输入表格数据具有一个或多个对象描述特征。对象描述特征具有语义。
其中,输入表格数据具有类别型对象描述特征和数值型对象描述特征中的至少一种,类别型对象描述特征对应的特征值为非数值,数值型对象描述特征对应的特征值为数值。
在一种可能的实现方式中,输入表格数据具有类别型对象描述特征;编码单元701用于对输入表格数据进行标准编码时,具体用于从输入表格数据中获取类别型对象描述特征对应的特征值;对类别型对象描述特征对应的特征值进行独热编码。
编码单元701用于对类别型对象描述特征对应的特征值进行独热编码时,具体用于将类别型对象描述特征对应的特征值由非数值编码为数值。
在一种可能的实现方式中,输入表格数据具有数值型对象描述特征;编码单元701用于对输入表格数据进行标准编码时,具体用于从输入表格数据中获取数值型对象描述特征对应的特征值;对数值型对象描述特征对应的特征值进行归一化编码。
编码单元701用于对数值型对象描述特征对应的特征值进行归一化编码时,具体用于将数值型对象描述特征对应的特征值编码映射至同一数值区间。
在一种可能的实现方式中,该数据处理装置70还包括检验单元703、输出单元704和调整单元705。
检验单元703,用于检验输入表格数据与输出表格数据的相似度;
输出单元704,用于若输入表格数据与输出表格数据的相似度达到第二阈值,则输出生成式对抗网络模型和输出表格数据;
调整单元705,用于若输入表格数据与输出表格数据的相似度未达到第二阈值,则对生成式对抗网络模型的初始化参数进行调整,以使调整后的输出表格数据与输入表格数据的相似度达到第二阈值。
在一种可能的实现方式中,检验单元703用于检验输入表格数据与输出表格数据的相似度时,具体用于获取相似度检验条件;根据相似度检验条件,对输入表格数据与输出表格数据的相似度进行检验,以确定输出表格数据是否满足相似度检验条件。
在一种可能的实现方式中,相似度检验条件包括正负对象数据比例检验条件;检验单元703用于根据相似度检验条件,对输入表格数据与输出表格数据的相似度进行检验时,具体用于统计输入表格数据的正负对象数据比例,统计输出表格数据的正负对象数据比例;判断输入表格数据的正负对象数据比例与输出表格数据的正负对象数据比例之间的差值是否在第一范围内;若输入表格数据的正负对象数据比例与输出表格数据的正负对象数据比例之间的差值在第一范围内,则确定输入表格数据的正负对象数据比例与输出表格数据的正负对象数据比例一致,输出表格数据的正负对象数据比例满足正负对象数据比例检验条件。
在一种可能的实现方式中,相似度检验条件包括特征分布检验条件;检验单元703用于根据相似度检验条件,对输入表格数据与输出表格数据的相似度进行检验时,具体用于计算输出表格数据中对象描述特征i相对于输入表格数据中对象描述特征i的相对熵,对象描述特征i为一个或多个对象描述特征中的任意一个对象描述特征;判断相对熵是否在第二范围内;若相对熵在第二范围内,则确定输出表格数据中对象描述特征i的特征分布服从输入表格数据中的对象描述特征i的特征分布,输出表格数据中对象描述特征i的特征分布满足特征分布检验条件。
在一种可能的实现方式中,相似度检验条件包括特征-标签相关性检验条件;检验单元703用于根据相似度检验条件,对输入表格数据与输出表格数据的相似度进行检验时,具体用于计算输入表格数据中对象描述特征j与对象标签的第一互信息,计算输出表格数据中对象描述特征j与对象标签的第二互信息,对象描述特征j为一个或多个对象描述特征中的任意一个对象描述特征;判断第一互信息与第二互信息之间的差值是否在第三范围内;若第一互信息与第二互信息之间的差值在第三范围内,则确定输出表格数据中的对象描述特征j与对象标签之间的相关性满足特征-标签相关性检验条件。
在一种可能的实现方式中,检验单元703,还用于在输入表格数据与输出表格数据的相似度达到第二阈值的情况下,检验输出表格数据与第一输出表格数据的相似度,第一输出表格数据为采用输出的第一生成式对抗网络模型得到的输出表格数据;
调整单元705,还用于若输出表格数据与第一输出表格数据的相似度未达到第二阈值,则对第一生成式对抗网络模型的初始化参数进行调整。
该数据处理装置70可以实现前述方法实施例中数据处理装置的功能,该数据处理装置70中各个单元执行详细过程可以参见前述方法实施例数据处理装置的执行步骤,此处不在赘述。
请参见图8,为本申请实施例提供的数据处理装置的实体结构简化示意图,该数据处理装置80包括收发器801、处理器802和存储器803。收发器801、处理器802和存储器803可以通过总线804相互连接,也可以通过其它方式相连接。图7所示的编码单元701、生成单元702、检验单元703和调整单元704所实现的相关功能可以通过处理器802来实现。
收发器801用于发送数据和/或信令,以及接收数据和/或信令。应用在本申请实施例中,收发器801用于与服务器进行通信,从服务器获取输入表格数据等。
处理器802可以包括是一个或多个处理器,例如包括一个或多个中央处理器(central processing unit,CPU),在处理器802是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。应用在本申请实施例中,处理器802用于执行图2所示实施例中的步骤S201-步骤S203,还用于执行图6所示实施例中的步骤S601-步骤S606。
存储器803包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器803用于相关指令及数据。存储器803用于存储数据处理装置80的程序代码和数据。
可以理解的是,图8仅仅示出了数据处理装置的简化设计。在实际应用中,数据处理装置还可以分别包含必要的其他元件,包含但不限于任意数量的收发器、处理器、控制器、存储器、通信单元等,而所有可以实现本申请的装置都在本申请的保护范围之内。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。因此,本申请又一实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
本申请又一实施例还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。
本领域普通技术人员可以意识到,结合本申请中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
Claims (30)
- 一种数据处理方法,其特征在于,包括:对输入表格数据进行标准化编码,得到第一表格数据,所述第一表格数据的对象描述特征为数值型对象描述特征;基于所述第一表格数据,采用生成式对抗网络模型生成第二表格数据,所述第二表格数据与所述第一表格数据的相似度达到第一阈值;对所述第二表格数据进行逆标准化编码,得到输出表格数据,所述输出表格数据与所述输入表格数据具有相同的对象描述特征。
- 根据权利要求1所述的方法,其特征在于,所述输入表格数据具有一个或多个对象描述特征。
- 根据权利要求2所述的方法,其特征在于,所述对象描述特征具有语义。
- 根据权利要求1所述的方法,其特征在于,所述输入表格数据具有类别型对象描述特征和所述数值型对象描述特征中的至少一种,所述类别型对象描述特征对应的特征值为非数值,所述数值型对象描述特征对应的特征值为数值。
- 根据权利要求4所述的方法,其特征在于,所述输入表格数据具有所述类别型对象描述特征;所述对输入表格数据进行标准化编码,包括:从所述输入表格数据中获取所述类别型对象描述特征对应的特征值;对所述类别型对象描述特征对应的特征值进行独热编码。
- 根据权利要求5所述的方法,其特征在于,所述对所述类别型对象描述特征对应的特征值进行独热编码,包括:将所述类别型对象描述特征对应的特征值由非数值编码为数值。
- 根据权利要求4所述的方法,其特征在于,所述输入表格数据具有所述数值型对象描述特征;所述对输入表格数据进行标准化编码,包括:从所述输入表格数据中获取所述数值型对象描述特征对应的特征值;对所述数值型对象描述特征对应的特征值进行归一化编码。
- 根据权利要求7所述的方法,其特征在于,所述对所述数值型对象描述特征对应的特征值进行归一化编码,包括:将所述数值型对象描述特征对应的特征值编码映射至同一数值区间。
- 根据权利要求1-8任一项所述的方法,其特征在于,所述方法还包括:检验所述输入表格数据与所述输出表格数据的相似度;若所述输入表格数据与所述输出表格数据的相似度达到第二阈值,则输出所述生成式对抗网络模型和所述输出表格数据;若所述输入表格数据与所述输出表格数据的相似度未达到第二阈值,则对所述生成式对抗网络模型的初始化参数进行调整,以使调整后的所述输出表格数据与所述输入表格数据的相似度达到所述第二阈值。
- 根据权利要求9所述的方法,其特征在于,所述检验所述输入表格数据与所述输出表格数据的相似度,包括:获取相似度检验条件;根据所述相似度检验条件,对所述输入表格数据与所述输出表格数据的相似度进行检验,以确定所述输出表格数据是否满足所述相似度检验条件。
- 根据权利要求10所述的方法,其特征在于,所述相似度检验条件包括正负对象数据比例检验条件;所述根据所述相似度检验条件,对所述输入表格数据与所述输出表格数据的相似度进行检验,包括:统计所述输入表格数据的正负对象数据比例,统计所述输出表格数据的正负对象数据比例;判断所述输入表格数据的正负对象数据比例与所述输出表格数据的正负对象数据比例之间的差值是否在第一范围内;若所述输入表格数据的正负对象数据比例与所述输出表格数据的正负对象数据比例之间的差值在所述第一范围内,则确定所述输入表格数据的正负对象数据比例与所述输出表格数据的正负对象数据比例一致,所述输出表格数据的正负对象数据比例满足所述正负对象数据比例检验条件。
- 根据权利要求10所述的方法,其特征在于,所述相似度检验条件包括特征分布检验条件;所述根据所述相似度检验条件,对所述输入表格数据与所述输出表格数据的相似度进行检验,包括:计算所述输出表格数据中对象描述特征i相对于所述输入表格数据中对象描述特征i的相对熵,所述对象描述特征i为所述一个或多个对象描述特征中的任意一个对象描述特征;判断所述相对熵是否在第二范围内;若所述相对熵在所述第二范围内,则确定所述输出表格数据中所述对象描述特征i的特征分布服从所述输入表格数据中的所述对象描述特征i的特征分布,所述输出表格数据中所述对象描述特征i的特征分布满足所述特征分布检验条件。
- 根据权利要求10所述的方法,其特征在于,所述相似度检验条件包括特征-标签相关性检验条件;所述根据所述相似度检验条件,对所述输入表格数据与所述输出表格数据的相似度进行检验,包括:计算所述输入表格数据中对象描述特征j与对象标签的第一互信息,计算所述输出表格数据中对象描述特征j与所述对象标签的第二互信息,所述对象描述特征j为所述一个或多个对象描述特征中的任意一个对象描述特征;判断所述第一互信息与所述第二互信息之间的差值是否在第三范围内;若所述第一互信息与所述第二互信息之间的差值在所述第三范围内,则确定所述输出表格数据中的所述对象描述特征j与所述对象标签之间的相关性满足所述特征-标签相关性检验条件。
- 根据权利要求9所述的方法,其特征在于,所述方法还包括:在所述输入表格数据与所述输出表格数据的相似度达到所述第二阈值的情况下,检验所述输出表格数据与所述第一输出表格数据的相似度,所述第一输出表格数据为采用输出的第一生成式对抗网络模型得到的输出表格数据;若所述输出表格数据与所述第一输出表格数据的相似度未达到所述第二阈值,则对所述第一生成式对抗网络模型的初始化参数进行调整。
- 一种数据处理装置,其特征在于,包括:编码单元,用于对输入表格数据进行标准编码,得到第一表格数据,所述第一表格数据的对象描述特征为数值型描述特征;生成单元,基于所述第一表格数据,采用生成式对抗网络模型生成第二表格数据,所述第二表格数据与所述第一表格数据的相似度达到第一阈值;所述编码单元,还用于对所述第二表格数据进行逆标准化编码,得到输出表格数据,所述输出表格数据与所述输入表格数据具有相同的对象描述特征。
- 根据权利要求15所述的数据处理装置,其特征在于,所述输入表格数据具有一个或多个对象描述特征。
- 根据权利要求16所述的数据处理装置,其特征在于,所述对象描述特征具有语义。
- 根据权利要求15所述的数据处理装置,其特征在于,所述输入表格数据具有类别型对象描述特征和所述数值型对象描述特征中的至少一种,所述类别型对象描述特征对应的特征值为非数值,所述数值型对象描述特征对应的特征值为数值。
- 根据权利要求18所述的数据处理装置,其特征在于,所述输入表格数据具有所述类别型对象描述特征;所述编码单元用于对输入表格数据进行标准编码时,具体用于从所述输入表格数据中获取所述类别型对象描述特征对应的特征值;对所述类别型对象描述特征对应的特征值进行独热编码。
- 根据权利要求19所述的数据处理装置,其特征在于,所述编码单元用于对所述类别型对象描述特征对应的特征值进行独热编码时,具体用于将所述类别型对象描述特征对应的特征值由非数值编码为数值。
- 根据权利要求18所述的数据处理装置,其特征在于,所述输入表格数据具有所述数值型对象描述特征;所述编码单元用于对输入表格数据进行标准编码时,具体用于从所述输入表格数据中获取所述数值型对象描述特征对应的特征值;对所述数值型对象描述特征对应的特征值进行归一化编码。
- 根据权利要求21所述的数据处理装置,其特征在于,所述编码单元用于对所述数值型对象描述特征对应的特征值进行归一化编码时,具体用于将所述数值型对象描述特征对应的特征值编码映射至同一数值区间。
- 根据权利要求15-22任一项所述的数据处理装置,其特征在于,所述数据处理装置还包括:检验单元,用于检验所述输入表格数据与所述输出表格数据的相似度;输出单元,用于若所述输入表格数据与所述输出表格数据的相似度达到第二阈值,则输出所述生成式对抗网络模型和所述输出表格数据;调整单元,用于若所述输入表格数据与所述输出表格数据的相似度未达到第二阈值,则对所述生成式对抗网络模型的初始化参数进行调整,以使调整后的所述输出表格数据与所述输入表格数据的相似度达到所述第二阈值。
- 根据权利要求23所述的数据处理装置,其特征在于,所述检验单元用于检验所述输入表格数据与所述输出表格数据的相似度时,具体用于获取相似度检验条件;根据所述相似度检验条件,对所述输入表格数据与所述输出表格数据的相似度进行检验,以确定所述输出表格数据是否满足所述相似度检验条件。
- 根据权利要求24所述的数据处理装置,其特征在于,所述相似度检验条件包括正负对象数据比例检验条件;所述检验单元用于根据所述相似度检验条件,对所述输入表格数据与所述输出表格数据的相似度进行检验时,具体用于统计所述输入表格数据的正负对象数据比例,统计所述输出表格数据的正负对象数据比例;判断所述输入表格数据的正负对象数据比例与所述输出表格数据的正负对象数据比例之间的差值是否在第一范围内;若所述输入表格数据的正负对象数据比例与所述输出表格数据的正负对象数据比例之间的差值在所述第一范围内,则确定所述输入表格数据的正负对象数据比例与所述输出表格数据的正负对象数据比例一致,所述输出表格数据的正负对象数据比例满足所述正负对象数据比例检验条件。
- 根据权利要求24所述的数据处理装置,其特征在于,所述相似度检验条件包括特征分布检验条件;所述检验单元用于根据所述相似度检验条件,对所述输入表格数据与所述输出表格数据的相似度进行检验时,具体用于计算所述输出表格数据中对象描述特征i相对于所述输入表格数据中对象描述特征i的相对熵,所述对象描述特征i为所述一个或多个对象描述特征中的任意一个对象描述特征;判断所述相对熵是否在第二范围内;若所述相对熵在所述第二范围内,则确定所述输出表格数据中所述对象描述特征i的特征分布服从所述输入表格数据中的所述对象描述特征i的特征分布,所述输出表格数据中所述对象描述特征i的特征分布满足所述特征分布检验条件。
- 根据权利要求24所述的数据处理装置,其特征在于,所述相似度检验条件包括特征-标签相关性检验条件;所述检验单元用于根据所述相似度检验条件,对所述输入表格数据与所述输出表格数据的相似度进行检验时,具体用于计算所述输入表格数据中对象描述特征j与对象标签的第一互信息,计算所述输出表格数据中对象描述特征j与所述对象标签的第二互信息,所述对象描述特征j为所述一个或多个对象描述特征中的任意一个对象描述特征;判断所述第一互信息与所述第二互信息之间的差值是否在第三范围内;若所述第一互信息与所述第二互信息之间的差值在所述第三范围内,则确定所述输出表格数据中的所述对象描述特征j与所述对象标签之间的相关性满足所述特征-标签相关性检验条件。
- 根据权利要求27所述的数据处理装置,其特征在于,所述检验单元,还用于在所述输入表格数据与所述输出表格数据的相似度达到所述第 二阈值的情况下,检验所述输出表格数据与所述第一输出表格数据的相似度,所述第一输出表格数据为采用输出的第一生成式对抗网络模型得到的输出表格数据;所述调整单元,还用于若所述输出表格数据与所述第一输出表格数据的相似度未达到所述第二阈值,则对所述第一生成式对抗网络模型的初始化参数进行调整。
- 一种数据处理装置,其特征在于,所述数据处理装置包括处理器、收发器和存储器,其中,所述收发器用于接收和发送信息,所述存储器中存储计算机执行指令,所述处理器通过总线与所述存储器和所述收发器连接,所述处理器执行所述存储器中存储的计算机执行指令,以使所述数据处理装置执行如权利要求1-14任一项所述的方法。
- 一种计算机可读存储介质,包括指令,当其在计算机上执行时,使得所述计算机执行如权利要求1-14任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810630422.5A CN110619535B (zh) | 2018-06-19 | 2018-06-19 | 一种数据处理方法及其装置 |
CN201810630422.5 | 2018-06-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019242627A1 true WO2019242627A1 (zh) | 2019-12-26 |
Family
ID=68920539
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/091819 WO2019242627A1 (zh) | 2018-06-19 | 2019-06-19 | 一种数据处理方法及其装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110619535B (zh) |
WO (1) | WO2019242627A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401448A (zh) * | 2020-03-16 | 2020-07-10 | 中科天玑数据科技股份有限公司 | 一种交易平台分类方法和装置 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111507849A (zh) * | 2020-03-25 | 2020-08-07 | 上海商汤智能科技有限公司 | 核保方法及相关装置、设备 |
CN111507850A (zh) * | 2020-03-25 | 2020-08-07 | 上海商汤智能科技有限公司 | 核保方法及相关装置、设备 |
CN111625538B (zh) * | 2020-04-27 | 2024-06-28 | 平安银行股份有限公司 | 基于虚拟数据表技术的数据处理方法、装置及电子设备 |
US11308077B2 (en) * | 2020-07-21 | 2022-04-19 | International Business Machines Corporation | Identifying source datasets that fit a transfer learning process for a target domain |
CN114818516B (zh) * | 2022-06-27 | 2022-09-20 | 中国石油大学(华东) | 一种井筒腐蚀形态剖面智能预测方法 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140122505A1 (en) * | 2012-10-30 | 2014-05-01 | Canon Kabushiki Kaisha | Information processing apparatus, control method for the same, and computer-readable recording medium |
CN104603779A (zh) * | 2012-08-31 | 2015-05-06 | 日本电气株式会社 | 文本挖掘设备、文本挖掘方法和计算机可读记录介质 |
CN107563417A (zh) * | 2017-08-18 | 2018-01-09 | 北京天元创新科技有限公司 | 一种深度学习人工智能模型建立方法及系统 |
US20180101770A1 (en) * | 2016-10-12 | 2018-04-12 | Ricoh Company, Ltd. | Method and system of generative model learning, and program product |
CN107943784A (zh) * | 2017-11-02 | 2018-04-20 | 南华大学 | 基于生成对抗网络的关系抽取方法 |
CN108021931A (zh) * | 2017-11-20 | 2018-05-11 | 阿里巴巴集团控股有限公司 | 一种数据样本标签处理方法及装置 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018053340A1 (en) * | 2016-09-15 | 2018-03-22 | Twitter, Inc. | Super resolution using a generative adversarial network |
CN107590531A (zh) * | 2017-08-14 | 2018-01-16 | 华南理工大学 | 一种基于文本生成的wgan方法 |
-
2018
- 2018-06-19 CN CN201810630422.5A patent/CN110619535B/zh active Active
-
2019
- 2019-06-19 WO PCT/CN2019/091819 patent/WO2019242627A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104603779A (zh) * | 2012-08-31 | 2015-05-06 | 日本电气株式会社 | 文本挖掘设备、文本挖掘方法和计算机可读记录介质 |
US20140122505A1 (en) * | 2012-10-30 | 2014-05-01 | Canon Kabushiki Kaisha | Information processing apparatus, control method for the same, and computer-readable recording medium |
US20180101770A1 (en) * | 2016-10-12 | 2018-04-12 | Ricoh Company, Ltd. | Method and system of generative model learning, and program product |
CN107563417A (zh) * | 2017-08-18 | 2018-01-09 | 北京天元创新科技有限公司 | 一种深度学习人工智能模型建立方法及系统 |
CN107943784A (zh) * | 2017-11-02 | 2018-04-20 | 南华大学 | 基于生成对抗网络的关系抽取方法 |
CN108021931A (zh) * | 2017-11-20 | 2018-05-11 | 阿里巴巴集团控股有限公司 | 一种数据样本标签处理方法及装置 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401448A (zh) * | 2020-03-16 | 2020-07-10 | 中科天玑数据科技股份有限公司 | 一种交易平台分类方法和装置 |
CN111401448B (zh) * | 2020-03-16 | 2024-05-24 | 中科天玑数据科技股份有限公司 | 一种交易平台分类方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
CN110619535B (zh) | 2023-07-14 |
CN110619535A (zh) | 2019-12-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019242627A1 (zh) | 一种数据处理方法及其装置 | |
US11687384B2 (en) | Real-time synthetically generated video from still frames | |
CN110705301B (zh) | 实体关系抽取方法及装置、存储介质、电子设备 | |
CN112508044A (zh) | 人工智能ai模型的评估方法、系统及设备 | |
WO2021051917A1 (zh) | 人工智能ai模型的评估方法、系统及设备 | |
US11270375B1 (en) | Method and system for aggregating personal financial data to predict consumer financial health | |
WO2020082734A1 (zh) | 文本情感识别方法、装置、电子设备及计算机非易失性可读存储介质 | |
CN112231592B (zh) | 基于图的网络社团发现方法、装置、设备以及存储介质 | |
CN110348471B (zh) | 异常对象识别方法、装置、介质及电子设备 | |
CN113298121A (zh) | 基于多数据源建模的消息发送方法、装置和电子设备 | |
CN112131322A (zh) | 时间序列分类方法及装置 | |
JP2023536773A (ja) | テキスト品質評価モデルのトレーニング方法及びテキスト品質の決定方法、装置、電子機器、記憶媒体およびコンピュータプログラム | |
CN111209403B (zh) | 数据处理方法、装置、介质及电子设备 | |
CN116627781A (zh) | 目标模型验证方法以及装置 | |
CN111126860A (zh) | 任务分配方法、任务分配装置和电子设备 | |
CN114579876A (zh) | 虚假信息检测方法、装置、设备及介质 | |
CN113886547A (zh) | 基于人工智能的客户实时对话转接方法、装置和电子设备 | |
CN113761145A (zh) | 语言模型训练方法、语言处理方法和电子设备 | |
CN112632229A (zh) | 文本聚类方法及装置 | |
CN112115981A (zh) | 一种社交网络博主的embedding评估方法及系统 | |
CN111428767A (zh) | 数据处理方法及装置、处理器、电子设备及存储介质 | |
CN114548765B (zh) | 用于风险识别的方法和装置 | |
US20240184812A1 (en) | Distributed active learning in natural language processing for determining resource metrics | |
CN110941714A (zh) | 分类规则库构建方法、应用分类方法及装置 | |
CN118394945B (zh) | 一种基于人工智能的短信内容分析方法和系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19821779 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19821779 Country of ref document: EP Kind code of ref document: A1 |