CN111797078A - Data cleaning method, model training method, device, storage medium and equipment - Google Patents

Data cleaning method, model training method, device, storage medium and equipment Download PDF

Info

Publication number
CN111797078A
CN111797078A CN201910282171.0A CN201910282171A CN111797078A CN 111797078 A CN111797078 A CN 111797078A CN 201910282171 A CN201910282171 A CN 201910282171A CN 111797078 A CN111797078 A CN 111797078A
Authority
CN
China
Prior art keywords
cleaning
data
cleaned
rule
cleaning rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910282171.0A
Other languages
Chinese (zh)
Inventor
陈仲铭
何明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN201910282171.0A priority Critical patent/CN111797078A/en
Publication of CN111797078A publication Critical patent/CN111797078A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a data cleaning method, a model training method, a device, a storage medium and equipment, wherein data to be cleaned needing data cleaning can be firstly obtained, the cleaning requirement of the data to be cleaned is obtained, then a target cleaning rule used for carrying out data cleaning on the data to be cleaned is determined according to the obtained data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model, and finally the data to be cleaned is subjected to data cleaning according to the determined target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement. Therefore, as long as the cleaning rule classification model is obtained through pre-training, the cleaning rule classification model can be subsequently utilized to automatically clean data without excessive manual participation, so that the labor cost of data cleaning is reduced, and the efficiency of data cleaning is improved.

Description

Data cleaning method, model training method, device, storage medium and equipment
Technical Field
The application relates to the technical field of data processing, in particular to a data cleaning method, a model training method, a device, a storage medium and equipment.
Background
At present, how to process massive data becomes an examination faced by electronic equipment, and the primary work of processing data is data cleaning, namely recognizing and filtering out dirty data and keeping clean data. However, in the related art, data cleaning often depends on manual domain knowledge, experience, and the like, which results in a large amount of human resource consumption, and thus the human cost for data cleaning is high.
Disclosure of Invention
The embodiment of the application provides a data cleaning method, a model training method, a device, a storage medium and equipment, which can reduce the labor cost of data cleaning.
In a first aspect, an embodiment of the present application provides a data cleaning method, which is applied to an electronic device, and the data cleaning method includes:
acquiring data to be cleaned, which needs to be cleaned;
acquiring the cleaning requirement of the data to be cleaned;
determining a target cleaning rule for cleaning the data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model;
performing data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect on the data to be cleaned meets the cleaning requirement;
the cleaning rule classification model is obtained by performing model training by using a cleaning rule characteristic representing a cleaning rule as a target output and a combined characteristic representing sample data to be cleaned corresponding to the cleaning rule and a cleaning effect of the sample data to be cleaned as a training input.
In a second aspect, an embodiment of the present application provides a model training method, which is applied to an electronic device, and the model training method includes:
acquiring a plurality of cleaning rules and acquiring sample data to be cleaned corresponding to each cleaning rule;
acquiring a cleaning effect of each cleaning rule for cleaning data of the corresponding sample data to be cleaned;
acquiring the combined characteristics of the sample data to be cleaned and the corresponding cleaning effect thereof, and acquiring the cleaning rule characteristics of the cleaning rules;
and performing model training by taking each joint feature as training input and taking the cleaning rule feature corresponding to each joint feature as target output to obtain a cleaning rule classification model.
In a third aspect, an embodiment of the present application provides a data cleaning apparatus, which is applied to an electronic device, and includes:
the data acquisition module is used for acquiring data to be cleaned, which needs to be cleaned;
the requirement acquisition module is used for acquiring the cleaning requirement of the data to be cleaned;
the rule determining module is used for determining a target cleaning rule for cleaning the data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model;
the data cleaning module is used for cleaning the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement;
the cleaning rule classification model is obtained by performing model training by using a cleaning rule characteristic representing a cleaning rule as a target output and a combined characteristic representing sample data to be cleaned corresponding to the cleaning rule and a cleaning effect of the sample data to be cleaned as a training input.
In a fourth aspect, an embodiment of the present application provides a model training apparatus applied to an electronic device, where the model training apparatus includes:
the first acquisition module is used for acquiring a plurality of cleaning rules and acquiring sample data to be cleaned corresponding to each cleaning rule;
the second acquisition module is used for acquiring the cleaning effect of each cleaning rule for cleaning the data of the corresponding sample data to be cleaned;
the third acquisition module is used for acquiring the combined characteristics of the sample data to be cleaned and the cleaning effect corresponding to the sample data to be cleaned and acquiring the cleaning rule characteristics of the cleaning rules;
and the model training module is used for performing model training by taking each joint feature as training input and taking the cleaning rule feature corresponding to each joint feature as target output to obtain a cleaning rule classification model.
In a fifth aspect, the present application provides a storage medium, on which a computer program is stored, which, when running on a computer, causes the computer to perform the steps in the data cleansing method as provided in the present application, or causes the computer to perform the steps in the model training method as provided in the present application.
In a sixth aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory has a computer program, and the processor is configured to execute, by calling the computer program, the steps in the data cleansing method provided in the embodiment of the present application, or the steps in the model training method provided in the embodiment of the present application.
In the embodiment of the application, the electronic equipment can firstly acquire the data to be cleaned, which needs to be subjected to data cleaning, and acquire the cleaning requirement of the data to be cleaned, then the target cleaning rule for performing data cleaning on the data to be cleaned is determined according to the acquired data to be cleaned, the cleaning requirement and the pre-trained cleaning rule classification model, and finally the data to be cleaned is subjected to data cleaning according to the determined target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement. Therefore, as long as the cleaning rule classification model is obtained through pre-training, the cleaning rule classification model can be subsequently utilized to automatically clean data without excessive manual participation, so that the labor cost of data cleaning is reduced, and the efficiency of data cleaning is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a panoramic sensing architecture provided in an embodiment of the present application.
Fig. 2 is a schematic flow chart of a data cleaning method according to an embodiment of the present disclosure.
Fig. 3 is another schematic flow chart of a data cleansing method according to an embodiment of the present application.
Fig. 4 is a schematic diagram of the electronic device obtaining the target cleaning rule according to the cleaning rule classification model in the embodiment of the application.
Fig. 5 is a schematic flow chart of a model training method according to an embodiment of the present disclosure.
Fig. 6 is another schematic flow chart of a model training method according to an embodiment of the present disclosure.
Fig. 7 is a schematic view of an application scenario for model training in the embodiment of the present application.
FIG. 8 is a schematic structural diagram of a data cleansing apparatus according to an embodiment of the present application.
Fig. 9 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application.
Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Fig. 11 is another schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Referring to the drawings, wherein like reference numbers refer to like elements, the principles of the present application are illustrated as being implemented in a suitable computing environment. The following description is based on illustrated embodiments of the application and should not be taken as limiting the application with respect to other embodiments that are not detailed herein.
With the miniaturization and intellectualization of sensors, electronic devices such as mobile phones and tablet computers integrate more and more sensors, such as light sensors, distance sensors, position sensors, acceleration sensors, gravity sensors, and the like. The electronic device can acquire more data with less power consumption through the configured sensor. Meanwhile, the electronic device can acquire data related to the state of the electronic device and data related to the state of the user during operation. In general, the electronic device can acquire data related to an external environment (such as temperature, light, place, sound, weather, and the like), data related to a user state (such as posture, speed, usage habits, personal basic information, and the like), and data related to a state of the electronic device (such as power consumption, resource usage, network conditions, and the like). In the embodiment of the application, the data which can be acquired by the electronic device is recorded as panoramic data.
In the embodiment of the application, in order to process the data acquired by the electronic device, a panoramic sensing architecture is provided. Referring to fig. 1, fig. 1 is a schematic structural diagram of a panoramic sensing architecture provided in an embodiment of the present application, and the panoramic sensing architecture is applied to an electronic device and includes, from bottom to top, an information sensing layer, a data processing layer, a feature extraction layer, a scene modeling layer, and an intelligent service layer.
As the bottom layer of the panoramic sensing architecture, the information sensing layer is used for acquiring original data, namely panoramic data, capable of describing various types of scenes of a user. Wherein the information perception layer is composed of a plurality of sensors for data acquisition, including, but not limited to, a distance sensor for detecting a distance between the electronic device and an external object, a magnetic field sensor for detecting magnetic field information of an environment in which the electronic device is located, a light sensor for detecting light information of an environment in which the electronic device is located, an acceleration sensor for detecting acceleration data of the electronic device, a fingerprint sensor for collecting fingerprint information of a user, a hall sensor for sensing magnetic field information, a position sensor for detecting a geographical position in which the electronic device is currently located, a gyroscope for detecting an angular velocity of the electronic device in various directions, an inertial sensor for detecting motion data of the electronic device, a posture sensor for sensing posture information of the electronic device, a barometer for detecting an air pressure of an environment in which the electronic device is located, a heart rate sensor for detecting heart rate information of a user, and the like, which are illustrated.
And as a secondary bottom layer of the panoramic sensing architecture, the data processing layer is used for processing the original data acquired by the information sensing layer and eliminating the problems of noise, inconsistency and the like of the original data. The data processing layer can perform data cleaning, data integration, data transformation, data reduction and other processing on the data acquired by the information perception layer.
And the characteristic extraction layer is used for extracting the characteristics of the data processed by the data processing layer to extract the characteristics included in the data as an intermediate layer of the panoramic perception architecture. The feature extraction layer may extract features or process the extracted features by a method such as a filtering method, a packing method, or an integration method.
The filtering method is to filter the extracted features to remove redundant feature data. Packaging methods are used to screen the extracted features. The integration method is to integrate a plurality of feature extraction methods together to construct a more efficient and more accurate feature extraction method for extracting features.
As a second highest level of the panoramic sensing architecture, the scene modeling layer is used for constructing a model according to the features extracted by the feature extraction layer, and the obtained model can be used for representing the state of the electronic device, the user state, the environment state and the like. For example, the scenario modeling layer may construct a key value model, a pattern identification model, a graph model, an entity relation model, an object-oriented model, and the like according to the features extracted by the feature extraction layer.
And as the highest layer of the panoramic perception architecture, the intelligent service layer is used for providing intelligent services according to the model constructed by the scene modeling layer. For example, the intelligent service layer may provide basic application services for the user, may perform system intelligent optimization services for the electronic device, and may also provide personalized intelligent services for the user.
In addition, the panoramic sensing architecture further comprises an algorithm library, and the algorithm library comprises, but is not limited to, algorithms such as a markov algorithm, a hidden dirichlet distribution algorithm, a bayesian classification algorithm, a support vector machine, a K-means clustering algorithm, a K-nearest neighbor algorithm, a conditional random field, a residual error network, a long-short term memory network, a convolutional neural network, a cyclic neural network and the like.
The embodiment of the present application first provides a data cleaning method, where an execution subject of the data cleaning method may be the data cleaning apparatus provided in the embodiment of the present application, or an electronic device integrated with the data cleaning apparatus, where the data cleaning apparatus may be implemented in a hardware or software manner. The electronic device may be a device with processing capability configured with a processor, such as a smart phone, a tablet computer, a palm computer, a notebook computer, or a desktop computer.
Based on the data cleaning method provided by the embodiment of the application, the collected panoramic data is provided to the data processing layer in the information perception layer; the data processing layer takes the panoramic data from the information perception layer as data to be cleaned which needs data cleaning, the data processing layer cleans the data and provides the cleaned data to the characteristic extraction layer; the feature extraction layer performs feature extraction on the data from the data processing layer to obtain features capable of representing the data, and provides the extracted features to the scene modeling layer; the scene modeling layer carries out modeling based on the features from the feature extraction layer, and the model obtained by modeling is used for representing the state of the electronic equipment, the user state or the environment state and the like; and finally, the intelligent service layer provides corresponding intelligent services, such as basic application services, system optimization services, personalized services and the like, according to the model constructed by the scenario modeling layer.
Referring to fig. 2, fig. 2 is a schematic flow chart of a data cleaning method according to an embodiment of the present application, where the data cleaning method is implemented in a data processing layer of a panoramic sensing architecture, and as shown in fig. 2, the flow of the data cleaning method according to the embodiment of the present application may be as follows:
in 101, data to be cleaned, which needs to be cleaned, is acquired.
For example, the electronic device may obtain data to be cleaned, which needs to be cleaned, from a local location, obtain data to be cleaned, which needs to be cleaned, from other electronic devices, obtain data to be cleaned, which needs to be cleaned, from a network, and the like.
At 102, a cleaning requirement for data to be cleaned is obtained.
As will be appreciated by those of ordinary skill in the art, real-world data tends to be multidimensional, incomplete, noisy, and inconsistent, with the goal of data cleansing being to fill in missing values, smooth noise and identify outliers, correct inconsistencies in the data, and the like.
In the embodiment of the application, after the electronic equipment acquires the data to be cleaned, which needs to be subjected to data cleaning, the cleaning requirement of the data to be cleaned is further acquired. In a colloquial way, the cleansing requirement describes a cleansing effect which is desired to be achieved by data cleansing of data to be cleansed, for example, original data to be cleansed contains data with multiple dimensions, and the dimensions are often not independent, that is, perhaps, there is a relation among a plurality of dimensions, and perhaps, there is no relation among the dimensions, so that the cleansing requirement of data to be executed can be to reduce the dimension of the data to be cleansed to a specified dimension.
It will be understood by those skilled in the art that the cleaning requirement depends on the actual requirement of the electronic device for data processing, and the embodiment of the present application is not limited thereto.
In 103, a target cleaning rule for data cleaning of the data to be cleaned is determined according to the data to be cleaned, the cleaning requirement and the pre-trained cleaning rule classification model.
It should be noted that, in the embodiment of the present application, a cleaning rule classification model for selecting which cleaning rule to perform data cleaning on data to be cleaned is configured in an electronic device, and the cleaning rule classification model is obtained by performing model training by using a cleaning rule feature representing a cleaning rule as a target output and a combined feature representing sample data to be cleaned corresponding to the cleaning rule and a cleaning effect thereof as a training input.
For example, all possible cleaning rules can be integrated in advance, and sample data to be cleaned and cleaning effects thereof corresponding to each cleaning rule are collected at the same time; then, acquiring a cleaning rule characteristic capable of representing a cleaning rule and acquiring a combined characteristic capable of representing sample data to be cleaned and a cleaning effect of the sample data; then, each combined feature is used as training input, the cleaning rule feature corresponding to each combined feature is used as target output, model training is carried out according to a preset training algorithm, and a cleaning rule classification model used for selecting which cleaning rule to clean data to be cleaned is obtained through training.
Therefore, after the electronic equipment acquires the data to be cleaned, which needs to be subjected to data cleaning, and acquires the cleaning requirement of the data to be cleaned, the data to be cleaned and the cleaning requirement can be input into the cleaning rule classification model, so that the cleaning rule classification model outputs the cleaning rule which can perform data cleaning on the data to be cleaned and the cleaning effect meets the cleaning requirement, and the cleaning rule is used as a target cleaning rule for performing data cleaning on the data to be cleaned.
And 104, performing data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement.
In the embodiment of the application, after the target cleaning rule used for performing data cleaning on the data to be cleaned is determined, the electronic equipment can perform data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement, and finally the required data is obtained.
Therefore, in the embodiment of the application, the electronic device can firstly acquire the data to be cleaned, which needs to be subjected to data cleaning, and acquire the cleaning requirement of the data to be cleaned, then determine the target cleaning rule for performing data cleaning on the data to be cleaned according to the acquired data to be cleaned, the cleaning requirement and the pre-trained cleaning rule classification model, and finally perform data cleaning on the data to be cleaned according to the determined target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement. Therefore, as long as the cleaning rule classification model is obtained through pre-training, the cleaning rule classification model can be subsequently utilized to automatically clean data without excessive manual participation, so that the labor cost of data cleaning is reduced, and the efficiency of data cleaning is improved.
Referring to fig. 3, fig. 3 is another schematic flow chart of a data cleansing method according to an embodiment of the present disclosure. The data cleaning method can be applied to the electronic equipment, and the flow of the data cleaning method can comprise the following steps:
in 201, sensor data collected by a sensor is acquired, and the acquired sensor data is used as data to be cleaned.
It should be noted that electronic devices are often configured with a variety of sensors through which the environment in which the device is located, the motion of the device, and the like, can be sensed. The electronic device is configured with sensors including, but not limited to, a gravity sensor, an acceleration sensor, a positioning sensor (such as a satellite positioning sensor, a base station positioning sensor, etc.), a sound sensor, a light sensor, and the like.
However, the sensor data collected by these sensors is not all required by the electronic device, which requires the electronic device to clean the sensor data to obtain the actually required data.
Therefore, in the embodiment of the application, the electronic device can acquire the sensor data acquired by the sensor configured by the electronic device, and the acquired sensor data is used as the data to be cleaned.
At 202, a cleansing requirement for data to be cleansed is obtained.
As will be appreciated by those of ordinary skill in the art, real-world data tends to be multidimensional, incomplete, noisy, and inconsistent, with the goal of data cleansing being to fill in missing values, smooth noise and identify outliers, correct inconsistencies in the data, and the like.
In the embodiment of the application, after the electronic equipment acquires the data to be cleaned, which needs to be subjected to data cleaning, the cleaning requirement of the data to be cleaned is further acquired. In a colloquial way, the cleansing requirement describes a cleansing effect which is desired to be achieved by data cleansing of data to be cleansed, for example, original data to be cleansed contains data with multiple dimensions, and the dimensions are often not independent, that is, perhaps, there is a relation among a plurality of dimensions, and perhaps, there is no relation among the dimensions, so that the cleansing requirement of data to be executed can be to reduce the dimension of the data to be cleansed to a specified dimension.
It should be noted that, as can be understood by those skilled in the art, the requirement for cleaning depends on the actual requirement of data processing performed by the electronic device, and the embodiment of the present application is not particularly limited thereto.
In 203, the combined characteristics of the data to be cleaned and the cleaning requirements are obtained.
In 204, the obtained combined features are input into a cleaning rule classification model to obtain cleaning rule features output by the cleaning rule classification model.
In 205, a cleaning rule matching with the cleaning rule feature output by the cleaning rule classification model is determined as a target cleaning rule for data cleaning of the data to be cleaned.
It should be noted that, in the embodiment of the present application, a cleaning rule classification model for selecting which cleaning rule to perform data cleaning on data to be cleaned is configured in an electronic device, and the cleaning rule classification model is obtained by performing model training by using a cleaning rule feature representing a cleaning rule as a target output and a combined feature representing sample data to be cleaned corresponding to the cleaning rule and a cleaning effect thereof as a training input.
For example, all possible cleaning rules can be integrated in advance, and sample data to be cleaned and cleaning effects thereof corresponding to each cleaning rule are collected at the same time; then, acquiring a cleaning rule characteristic capable of representing a cleaning rule and acquiring a combined characteristic capable of representing sample data to be cleaned and a cleaning effect of the sample data; then, each combined feature is used as training input, the cleaning rule feature corresponding to each combined feature is used as target output, model training is carried out according to a preset training algorithm, and a cleaning rule classification model used for selecting which cleaning rule to clean data to be cleaned is obtained through training.
Therefore, after the electronic equipment acquires the data to be cleaned, which needs to be subjected to data cleaning, and acquires the cleaning requirement of the data to be cleaned, the data to be cleaned and the cleaning requirement can be input into the cleaning rule classification model, so that the cleaning rule classification model outputs the cleaning rule which can perform data cleaning on the data to be cleaned and the cleaning effect meets the cleaning requirement, and the cleaning rule is used as a target cleaning rule for performing data cleaning on the data to be cleaned.
It should be noted that, in the embodiment of the present application, the data to be cleaned and the cleaning requirement are input to the cleaning rule classification model, and the data to be cleaned and the cleaning requirement are not input to the cleaning rule classification model, but the characteristics capable of characterizing the data to be cleaned and the cleaning requirement are input to the cleaning rule classification model.
Therefore, in the embodiment of the application, after the electronic device acquires the data to be cleaned and the cleaning requirement of the data to be cleaned, the electronic device further acquires the joint feature of the data to be cleaned and the cleaning requirement, and performs joint depth characterization on the data to be cleaned and the cleaning requirement thereof by using the joint feature.
And after the acquired combined characteristics of the data to be cleaned and the cleaning requirements thereof are acquired, the electronic equipment inputs the combined characteristics into a pre-trained cleaning rule classification model for processing. And on the other hand, the cleaning rule classification model processes the input combined features and outputs corresponding cleaning rule features, and the cleaning rule features represent cleaning rules which can be used for cleaning data to be cleaned and have cleaning effects meeting cleaning requirements.
After the electronic equipment obtains the cleaning rule features output by the cleaning rule classification model, further determining the cleaning rules matched with the cleaning rule features, and taking the cleaning rules as target cleaning rules for data cleaning of the data to be cleaned.
For example, referring to fig. 4, the electronic device obtains the combined feature a of the data to be cleaned and the cleaning requirement thereof, inputs the combined feature a into the cleaning rule classification model for processing, obtains the cleaning rule feature a output by the cleaning rule classification model, and matches the cleaning rule a as the target cleaning rule.
In 206, data cleaning is performed on the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement.
In the embodiment of the application, after the target cleaning rule used for performing data cleaning on the data to be cleaned is determined, the electronic equipment can perform data cleaning on the data to be cleaned according to the target cleaning rule, and the cleaning effect of the data to be cleaned meets the corresponding cleaning requirement.
In one embodiment, the "acquiring the combined characteristics of the data to be cleaned and the cleaning requirement" may include:
and acquiring the data to be cleaned and the combined characteristics of the cleaning requirements according to the generated countermeasure network.
In the embodiment of the application, considering that the generation of the countermeasure network can generate more sample data based on the existing data and has stronger feature learning capability, the electronic device can acquire the data to be cleaned and the combined features of the cleaning requirements according to the generation of the countermeasure network.
When the electronic equipment acquires the data to be cleaned and the combined characteristics of the cleaning requirements, the data to be cleaned and the cleaning requirements form a data pair, which is expressed as the data to be cleaned and the cleaning requirements, and then the combined characteristics of the data to be cleaned and the cleaning requirements are constructed according to the generated countermeasure network.
It should be noted that in other embodiments, the data to be cleaned and the combined characteristics of the cleaning requirements can be selected by those skilled in the art according to the actual requirements and by selecting an appropriate characteristic construction mode.
In one embodiment, determining a cleaning rule matching the cleaning rule feature output by the cleaning rule classification model includes:
(1) acquiring the similarity between the cleaning rule features output by the cleaning rule classification model and the cleaning rule features of a plurality of pre-stored cleaning rules;
(2) and taking the cleaning rule with the similarity reaching the preset similarity as the cleaning rule matched with the cleaning rule characteristic output by the cleaning rule classification model.
It should be noted that, in the embodiment of the present application, the cleaning rule matched with the cleaning rule feature output by the cleaning rule classification model means that the similarity between the cleaning rule feature of the cleaning rule and the cleaning rule feature output by the cleaning rule classification model reaches the preset similarity.
Therefore, when the electronic device determines the cleaning rule matched with the cleaning rule feature output by the cleaning rule classification model, the electronic device may first obtain the similarity between the cleaning rule feature output by the cleaning rule classification model and the cleaning rule features of the pre-stored multiple cleaning rules, and then use the cleaning rule with the similarity reaching the preset similarity as the cleaning rule matched with the cleaning rule feature output by the cleaning rule classification model, that is, the target cleaning rule subsequently used for data cleaning of the data to be cleaned.
For example, it is assumed that the electronic device is pre-stored with a cleaning rule feature a of a cleaning rule a, a cleaning rule feature B of a cleaning rule B, and a cleaning rule feature C of a cleaning rule C, and the preset similarity is configured to be 85%. If the electronic device obtains that the similarity between the cleaning rule feature A of the cleaning rule A and the cleaning rule feature output by the cleaning rule classification model is 40%, the similarity between the cleaning rule feature B of the cleaning rule B and the cleaning rule feature output by the cleaning rule classification model is 45%, and the similarity between the cleaning rule feature C of the cleaning rule C and the cleaning rule feature output by the cleaning rule classification model is 86%, it can be seen that the similarity between the cleaning rule feature C of the cleaning rule C and the cleaning rule feature output by the cleaning rule classification model reaches the preset similarity (85%), and at this time, the electronic device determines the cleaning rule C as the cleaning rule matched with the cleaning rule feature output by the cleaning rule classification model.
When calculating the similarity between the two cleaning rule features, the electronic device may use the feature distance between the two cleaning rule features to measure the similarity between the two cleaning rule features, that is, calculate the feature distance between the two cleaning rule features (one of the feature distances may be selected by a person skilled in the art according to actual needs, such as an euclidean distance, a manhattan distance, a chebyshev distance, a cosine distance, and the like), as the similarity between the two cleaning rule features. Any characteristic distance can be selected by one of ordinary skill in the art according to actual needs.
In one embodiment, the "data cleansing the data to be cleansed according to the target cleansing rule" includes:
and calling one or more cleaning functions corresponding to the target cleaning rule, and cleaning data to be cleaned.
It should be noted that in the embodiment of the present application, each cleaning rule is composed of one or more cleaning functions, and the cleaning functions are used for actually implementing cleaning operations, including but not limited to missing value processing, normalization processing, noise elimination processing, and the like. The cleaning function itself can be written by a related technician using a computer programming language (e.g., C language, Java language, Python language, etc.), such as a regular expression, a filter function, an SQL expression, etc.
Therefore, when the electronic equipment performs data cleaning on the data to be cleaned according to the target cleaning rule, one or more cleaning functions corresponding to the target cleaning rule can be called to perform data cleaning on the data to be cleaned, so that the cleaning effect of the data to be cleaned meets the cleaning requirement, and finally the required data is obtained.
Referring to fig. 5, fig. 5 is a model training method provided in this embodiment of the present application, where the model training method is used to train a cleaning rule classification model required in the data cleaning method provided in this embodiment of the present application, and an execution subject of the model training method may be a model training apparatus provided in this embodiment of the present application, or an electronic device integrated with the model training apparatus, where the model training apparatus may be implemented in a hardware or software manner. As shown in fig. 5, a flow of the model training method provided in the embodiment of the present application may be as follows:
in 301, a plurality of cleaning rules are obtained, and sample data to be cleaned corresponding to each cleaning rule is obtained.
In the embodiment of the application, a database facing the cleaning rule can be created in the electronic device in advance, wherein the database facing the cleaning rule comprises a cleaning rule sub-database, a sample data sub-database to be cleaned and a cleaning effect sub-database.
In performing model training, the electronic device may integrate all possible cleansing rules and store the cleansing rules in a cleansing rules sub-database. For example, the electronic device stores the acquired plurality of cleansing rules in a character string form in the cleansing rule sub-database.
In addition, for the acquired cleaning rules stored in the cleaning rule sub-database, the electronic device further acquires sample data to be cleaned corresponding to each cleaning rule, and stores the sample data to be cleaned in the sample data sub-database to be cleaned, for example, stores the sample data to be cleaned in the sample data sub-database to be cleaned, and stores the sample data to be cleaned in the sample data sub-database to be cleaned, for example, stores the digital type sample data to be cleaned in the sample data sub-database to be cleaned.
It should be noted that the electronic device may obtain sample data to be cleaned locally, may obtain sample data to be cleaned from other electronic devices, and may also obtain sample data to be cleaned from the internet.
In 302, a cleaning effect of each cleaning rule for cleaning data of the corresponding sample data to be cleaned is obtained.
In the embodiment of the application, after the electronic device obtains the plurality of cleaning rules and the sample data to be cleaned corresponding to each cleaning rule, the electronic device further obtains the cleaning effect of each cleaning rule for cleaning the corresponding sample data to be cleaned, and stores the cleaning effect into the cleaning effect sub-database. For example, the cleaning effect may be stored in a table in the cleaning effect sub-database.
In 303, the joint features of each sample data to be cleaned and the corresponding cleaning effect are obtained, and the cleaning rule features of each cleaning rule are obtained.
In the embodiment of the application, for each obtained sample data to be cleaned and the corresponding cleaning effect thereof, the electronic device further obtains a joint feature of each sample data to be cleaned and the corresponding cleaning effect thereof, and performs joint depth characterization on the sample data to be cleaned and the corresponding cleaning effect thereof by using the joint feature.
In addition, the electronic equipment also acquires the cleaning rule characteristics of each cleaning rule, and the cleaning rule characteristics are used for representing the cleaning rules. For example, the electronic device may obtain, as the washing rule features of each washing rule, the vocabulary features of one or more washing functions corresponding to each washing rule.
At 304, model training is performed with each joint feature as a training input and the cleaning rule feature corresponding to each joint feature as a target output to obtain a cleaning rule classification model.
In the embodiment of the application, after the obtained combined features of the sample data to be cleaned and the corresponding cleaning effect of the sample data and the cleaning rule features of the cleaning rules are obtained, the electronic device can use the combined features as training input and the cleaning rule features corresponding to the combined features as target output, and perform model training according to a preset training algorithm to train and obtain a cleaning rule classification model for automatically selecting the cleaning rules.
The training algorithm is a machine learning algorithm, and the machine learning algorithm can realize various functions through continuous feature learning, for example, data to be cleaned and a cleaning requirement corresponding to the data to be cleaned can be given, and a cleaning rule which can clean the data to be cleaned and has a cleaning effect reaching the cleaning requirement is automatically selected. The machine learning algorithm may include: decision tree models, logistic regression models, bayesian models, neural network models, clustering models, and the like.
In addition, the algorithm type of the machine learning algorithm may be divided according to various situations, for example, the machine learning algorithm may be divided into: supervised learning algorithms, unsupervised learning algorithms, semi-supervised learning algorithms, reinforcement learning algorithms, and the like.
Under supervised learning, input data is called as "training data", and each set of training data has a definite identification or result, such as "spam" and "non-spam" in a spam prevention system, and "1", "2", "3", "4" in handwritten number recognition, and the like. When the recognition model is established, a learning process is established through supervised learning, scene type information is compared with an actual result of training data, and the recognition model is continuously adjusted until the scene type information of the model reaches an expected accuracy rate. Common application scenarios for supervised learning are classification problems and regression problems. Common algorithms are Logistic Regression (Logistic Regression) and Back Propagation Neural Network (Back Propagation Neural Network).
In unsupervised learning, data is not specifically labeled and the recognition model is to infer some of the intrinsic structure of the data. Common application scenarios include learning and clustering of association rules. Common algorithms include Apriori algorithm and k-Means algorithm, among others.
Semi-supervised learning algorithms, in which input data is partially identified, can be used for type recognition, but the model first needs to learn the intrinsic structure of the data in order to reasonably organize the data for prediction. The application scenarios include classification and regression, and the algorithms include some extensions to common supervised learning algorithms that first attempt to model the unidentified data and then predict the identified data based thereon. Such as Graph theory Inference algorithm (Graph Inference) or Laplacian support vector machine (Laplacian SVM).
Reinforcement learning algorithms, in which input data is used as feedback to the model, unlike supervised models, which simply serve as a way to check for model alignment errors, are used in reinforcement learning, in which input data is fed back directly to the model, and the model must be adjusted immediately for this. Common application scenarios include dynamic systems and robot control. Common algorithms include Q-Learning and time difference Learning (Temporal difference Learning).
Further, the machine learning algorithm can also be divided into based on similarities according to the function and form of the algorithm:
regression algorithms, common ones include: least squares (ideal Least Square), Logistic Regression (Logistic Regression), Stepwise Regression (Stepwise Regression), Multivariate Adaptive Regression Splines (Multivariate Adaptive Regression Splines) and local scatter Smoothing estimation (localized estimated scattered scattering).
Example-based algorithms include k-Nearest Neighbor (KNN), Learning Vector Quantization (LVQ), and Self-Organizing Map algorithm (SOM).
A common algorithm of the regularization method includes: ridge Regression, Last Absolute Shrinkageand Selection Operator (LASSO), and Elastic networks (Elastic Net).
Decision tree algorithms, common ones include: classification And Regression Trees (CART), ID3(Iterative Dichotomiser 3), C4.5, Chi-squared automated interaction Detection (CHAID), Decision Stump, Random Forest (Random Forest), Multivariate Adaptive Regression Spline (MARS), And Gradient Boosting Machine (GBM).
The Bayesian method algorithm comprises the following steps: naive Bayes algorithm, average single-Dependence estimation (AODE), and Bayesian Belief Network (BBN).
Referring to fig. 6, fig. 6 is another schematic flow chart of a model training method according to an embodiment of the present disclosure. The model training method can be applied to the electronic equipment, and the flow of the model training method can comprise the following steps:
in 401, a plurality of cleaning rules and sample data to be cleaned corresponding to each cleaning rule are obtained.
In the embodiment of the application, a database facing the cleaning rule can be created in the electronic device in advance, wherein the database facing the cleaning rule comprises a cleaning rule sub-database, a sample data sub-database to be cleaned and a cleaning effect sub-database.
In performing model training, the electronic device may integrate all possible cleansing rules and store the cleansing rules in a cleansing rules sub-database. For example, the electronic device stores the acquired plurality of cleansing rules in a character string form in the cleansing rule sub-database.
In addition, for the acquired cleaning rules stored in the cleaning rule sub-database, the electronic device further acquires sample data to be cleaned corresponding to each cleaning rule, and stores the sample data to be cleaned in the sample data sub-database to be cleaned, for example, stores the sample data to be cleaned in the sample data sub-database to be cleaned, and stores the sample data to be cleaned in the sample data sub-database to be cleaned, for example, stores the digital type sample data to be cleaned in the sample data sub-database to be cleaned.
It should be noted that the electronic device may obtain sample data to be cleaned locally, may obtain sample data to be cleaned from other electronic devices, and may also obtain sample data to be cleaned from the internet.
In 402, a cleaning effect of each cleaning rule for cleaning data of the corresponding sample data to be cleaned is obtained.
In the embodiment of the application, after the electronic device obtains the plurality of cleaning rules and the sample data to be cleaned corresponding to each cleaning rule, the electronic device further obtains the cleaning effect of each cleaning rule for cleaning the corresponding sample data to be cleaned, and stores the cleaning effect into the cleaning effect sub-database. For example, the cleaning effect may be stored in a table in the cleaning effect sub-database.
At 403, joint features of each sample data to be cleaned and its corresponding cleaning effect are obtained according to the generated countermeasure network.
In the embodiment of the application, for each obtained sample data to be cleaned and the corresponding cleaning effect thereof, the electronic device further obtains a joint feature of each sample data to be cleaned and the corresponding cleaning effect thereof, and performs joint depth characterization on the sample data to be cleaned and the corresponding cleaning effect thereof by using the joint feature.
Considering that the generation of the countermeasure network can generate more sample data based on the existing data and has stronger characteristic learning capability, the electronic equipment can acquire the sample data to be cleaned and the combined characteristics of the cleaning effect according to the generation of the countermeasure network.
When the electronic equipment acquires the combined characteristics of the sample data to be cleaned and the corresponding cleaning effect, the sample data to be cleaned and the corresponding cleaning effect form a data pair, which is expressed as < the sample data to be cleaned, the cleaning effect >, and then the combined characteristics of < the sample data to be cleaned, the cleaning effect > are constructed according to the generated countermeasure network.
At 404, lexical features of one or more cleaning functions corresponding to each cleaning rule are obtained according to the encoder neural network as cleaning rule features of each cleaning rule.
In the embodiment of the application, the electronic device further obtains the cleaning rule features of each cleaning rule, and the cleaning rule features are used for representing the cleaning rules.
It should be noted that in the embodiment of the present application, each cleaning rule is composed of one or more cleaning functions, and the cleaning functions are used for actually implementing cleaning operations, including but not limited to missing value processing, normalization processing, noise elimination processing, and the like. The cleaning function itself can be written by a related technician using a computer programming language (e.g., C language, Java language, Python language, etc.), such as a regular expression, a filter function, an SQL expression, etc.
When the electronic equipment acquires the cleaning rule characteristics of each cleaning rule, for any cleaning rule, the electronic equipment performs word segmentation operation on one or more cleaning functions corresponding to the cleaning rule to obtain a word sequence of the cleaning rule, and then the word sequence is input into a neural network of an encoder to be encoded to obtain a word characteristic vector with representation capability as the cleaning rule characteristics of the cleaning rule.
For example, for a cleaning rule, the electronic device performs a word segmentation operation on the cleaning rule to obtain a word sequence C ═ (C)1,c2,……,cn) The encoder neural network to which the vocabulary sequence C is input is encoded to obtain a vocabulary feature vector V ═ (V ═ V)1,v2,……vn) And taking the vocabulary feature vector V as the cleaning rule feature of the cleaning rule.
It should be noted that, in the embodiment of the present application, specific models and topology structures of the encoder neural network are not limited, a single-layer recurrent neural network may be used for training to obtain the encoder neural network, a multi-layer recurrent neural network may also be used for training to obtain the encoder neural network, and a convolutional neural network, or a variant thereof, or a neural network with other network structures may also be used for training to obtain the encoder neural network. For example, in the embodiment of the present application, a recurrent neural network may be used to construct the encoder neural network.
In 405, the combined features are used as training input, the cleaning rule features corresponding to the combined features are used as target output, and model training is performed by using a conditional cycle neural network to obtain a cleaning rule classification model.
In the embodiment of the application, after the obtained combined features of the sample data to be cleaned and the corresponding cleaning effect of the sample data and the cleaning rule features of the cleaning rules are obtained, the electronic device can use the combined features as training input and the cleaning rule features corresponding to the combined features as target output, and performs model training by using a conditional cycle neural network to train and obtain a cleaning rule classification model for automatically selecting the cleaning rules.
For a clearer understanding of the embodiment of the present application, please refer to fig. 7, and fig. 7 is a schematic view of an application scenario of the embodiment of the present application for model training.
Firstly, a database facing a cleaning rule is constructed, and the database comprises three sub-databases, namely a cleaning rule sub-database, a sample data sub-database to be cleaned and a cleaning effect sub-database. Integrating all possible cleaning rules, simultaneously collecting sample data to be cleaned and cleaning effects thereof corresponding to each cleaning rule, storing the cleaning rules into the cleaning rule sub-database in a character string mode, storing the sample data to be cleaned into the sample data sub-database to be cleaned, and storing the cleaning effects into the cleaning effect sub-database in a table mode.
And secondly, coding all cleaning rules by using a coder neural network constructed by the cyclic neural network to obtain corresponding vocabulary feature vectors as the cleaning rule features of all the cleaning rules. Meanwhile, a data pair of the sample data to be cleaned and the cleaning effect thereof corresponding to each cleaning rule is represented as < sample data to be cleaned, cleaning effect >, and the < sample data to be cleaned, cleaning effect > pair of each cleaning rule is learned by using the generated countermeasure network to obtain < sample data to be cleaned, cleaning effect > combined characteristic.
And finally, taking the combined characteristics of the sample data to be cleaned and the cleaning effect corresponding to each cleaning rule as training input, taking the vocabulary characteristic vector as target output, and performing model training by using a conditional cycle neural network to obtain a cleaning rule classification model.
Therefore, as long as the data to be cleaned needing data cleaning and the trained cleaning rule classification model input by the cleaning requirement are input, the cleaning rule output by the cleaning rule classification model can be obtained, and the cleaning effect of data cleaning on the data to be cleaned by utilizing the cleaning rule can meet the cleaning requirement.
The embodiment of the application also provides a data cleaning device. Referring to fig. 8, fig. 8 is a schematic structural diagram of a data cleaning apparatus according to an embodiment of the present application. The data cleaning device is applied to electronic equipment, and comprises a data acquisition module 501, a requirement acquisition module 502, a rule determination module 503 and a data cleaning module 504, and comprises the following components:
a data obtaining module 501, configured to obtain data to be cleaned, where the data is required to be cleaned;
a requirement obtaining module 502, configured to obtain a cleaning requirement of data to be cleaned;
the rule determining module 503 is configured to determine a target cleaning rule for data cleaning of the data to be cleaned according to the data to be cleaned, the cleaning requirement and the pre-trained cleaning rule classification model;
the data cleaning module 504 is configured to perform data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement;
the cleaning rule classification model is obtained by performing model training by using the cleaning rule characteristics representing the cleaning rules as target output and the combined characteristics representing the sample data to be cleaned corresponding to the cleaning rules and the cleaning effect of the sample data.
In an embodiment, when determining a target cleaning rule for data cleaning of data to be cleaned according to the data to be cleaned, the cleaning requirement, and the pre-trained cleaning rule classification model, the rule determining module 503 may be configured to:
acquiring the data to be cleaned and the combined characteristics of the cleaning requirements;
inputting the obtained combined features into a cleaning rule classification model to obtain cleaning rule features output by the cleaning rule classification model;
and determining a cleaning rule matched with the cleaning rule characteristic output by the cleaning rule classification model, and taking the cleaning rule as a target cleaning rule for cleaning data to be cleaned.
In one embodiment, when determining a cleansing rule matching the cleansing rule features output by the cleansing rule classification model, the rule determination module 503 may be configured to:
acquiring the similarity between the cleaning rule features output by the cleaning rule classification model and the cleaning rule features of a plurality of pre-stored cleaning rules;
and taking the cleaning rule with the similarity reaching the preset similarity as the cleaning rule matched with the cleaning rule characteristic output by the cleaning rule classification model.
In an embodiment, when performing data cleansing on data to be cleansed according to the target cleansing rule, the data cleansing module 504 may be configured to:
and calling one or more cleaning functions corresponding to the target cleaning rule, and cleaning data to be cleaned.
In an embodiment, when acquiring data to be cleaned, which needs to be cleaned, the data acquiring module 501 may be configured to:
and acquiring sensor data acquired by a sensor, and taking the acquired sensor data as data to be cleaned.
The embodiment of the application also provides a model training device. Referring to fig. 9, fig. 9 is a schematic structural diagram of a model training device according to an embodiment of the present application. The model training device is applied to electronic equipment, and the data cleaning device comprises a first obtaining module 601, a second obtaining module 602, a third obtaining module 603 and a model training module 604, as follows:
a first obtaining module 601, configured to obtain a plurality of cleaning rules and obtain sample data to be cleaned corresponding to each cleaning rule;
a second obtaining module 602, configured to obtain a cleaning effect of each cleaning rule for performing data cleaning on sample data to be cleaned corresponding to the cleaning rule;
a third obtaining module 603, configured to obtain joint features of each sample data to be cleaned and a corresponding cleaning effect thereof, and obtain cleaning rule features of each cleaning rule;
and the model training module 604 is configured to perform model training by using each joint feature as a training input and using the cleaning rule feature corresponding to each joint feature as a target output, so as to obtain a cleaning rule classification model.
In one embodiment, when acquiring the cleansing rule features of each cleansing rule, the third acquiring module 603 may be configured to:
and acquiring the vocabulary characteristics of one or more cleaning functions corresponding to each cleaning rule as the cleaning rule characteristics of each cleaning rule.
In one embodiment, when obtaining the vocabulary features of the one or more cleaning functions corresponding to each cleaning rule, the third obtaining module 603 may be configured to:
and acquiring the vocabulary characteristics of one or more cleaning functions corresponding to each cleaning rule according to the encoder neural network.
In an embodiment, when acquiring the combined features of each sample data to be cleaned and the corresponding cleaning effect thereof, the third acquiring module 603 may be configured to:
and acquiring the sample data to be cleaned and the joint characteristics of the corresponding cleaning effect thereof according to the generated countermeasure network.
In an embodiment, when performing model training by using each joint feature as a training input and using the cleaning rule feature corresponding to each joint feature as a target output to obtain a cleaning rule classification model, the model training module 604 may be configured to:
and taking each joint feature as training input, taking the cleaning rule feature corresponding to each joint feature as target output, and performing model training by using a conditional cycle neural network to obtain a cleaning rule classification model.
Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when the computer program stored in the computer program is executed on a computer, the computer is caused to execute the steps in the data cleaning method provided in this embodiment, or the computer is caused to execute the steps in the model training method provided in this embodiment. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.
An embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the processor executes steps in the data cleaning method provided in this embodiment by calling a computer program stored in the memory, or executes steps in the model training method provided in this embodiment.
In an embodiment, an electronic device is also provided. Referring to fig. 10, the electronic device includes a processor 701 and a memory 702. The processor 701 is electrically connected to the memory 702.
The processor 701 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, performs various functions of the electronic device and processes data by running or loading a computer program stored in the memory 702 and calling data stored in the memory 702.
The memory 702 may be used to store software programs and modules, and the processor 701 executes various functional applications and data processing by operating the computer programs and modules stored in the memory 702. The memory 702 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, a computer program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 702 may also include a memory controller to provide the processor 701 with access to the memory 702.
In this embodiment of the application, the processor 701 in the electronic device loads instructions corresponding to one or more processes of the computer program into the memory 702, and the processor 701 executes the computer program stored in the memory 702, so as to implement various functions as follows:
acquiring data to be cleaned, which needs to be cleaned;
acquiring a cleaning requirement of data to be cleaned;
determining a target cleaning rule for cleaning data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model;
performing data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement;
the cleaning rule classification model is obtained by performing model training by using the cleaning rule characteristics representing the cleaning rules as target output and the combined characteristics representing the sample data to be cleaned corresponding to the cleaning rules and the cleaning effect of the sample data.
Alternatively, the processor 701 in the electronic device may load instructions corresponding to one or more processes of the computer program into the memory 702, and the processor 701 executes the computer program stored in the memory 702, so as to implement various functions, as follows:
acquiring a plurality of cleaning rules and acquiring sample data to be cleaned corresponding to each cleaning rule;
acquiring a cleaning effect of each cleaning rule for cleaning data of sample data to be cleaned corresponding to the cleaning rule;
acquiring combined characteristics of each sample data to be cleaned and a cleaning effect corresponding to the sample data, and acquiring cleaning rule characteristics of each cleaning rule;
and taking each joint feature as training input, and taking the cleaning rule feature corresponding to each joint feature as target output to carry out model training, thereby obtaining a cleaning rule classification model.
Referring to fig. 11, fig. 11 is another schematic structural diagram of the electronic device according to the embodiment of the present disclosure, and the difference from the electronic device shown in fig. 10 is that the electronic device further includes components such as an input unit 703 and an output unit 704.
The input unit 703 may be used for receiving input numbers, character information, or user characteristic information (such as a fingerprint), and generating a keyboard, a mouse, a joystick, an optical or trackball signal input, etc., related to user settings and function control, among others.
The output unit 704 may be used to display information input by the user or information provided to the user, such as a screen.
In this embodiment of the application, the processor 701 in the electronic device loads instructions corresponding to one or more processes of the computer program into the memory 702, and the processor 701 executes the computer program stored in the memory 702, so as to implement various functions as follows:
acquiring data to be cleaned, which needs to be cleaned;
acquiring a cleaning requirement of data to be cleaned;
determining a target cleaning rule for cleaning data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model;
and performing data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement.
In an embodiment, when determining a target cleaning rule for data cleaning of data to be cleaned according to the data to be cleaned, the cleaning requirement and the pre-trained cleaning rule classification model, the processor 701 may perform:
acquiring the data to be cleaned and the combined characteristics of the cleaning requirements;
inputting the obtained combined features into a cleaning rule classification model to obtain cleaning rule features output by the cleaning rule classification model;
and determining a cleaning rule matched with the cleaning rule characteristic output by the cleaning rule classification model, and taking the cleaning rule as a target cleaning rule for cleaning data to be cleaned.
In one embodiment, when determining a cleansing rule matching the cleansing rule features output by the cleansing rule classification model, the processor 701 may perform:
acquiring the similarity between the cleaning rule features output by the cleaning rule classification model and the cleaning rule features of a plurality of pre-stored cleaning rules;
and taking the cleaning rule with the similarity reaching the preset similarity as the cleaning rule matched with the cleaning rule characteristic output by the cleaning rule classification model.
In an embodiment, when performing data cleansing on data to be cleansed according to a target cleansing rule, the processor 701 may perform:
and calling one or more cleaning functions corresponding to the target cleaning rule, and cleaning data to be cleaned.
In one embodiment, when acquiring data to be cleaned requiring data cleaning, the processor 701 may perform:
and acquiring sensor data acquired by a sensor, and taking the acquired sensor data as data to be cleaned.
Alternatively, the processor 701 in the electronic device may load instructions corresponding to one or more processes of the computer program into the memory 702, and the processor 701 executes the computer program stored in the memory 702, so as to implement various functions, as follows:
acquiring a plurality of cleaning rules and acquiring sample data to be cleaned corresponding to each cleaning rule;
acquiring a cleaning effect of each cleaning rule for cleaning data of sample data to be cleaned corresponding to the cleaning rule;
acquiring combined characteristics of each sample data to be cleaned and a cleaning effect corresponding to the sample data, and acquiring cleaning rule characteristics of each cleaning rule;
and taking each joint feature as training input, and taking the cleaning rule feature corresponding to each joint feature as target output to carry out model training, thereby obtaining a cleaning rule classification model.
In one embodiment, in obtaining the cleansing rule features of each cleansing rule, the processor 701 may perform:
and acquiring the vocabulary characteristics of one or more cleaning functions corresponding to each cleaning rule as the cleaning rule characteristics of each cleaning rule.
In one embodiment, in obtaining the vocabulary features of the one or more cleaning functions corresponding to each cleaning rule, the processor 701 may perform:
and acquiring the vocabulary characteristics of one or more cleaning functions corresponding to each cleaning rule according to the encoder neural network.
In an embodiment, when acquiring the joint feature of each sample data to be cleaned and the corresponding cleaning effect thereof, the processor 701 may perform:
and acquiring the sample data to be cleaned and the joint characteristics of the corresponding cleaning effect thereof according to the generated countermeasure network.
In one embodiment, when performing model training using each joint feature as a training input and using the cleaning rule feature corresponding to each joint feature as a target output to obtain a cleaning rule classification model, the processor 701 may perform:
and taking each joint feature as training input, taking the cleaning rule feature corresponding to each joint feature as target output, and performing model training by using a conditional cycle neural network to obtain a cleaning rule classification model.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It should be noted that, for the data cleaning method/model training method of the embodiment of the present application, it can be understood by a person skilled in the art that all or part of the process of implementing the data cleaning method/model training method of the embodiment of the present application can be implemented by controlling related hardware through a computer program, where the computer program can be stored in a computer readable storage medium, such as a memory of an electronic device, and executed by at least one processor in the electronic device, and during the execution process, the process of the embodiment of the data cleaning method/model training method can be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, etc.
For the data cleaning device/model training device of the embodiment of the present application, each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, or the like.
The data cleaning method, the model training method, the device, the storage medium and the equipment provided by the embodiment of the application are introduced in detail, a specific example is applied in the text to explain the principle and the implementation mode of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (15)

1. A data cleaning method is applied to electronic equipment and is characterized by comprising the following steps:
acquiring data to be cleaned, which needs to be cleaned;
acquiring the cleaning requirement of the data to be cleaned;
determining a target cleaning rule for cleaning the data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model;
performing data cleaning on the data to be cleaned according to the target cleaning rule, so that the cleaning effect on the data to be cleaned meets the cleaning requirement;
the cleaning rule classification model is obtained by performing model training by using a cleaning rule characteristic representing a cleaning rule as a target output and a combined characteristic representing sample data to be cleaned corresponding to the cleaning rule and a cleaning effect of the sample data to be cleaned as a training input.
2. The data cleaning method of claim 1, wherein the determining a target cleaning rule for data cleaning of the data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model comprises:
acquiring the data to be cleaned and the combined characteristics of the cleaning requirements;
inputting the combined features into the cleaning rule classification model to obtain cleaning rule features output by the cleaning rule classification model;
and determining the cleaning rule matched with the cleaning rule characteristic as the target cleaning rule.
3. The data cleansing method of claim 2, wherein the determining the cleansing rule matching the cleansing rule feature comprises:
acquiring the similarity between the cleaning rule features and the cleaning rule features of a plurality of pre-stored cleaning rules;
and determining the cleaning rule with the similarity reaching the preset similarity as the cleaning rule matched with the cleaning rule characteristic.
4. The data cleansing method of claim 1, wherein the data cleansing of the data to be cleansed according to the target cleansing rule comprises:
and calling one or more cleaning functions corresponding to the target cleaning rule to perform data cleaning on the data to be cleaned.
5. The data cleaning method of claim 1, wherein obtaining data to be cleaned for which data cleaning is required comprises:
and acquiring sensor data acquired by a sensor, and taking the sensor data as data to be cleaned.
6. A model training method is applied to electronic equipment and is characterized by comprising the following steps:
acquiring a plurality of cleaning rules and acquiring sample data to be cleaned corresponding to each cleaning rule;
acquiring a cleaning effect of each cleaning rule for cleaning data of the corresponding sample data to be cleaned;
acquiring the combined characteristics of the sample data to be cleaned and the corresponding cleaning effect thereof, and acquiring the cleaning rule characteristics of the cleaning rules;
and performing model training by taking each joint feature as training input and taking the cleaning rule feature corresponding to each joint feature as target output to obtain a cleaning rule classification model.
7. The data cleansing method of claim 6, wherein the obtaining cleansing rule features of each of the cleansing rules comprises:
and acquiring the vocabulary characteristics of one or more cleaning functions corresponding to each cleaning rule as the cleaning rule characteristics of each cleaning rule.
8. The method of claim 7, wherein the obtaining the vocabulary characteristics of the one or more cleansing functions corresponding to each of the cleansing rules comprises:
and acquiring the vocabulary characteristics of one or more cleaning functions corresponding to each cleaning rule according to the neural network of the encoder.
9. The method according to claim 8, wherein the obtaining of the joint feature of each sample data to be cleaned and the corresponding cleaning effect thereof comprises:
and acquiring the combined characteristics of the sample data to be cleaned and the corresponding cleaning effect thereof according to the generated countermeasure network.
10. The data cleaning method of claim 6, wherein performing model training using each of the joint features as a training input and using the cleaning rule feature corresponding to each of the joint features as a target output to obtain a cleaning rule classification model, comprises:
and taking each joint feature as training input, taking the cleaning rule feature corresponding to each joint feature as target output, and performing model training by using a conditional cycle neural network to obtain the cleaning rule classification model.
11. A data cleaning device is applied to electronic equipment and is characterized by comprising:
the data acquisition module is used for acquiring data to be cleaned, which needs to be cleaned;
the requirement acquisition module is used for acquiring the cleaning requirement of the data to be cleaned;
the rule determining module is used for determining a target cleaning rule for cleaning the data to be cleaned according to the data to be cleaned, the cleaning requirement and a pre-trained cleaning rule classification model;
the data cleaning module is used for cleaning the data to be cleaned according to the target cleaning rule, so that the cleaning effect of the data to be cleaned meets the cleaning requirement;
the cleaning rule classification model is obtained by performing model training by using a cleaning rule characteristic representing a cleaning rule as a target output and a combined characteristic representing sample data to be cleaned corresponding to the cleaning rule and a cleaning effect of the sample data to be cleaned as a training input.
12. A model training device applied to electronic equipment is characterized by comprising:
the first acquisition module is used for acquiring a plurality of cleaning rules and acquiring sample data to be cleaned corresponding to each cleaning rule;
the second acquisition module is used for acquiring the cleaning effect of each cleaning rule for cleaning the data of the corresponding sample data to be cleaned;
the third acquisition module is used for acquiring the combined characteristics of the sample data to be cleaned and the cleaning effect corresponding to the sample data to be cleaned and acquiring the cleaning rule characteristics of the cleaning rules;
and the model training module is used for performing model training by taking each joint feature as training input and taking the cleaning rule feature corresponding to each joint feature as target output to obtain a cleaning rule classification model.
13. A storage medium having stored thereon a computer program for causing a computer to perform a data cleansing method according to any one of claims 1 to 5 or a model training method according to any one of claims 6 to 10 when the computer program is run on the computer.
14. An electronic device comprising a processor and a memory, the memory storing a computer program, wherein the processor is configured to perform the data cleansing method according to any one of claims 1 to 5 by invoking the computer program.
15. An electronic device comprising a processor and a memory, the memory storing a computer program, wherein the processor is configured to perform the model training method of any one of claims 6 to 10 by invoking the computer program.
CN201910282171.0A 2019-04-09 2019-04-09 Data cleaning method, model training method, device, storage medium and equipment Pending CN111797078A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910282171.0A CN111797078A (en) 2019-04-09 2019-04-09 Data cleaning method, model training method, device, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910282171.0A CN111797078A (en) 2019-04-09 2019-04-09 Data cleaning method, model training method, device, storage medium and equipment

Publications (1)

Publication Number Publication Date
CN111797078A true CN111797078A (en) 2020-10-20

Family

ID=72805340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910282171.0A Pending CN111797078A (en) 2019-04-09 2019-04-09 Data cleaning method, model training method, device, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN111797078A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632051A (en) * 2020-12-25 2021-04-09 中国工商银行股份有限公司 Neural network-based database cleaning method and system
CN112860676A (en) * 2021-02-06 2021-05-28 高云 Data cleaning method applied to big data mining and business analysis and cloud server
CN113190542A (en) * 2021-05-19 2021-07-30 西安图迹信息科技有限公司 Big data cleaning and denoising method and system for power grid and computer storage medium
CN113420623A (en) * 2021-06-09 2021-09-21 山东师范大学 5G base station detection method and system based on self-organizing mapping neural network
WO2021189960A1 (en) * 2020-10-22 2021-09-30 平安科技(深圳)有限公司 Method and apparatus for training adversarial network, method and apparatus for supplementing medical data, and device and medium
CN115438183A (en) * 2022-08-31 2022-12-06 广州宝立科技有限公司 Business website monitoring system based on natural language processing
CN116061189A (en) * 2023-03-08 2023-05-05 国网瑞嘉(天津)智能机器人有限公司 Robot operation data processing system, method, device, equipment and medium
CN116775639A (en) * 2023-08-08 2023-09-19 阿里巴巴(中国)有限公司 Data processing method, storage medium and electronic device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016165378A1 (en) * 2015-04-16 2016-10-20 国网新源张家口风光储示范电站有限公司 Energy storage power station mass data cleaning method and system
CN108734330A (en) * 2017-04-24 2018-11-02 北京京东尚科信息技术有限公司 Data processing method and device
CN108764372A (en) * 2018-06-08 2018-11-06 Oppo广东移动通信有限公司 Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109299233A (en) * 2018-09-19 2019-02-01 平安科技(深圳)有限公司 Text data processing method, device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016165378A1 (en) * 2015-04-16 2016-10-20 国网新源张家口风光储示范电站有限公司 Energy storage power station mass data cleaning method and system
CN108734330A (en) * 2017-04-24 2018-11-02 北京京东尚科信息技术有限公司 Data processing method and device
CN108764372A (en) * 2018-06-08 2018-11-06 Oppo广东移动通信有限公司 Construction method and device, mobile terminal, the readable storage medium storing program for executing of data set
CN108875821A (en) * 2018-06-08 2018-11-23 Oppo广东移动通信有限公司 The training method and device of disaggregated model, mobile terminal, readable storage medium storing program for executing
CN109299233A (en) * 2018-09-19 2019-02-01 平安科技(深圳)有限公司 Text data processing method, device, computer equipment and storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021189960A1 (en) * 2020-10-22 2021-09-30 平安科技(深圳)有限公司 Method and apparatus for training adversarial network, method and apparatus for supplementing medical data, and device and medium
CN112632051A (en) * 2020-12-25 2021-04-09 中国工商银行股份有限公司 Neural network-based database cleaning method and system
CN112632051B (en) * 2020-12-25 2024-06-14 中国工商银行股份有限公司 Database cleaning method and system based on neural network
CN112860676A (en) * 2021-02-06 2021-05-28 高云 Data cleaning method applied to big data mining and business analysis and cloud server
CN113190542A (en) * 2021-05-19 2021-07-30 西安图迹信息科技有限公司 Big data cleaning and denoising method and system for power grid and computer storage medium
CN113190542B (en) * 2021-05-19 2023-02-24 西安图迹信息科技有限公司 Big data cleaning and denoising method and system for power grid and computer storage medium
CN113420623A (en) * 2021-06-09 2021-09-21 山东师范大学 5G base station detection method and system based on self-organizing mapping neural network
CN115438183A (en) * 2022-08-31 2022-12-06 广州宝立科技有限公司 Business website monitoring system based on natural language processing
CN115438183B (en) * 2022-08-31 2023-07-04 广州宝立科技有限公司 Business website monitoring system based on natural language processing
CN116061189A (en) * 2023-03-08 2023-05-05 国网瑞嘉(天津)智能机器人有限公司 Robot operation data processing system, method, device, equipment and medium
CN116775639A (en) * 2023-08-08 2023-09-19 阿里巴巴(中国)有限公司 Data processing method, storage medium and electronic device

Similar Documents

Publication Publication Date Title
CN111797078A (en) Data cleaning method, model training method, device, storage medium and equipment
JP7193252B2 (en) Captioning image regions
US9846840B1 (en) Semantic class localization in images
Nandedkar et al. A fuzzy min-max neural network classifier with compensatory neuron architecture
CN113688304A (en) Training method for search recommendation model, and method and device for sequencing search results
CN113362382A (en) Three-dimensional reconstruction method and three-dimensional reconstruction device
CN110414550B (en) Training method, device and system of face recognition model and computer readable medium
CN111368656A (en) Video content description method and video content description device
Chen et al. Multi-SVM based Dempster–Shafer theory for gesture intention understanding using sparse coding feature
Araki et al. Online object categorization using multimodal information autonomously acquired by a mobile robot
CN113569598A (en) Image processing method and image processing apparatus
US20230020965A1 (en) Method and apparatus for updating object recognition model
CN113164056A (en) Sleep prediction method, device, storage medium and electronic equipment
Wu et al. Combining hidden Markov model and fuzzy neural network for continuous recognition of complex dynamic gestures
Goutsu et al. Classification of multi-class daily human motion using discriminative body parts and sentence descriptions
CN112529149A (en) Data processing method and related device
Steyer et al. Elastic analysis of irregularly or sparsely sampled curves
CN115879508A (en) Data processing method and related device
CN114139630A (en) Gesture recognition method and device, storage medium and electronic equipment
CN111797080A (en) Model training method, data recovery device, storage medium and equipment
CN113170018A (en) Sleep prediction method, device, storage medium and electronic equipment
CN116109449A (en) Data processing method and related equipment
CN110348406B (en) Parameter estimation method and device
Sucharta et al. A survey on various pattern recognition methods for the identification of a Different types of images
JP7347750B2 (en) Verification device, learning device, method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination