CN109977110A - Data cleaning method, device and equipment - Google Patents

Data cleaning method, device and equipment Download PDF

Info

Publication number
CN109977110A
CN109977110A CN201910348187.7A CN201910348187A CN109977110A CN 109977110 A CN109977110 A CN 109977110A CN 201910348187 A CN201910348187 A CN 201910348187A CN 109977110 A CN109977110 A CN 109977110A
Authority
CN
China
Prior art keywords
target
data
cleaning
attribute
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910348187.7A
Other languages
Chinese (zh)
Other versions
CN109977110B (en
Inventor
张俊鹏
甘长华
方薇
汪发佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dt Dream Technology Co Ltd
Original Assignee
Hangzhou Dt Dream Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dt Dream Technology Co Ltd filed Critical Hangzhou Dt Dream Technology Co Ltd
Priority to CN202011183142.8A priority Critical patent/CN112199366A/en
Priority to CN201910348187.7A priority patent/CN109977110B/en
Publication of CN109977110A publication Critical patent/CN109977110A/en
Application granted granted Critical
Publication of CN109977110B publication Critical patent/CN109977110B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Abstract

The embodiment of the present invention provides a kind of data cleaning method, device and equipment.The target metadata information that the embodiment of the present invention passes through the extraction target matrix from specified database, and the content of target matrix is copied to the middle table in data warehouse, according to target metadata information, determine the target cleaning task of target matrix, cleaning operation is executed to middle table according to each cleaning rule in target cleaning task, obtain the corresponding target criteria table of target matrix, equipment target criteria table being sent to where specified database, target criteria table to be stored to specified database by changing equipment, the cleaning task for generating tables of data can be automated according to the metadata information of extraction, and the cleaning to tables of data is automatically performed according to cleaning task, entire cleaning process carries out automatically, without manual intervention, and it can be cleaned for the total data table batch in database or system, shorten scavenging period, Improve the treatment effeciency of data cleansing.

Description

Data cleaning method, device and equipment
Technical field
The present invention relates to technical field of data processing more particularly to a kind of data cleaning methods, device and equipment.
Background technique
Currently, big data is widely paid attention to.By the analysis to big data, can obtain many valuable Information.However, the various data in big data are inevitably present by the various influences during Data Logging and Management Various quality problems, such as loss of data, Data duplication, data are not inconsistent standardization, data are imperfect, data are expired etc..Greatly Positively related relationship is presented between the value and the quality of data of data, therefore, in order to excavate costly letter from big data Breath needs to carry out necessary processing to big data, to improve the quality of data of big data.
Data cleansing is to improve one of the method for the quality of data of big data.In the related technology, the process of data cleansing is: Each tables of data in raw data base will be cleaned by the cleaning task of the manual type manual configuration tables of data Task is corresponding by structured query language (Structured Query Language, SQL) sentence write-in raw data table It in the SQL script of manual compiling, then is directed to every table and is separately operable corresponding SQL script, to generate standard scale and be stored in original In database.In this technology, data cleansing process needs manually to configure the cleaning task of every tables of data, and time-consuming, Therefore low efficiency.
Summary of the invention
To overcome the problems in correlation technique, the present invention provides a kind of data cleaning method, device and equipment.
According to a first aspect of the embodiments of the present invention, a kind of data cleaning method is provided, which comprises
The target metadata information of target matrix is extracted from specified database, and by the content of the target matrix Copy to the middle table in data warehouse;
According to the target metadata information, the target cleaning task of the target matrix, the target cleaning are determined Task includes at least one cleaning rule;
Cleaning operation is executed to the middle table according to each cleaning rule in the target cleaning task, obtains target mark Quasi- table;
The equipment target criteria table being sent to where the specified database, with by the equipment by the target Standard scale is stored to the specified database.
According to a second aspect of the embodiments of the present invention, a kind of data cleansing device is provided, described device includes:
Extraction and replication module, for extracting the target metadata information of target matrix from specified database, and will The content of the target matrix copies to the middle table in data warehouse;
Task determining module, for determining the target cleaning of the target matrix according to the target metadata information Task, the target cleaning task include at least one cleaning rule;
Cleaning module, for executing cleaning behaviour to the middle table according to each cleaning rule in the target cleaning task Make, obtains target criteria table;
Sending module, the equipment for being sent to the target criteria table where the specified database, by described Equipment stores the target criteria table to the specified database.
According to a third aspect of the embodiments of the present invention, a kind of data cleansing equipment is provided, including processor and for depositing Store up the memory of the executable instruction of the processor;
The processor is configured to:
The target metadata information of target matrix is extracted from specified database, and by the content of the target matrix Copy to the middle table in data warehouse;
According to the target metadata information, the target cleaning task of the target matrix, the target cleaning are determined Task includes at least one cleaning rule;
Cleaning operation is executed to the middle table according to each cleaning rule in the target cleaning task, obtains the mesh Mark the corresponding target criteria table of tables of data;
The equipment target criteria table being sent to where the specified database, with by the equipment by the target Standard scale is stored to the specified database.
Technical solution provided in an embodiment of the present invention can include the following benefits:
Data cleaning method provided in an embodiment of the present invention can automate according to the metadata information of extraction and generate data The cleaning task of table, and it is automatically performed the cleaning to tables of data according to cleaning task, entire cleaning process carries out automatically, is not necessarily to people Work intervention, and can be cleaned for the total data table batch in database or system, it effectively shortens needed for cleaning process Time, improve the treatment effeciency of data cleansing.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not This specification can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the reality for meeting this specification Example is applied, and is used to explain the principle of this specification together with specification.
Fig. 1 is an Application Scenarios-Example figure of data cleaning method provided in an embodiment of the present invention.
Fig. 2 is a flow example figure of data cleaning method provided in an embodiment of the present invention.
Fig. 3 is the functional block diagram of data cleansing device provided in an embodiment of the present invention.
Fig. 4 is a hardware structure diagram of data cleansing equipment provided in an embodiment of the present invention.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the embodiment of the present invention.
It in the term that the embodiment of the present invention uses and is not intended to merely for describing the purpose of the specific embodiment of the present invention Limit the embodiment of the present invention.In the embodiment of the present invention and the "an" of singular used in the attached claims, " institute State " and "the" be also intended to including most forms, unless the context clearly indicates other meaning.It is also understood that making herein Term "and/or" refers to and may combine comprising one or more associated any or all of project listed.
It will be appreciated that though various letters may be described using term first, second, third, etc. in the embodiment of the present invention Breath, but these information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other out.For example, In the case where not departing from range of embodiment of the invention, the first information can also be referred to as the second information, similarly, the second information The first information can also be referred to as.Depending on context, word as used in this " if " can be construed to " ... When " or " when ... " or " in response to determination ".
Big data all has important application value in all fields.Corporate client can using the analysis result of big data come Decision is done, these decisions can adjust marketing strategy with auxiliary enterprises;Demand of the common consumer to big data is mainly reflected in Information can search on demand, and can provide friendly, believable information recommendation, next is to provide high-order service;For harmonious society, greatly Data can bring completely new life style-smart city, and smart city construction is a complicated system engineering, and is counted greatly It is supported according to data will be provided for it.
In order to obtain costly information from big data, needs to administer big data, improve the data of big data Quality.Data cleansing is one of the mode that big data is administered.
Data cleansing refers to discovery and corrects identifiable error in data in tables of data.Data cleansing may include checking data Consistency handles invalid value and missing values etc..Data in data warehouse are the set of the data towards a certain theme, these numbers According to being extracted from multiple operation systems and including historical data, thus the unavoidable data having be wrong data, Some data have conflict between each other, and data that are these mistakes or having conflict are undesirable data, referred to as " dirty number According to ".The task of data cleansing is the undesirable data of filtering, and satisfactory data are placed in standard scale back to original It is saved in database.Undesirable data mainly have incomplete data, the data of mistake, duplicate data three big Class.
Many governments, mechanism, the data of enterprise are still stored in Traditional affair type database at present, such as Mysql number According to library, oracle database, Pgsql database etc..In the related technology, straight with SQL script mode when executing data cleansing It connects and builds data warehouse on raw data base.This aspect can reduce the performance of raw data base, on the other hand also be not easy to Data are carried out with the maintenance and tracking of systematization.It also, is that each tables of data manually configures due to needing developer Cleaning task, and developer must constantly attempt explicitly know which this tables of data needs do in most cases A little cleaning operations, therefore treatment effeciency is very low.Further, since tables of data is different, corresponding cleaning task is also different, so It can only be that cleaning task is separately configured to every table, can not simultaneously be multiple tables of data batch configuration cleaning tasks, this is further Reduce treatment effeciency.
In practical applications, data volume to be treated is very huge, and the object of cleaning is often tens of thousands of tables, numbers Ten very much area, several hundred TB data need considerable exploitation if so more data are handled in the way of the relevant technologies Personnel, human cost are very high.
The embodiment of the invention provides the numbers that one kind can automatically generate cleaning task according to the metadata information of tables of data It according to cleaning method, does not need manually to configure cleaning task, entire cleaning process automation carries out, and substantially reduces and cleaned The time of Cheng Suoxu, improve treatment effeciency.And it since cleaning process does not need manually to participate in, can reduce to exploitation The demand of personnel, to save human cost.
Fig. 1 is an Application Scenarios-Example figure of data cleaning method provided in an embodiment of the present invention.Referring to Figure 1, it takes Raw data base 1 is provided in business device A, server B is used to create data warehouse, such as uses Hive (one based on Hadoop A Tool for Data Warehouse) creation data warehouse, the tables of data in raw data base 1 is copied in data warehouse, in data bins Cleaning operation is carried out to table data in library, generates standard scale and/or problem table.Wherein, standard scale is satisfactory for storing Data and/or undesirable data (also referred herein as problem data), problem table are used for storage problem data.
Server B can carry out data cleansing using data cleaning method process provided in an embodiment of the present invention.Carry out Before cleaning, following content is provided in server B in advance:
1, (address, user name, the password etc. of equipment where data source, data source is example data source connection information as shown in figure 1 Raw data base 1, then data source connection information is the address of server A, the username and password for accessing raw data base 1).
2, the corresponding relationship of literary name section and normal data member;The corresponding relationship of normal data member and cleaning rule;Literary name section With the corresponding relationship of specified attribute;The corresponding relationship of attribute and cleaning rule.
Content is configured according to above-mentioned first item, the accessible raw data base 1 of server B, so as to from raw data base 1 The content of the middle metadata information for extracting tables of data and replicate data table is into the data warehouse of server B local.
Content is configured according to above-mentioned Section 2, server B can be according to the metadata information automatically generated data of extraction The corresponding cleaning task of table, so that server B automatically holds the tables of data in local data warehouse according to the cleaning task of generation Row cleaning operation.
Data cleaning method provided by the invention is described in detail below by embodiment.
Fig. 2 is a flow example figure of data cleaning method provided in an embodiment of the present invention.As shown in Fig. 2, data are clear Washing method may include:
S201 extracts the target metadata information of target matrix from specified database, and will be in target matrix Hold the middle table copied in data warehouse.
S202 determines that the target cleaning task of target matrix, target cleaning task include according to target metadata information At least one cleaning rule.
S203 executes cleaning operation to middle table according to each cleaning rule in target cleaning task, obtains target criteria Table includes at least satisfactory data in target matrix in target criteria table.
S204, equipment target criteria table being sent to where specified database, to be deposited target criteria table by changing equipment It stores up to specified database.
The data cleaning method of the embodiment of the present invention can be applied in server B above-mentioned.
In step s 201, metadata information may include general metadata information, such as table name, the field of tables of data Name, field description etc. also may include literary name section specified attribute, such as whether field is major key, field could be sky etc..
The metadata information of each tables of data includes general metadata information, but not necessarily includes that literary name section is specified and belong to Property.There are one or more fields that there is specified attribute in some tables of data, then includes literary name in the metadata information of the tables of data Section specified attribute;All fields all do not have specified attribute in some tables of data, then do not wrap in the metadata information of the tables of data Include literary name section specified attribute.
Here metadata information is illustrated.Table 1 is user's table, and the content of table 1 is as follows:
1 user's table of table
Then the metadata information of table 1 is as follows:
Table name: Yong Hubiao;
Field name: user identity card number;Type: character string String;Length: 0-18;
Field name: user's name;Type: character string String;Length: 0-20;
Field name: age of user;Type: character string String;Length: 0-20.
During an illustrative realization, in step S201, the mesh of target matrix is extracted from specified database Metadata information is marked, may include: that the connection of java database is called according to the link information of the specified database of user configuration (Java DataBase Connectivity, JDBC) interface, establishes connection with specified database;JDBC driving packet is called to provide DatabaseMetaData interface, inquire the column of target matrix, the table name, field name, field for obtaining target matrix are retouched It states, field specified attribute.
For example, field specified attribute may is that whether be major key, field could be sky etc..
During an illustrative realization, in step S201, the content of target matrix is copied into data warehouse In middle table, may include:
According to target metadata information, locallyd create and the specified number where target matrix using Tool for Data Warehouse It creates with target matrix according to library intermediate database of the same name, in intermediate database with identical table name and identical table structure Middle table;
Middle table is written in the content-data of reading by the content-data for reading target matrix storage.
Wherein, Tool for Data Warehouse can be Hive.
It should be noted that need to only create an intermediate database for all tables of data in a raw data base.
It, will also be in local (such as server B above-mentioned after the content of target matrix is all copied to middle table In) storage middle table metadata information.When in metadata information not including literary name section specified attribute, the metadata of middle table The metadata information of information and target matrix is identical.When in metadata information including literary name section specified attribute, If user does not modify to the literary name section specified attribute of middle table, the metadata information and target matrix of middle table Metadata information be still identical;If user is not modified the literary name section specified attribute of middle table, General metadata letter in the metadata information of general metadata information and target matrix in the metadata information of middle table Manner of breathing is same, the literary name section in the metadata information of literary name section specified attribute and target matrix in the metadata information of middle table Specified attribute is different.
It, can be by the general metadata information and literary name section of table in data cleansing equipment (such as server B above-mentioned) Specified attribute stores respectively.Preserved in data cleansing equipment the general metadata information of target matrix, middle table it is general Metadata information, the literary name section specified attribute of target matrix, the literary name section specified attribute of middle table.Wherein, literary name section is specified Attribute is stored in the corresponding relationship of literary name section and specified attribute.
One it is illustrative realize during, by the content of target matrix copy to the middle table in data warehouse it Afterwards, confirmation command metadata information can also be confirmed according to the metadata of user.At this point, the literary name section of the not no modification table of user refers to Determine attribute.
One it is illustrative realize during, by the content of target matrix copy to the middle table in data warehouse it Afterwards, the literary name section specified attribute in the metadata information of instruction modification middle table can also be modified according to the metadata of user;And After having modified, confirmation command metadata is confirmed according to the metadata of user.
In many Traditional affair type databases, many tables of data are all to add certainly as major key, these major keys itself There is no business meaning and a value, and non-master key field in tables of data, such as certificate number, taxpayer's number, invoice number etc., It is the master data in industry.At this point, user can modify the primary key column of middle table according to this example in data warehouse.Special feelings Under condition, when certain raw data table is cited in different modelling operabilities, it is different by the field as major key, therefore can To increase a scene identity position newly in the configuration of major key, scene identity position is for distinguishing different application scene (GroupId).
In addition, in addition to major key, user can also be modified by metadata instruction increase setting middle table it is specified be classified as it is non- It is empty.
During an illustrative realization, the content-data of target matrix storage is read, by the content number of reading It may include: the number extracted by java name interface (JDBC Data Source) in target matrix according to write-in middle table According to being inserted into (HiveContext Insert Into) SQL statement by Hive text and the data of extraction imported into middle table In.The function of the replicate data can be realized with a Spark application program (App).
In step S202, the corresponding cleaning task of each tables of data, different data table corresponds to different cleaning tasks.
It include at least one cleaning rule in cleaning task.Cleaning rule is used to indicate the behaviour cleaned to tables of data Make mode.
For example, cleaning rule can be nonumeric value filtering rule, date standardization rule, codomain filtering rule etc. Deng.
During an illustrative realization, step S202 may include:
From the corresponding relationship of established literary name section and normal data member, find out with it is each in target metadata information The target data member of fields match;
From in the corresponding relationship of established normal data member and cleaning rule, finding out and target data member matched the One cleaning rule;
According to the first cleaning rule, cleaning task is generated.
Data element can be the industry standard data member of national regulation, be also possible to enterprise or government bodies' unit according to going through Itself proprietary data element that history data and business scenario define.The following requirement that field should meet is defined in data element: Data type, data format, value range etc., these requirements eventually by normal data member and cleaning rule corresponding relationship quilt Resolve to various standardized cleaning rules.
It illustrates.The metadata information of above-mentioned table 1 includes 3 fields: user identity card number, user's name, user Age.Assuming that the corresponding relationship of literary name section and normal data member is as shown in table 2:
The corresponding relationship of 2 literary name section of table and normal data member
Literary name section Normal data member
User identity card number Identification card number
User's name String length
Age of user Numerical value minimax
…… ……
It can then be found out from table 2, the corresponding normal data member of field " user identity card number " is " identification card number ", field " user's name " corresponding normal data member is " string length ", and the corresponding normal data member of field " age of user " is " number It is worth minimax ".
The corresponding relationship of tentative standard data element and cleaning rule is as shown in table 3.
The corresponding relationship of table 3 normal data member and cleaning rule
It can be found out from table 3, the corresponding cleaning rule of normal data member " identification card number " is that " identity card 15 turn 18 Position rule " and " identity card legitimacy filtering rule ", the corresponding cleaning rule of normal data member " string length " is " character string Length filtration ", the corresponding cleaning rule of normal data member " numerical value minimax " is " codomain filtering rule ".
Then the corresponding cleaning task of table 1 includes 4 the first cleaning rules: identity card 15 turn 18 rules, identity card conjunction Method filtering rule, string length filtering and codomain filtering rule.
Cleaning rule may include two types: filtering rule and normalisation rule.For example, normalisation rule has date mark Standardization rule, synonym normalisation rule, identity card normalisation rule etc..
One it is illustrative realize during, include literary name section specified attribute in target metadata information, then step S202 can also include:
From in the corresponding relationship of established literary name section and specified attribute, finding out and each word in target metadata information The matched objective attribute target attribute of section;
It is advised from finding out in the corresponding relationship of established attribute and cleaning rule with matched second cleaning of objective attribute target attribute Then;
According to the first cleaning rule, generate the target cleaning task of target matrix, comprising: according to the first cleaning rule and Second cleaning rule generates the target cleaning task of target matrix.
For example, it is assumed that the corresponding relationship of literary name section and specified attribute is as shown in table 4.
The corresponding relationship of table 4 literary name section and specified attribute
Literary name section Specified attribute
User identity card number Major key field
User's name Value could be sky
Age of user Nothing
…… ……
It can then be found out from table 4, the corresponding specified attribute of field " user identity card number " of table 1 is " major key word Section ", the corresponding specified attribute of field " user's name " of table 1 are " value could be empty ".That is the objective attribute target attribute of table 1 has 2: main Key field and value could be sky.
Assuming that the corresponding relationship of attribute and cleaning rule is as shown in table 5.
The corresponding relationship of table 5 attribute and cleaning rule
It can be found out from table 5, the corresponding cleaning rule of attribute " major key field " is " duplicate removal rule ", attribute " value Could be empty " corresponding cleaning rule is " null value filtering rule ".
Then the corresponding cleaning task of table 1 further includes 2 the second cleaning rules: duplicate removal rule and null value filtering rule.
In this way, the final corresponding cleaning task of table 1 includes 6 cleaning rules altogether: identity card 15 turn 18 rules, identity Demonstrate,prove legitimacy filtering rule, string length filtering, codomain filtering rule, duplicate removal rule and null value filtering rule.
As it can be seen that the cleaning rule in cleaning task is obtained according to the metadata information being drawn into is automatic in step S202 It arrives, drag different cleaning rules in human-computer interaction interface without user or carries out the configuration of cleaning task by sql command, from And enable entire cleaning process to automate progress, and can clean with multiple tables of data batches, very flexibly.
It should be noted that the corresponding relationship of above-mentioned established literary name section and specified attribute refers to the literary name section of middle table With the corresponding relationship of specified attribute.In the case where user modifies the literary name section specified attribute of middle table, above-mentioned established table The corresponding relationship of field and specified attribute refers to the corresponding relationship of modified literary name section and specified attribute.
Therefore, during an illustrative realization, in the corresponding relationship from established literary name section and specified attribute In, find out with before the objective attribute target attribute of each fields match in target metadata information, the method also includes:
The attributes modification information to specific field in the target metadata information is received, the attributes modification information is used for The specified attribute of the specific field is revised as specified attribute by instruction;
The specified attribute of specific field described in middle table is modified according to the attributes modification information;
According to the modification of the specified attribute to the specific field, the corresponding literary name section of the middle table and specified category are updated The corresponding relationship of property;
It is described from the corresponding relationship of established literary name section and specified attribute, find out in target metadata information The objective attribute target attribute of each fields match, comprising: from the corresponding relationship of updated literary name section and specified attribute, finding out and target The objective attribute target attribute of each fields match in metadata information.
In step S203, cleaning is carried out to the middle table in data warehouse, the content and target data of middle table Table is identical, therefore executes cleaning operation to middle table and be equivalent to carry out cleaning operation to target matrix, to obtained by middle table Wash result be target matrix wash result.
Cleaning operation is executed to middle table according to each cleaning rule in cleaning task, is referred to for every a line in middle table Data are cleaned with each cleaning rule in cleaning task respectively, will if being judged as dirty data according to any cleaning rule The row data are written in problem table or carrying problem identification field is written in standard scale;If sentenced according to all cleaning rules Disconnected is not dirty data, and the row data are written in standard scale.
In one example, the cleaning process of the target line of the corresponding middle table of target matrix may is that
For each cleaning rule in target cleaning task, read from the target line of middle table corresponding with the cleaning rule Field data;
If the cleaning rule is filtering rule, determine whether the data read are legal according to the cleaning rule;If the cleaning Rule is normalisation rule, determines whether the data read need to standardize according to the cleaning rule;
If all filtering rules in target cleaning task determine that data are legal, and the data of target line are cleaned through target All normalisation rules in task standardize or have been determined as not needing to standardize, it is determined that the data of target line are not It, will be in the data write-in standard scale of target line for dirty data;
If any filtering rule in target cleaning task determines that data are illegal, it is determined that the data of target line are dirty number According to, by the data of target line write-in problem table, or will be in the data carrying problem identification field write-in problem table of target line.
For example, the corresponding cleaning task of above-mentioned table 1 includes 6 cleaning rules.With above-mentioned 1 the first row data of table and identity card Illustrate cleaning process for legitimacy filtering rule, identity card 15 turns of 18 rules:
The data " xxxxxxxxxxxxxxxxx1 " for reading 1 the first row user identity card field of table, are closed according to identity card Method filtering rule judges whether " xxxxxxxxxxxxxxxxx1 " be legal, if legal execution operates in next step, if do not conformed to Rule terminates the data cleansing process to the first row in the first row data of table 1 all write-in problem table;
The data " xxxxxxxxxxxxxxxxx1 " for reading 1 the first row user identity card field of table, according to identity card 15 Position turns whether 18 rule judgements " xxxxxxxxxxxxxxxxx1 " need to convert, if it is desired, then will " xxxxxxxxxxxxxxxxx1 " is converted to 18 identification card numbers, and operation is performed the next step after conversion;If you do not need to conversion, directly It connects and performs the next step operation;
……
In step S204, target criteria table can be sent to where specified database by another Spark App Equipment.For example, in one example, step S204 may include:
The number in the standard scale of data warehouse is extracted by Hive text selecting (HiveContext Select) SQL statement According to, by java name interface (JDBC Data Source) insertion (Insert into) SQL statement the data of extraction are write Enter specified database.
After step S204, a table: target criteria table is at least increased in specified database newly.It is written by problem data In the case where problem table, problem table is also sent to specified database, increases two tables: target criteria in specified database newly at this time Table and problem table.
During an illustrative realization, the quantity of target matrix is at least one.
This example allows to clean multiple tables of data batches, effectively shortens the totality to entire cleaning target data Scavenging period further improves the treatment effeciency of cleaning process.
During an illustrative realization, data cleaning method can also include:
During executing cleaning operation to middle table, ask the problems in middle table data and problem data are corresponding It inscribes in identification field write-in standard scale;Alternatively,
During executing cleaning operation to middle table, the problems in middle table data are written in problem table.
In this example, two kinds of processing modes are provided for problem data: first is that, problem data is discarded into problem table In;Second is that standard scale is written in problem data carrying problem identification field.
In one example, problem identification field can there are two: dirty mark (dirty_flag), dirty type (dirty_ type)。
The value range of dirty_flag is 0 and 1, is judged as the data of problem data by filtering rule and passes through The data of normalisation rule convert failed, dirty_flag can be set to 1.
The value of dirty_type is 128 strings of binary characters, is defaulted as full 0, and the value of each character is 0 and 1, Subscript is from left to right 0-127.Each character of dirty_type all has corresponding rule, such as a certain position character quilt It is set to 1 and shows that some specific rule determines that this value is illegal, be set to 0 and show that some specific rule determines that this value is legal.
The processing of problem data is carried out not in data warehouse.
When user needs to handle problem data, problem data can be inquired in the following way:
Mode one can pass through " Select*from problem when problem data stores in problem table
Data in table " order inquiry problem table;
Mode two can pass through " Select*from standard scale where when problem data stores in standard scale The problems in dirty_flag=' 1 ' " order query criteria table data.
During an illustrative realization, data cleaning method can also include:
During executing cleaning operation to middle table, the problems in middle table data are sent to problem data backtracking Device obtains amendment data to be modified by problem data backtracking device to problem data;
Amendment data are written in target criteria table.
In this example, amendment data are the satisfactory data obtained after modifying to problem data.For example, a certain Identification card number in item record is 15, and identification card number should be 18 in satisfactory normal data, then to the record Modification mode is: identification card number being revised as 18, obtains amendment data.
Data cleaning method provided in an embodiment of the present invention, by the target for extracting target matrix from specified database Metadata information, and the content of target matrix is copied into the middle table in data warehouse, according to target metadata information, really The target cleaning task for the tables of data that sets the goal, target cleaning task includes at least one cleaning rule, in data warehouse, according to Each cleaning rule in target cleaning task executes cleaning operation to middle table, obtains the corresponding target criteria of target matrix Table, equipment target criteria table being sent to where specified database, to be stored target criteria table to specified number by changing equipment According to library, the cleaning task for generating tables of data can be automated according to the metadata information of extraction, and automatically complete according to cleaning task The cleaning of paired data table, entire cleaning process carry out automatically, are not necessarily to manual intervention, and can be for complete in database or system Portion's tables of data batch is cleaned, and the time needed for effectively shortening cleaning process improves the treatment effeciency of data cleansing.
Based on above-mentioned embodiment of the method, the embodiment of the present invention also provides corresponding devices, equipment and storage medium are real Apply example.About the device of the embodiment of the present invention, the detailed implementation of equipment and storage medium embodiment, preceding method is referred to The respective description of embodiment part.
Fig. 3 is the functional block diagram of data cleansing device provided in an embodiment of the present invention.As shown in figure 3, in the present embodiment, Data cleansing device may include:
Extraction and replication module 310, for extracting the target metadata information of target matrix from specified database, and The content of the target matrix is copied to the middle table in data warehouse;
Task determining module 320, for determining that the target of the target matrix is clear according to the target metadata information Task is washed, the target cleaning task includes at least one cleaning rule;
Cleaning module 330 is clear for executing according to each cleaning rule in the target cleaning task to the middle table Operation is washed, target criteria table is obtained;
Sending module 340, the equipment for being sent to the target criteria table where the specified database, by institute Equipment is stated to store the target criteria table to the specified database.
During an illustrative realization, task determining module 320 is specifically used for:
From the corresponding relationship of established literary name section and normal data member, find out with it is each in target metadata information The target data member of fields match;
From in the corresponding relationship of established normal data member and cleaning rule, finds out and matched with the target data member The first cleaning rule;
According to the first cleaning rule and the second cleaning rule, the target cleaning task of the target matrix is generated.
It include literary name section specified attribute in target metadata information during an illustrative realization, task determines Module 320 is also used to:
From in the corresponding relationship of established literary name section and specified attribute, finding out and each word in target metadata information The matched objective attribute target attribute of section;
It is clear with the objective attribute target attribute matched second from finding out in the corresponding relationship of established attribute and cleaning rule Wash rule;
Task determining module 320 is for according to the first cleaning rule, the target cleaning for generating the target matrix to be appointed When business, be specifically used for: according to the first cleaning rule and the second cleaning rule, the target cleaning for generating the target matrix is appointed Business.
During an illustrative realization, described device further include:
Modification information receiving module, for receiving the attribute modification letter to specific field in the target metadata information Breath, the attributes modification information, which is used to indicate, is revised as specified attribute for the specified attribute of the specific field;
Attribute modification module, for modifying the specified category of specific field described in middle table according to the attributes modification information Property;
Relationship update module updates the middle table pair for the modification according to the specified attribute to the specific field The corresponding relationship of the literary name section and specified attribute answered;
Task determining module 320 for from the corresponding relationship of established literary name section and specified attribute, find out with When the objective attribute target attribute of each fields match in target metadata information, it is specifically used for: from updated literary name section and specified attribute Corresponding relationship in, find out the objective attribute target attribute with each fields match in target metadata information.
During an illustrative realization, the quantity of target matrix is at least one.
During an illustrative realization, described device further include:
Data module is write, is used for during executing cleaning operation to the middle table, by asking in the middle table It inscribes data and the corresponding problem identification field of described problem data is written in the standard scale;Alternatively, for the centre During table executes cleaning operation, the problems in middle table data are written in problem table.
During an illustrative realization, described device further include:
Problem table sending module, the equipment for being sent to described problem table where the specified database, by institute Equipment is stated to store described problem table to the specified database.
During an illustrative realization, described device further include:
During executing cleaning operation to the middle table, the problems in middle table data are sent to problem Data recall device, to be modified by described problem data backtracking device to described problem data, obtain amendment data;
The amendment data are written in the target criteria table.
The embodiment of the invention also provides a kind of data cleansing equipment.Fig. 4 is data cleansing provided in an embodiment of the present invention One hardware structure diagram of equipment.As shown in figure 4, data cleansing equipment includes: internal bus 401, and pass through internal bus The memory 402 of connection, processor 403 and external interface 404.
The processor 403 for reading the machine readable instructions on memory 402, and executes described instruction to realize Following operation:
The target metadata information of target matrix is extracted from specified database, and by the content of the target matrix Copy to the middle table in data warehouse;
According to the target metadata information, the target cleaning task of the target matrix, the target cleaning are determined Task includes at least one cleaning rule;
In the data warehouse, the middle table is executed according to each cleaning rule in the target cleaning task clear Operation is washed, target criteria table is obtained;
The equipment target criteria table being sent to where the specified database, with by the equipment by the target Standard scale is stored to the specified database.
During an illustrative realization, processor 403 executes described instruction also to realize following operation:
From the corresponding relationship of established literary name section and normal data member, find out with it is each in target metadata information The target data member of fields match;
From in the corresponding relationship of established normal data member and cleaning rule, finds out and matched with the target data member The first cleaning rule;
According to the first cleaning rule, the target cleaning task of the target matrix is generated.
It include literary name section specified attribute, processor 403 in target metadata information during an illustrative realization Execute described instruction also to realize following operation:
From in the corresponding relationship of established literary name section and specified attribute, finding out and each word in target metadata information The matched objective attribute target attribute of section;
It is clear with the objective attribute target attribute matched second from finding out in the corresponding relationship of established attribute and cleaning rule Wash rule;
According to the first cleaning rule, the target cleaning task of the target matrix is generated specifically: according to the first cleaning Rule and the second cleaning rule, generate the target cleaning task of the target matrix.
During an illustrative realization, processor 403 executes described instruction also to realize following operation:
The attributes modification information to specific field in the target metadata information is received, the attributes modification information is used for The specified attribute of the specific field is revised as specified attribute by instruction;
The specified attribute of specific field described in middle table is modified according to the attributes modification information;
According to the modification of the specified attribute to the specific field, the corresponding literary name section of the middle table and specified category are updated The corresponding relationship of property;
It is described from the corresponding relationship of established literary name section and specified attribute, find out in target metadata information The objective attribute target attribute of each fields match, specifically: from the corresponding relationship of updated literary name section and specified attribute, finding out and mesh Mark the objective attribute target attribute of each fields match in metadata information.
During an illustrative realization, the quantity of the target matrix is at least one.
During an illustrative realization, processor 403 executes described instruction also to realize following operation:
During executing cleaning operation to the middle table, by the problems in middle table data and described problem The corresponding problem identification field of data is written in the standard scale;Alternatively,
During executing cleaning operation to the middle table, problem table is written into the problems in middle table data In.
During an illustrative realization, processor 403 executes described instruction also to realize following operation:
Equipment described problem table being sent to where the specified database, to be deposited described problem table by the equipment It stores up to the specified database.
During an illustrative realization, processor 403 executes described instruction also to realize following operation:
During executing cleaning operation to the middle table, the problems in middle table data are sent to problem Data recall device, to be modified by described problem data backtracking device to described problem data, obtain amendment data;
The amendment data are written in the target criteria table.
The embodiment of the present invention also provides a kind of computer readable storage medium, stores on the computer readable storage medium There are several computer instructions, the computer instruction, which is performed, to be handled as follows:
The target metadata information of target matrix is extracted from specified database, and by the content of the target matrix Copy to the middle table in data warehouse;
According to the target metadata information, the target cleaning task of the target matrix, the target cleaning are determined Task includes at least one cleaning rule;
In the data warehouse, the middle table is executed according to each cleaning rule in the target cleaning task clear Operation is washed, target criteria table is obtained;
The equipment target criteria table being sent to where the specified database, with by the equipment by the target Standard scale is stored to the specified database.
During an illustrative realization, the computer instruction, which is performed, to be also handled as follows:
From the corresponding relationship of established literary name section and normal data member, find out with it is each in target metadata information The target data member of fields match;
From in the corresponding relationship of established normal data member and cleaning rule, finds out and matched with the target data member The first cleaning rule;
According to the first cleaning rule, the target cleaning task of the target matrix is generated.
It include literary name section specified attribute, the calculating in target metadata information during an illustrative realization Machine instruction, which is performed, to be also handled as follows:
From in the corresponding relationship of established literary name section and specified attribute, finding out and each word in target metadata information The matched objective attribute target attribute of section;
It is clear with the objective attribute target attribute matched second from finding out in the corresponding relationship of established attribute and cleaning rule Wash rule;
According to the first cleaning rule, the target cleaning task of the target matrix is generated specifically: according to the first cleaning Rule and the second cleaning rule, generate the target cleaning task of the target matrix.
During an illustrative realization, the computer instruction, which is performed, to be also handled as follows:
The attributes modification information to specific field in the target metadata information is received, the attributes modification information is used for The specified attribute of the specific field is revised as specified attribute by instruction;
The specified attribute of specific field described in middle table is modified according to the attributes modification information;
According to the modification of the specified attribute to the specific field, the corresponding literary name section of the middle table and specified category are updated The corresponding relationship of property;
It is described from the corresponding relationship of established literary name section and specified attribute, find out in target metadata information The objective attribute target attribute of each fields match, specifically: from the corresponding relationship of updated literary name section and specified attribute, finding out and mesh Mark the objective attribute target attribute of each fields match in metadata information.
During an illustrative realization, the quantity of the target matrix is at least one.
During an illustrative realization, the computer instruction, which is performed, to be also handled as follows:
During executing cleaning operation to the middle table, by the problems in middle table data and described problem The corresponding problem identification field of data is written in the standard scale;Alternatively,
During executing cleaning operation to the middle table, problem table is written into the problems in middle table data In.
During an illustrative realization, the computer instruction, which is performed, to be also handled as follows:
Equipment described problem table being sent to where the specified database, to be deposited described problem table by the equipment It stores up to the specified database.
During an illustrative realization, the computer instruction, which is performed, to be also handled as follows:
During executing cleaning operation to the middle table, the problems in middle table data are sent to problem Data recall device, to be modified by described problem data backtracking device to described problem data, obtain amendment data;
The amendment data are written in the target criteria table.
For device and apparatus embodiments, since it corresponds essentially to embodiment of the method, so related place referring to The part of embodiment of the method illustrates.The apparatus embodiments described above are merely exemplary, wherein described be used as is divided Module from part description may or may not be physically separated, the component shown as module can be or It may not be physical module, it can it is in one place, or may be distributed on multiple network modules.It can basis The actual purpose for needing to select some or all of the modules therein to realize this specification scheme.Ordinary skill people Member can understand and implement without creative efforts.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.
Those skilled in the art will readily occur to this specification after considering specification and practicing the invention applied here Other embodiments.This specification is intended to cover any variations, uses, or adaptations of this specification, these modifications, Purposes or adaptive change follow the general principle of this specification and do not apply in the art including this specification Common knowledge or conventional techniques.The description and examples are only to be considered as illustrative, the true scope of this specification and Spirit is indicated by the following claims.
It should be understood that this specification is not limited to the precise structure that has been described above and shown in the drawings, And various modifications and changes may be made without departing from the scope thereof.The range of this specification is only limited by the attached claims System.
The foregoing is merely the preferred embodiments of this specification, all in this explanation not to limit this specification Within the spirit and principle of book, any modification, equivalent substitution, improvement and etc. done should be included in the model of this specification protection Within enclosing.

Claims (11)

1. a kind of data cleaning method, which is characterized in that the described method includes:
The target metadata information of target matrix is extracted from specified database, and the content of the target matrix is replicated Middle table into data warehouse;
According to the target metadata information, the target cleaning task of the target matrix, the target cleaning task are determined Including at least one cleaning rule;
Cleaning operation is executed to the middle table according to each cleaning rule in the target cleaning task, obtains target criteria Table;
The equipment target criteria table being sent to where the specified database, with by the equipment by the target criteria Table is stored to the specified database.
2. the method according to claim 1, wherein described according to the target metadata information, determine described in The target cleaning task of target matrix, comprising:
From in the corresponding relationship of established literary name section and normal data member, find out and each field in target metadata information Matched target data member;
From in the corresponding relationship of established normal data member and cleaning rule, finding out and target data member matched the One cleaning rule;
According to the first cleaning rule, the target cleaning task of the target matrix is generated.
3. according to the method described in claim 2, it is characterized in that, the target metadata information includes the specified category of literary name section Property, it is described from the corresponding relationship of established normal data member and cleaning rule, it finds out and is matched with the target data member The first cleaning rule after, further includes:
From in the corresponding relationship of established literary name section and specified attribute, finding out and each field in target metadata information The objective attribute target attribute matched;
It is advised from finding out in the corresponding relationship of established attribute and cleaning rule with matched second cleaning of the objective attribute target attribute Then;
It is described according to the first cleaning rule, generate the target cleaning task of the target matrix, comprising: according to first cleaning advise Then with the second cleaning rule, the target cleaning task of the target matrix is generated.
4. according to the method described in claim 3, it is characterized in that, in pair from established literary name section and specified attribute In should being related to, find out with before the objective attribute target attribute of each fields match in target metadata information, the method also includes:
The attributes modification information to specific field in the target metadata information is received, the attributes modification information is used to indicate The specified attribute of the specific field is revised as specified attribute;
The specified attribute of specific field described in middle table is modified according to the attributes modification information;
According to the modification of the specified attribute to the specific field, the corresponding literary name section of the middle table and specified attribute are updated Corresponding relationship;
It is described from the corresponding relationship of established literary name section and specified attribute, find out and each word in target metadata information The matched objective attribute target attribute of section, comprising: from the corresponding relationship of updated literary name section and specified attribute, find out and target element number It is believed that the objective attribute target attribute of each fields match in breath.
5. the method according to claim 1, wherein the quantity of the target matrix is at least one.
6. the method according to claim 1, wherein the method also includes:
During executing cleaning operation to the middle table, by the problems in middle table data and described problem data Corresponding problem identification field is written in the target criteria table;Alternatively,
During executing cleaning operation to the middle table, the problems in middle table data are written in problem table.
7. according to the method described in claim 6, it is characterized in that, the method also includes:
Equipment described problem table being sent to where the specified database, with by the equipment by described problem table store to The specified database.
8. the method according to claim 1, wherein the method also includes:
During executing cleaning operation to the middle table, the problems in middle table data are sent to problem data Recall device, to be modified by described problem data backtracking device to described problem data, obtains amendment data;
The amendment data are written in the target criteria table.
9. a kind of data cleansing device, which is characterized in that described device includes:
Extraction and replication module, for extracting the target metadata information of target matrix from specified database, and will be described The content of target matrix copies to the middle table in data warehouse;
Task determining module, for determining the target cleaning task of the target matrix according to the target metadata information, The target cleaning task includes at least one cleaning rule;
Cleaning module, for executing cleaning operation to the middle table according to each cleaning rule in the target cleaning task, Obtain target criteria table;
Sending module, the equipment for being sent to the target criteria table where the specified database, by the equipment The target criteria table is stored to the specified database.
10. device according to claim 9, which is characterized in that the task determining module is specifically used for:
From in the corresponding relationship of established literary name section and normal data member, find out and each field in target metadata information Matched target data member;
From in the corresponding relationship of established normal data member and cleaning rule, finding out and target data member matched the One cleaning rule;
According to the first cleaning rule and the second cleaning rule, the target cleaning task of the target matrix is generated.
11. a kind of data cleansing equipment, which is characterized in that the executable finger including processor and for storing the processor The memory of order;
The processor is configured to:
The target metadata information of target matrix is extracted from specified database, and the content of the target matrix is replicated Middle table into data warehouse;
According to the target metadata information, the target cleaning task of the target matrix, the target cleaning task are determined Including at least one cleaning rule;
Cleaning operation is executed to the middle table according to each cleaning rule in the target cleaning task, obtains the number of targets According to the corresponding target criteria table of table;
The equipment target criteria table being sent to where the specified database, with by the equipment by the target criteria Table is stored to the specified database.
CN201910348187.7A 2019-04-28 2019-04-28 Data cleaning method, device and equipment Active CN109977110B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011183142.8A CN112199366A (en) 2019-04-28 2019-04-28 Data table processing method, device and equipment
CN201910348187.7A CN109977110B (en) 2019-04-28 2019-04-28 Data cleaning method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910348187.7A CN109977110B (en) 2019-04-28 2019-04-28 Data cleaning method, device and equipment

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202011183142.8A Division CN112199366A (en) 2019-04-28 2019-04-28 Data table processing method, device and equipment

Publications (2)

Publication Number Publication Date
CN109977110A true CN109977110A (en) 2019-07-05
CN109977110B CN109977110B (en) 2020-12-04

Family

ID=67086750

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202011183142.8A Pending CN112199366A (en) 2019-04-28 2019-04-28 Data table processing method, device and equipment
CN201910348187.7A Active CN109977110B (en) 2019-04-28 2019-04-28 Data cleaning method, device and equipment

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202011183142.8A Pending CN112199366A (en) 2019-04-28 2019-04-28 Data table processing method, device and equipment

Country Status (1)

Country Link
CN (2) CN112199366A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569298A (en) * 2019-09-12 2019-12-13 成都中科大旗软件股份有限公司 data docking and visualization method and system
CN110569236A (en) * 2019-09-03 2019-12-13 北京明略软件系统有限公司 Data management method and device
CN110727668A (en) * 2019-09-30 2020-01-24 北京百度网讯科技有限公司 Data cleaning method and device
CN111159177A (en) * 2019-12-10 2020-05-15 大唐软件技术股份有限公司 Data fusion method, device, equipment and medium based on heterogeneous data
CN111258993A (en) * 2020-01-09 2020-06-09 佛山科学技术学院 Method and device for filtering abnormal data of industrial big data
CN111427873A (en) * 2020-03-12 2020-07-17 无码科技(杭州)有限公司 Data cleaning method and system
CN111639066A (en) * 2020-05-14 2020-09-08 杭州数梦工场科技有限公司 Data cleaning method and device
CN111651466A (en) * 2020-05-09 2020-09-11 杭州数梦工场科技有限公司 Data sampling method and device
CN111694824A (en) * 2020-05-25 2020-09-22 智强通达科技(北京)有限公司 Method for mapping and cleaning oil data chain
CN111767267A (en) * 2020-06-18 2020-10-13 杭州数梦工场科技有限公司 Metadata processing method and device and electronic equipment
CN111858566A (en) * 2020-06-15 2020-10-30 邯郸钢铁集团有限责任公司 Real-time data extraction application method
CN112000656A (en) * 2020-09-01 2020-11-27 北京天源迪科信息技术有限公司 Intelligent data cleaning method and device based on metadata
CN112131239A (en) * 2020-09-30 2020-12-25 腾讯科技(深圳)有限公司 Data processing method, computer equipment and readable storage medium
CN112256688A (en) * 2020-11-26 2021-01-22 杭州数梦工场科技有限公司 Service data cleaning method and device and electronic equipment
CN112650754A (en) * 2020-12-24 2021-04-13 浪潮云信息技术股份公司 Method for importing total data of relational database into Hive
CN112800049A (en) * 2021-04-06 2021-05-14 航天神舟智慧系统技术有限公司 EXCEL data source cleaning method and system based on big data, electronic device and storage medium
CN112988804A (en) * 2019-12-12 2021-06-18 陕西西部资信股份有限公司 Data transmission method and system
CN113094415A (en) * 2019-12-23 2021-07-09 北京懿医云科技有限公司 Data extraction method and device, computer readable medium and electronic equipment
CN113392099A (en) * 2021-07-01 2021-09-14 苏州维众数据技术有限公司 Automatic data cleaning method
CN113468155A (en) * 2021-07-05 2021-10-01 杭州数梦工场科技有限公司 Problem data processing method and device
CN113568903A (en) * 2021-06-25 2021-10-29 邯郸钢铁集团有限责任公司 Real-time PLC variable extraction application method
CN113722404A (en) * 2021-07-27 2021-11-30 张博 High-efficiency analysis method for multi-dimensional data organization

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360491B (en) * 2021-06-30 2024-03-29 杭州数梦工场科技有限公司 Data quality inspection method, device, electronic equipment and storage medium
CN113590593A (en) * 2021-08-04 2021-11-02 浙江大华技术股份有限公司 Method and device for generating data table information, storage medium and electronic device
CN116415199B (en) * 2023-04-13 2023-10-20 广东铭太信息科技有限公司 Business data outlier analysis method based on audit intermediate table

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060195488A1 (en) * 1999-09-21 2006-08-31 International Business Machines Corporation Method, system, program and data structure for cleaning a database table
CN107229662A (en) * 2016-03-25 2017-10-03 阿里巴巴集团控股有限公司 Data cleaning method and device
CN108062387A (en) * 2017-12-14 2018-05-22 国网陕西省电力公司电力科学研究院 A kind of real time data cleaning and conversion method towards TAS systems
CN108959620A (en) * 2018-07-18 2018-12-07 上海汉得信息技术股份有限公司 A kind of data cleaning method and equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290622A (en) * 2007-04-20 2008-10-22 鸿富锦精密工业(深圳)有限公司 Database cleaning system and method
CN101676900A (en) * 2008-09-18 2010-03-24 阿里巴巴集团控股有限公司 Data cleaning method for improving accuracy of target data and cleaning system thereof
CN106709269B (en) * 2017-03-13 2018-08-07 山东众阳软件有限公司 A kind of creation method and system in medical treatment big data warehouse
CN107239581A (en) * 2017-07-07 2017-10-10 小草数语(北京)科技有限公司 Data cleaning method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060195488A1 (en) * 1999-09-21 2006-08-31 International Business Machines Corporation Method, system, program and data structure for cleaning a database table
CN107229662A (en) * 2016-03-25 2017-10-03 阿里巴巴集团控股有限公司 Data cleaning method and device
CN108062387A (en) * 2017-12-14 2018-05-22 国网陕西省电力公司电力科学研究院 A kind of real time data cleaning and conversion method towards TAS systems
CN108959620A (en) * 2018-07-18 2018-12-07 上海汉得信息技术股份有限公司 A kind of data cleaning method and equipment

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569236A (en) * 2019-09-03 2019-12-13 北京明略软件系统有限公司 Data management method and device
CN110569298A (en) * 2019-09-12 2019-12-13 成都中科大旗软件股份有限公司 data docking and visualization method and system
CN110569298B (en) * 2019-09-12 2023-03-24 成都中科大旗软件股份有限公司 Data docking and visualization method and system
CN110727668A (en) * 2019-09-30 2020-01-24 北京百度网讯科技有限公司 Data cleaning method and device
CN110727668B (en) * 2019-09-30 2022-03-01 北京百度网讯科技有限公司 Data cleaning method and device
CN111159177A (en) * 2019-12-10 2020-05-15 大唐软件技术股份有限公司 Data fusion method, device, equipment and medium based on heterogeneous data
CN111159177B (en) * 2019-12-10 2023-11-07 大唐软件技术股份有限公司 Heterogeneous data-based data fusion method, device, equipment and medium
CN112988804A (en) * 2019-12-12 2021-06-18 陕西西部资信股份有限公司 Data transmission method and system
CN113094415A (en) * 2019-12-23 2021-07-09 北京懿医云科技有限公司 Data extraction method and device, computer readable medium and electronic equipment
CN113094415B (en) * 2019-12-23 2024-03-29 北京懿医云科技有限公司 Data extraction method, data extraction device, computer readable medium and electronic equipment
CN111258993A (en) * 2020-01-09 2020-06-09 佛山科学技术学院 Method and device for filtering abnormal data of industrial big data
CN111427873A (en) * 2020-03-12 2020-07-17 无码科技(杭州)有限公司 Data cleaning method and system
CN111427873B (en) * 2020-03-12 2023-03-14 无码科技(杭州)有限公司 Data cleaning method and system
CN111651466A (en) * 2020-05-09 2020-09-11 杭州数梦工场科技有限公司 Data sampling method and device
CN111651466B (en) * 2020-05-09 2023-07-25 杭州数梦工场科技有限公司 Data sampling method and device
CN111639066A (en) * 2020-05-14 2020-09-08 杭州数梦工场科技有限公司 Data cleaning method and device
CN111694824A (en) * 2020-05-25 2020-09-22 智强通达科技(北京)有限公司 Method for mapping and cleaning oil data chain
CN111858566A (en) * 2020-06-15 2020-10-30 邯郸钢铁集团有限责任公司 Real-time data extraction application method
CN111767267A (en) * 2020-06-18 2020-10-13 杭州数梦工场科技有限公司 Metadata processing method and device and electronic equipment
CN112000656A (en) * 2020-09-01 2020-11-27 北京天源迪科信息技术有限公司 Intelligent data cleaning method and device based on metadata
CN112131239A (en) * 2020-09-30 2020-12-25 腾讯科技(深圳)有限公司 Data processing method, computer equipment and readable storage medium
CN112256688A (en) * 2020-11-26 2021-01-22 杭州数梦工场科技有限公司 Service data cleaning method and device and electronic equipment
CN112650754A (en) * 2020-12-24 2021-04-13 浪潮云信息技术股份公司 Method for importing total data of relational database into Hive
CN112800049A (en) * 2021-04-06 2021-05-14 航天神舟智慧系统技术有限公司 EXCEL data source cleaning method and system based on big data, electronic device and storage medium
CN113568903A (en) * 2021-06-25 2021-10-29 邯郸钢铁集团有限责任公司 Real-time PLC variable extraction application method
CN113392099A (en) * 2021-07-01 2021-09-14 苏州维众数据技术有限公司 Automatic data cleaning method
CN113468155A (en) * 2021-07-05 2021-10-01 杭州数梦工场科技有限公司 Problem data processing method and device
CN113468155B (en) * 2021-07-05 2024-03-29 杭州数梦工场科技有限公司 Question data processing method and device
CN113722404A (en) * 2021-07-27 2021-11-30 张博 High-efficiency analysis method for multi-dimensional data organization

Also Published As

Publication number Publication date
CN109977110B (en) 2020-12-04
CN112199366A (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN109977110A (en) Data cleaning method, device and equipment
US11567997B2 (en) Query language interoperabtility in a graph database
CN107958057B (en) Code generation method and device for data migration in heterogeneous database
CN111459985B (en) Identification information processing method and device
US20210174006A1 (en) System and method for facilitating complex document drafting and management
AU2019204976A1 (en) Intelligent data ingestion system and method for governance and security
CN103810224B (en) information persistence and query method and device
WO2010045331A2 (en) Method and apparatus for gathering and organizing information pertaining to an entity
CN106649503A (en) Query method and system based on sql
CN107741903A (en) Application compatibility method of testing, device, computer equipment and storage medium
CN108319661A (en) A kind of structured storage method and device of spare part information
CN109657803B (en) Construction of machine learning models
CN109308258A (en) Building method, device, computer equipment and the storage medium of test data
CN106802928B (en) Power grid historical data management method and system
CN112948473A (en) Data processing method, device and system of data warehouse and storage medium
EP3594822A1 (en) Intelligent data ingestion system and method for governance and security
US20160125026A1 (en) Proactive query migration to prevent failures
US11797705B1 (en) Generative adversarial network for named entity recognition
CN111061733B (en) Data processing method, device, electronic equipment and computer readable storage medium
CN116452123A (en) Method and device for generating characteristic value of inventory item and computer equipment
CN110263104A (en) JSON character string processing method and device
CN114860727A (en) Zipper watch updating method and device
Vorndran et al. Metadata Sharing–How to Transfer Metadata Information among Work Cluster Members
CN109636303B (en) Storage method and system for semi-automatically extracting and structuring document information
Monaco Methods for in-sourcing authority control with MarcEdit, SQL, and regular expressions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant