CN106874290A - A kind of Data Cleaning Method and equipment - Google Patents

A kind of Data Cleaning Method and equipment Download PDF

Info

Publication number
CN106874290A
CN106874290A CN201510920989.2A CN201510920989A CN106874290A CN 106874290 A CN106874290 A CN 106874290A CN 201510920989 A CN201510920989 A CN 201510920989A CN 106874290 A CN106874290 A CN 106874290A
Authority
CN
China
Prior art keywords
indicator
examination
specific statistics
maintenance task
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510920989.2A
Other languages
Chinese (zh)
Other versions
CN106874290B (en
Inventor
王立伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201510920989.2A priority Critical patent/CN106874290B/en
Publication of CN106874290A publication Critical patent/CN106874290A/en
Application granted granted Critical
Publication of CN106874290B publication Critical patent/CN106874290B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of Data Cleaning Method and equipment.In the case where indicator-specific statistics maintenance task table and index cleaning Code Template is pre-set, when reach synchronization point when according to current state be effective indicator-specific statistics maintenance task table and index cleaning Code Template configuration data cleaning task, and testing results are carried out to data cleaning task, configuration is just scheduled according to indicator-specific statistics maintenance task table and index cleaning Code Template only when data cleansing task run is successfully tested, and data cleansing task is distributed to production environment, so that data warehouse carries out data cleansing.So as to automatically carry out data cleansing task, the workload of data warehouse developer is reduced, improve data mining efficiency.

Description

A kind of Data Cleaning Method and equipment
Technical field
The present invention relates to communication technical field, more particularly to a kind of Data Cleaning Method.The present invention goes back simultaneously It is related to a kind of data cleansing equipment.
Background technology
With the arrival in DT (Data Technology, data processing technique) epoch, data value is increasingly Highlight.For the platform operation business of internet or service provider, its every business is to data Demand is just reaching a unprecedented height.How to be analysed in depth for existing data and from Potential value is excavated in data, as the technical problem that those skilled in the art primarily solve.
At present, the technical staff of business team and processing data is progressively setting up even closer cooperation and is closing System, one of them important area of collaboration is model deployment.By taking the data processing of credible system as an example, should By disposing, whether the operation to certain account in certain circumstances of a set of off-line model is credible to be known system Not, reduced by way of only allowing white list and user is bothered, the experience of user is lifted with this. The trust model is based on account and various environmental informations (MAC (Media Access Control, media Jie Enter key-course), UMID (Unique Material Identifier, Unique Material identification code), TID (THREAD Identifier, thread command character) etc.) under fixed index carry out confidence level mark (such as index A>1, index B>2 are designated grade one, index A>3, index B>4 are designated grade two). The corresponding model construction personnel of business team are responsible for the determination of model index and threshold value, the technology of processing data Personnel are responsible for the cleaning of base values, and model deployment and data push to application system, complete whole data Link closed loop.
After model construction personnel submit model deployment requirements to the technical staff of processing data, processing data Technical staff need to develop waiting after undertaking demand, index cleaning and model deployment etc. are a series of Operation, performed this series of flow.Sometimes when technical staff's resource of processing data is nervous, Model deployment has more serious extension.
As can be seen here, how automatically and efficiently to realize data cleansing, thus solve the problems, such as resource-constrained with And the person works' efficiency that develops skill, as those skilled in the art's technical problem urgently to be resolved hurrily.
The content of the invention
The invention provides a kind of Data Cleaning Method, pre-set indicator-specific statistics maintenance task table and refer to SD washes Code Template, and the method includes:
It is effective indicator-specific statistics maintenance task table and institute according to current state when synchronization point is reached Index cleaning Code Template configuration data cleaning task is stated, the indicator-specific statistics maintenance task table is comprising current For the element and its corresponding data of index cleaning;
Testing results are carried out to the data cleansing task;
If the data cleansing task run is successfully tested, according to the indicator-specific statistics maintenance task table and The index cleaning Code Template is scheduled configuration, and the data cleansing task is distributed into production ring Border, so that data warehouse carries out data cleansing.
Preferably, testing results are carried out to the data cleansing task, specially:
Flow is run according to data cleansing tasks carrying examination, and judges that the examination runs whether flow succeeds;
If flow success is run in the examination, the result data to being obtained by the examination race flow is verified;
If being verified for the data, confirms that the data cleansing task run is successfully tested;
If flow failure is run in the examination or the checking of the data does not pass through, the data cleansing task is confirmed Testing results fail.
Preferably, flow is run according to data cleansing tasks carrying examination, specially:
Run the index cleaning Code Template;
Code Template is cleaned according to the index and reads the indicator-specific statistics maintenance task table, and to the finger The corresponding data of each element are parsed in mark statistics maintenance task table;
According to analysis result and index cleaning Code Template splicing generation SQL statement, and run institute State SQL statement.
Preferably, before synchronization point is reached, also include:
Obtain the current state with each indicator-specific statistics maintenance task table;
If existence is the indicator-specific statistics maintenance task table of standby service examination & approval, the indicator-specific statistics is safeguarded Task list carries out business approval, and by the indicator-specific statistics maintenance task table after the business approval passes through State be updated to treat that technology is audited;
If existence is the indicator-specific statistics maintenance task table for treating technology examination & verification, the indicator-specific statistics is safeguarded Task list carries out technology examination & verification, and by the indicator-specific statistics maintenance task table after technology examination & verification passes through State be updated to effectively.
Preferably, also include:
If state is the business approval of the indicator-specific statistics maintenance task table of standby service examination & approval not passing through, will be described It is to be modified that the state of indicator-specific statistics maintenance task table is updated to business approval, and is the business by state Examination & approval indicator-specific statistics maintenance task table to be modified state after the modification is updated to standby service examination & approval;
If the technology examination & verification that state is the indicator-specific statistics maintenance task table for treating technology examination & verification does not pass through, will be described The state of indicator-specific statistics maintenance task table is updated to technology and audits to be modified, and by state for institute's technology is examined Core indicator-specific statistics maintenance task table to be modified state after the modification is updated to treat that technology is audited.
Preferably, before the current state with each indicator-specific statistics maintenance task table is obtained, also wrap Include:
When the newly-increased request of data cleansing task is received, according in the newly-increased request of the data cleansing task The indicator-specific statistics maintenance task table that the corresponding data genaration of each described element for carrying is increased newly, and will be described new The state of the indicator-specific statistics maintenance task table of increasing is set to standby service examination & approval;
When the modification request of data cleansing task is received, according in data cleansing task modification request The corresponding data of element to be modified for carrying and the corresponding original finger of data cleansing task modification request The newly-increased indicator-specific statistics maintenance task table of mark statistics maintenance task table generation, and the newly-increased index is united The state for counting maintenance task table is set to standby service examination & approval.
Preferably, also include:
If state is institute's technology auditing to be modified or described business approval indicator-specific statistics maintenance task to be modified Table is not changed in default time threshold, and the state of the indicator-specific statistics maintenance task table is updated to It is invalid.
Correspondingly, the application also proposed a kind of data cleansing equipment, and the equipment pre-sets index system Meter maintenance task table and index cleaning Code Template, the equipment also include:
Configuration module, when reach synchronization point when according to current state be effective indicator-specific statistics maintenance task Table and index cleaning Code Template configuration data cleaning task, the indicator-specific statistics maintenance task table Comprising the element and its corresponding data that are currently used in index cleaning;
Test module, testing results are carried out to the data cleansing task;
Release module, safeguards when the data cleansing task run is successfully tested according to the indicator-specific statistics Task list and index cleaning Code Template are scheduled configuration, and the data cleansing task is sent out Cloth is to production environment, so that data warehouse carries out data cleansing.
Preferably, the test module is specifically included:
Submodule is run in examination, runs flow according to data cleansing tasks carrying examination, and judge that stream is run in the examination Whether journey succeeds;
If flow success is run in the examination, the examination runs submodule to running the result that flow is obtained by the examination Data are verified;
If being verified for the data, described to try to run the submodule confirmation data cleansing task run survey Try successfully;
If flow failure is run in the examination or the checking of the data does not pass through, the examination is run submodule and confirms institute State data cleansing task run test crash.
Preferably, the examination is run submodule and runs flow according to data cleansing tasks carrying examination, specially:
Run the index cleaning Code Template;
Code Template is cleaned according to the index and reads the indicator-specific statistics maintenance task table, and to the finger The corresponding data of each element are parsed in mark statistics maintenance task table;
According to analysis result and index cleaning Code Template splicing generation SQL statement, and run institute State SQL statement.
Preferably, also include:
Acquisition module, obtains the current state with each indicator-specific statistics maintenance task table;
If existence is the indicator-specific statistics maintenance task table of standby service examination & approval, the acquisition module is to described Indicator-specific statistics maintenance task table carries out business approval, and the index is united after the business approval passes through The state for counting maintenance task table is updated to treat that technology is audited;
If existence is the indicator-specific statistics maintenance task table for treating technology examination & verification, the acquisition module is to described Indicator-specific statistics maintenance task table carries out technology examination & verification, and by index system after technology examination & verification passes through The state for counting maintenance task table is updated to effectively.
Preferably, also include:
It is described to obtain if state is the business approval of the indicator-specific statistics maintenance task table of standby service examination & approval not passing through It is to be modified that the state of the indicator-specific statistics maintenance task table is updated to business approval by modulus block, and by shape It is unemployed that state is that business approval indicator-specific statistics maintenance task table to be modified state after the modification is updated to Business examination & approval;
If the technology examination & verification that state is the indicator-specific statistics maintenance task table for treating technology examination & verification does not pass through, described to obtain It is to be modified that the state of the indicator-specific statistics maintenance task table is updated to technology examination & verification by modulus block, and by shape State is that institute's technology examination & verification indicator-specific statistics maintenance task table to be modified state after the modification is updated to treat technology Examination & verification.
Preferably, also include:
Generation module is new according to the data cleansing task when data cleansing task increases request newly receiving Increase the newly-increased indicator-specific statistics maintenance task table of the corresponding data genaration of each described element carried in request, and The state of the newly-increased indicator-specific statistics maintenance task table is set to standby service examination & approval;
Modified module, repaiies when the modification request of data cleansing task is received according to the data cleansing task Change the corresponding data of element to be modified and data cleansing task modification request correspondence carried in request The newly-increased indicator-specific statistics maintenance task table of original index statistics maintenance task table generation, and described will increase newly Indicator-specific statistics maintenance task table state be set to standby service examination & approval.
Preferably, also include:
Remove module, is that institute's technology audits to be modified or described business approval index system to be modified in state By the indicator-specific statistics maintenance task table when meter maintenance task table is not changed in default time threshold State is updated to invalid.
As can be seen here, by the technical scheme of application the application, indicator-specific statistics maintenance task is being pre-set According to current state it is effective when synchronization point is reached in the case of table and index cleaning Code Template Indicator-specific statistics maintenance task table and index cleaning Code Template configuration data cleaning task, and to data Cleaning task carries out testing results, just according to indicator-specific statistics only when data cleansing task run is successfully tested Maintenance task table and index cleaning Code Template are scheduled configuration, and data cleansing task is issued To production environment, so that data warehouse carries out data cleansing.Appoint so as to automatically carry out data cleansing Business, reduces the workload of data warehouse developer, improves data mining efficiency.
Brief description of the drawings
Fig. 1 is a kind of schematic flow sheet of Data Cleaning Method that the application is proposed;
Fig. 2 shows to carry out the flow that data warehouse examination is run, confirms and issued in the application specific embodiment It is intended to;
Fig. 3 is a kind of structural representation of data cleansing equipment that the application is proposed.
Specific embodiment
As stated in the Background Art, the corresponding model of each new environment for processing data technical staff and Speech, its dispositions method is all basically identical, but is limited to the resource problem of the technical staff of processing data, Often want waiting very long.Simultaneously for these homogeneous demands, the technical staff of processing data is each The code for being required for exploitation new, but the index demand of model construction personnel is all similar, therefore split Hair resource is a kind of greatly waste.By taking the data processing of credible system as an example, index is that fixed that is several Individual, simply the corresponding dimension of index (the corresponding environmental information of account) is different.
Based on above-mentioned situation, present applicant proposes a kind of Data Cleaning Method, to reduce code maintenance The workload of data mining personnel is reduced while cost, and then lifts development efficiency.The method passes through will The index cleaning logical abstraction of homogeneity out, makes Code Template, is reached by way of Transfer Parameters Purpose is cleaned to different indexs, while being safeguarded to the variable part in homogeneous index cleaning logic Treatment.Therefore before the method is implemented, indicator-specific statistics maintenance task table and index cleaning are pre-set Code Template, wherein, index cleaning Code Template is that cleaning can be completely performed after filling finishes variate-value One section of code of task, wherein needing the place of filling variate-value can use blank value or the side of free time Formula, and indicator-specific statistics maintenance task table then contain it is each required for generation one section of partial data cleaning code The corresponding variate-value of individual element.
As shown in figure 1, the schematic flow sheet of the Data Cleaning Method for the application proposition, comprises the following steps:
S101, is effective indicator-specific statistics maintenance task table according to current state when synchronization point is reached And the index cleaning Code Template configuration data cleaning task,.
Because the application is intended to carry out data cleansing task automatically, therefore technical staff can be according to actual feelings Condition sets a synchronizing cycle or a synchronization point is manually specified, so when in arrival synchronization point, The current state that can pass through to pre-set is that effective indicator-specific statistics maintenance task table and index clean generation Code mask configuration data cleaning task.Index is currently used in due to being contained in indicator-specific statistics maintenance task table The element of cleaning and its corresponding data, therefore in configuration process, can be according to indicator-specific statistics maintenance task Element in table is filled in index cleaning Code Template, and the execution code of data cleansing is generated with this.
In the preferred embodiment of the application, index cleaning relates generally to element as shown in table 1 below:
Element Explanation
Source table Example:Such as ctu event tables
Dimension field Example:Such as account USER_ID, UMID
Metric field Example:Such as amount of money AMOUNT
Metric form Example:Such as COUNT DISTINCT, SUM
Time marking field Example:Such as Time To Event gmt_occur
Collect the beginning and ending time Example:Such as count 20120101~20130101
Object table The index for counting needs to be placed in which table
Table 1
Correspondingly, based on the element filled the need for as implied above, index cleaning Code Template may be configured as Following false code:
Insert overwrite table object tables partition (dt=$ { yyyymmdd })
Select dimensions field 1,
Dimension field 2,
Metric form 1 (metric field 1),
Metric form 2 (metric field 2)
From sources table
Where time marking fields between collects the beginning and ending time.
By taking the indicator-specific statistics maintenance task table for double dimension combination Two indices currently to be counted as an example, then should Front page layout can be used similar to the maintenance page as shown in table 2 below in specific embodiment:
Table 2
Although it should be noted that the application specific embodiment is by above-mentioned Code Template and list template The execution code building mode of data cleansing is illustrated, but the application is not limited thereto, it is basic herein On other list Setting patterns or Code Template belong to the protection domain of the application.
Additionally, in order to carry out effective maintenance and management to indicator-specific statistics maintenance task table, the application's Preferred embodiment is provided with different task statuses for indicator-specific statistics maintenance task table, which includes Standby service is examined and is treated technology and audits two states, and wherein standby service is examined corresponding indicator-specific statistics and safeguarded Task list illustrates that the data cleansing task is not yet allowed to implement, and treats technology and audit corresponding indicator-specific statistics Maintenance task table illustrates that it has been allowed to implement, but technically still infeasible at present.By in business Examined on reasonability and technological rationality, it is ensured that the reasonable utilization of data warehouse resource.
Specifically, in the preferred embodiment of the application, obtaining and each indicator-specific statistics maintenance task After the current state of table, corresponding processing procedure is as follows:
(1) if existence is the indicator-specific statistics maintenance task table of standby service examination & approval, the index is united Meter maintenance task table carries out business approval, and safeguards the indicator-specific statistics after the business approval passes through The state of task list is updated to treat that technology is audited;
(2) if existence is the indicator-specific statistics maintenance task table for treating technology examination & verification, the index is united Meter maintenance task table carries out technology examination & verification, and safeguards the indicator-specific statistics after technology examination & verification passes through The state of task list is updated to effectively.
What above result was directed to is all processing mode ideally, but in actual process Middle technical staff needs constantly to increase new data cleansing task according to the actual requirements, while these are newly-increased Data cleansing task often because many reasons cause not pass through and need the technical staff to carry out it Modification, in the preferred embodiment of the application, the specific aim measure taken for different situations is as follows:
(1) business approval of the indicator-specific statistics maintenance task table that state is examined for standby service does not pass through
In this case, it is to be modified that the state of the indicator-specific statistics maintenance task table is updated into business approval, And by state be business approval indicator-specific statistics maintenance task table to be modified state after the modification more New is standby service examination & approval.
(2) state be treat technology examination & verification indicator-specific statistics maintenance task table technology examination & verification do not pass through
In this case, the state of the indicator-specific statistics maintenance task table is updated into technology audits to be modified, And by state be that institute technology is audited indicator-specific statistics maintenance task table to be modified state after the modification and updated To treat that technology is audited.
(3) the newly-increased request of data cleansing task is received
Increase the corresponding data genaration of each described element carried in request newly according to the data cleansing task new The indicator-specific statistics maintenance task table of increasing, and the state of the newly-increased indicator-specific statistics maintenance task table is set For standby service is examined;
(4) the modification request of data cleansing task is received
According to the corresponding data of element to be modified and institute that are carried in data cleansing task modification request State the newly-increased index system of the corresponding original index statistics maintenance task table generation of data cleansing task modification request Maintenance task table is counted, and the state of the newly-increased indicator-specific statistics maintenance task table is set to standby service and examined Batch.
While data task state-maintenance is completed through the above way, need not in order to clear up in time Data cleansing task, the application preferred embodiment is in state for the examination & verification of institute technology is to be modified or the business Examination & approval indicator-specific statistics maintenance task table to be modified in default time threshold not by modification in the case of, It is invalid that the state of the indicator-specific statistics maintenance task table is updated to.
S102, testing results are carried out to the data cleansing task.
After obtaining the execution code of execution data cleansing task based on S101, the step can be clear for data The task of washing carries out testing results.In the preferred embodiment of the application, appointed according to the data cleansing first Business performs examination and runs flow, and judges that the examination runs whether flow succeeds, and is then entered respectively according to situations below Row treatment:
(1) if flow success is run in the examination, the result data to being obtained by the examination race flow is carried out Checking;
(2) if being verified for the data, confirms that the data cleansing task run is successfully tested;
(3) if flow failure is run in the examination or the checking of the data does not pass through, confirm that the data are clear Wash task run test crash.
It should be noted that according to different applied environment and device type, technical staff can take Flow is run in the examination of different step, in the preferred embodiment of the application, is completed examination by following steps and is run stream Journey:
Step a) runs the index cleaning Code Template;
Step b) cleans Code Template and reads the indicator-specific statistics maintenance task table according to the index, and The corresponding data of each element in the indicator-specific statistics maintenance task table are parsed;
Step c) generates SQL statement according to analysis result and index cleaning Code Template splicing, and Run the SQL statement.
In specific embodiment as shown in Figure 2, postponed when completing to match somebody with somebody on line, warehouse needs to start to try to run and flows Journey, the code of node operation can read the synchronous allocation list for getting off, and the information in allocation list is parsed, Completion SQL statement is spliced into be run, examination runs that confirmation program is errorless, data accurate, stable performance, money After the consumption rationally of source, then formally it is distributed to production environment.
S103, if the data cleansing task run is successfully tested, according to the indicator-specific statistics maintenance task Table and index cleaning Code Template are scheduled configuration, and the data cleansing task is distributed to Production environment, so that data warehouse carries out data cleansing.
By the technical scheme using above-described embodiment, in the case where inline system and data warehouse is combined, Using inline system editor and advantage easy to maintenance, the change that above-mentioned homogeneous index SD is washed in logic Amount part inline system safeguards that data warehouse is used after data syn-chronization on line is returned into warehouse, Improve development efficiency.
To reach above technical purpose, the application also proposed a kind of data cleansing equipment, as shown in figure 3, The equipment pre-sets indicator-specific statistics maintenance task table and index cleaning Code Template, and the equipment is also wrapped Include:
Configuration module 310, safeguards for effective indicator-specific statistics according to current state when synchronization point is reached and appoints Business table and index cleaning Code Template configuration data cleaning task, the indicator-specific statistics maintenance task Table includes the element and its corresponding data for being currently used in index cleaning;
Test module 320, testing results are carried out to the data cleansing task;
Release module 330, ties up when the data cleansing task run is successfully tested according to the indicator-specific statistics Shield task list and index cleaning Code Template are scheduled configuration, and by the data cleansing task Production environment is distributed to, so that data warehouse carries out data cleansing.
In specific application scenarios, the test module is specifically included:
Submodule is run in examination, runs flow according to data cleansing tasks carrying examination, and judge that stream is run in the examination Whether journey succeeds;
If flow success is run in the examination, the examination runs submodule to running the result that flow is obtained by the examination Data are verified;
If being verified for the data, described to try to run the submodule confirmation data cleansing task run survey Try successfully;
If flow failure is run in the examination or the checking of the data does not pass through, the examination is run submodule and confirms institute State data cleansing task run test crash.
In specific application scenarios, the examination is run submodule and is tried to run according to the data cleansing tasks carrying Flow, specially:
Run the index cleaning Code Template;
Code Template is cleaned according to the index and reads the indicator-specific statistics maintenance task table, and to the finger The corresponding data of each element are parsed in mark statistics maintenance task table;
According to analysis result and index cleaning Code Template splicing generation SQL statement, and run described SQL statement.
In specific application scenarios, also include:
Acquisition module, obtains the current state with each indicator-specific statistics maintenance task table;
If existence is the indicator-specific statistics maintenance task table of standby service examination & approval, the acquisition module is to described Indicator-specific statistics maintenance task table carries out business approval, and the index is united after the business approval passes through The state for counting maintenance task table is updated to treat that technology is audited;
If existence is the indicator-specific statistics maintenance task table for treating technology examination & verification, the acquisition module is to described Indicator-specific statistics maintenance task table carries out technology examination & verification, and by index system after technology examination & verification passes through The state for counting maintenance task table is updated to effectively.
In specific application scenarios, also include:
It is described to obtain if state is the business approval of the indicator-specific statistics maintenance task table of standby service examination & approval not passing through It is to be modified that the state of the indicator-specific statistics maintenance task table is updated to business approval by modulus block, and by shape It is unemployed that state is that business approval indicator-specific statistics maintenance task table to be modified state after the modification is updated to Business examination & approval;
If the technology examination & verification that state is the indicator-specific statistics maintenance task table for treating technology examination & verification does not pass through, described to obtain It is to be modified that the state of the indicator-specific statistics maintenance task table is updated to technology examination & verification by modulus block, and by shape State is that institute's technology examination & verification indicator-specific statistics maintenance task table to be modified state after the modification is updated to treat technology Examination & verification.
In specific application scenarios, also include:
Generation module is new according to the data cleansing task when data cleansing task increases request newly receiving Increase the newly-increased indicator-specific statistics maintenance task table of the corresponding data genaration of each described element carried in request, and The state of the newly-increased indicator-specific statistics maintenance task table is set to standby service examination & approval;
Modified module, repaiies when the modification request of data cleansing task is received according to the data cleansing task Change the corresponding data of element to be modified and data cleansing task modification request correspondence carried in request The newly-increased indicator-specific statistics maintenance task table of original index statistics maintenance task table generation, and described will increase newly Indicator-specific statistics maintenance task table state be set to standby service examination & approval.
In specific application scenarios, also include:
Remove module, is that institute's technology audits to be modified or described business approval index system to be modified in state By the indicator-specific statistics maintenance task table when meter maintenance task table is not changed in default time threshold State is updated to invalid.
By the technical scheme of application the application, indicator-specific statistics maintenance task table and index are being pre-set Cleaning Code Template in the case of, when reach synchronization point when according to current state be effective indicator-specific statistics Maintenance task table and index cleaning Code Template configuration data cleaning task, and data cleaning task is entered Row testing results, just according to indicator-specific statistics maintenance task table only when data cleansing task run is successfully tested And index cleaning Code Template is scheduled configuration, and data cleansing task is distributed to production environment, So that data warehouse carries out data cleansing.So as to automatically carry out data cleansing task, data are reduced The workload of warehouse developer, improves data mining efficiency.
Through the above description of the embodiments, those skilled in the art can be understood that this hair It is bright to be realized by hardware, it is also possible to be realized by the mode of software plus necessary general hardware platform. Based on such understanding, technical scheme can be embodied in the form of software product, and this is soft It (can be CD-ROM, USB flash disk is mobile hard that part product can be stored in a non-volatile memory medium Disk etc.) in, including some instructions are used to so that a computer equipment (can be personal computer, take Business device, or the network equipment etc.) perform method described in each implement scene of the invention.
It will be appreciated by those skilled in the art that accompanying drawing is a schematic diagram for being preferable to carry out scene, in accompanying drawing Module or necessary to flow not necessarily implements the present invention.
It will be appreciated by those skilled in the art that the module in device in implement scene can be according to implement scene Description be distributed in the device of implement scene, it is also possible to is carried out respective change and is disposed other than this implementation In one or more devices of scene.The module of above-mentioned implement scene can merge into a module, also may be used To be further split into multiple submodule.
The invention described above sequence number is for illustration only, and the quality of implement scene is not represented.
Disclosed above is only several specific implementation scenes of the invention, but, the present invention is not limited to This, the changes that any person skilled in the art can think of should all fall into protection scope of the present invention.

Claims (14)

1. a kind of Data Cleaning Method, it is characterised in that pre-set indicator-specific statistics maintenance task table and Index cleans Code Template, and the method includes:
It is effective indicator-specific statistics maintenance task table and institute according to current state when synchronization point is reached Index cleaning Code Template configuration data cleaning task is stated, the indicator-specific statistics maintenance task table is comprising current For the element and its corresponding data of index cleaning;
Testing results are carried out to the data cleansing task;
If the data cleansing task run is successfully tested, according to the indicator-specific statistics maintenance task table and The index cleaning Code Template is scheduled configuration, and the data cleansing task is distributed into production ring Border, so that data warehouse carries out data cleansing.
2. the method for claim 1, it is characterised in that transported to the data cleansing task Row test, specially:
Flow is run according to data cleansing tasks carrying examination, and judges that the examination runs whether flow succeeds;
If flow success is run in the examination, the result data to being obtained by the examination race flow is verified;
If being verified for the data, confirms that the data cleansing task run is successfully tested;
If flow failure is run in the examination or the checking of the data does not pass through, the data cleansing task is confirmed Testing results fail.
3. method as claimed in claim 2, it is characterised in that according to the data cleansing tasks carrying Flow is run in examination, specially:
Run the index cleaning Code Template;
Code Template is cleaned according to the index and reads the indicator-specific statistics maintenance task table, and to the finger The corresponding data of each element are parsed in mark statistics maintenance task table;
According to analysis result and index cleaning Code Template splicing generation SQL statement, and run institute State SQL statement.
4. the method for claim 1, it is characterised in that before synchronization point is reached, also wrap Include:
Obtain the current state with each indicator-specific statistics maintenance task table;
If existence is the indicator-specific statistics maintenance task table of standby service examination & approval, the indicator-specific statistics is safeguarded Task list carries out business approval, and by the indicator-specific statistics maintenance task table after the business approval passes through State be updated to treat that technology is audited;
If existence is the indicator-specific statistics maintenance task table for treating technology examination & verification, the indicator-specific statistics is safeguarded Task list carries out technology examination & verification, and by the indicator-specific statistics maintenance task table after technology examination & verification passes through State be updated to effectively.
5. method as claimed in claim 4, it is characterised in that also include:
If state is the business approval of the indicator-specific statistics maintenance task table of standby service examination & approval not passing through, will be described It is to be modified that the state of indicator-specific statistics maintenance task table is updated to business approval, and is the business by state Examination & approval indicator-specific statistics maintenance task table to be modified state after the modification is updated to standby service examination & approval;
If the technology examination & verification that state is the indicator-specific statistics maintenance task table for treating technology examination & verification does not pass through, will be described The state of indicator-specific statistics maintenance task table is updated to technology and audits to be modified, and by state for institute's technology is examined Core indicator-specific statistics maintenance task table to be modified state after the modification is updated to treat that technology is audited.
6. method as claimed in claim 4, it is characterised in that obtaining and each indicator-specific statistics dimension Before protecting the current state of task list, also include:
When the newly-increased request of data cleansing task is received, according in the newly-increased request of the data cleansing task The indicator-specific statistics maintenance task table that the corresponding data genaration of each described element for carrying is increased newly, and will be described new The state of the indicator-specific statistics maintenance task table of increasing is set to standby service examination & approval;
When the modification request of data cleansing task is received, according in data cleansing task modification request The corresponding data of element to be modified for carrying and the corresponding original finger of data cleansing task modification request The newly-increased indicator-specific statistics maintenance task table of mark statistics maintenance task table generation, and the newly-increased index is united The state for counting maintenance task table is set to standby service examination & approval.
7. method as claimed in claim 6, it is characterised in that also include:
If state is institute's technology auditing to be modified or described business approval indicator-specific statistics maintenance task to be modified Table is not changed in default time threshold, and the state of the indicator-specific statistics maintenance task table is updated to It is invalid.
8. a kind of data cleansing equipment, it is characterised in that the equipment pre-sets indicator-specific statistics and safeguards appoints Business table and index cleaning Code Template, the equipment also include:
Configuration module, when reach synchronization point when according to current state be effective indicator-specific statistics maintenance task Table and index cleaning Code Template configuration data cleaning task, the indicator-specific statistics maintenance task table Comprising the element and its corresponding data that are currently used in index cleaning;
Test module, testing results are carried out to the data cleansing task;
Release module, safeguards when the data cleansing task run is successfully tested according to the indicator-specific statistics Task list and index cleaning Code Template are scheduled configuration, and the data cleansing task is sent out Cloth is to production environment, so that data warehouse carries out data cleansing.
9. equipment as claimed in claim 8, it is characterised in that the test module is specifically included:
Submodule is run in examination, runs flow according to data cleansing tasks carrying examination, and judge that stream is run in the examination Whether journey succeeds;
If flow success is run in the examination, the examination runs submodule to running the result that flow is obtained by the examination Data are verified;
If being verified for the data, described to try to run the submodule confirmation data cleansing task run survey Try successfully;
If flow failure is run in the examination or the checking of the data does not pass through, the examination is run submodule and confirms institute State data cleansing task run test crash.
10. equipment as claimed in claim 9, it is characterised in that the examination runs submodule according to Flow is run in the examination of data cleansing tasks carrying, specially:
Run the index cleaning Code Template;
Code Template is cleaned according to the index and reads the indicator-specific statistics maintenance task table, and to the finger The corresponding data of each element are parsed in mark statistics maintenance task table;
According to analysis result and index cleaning Code Template splicing generation SQL statement, and run institute State SQL statement.
11. equipment as claimed in claim 8, it is characterised in that also include:
Acquisition module, obtains the current state with each indicator-specific statistics maintenance task table;
If existence is the indicator-specific statistics maintenance task table of standby service examination & approval, the acquisition module is to described Indicator-specific statistics maintenance task table carries out business approval, and the index is united after the business approval passes through The state for counting maintenance task table is updated to treat that technology is audited;
If existence is the indicator-specific statistics maintenance task table for treating technology examination & verification, the acquisition module is to described Indicator-specific statistics maintenance task table carries out technology examination & verification, and by index system after technology examination & verification passes through The state for counting maintenance task table is updated to effectively.
12. equipment as claimed in claim 11, it is characterised in that also include:
It is described to obtain if state is the business approval of the indicator-specific statistics maintenance task table of standby service examination & approval not passing through It is to be modified that the state of the indicator-specific statistics maintenance task table is updated to business approval by modulus block, and by shape It is unemployed that state is that business approval indicator-specific statistics maintenance task table to be modified state after the modification is updated to Business examination & approval;
If the technology examination & verification that state is the indicator-specific statistics maintenance task table for treating technology examination & verification does not pass through, described to obtain It is to be modified that the state of the indicator-specific statistics maintenance task table is updated to technology examination & verification by modulus block, and by shape State is that institute's technology examination & verification indicator-specific statistics maintenance task table to be modified state after the modification is updated to treat technology Examination & verification.
13. equipment as claimed in claim 12, it is characterised in that also include:
Generation module is new according to the data cleansing task when data cleansing task increases request newly receiving Increase the newly-increased indicator-specific statistics maintenance task table of the corresponding data genaration of each described element carried in request, and The state of the newly-increased indicator-specific statistics maintenance task table is set to standby service examination & approval;
Modified module, repaiies when the modification request of data cleansing task is received according to the data cleansing task Change the corresponding data of element to be modified and data cleansing task modification request correspondence carried in request The newly-increased indicator-specific statistics maintenance task table of original index statistics maintenance task table generation, and described will increase newly Indicator-specific statistics maintenance task table state be set to standby service examination & approval.
14. equipment as claimed in claim 6, it is characterised in that also include:
Remove module, is that institute's technology audits to be modified or described business approval index system to be modified in state By the indicator-specific statistics maintenance task table when meter maintenance task table is not changed in default time threshold State is updated to invalid.
CN201510920989.2A 2015-12-11 2015-12-11 Data cleaning method and equipment Active CN106874290B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510920989.2A CN106874290B (en) 2015-12-11 2015-12-11 Data cleaning method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510920989.2A CN106874290B (en) 2015-12-11 2015-12-11 Data cleaning method and equipment

Publications (2)

Publication Number Publication Date
CN106874290A true CN106874290A (en) 2017-06-20
CN106874290B CN106874290B (en) 2020-08-04

Family

ID=59177501

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510920989.2A Active CN106874290B (en) 2015-12-11 2015-12-11 Data cleaning method and equipment

Country Status (1)

Country Link
CN (1) CN106874290B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241310A (en) * 2018-01-22 2018-07-03 苏州白金汉爵大酒店有限公司 Hotel room condition control method, apparatus and system
CN108363782A (en) * 2018-02-11 2018-08-03 中国联合网络通信集团有限公司 A kind of data cleaning method and Data clean system
CN108667826A (en) * 2018-04-25 2018-10-16 中国人民解放军战略支援部队信息工程大学 A kind of dispatching device and dispatching method based on four mould isomery redundant processors
CN112486969A (en) * 2020-12-01 2021-03-12 李孔雀 Data cleaning method applied to big data and deep learning and cloud server

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101094120A (en) * 2007-07-04 2007-12-26 中兴通讯股份有限公司 Automatic test system and method based on network stack system
CN101202958A (en) * 2007-12-14 2008-06-18 中国网络通信集团公司 Method, device and system of telecommunication service information processing
CN101370024A (en) * 2007-08-15 2009-02-18 北京灵图软件技术有限公司 Distributed information collection method and system
CN101477572A (en) * 2009-01-12 2009-07-08 深圳市里王智通软件有限公司 Method and system of dynamic data base based on TDS transition data storage technology
EP1906635B1 (en) * 2006-09-30 2010-02-10 Huawei Technologies Co., Ltd. Access apparatus and method for digital subscriber line test
CN102156893A (en) * 2011-03-24 2011-08-17 大连海事大学 Cleaning system and method thereof for data acquired by RFID device under network
CN102184491A (en) * 2011-05-31 2011-09-14 中信银行股份有限公司 Offsite auditing comprehensive analysis platform
CN102821373A (en) * 2012-08-15 2012-12-12 曙光信息产业(北京)有限公司 Short message service platform applicable to heterogeneous equipment and realization method of short message service platform
CN102915303A (en) * 2011-08-01 2013-02-06 阿里巴巴集团控股有限公司 Method and device for ETL (extract-transform-load) tests
CN103309904A (en) * 2012-03-16 2013-09-18 阿里巴巴集团控股有限公司 Method and device for generating data warehouse ETL (Extraction, Transformation and Loading) codes
CN103455408A (en) * 2013-09-05 2013-12-18 华为技术有限公司 Method and device for evaluating IO (Input/Output) processing stability of file system
CN104699796A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Data cleaning method based on data warehouse

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1906635B1 (en) * 2006-09-30 2010-02-10 Huawei Technologies Co., Ltd. Access apparatus and method for digital subscriber line test
CN101094120A (en) * 2007-07-04 2007-12-26 中兴通讯股份有限公司 Automatic test system and method based on network stack system
CN101370024A (en) * 2007-08-15 2009-02-18 北京灵图软件技术有限公司 Distributed information collection method and system
CN101202958A (en) * 2007-12-14 2008-06-18 中国网络通信集团公司 Method, device and system of telecommunication service information processing
CN101477572A (en) * 2009-01-12 2009-07-08 深圳市里王智通软件有限公司 Method and system of dynamic data base based on TDS transition data storage technology
CN102156893A (en) * 2011-03-24 2011-08-17 大连海事大学 Cleaning system and method thereof for data acquired by RFID device under network
CN102184491A (en) * 2011-05-31 2011-09-14 中信银行股份有限公司 Offsite auditing comprehensive analysis platform
CN102915303A (en) * 2011-08-01 2013-02-06 阿里巴巴集团控股有限公司 Method and device for ETL (extract-transform-load) tests
CN103309904A (en) * 2012-03-16 2013-09-18 阿里巴巴集团控股有限公司 Method and device for generating data warehouse ETL (Extraction, Transformation and Loading) codes
CN102821373A (en) * 2012-08-15 2012-12-12 曙光信息产业(北京)有限公司 Short message service platform applicable to heterogeneous equipment and realization method of short message service platform
CN103455408A (en) * 2013-09-05 2013-12-18 华为技术有限公司 Method and device for evaluating IO (Input/Output) processing stability of file system
CN104699796A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Data cleaning method based on data warehouse

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241310A (en) * 2018-01-22 2018-07-03 苏州白金汉爵大酒店有限公司 Hotel room condition control method, apparatus and system
CN108363782A (en) * 2018-02-11 2018-08-03 中国联合网络通信集团有限公司 A kind of data cleaning method and Data clean system
CN108667826A (en) * 2018-04-25 2018-10-16 中国人民解放军战略支援部队信息工程大学 A kind of dispatching device and dispatching method based on four mould isomery redundant processors
CN108667826B (en) * 2018-04-25 2020-09-04 中国人民解放军战略支援部队信息工程大学 Scheduling device and scheduling method based on four-mode heterogeneous redundant processor
CN112486969A (en) * 2020-12-01 2021-03-12 李孔雀 Data cleaning method applied to big data and deep learning and cloud server
CN112486969B (en) * 2020-12-01 2021-08-03 罗嗣扬 Data cleaning method applied to big data and deep learning and cloud server

Also Published As

Publication number Publication date
CN106874290B (en) 2020-08-04

Similar Documents

Publication Publication Date Title
Bülbül A hybrid shifting bottleneck-tabu search heuristic for the job shop total weighted tardiness problem
US8719784B2 (en) Assigning runtime artifacts to software components
CN106874290A (en) A kind of Data Cleaning Method and equipment
Yang et al. Modeling UML sequence diagrams using extended Petri nets
Nguyen et al. Topic-based defect prediction (nier track)
CN108182359A (en) The method, apparatus and storage medium of API safeties under a kind of test trusted context
Kress et al. Mathematical models for a flexible job shop scheduling problem with machine operator constraints
CN103186463B (en) Determine the method and system of the test specification of software
Alakeel Using fuzzy logic in test case prioritization for regression testing programs with assertions
CN111651346B (en) Method and device for testing front-end component, storage medium and computer equipment
Wang et al. Test case prioritization for service-oriented workflow applications: A perspective of modification impact analysis
TW201218008A (en) Intelligent architecture creator
CN116661739A (en) Processing method, device, equipment and storage medium of business rule
Ziadi et al. Software product line extraction from bytecode based applications
Sokolowski et al. Change is the only constant: dynamic updates for workflows
Yu et al. Dynamic slicing of Petri nets based on structural dependency graph and its application in system analysis
Miller et al. Animation can show only the presence of errors, never their absence
Göttmann et al. Static analysis techniques for efficient consistency checking of real-time-aware dspl specifications
Wang et al. An empirical study on establishing quantitative management model for testing process
Riegel et al. An analysis of priority-based decision heuristics for optimizing elicitation efficiency
CN115525255A (en) Construction method of intelligent contract system
Afraz et al. P3: Partitioned path profiling
Xie et al. Design and implementation of bank financial business automation testing framework based on QTP
Liu et al. FunRedisp: Reordering Function Dispatch in Smart Contract to Reduce Invocation Gas Fees
Tombe et al. Cyclomatic Complexity Metrics for Software Architecture Maintenance Risk Assessment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20201013

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201013

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Patentee after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: Alibaba Group Holding Ltd.