CN109815224A - Data quality checking and the method and apparatus of cleaning - Google Patents

Data quality checking and the method and apparatus of cleaning Download PDF

Info

Publication number
CN109815224A
CN109815224A CN201910089853.XA CN201910089853A CN109815224A CN 109815224 A CN109815224 A CN 109815224A CN 201910089853 A CN201910089853 A CN 201910089853A CN 109815224 A CN109815224 A CN 109815224A
Authority
CN
China
Prior art keywords
data
dirty
rule
quality
dirty data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910089853.XA
Other languages
Chinese (zh)
Inventor
程宏亮
李晓燕
吴垌沅
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Merrill Lynch Data Technology Ltd By Share Ltd
Original Assignee
Merrill Lynch Data Technology Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Merrill Lynch Data Technology Ltd By Share Ltd filed Critical Merrill Lynch Data Technology Ltd By Share Ltd
Priority to CN201910089853.XA priority Critical patent/CN109815224A/en
Publication of CN109815224A publication Critical patent/CN109815224A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The disclosure provides the method, apparatus of a kind of data quality checking and cleaning, is related to information technology field, can be improved the accuracy and efficiency of quality of data judgement, and completes the cleaning rectification of data.The specific technical proposal is: obtaining pending data;Set the K data quality rule for detecting the pending data quality, K >=1;It selects M in the K data quality rule to execute, dirty data, 1≤M≤K is filtered out from the pending data;Update the data in dirty data list.The disclosure is used for data quality checking and cleaning.

Description

Data quality checking and the method and apparatus of cleaning
Technical field
This disclosure relates to information technology field more particularly to data quality checking and the method and apparatus of cleaning.
Background technique
There are all kinds of dirty datas, traffic affecting consistency and accuracys in the operation system of enterprise.Data quality problem Often have a significant impact to data precision of analysis and reliability, data of low quality cannot effectively support user into Capable data analysis-decision system.Therefore every profession and trade is both needed to administer by data and solves the problems, such as that the quality of data is not high, data cleansing is data One of important link of improvement.It is generally necessary to which business personnel compares one by one to data and carries out manual amendment's mode cleaning, to protect Demonstrate,prove the quality of data.Should exist in the process and artificially judge the problems such as misjudgement is failed to judge and data rectify and improve low efficiency.
Summary of the invention
The embodiment of the present disclosure provides the method and apparatus of a kind of data quality checking and cleaning, can be improved the quality of data and sentences Fixed accuracy and efficiency.The technical solution is as follows:
According to the first aspect of the embodiments of the present disclosure, a kind of method of data quality checking and cleaning, this method packet are provided It includes:
Obtain pending data;
Set the K data quality rule for detecting the pending data quality, K >=1;
It selects M in the K data quality rule to execute, filters out dirty data from the pending data, 1≤ M≤K;
Update the data in dirty data list.
Technical solution provided by the present disclosure determines the K data matter for being used to screen dirty data for a pending data Gauge is then.For this K data quality rule, may be selected to execute any one or multiple quality of data rules therein carry out it is dirty Data screening is efficiently filtered out dirty according to the specific business need in practical application using different quality of data rules Data.Further, the data in dirty data list may be updated, achieve the purpose that data cleansing.
In one embodiment, which comprises
Range of receiving designated order, the quality rule that the range designated order is used to that this to be specified to check, wherein described Goal rule is one of described M data quality rule;
The pending data is detected according to the goal rule, filters out the entry there are dirty data.
By range designated order, dirty data screening can be carried out by the implementing result of some or certain several rules, to dirty The granularity of data screening is finely divided.
In one embodiment, further includes:
Dirty data list is exported, the dirty data list includes the entry in the pending data there are dirty data;
Color identifier is added for dirty data.
In one embodiment, further includes:
During rectifying and improving the dirty data, saves and rectify and improve the forward and backward dirty data list;
It receives historical data and checks instruction, the historical data checks any instructed for specifying the dirty data list Old version;
Export the old version that the historical data checks instruction.
The modification of dirty data will be preserved, if the user find that modification mistake can restore data to some history version This.
In one embodiment, further includes:
By being grouped, repeat screening, column assignment, column empty, capital and small letter conversion, date format conversion at least one of volume Function is collected, the dirty data is rectified and improved.
According to the second aspect of an embodiment of the present disclosure, a kind of data quality checking and cleaning device are provided, comprising:
Interface module, for obtaining pending data;
Control module sets the K data quality rule for detecting the pending data quality, K >=1;
Processing module is sieved from the pending data for selecting M in the K data quality rule to execute Dirty data is selected, and updates the data in dirty data list, 1≤M≤K.
In one embodiment, the control module, is used for range of receiving designated order, and the range designated order is used for The aimed quality rule for specifying this to check, wherein the goal rule is one of described M data quality rule;
The processing module, for filtering out according to the goal rule detection pending data, there are dirty datas Entry.
In one embodiment, the processing module, for exporting dirty data list, the dirty data list includes described There are the entries of dirty data in pending data;Color identifier is added for dirty data.
In one embodiment, the processing module, for rectifying and improving dirty data, and during rectifying and improving dirty data, It saves and updates the forward and backward dirty data list;
The control module checks instruction for receiving historical data, and the historical data checks instruction for specifying State any old version of dirty data list;
The processing module checks the old version of instruction for exporting the historical data.
In one embodiment, the processing module, for by grouping, repeat screening, column assignment, arrange empty, size At least one of conversion, date format conversion editting function are write, the dirty data is rectified and improved.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure.
Fig. 1 is the flow chart of the method for a kind of data quality checking that the embodiment of the present disclosure provides and cleaning.
Fig. 2 is the flow chart of the method for a kind of data quality checking that the embodiment of the present disclosure provides and cleaning.
Fig. 3 is a kind of display interface schematic diagram that the embodiment of the present disclosure provides.
Fig. 4 is a kind of display interface schematic diagram that the embodiment of the present disclosure provides.
Fig. 5 is the structure chart of a kind of data quality checking that the embodiment of the present disclosure provides and cleaning device.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
The quality problems of data are determined by way of manual examination and verification and are cleaned, and are easy to appear misjudgement and are failed to judge and repair Change inefficiency problem.Embodiment of the disclosure provides a kind of method of data quality checking and cleaning, can be according to different business Scene carries out data quality problem lookup to Various types of data and dirty data is rectified and improved.
The embodiment of the present disclosure provides a kind of method of data quality checking and cleaning, as shown in Figure 1, the data quality checking And cleaning method the following steps are included:
101, pending data is obtained.
The data that pending data can be passed to for the data or webservice interface of EXEL importing, from database The data etc. of reading.Embodiment of the disclosure for pending data specific source without limitation.
102, the K data quality rule for detecting pending data quality is set.
According to the specific business need in practical application, the K quality of data rules for screening dirty data, K can be set ≥1。
For example, in the creation data of an enterprise, including the fields such as COM code, name of material, it can be respectively for each Field sets quality rule.
103, M execution in K data quality rule is selected, dirty data is filtered out from pending data, and update Data in dirty data list.
For K data quality rule, may be selected to execute any one or multiple quality of data rules therein carry out it is dirty Data screening, to more pointedly go out to correspond to using different quality of data Rules Filterings according to the different demands of user Dirty data.
The dirty data filtered out can form dirty data list, be shown by WEB page, and user can carry out data therein Modification, achievees the purpose that data cleansing.
The method of data quality checking and cleaning that the embodiment of the present disclosure provides determines K for a pending data For screening the quality of data rule of dirty data.For this K data quality rule, may be selected to execute it is therein any one or Multiple quality of data rules carry out dirty data screening, i.e., according to the specific business need in practical application, using different data Quality rule efficiently filters out dirty data.Further, the data in dirty data list may be updated, reach data cleansing Purpose.
Method based on the corresponding embodiment of the above-mentioned Fig. 1 data quality checking provided and cleaning, another implementation of the disclosure Example has done further supplementary explanation to the method for data quality checking and cleaning.Content in part of step is corresponding with Fig. 1 Embodiment in step it is same or like, below only elaborate to difference in step.
Referring to shown in Fig. 2, the method for data quality checking provided in this embodiment and cleaning the following steps are included:
201, pending data is obtained.
202, the K quality of data for detecting pending data quality is determined from preset S data quality rule Rule.
In one embodiment, S data quality rule is preset.For certain a pending data, K therein can be called A, as screening dirty data quality of data rule.Wherein, 1≤K≤S.
Referring to display interface schematic diagram shown in Fig. 3, user can be by input equipments such as mouse, touch screens from preset S K are selected in quality of data rule.The K data quality rule that user chooses is shown in right area.
K data quality rule type includes but is not limited to that character length checks that rule, numeric format inspection are regular, unique Property rule, rule of consistency, non-empty rule, threshold rule etc..Each rule may specify corresponding detection field, for example, metering Unit<consistent rule>, which refers to, detects measurement unit field according to rule of consistency.
The above rule is only used as exemplary illustration, and the disclosure does not do specific quality of data rule and each value of K It limits.
203, range of receiving designated order.
Range designated order is for specifying goal rule.Goal rule is one of M data quality rule.
Mouse trigger range designated order can be used in user.Referring to the right area of Fig. 3, user is clicked a certain using mouse " execution " button corresponding after rule is checked in pending data with specified according to which or which quality of data rule Dirty data.
By taking character length checks as an example, when user clicks the rule corresponding " execution " button, data quality checking and clear Cleaning device according to character length check rule, pending data is checked, Fig. 3 left area output meet how many, no Meet how many results.
By taking goal rule is the situation of rule of consistency as an example, " measurement unit<consistent rule>" of display in right area Refer to, it is specified that measurement unit field is checked according to rule of consistency.
When user clicks measurement unit<consistent rule>corresponding " execution " button, data quality checking and cleaning device root According to the measurement unit field of rule of consistency detection pending data, exporting the data for meeting the rule of consistency has m item, is not inconsistent Close the data n item of the rule of consistency.
By range designated order, the granularity screened to dirty data can be carried out by the quality rule result of execution thin Point.
204, dirty data list is exported, adds color identifier for dirty data.
Referring to shown in Fig. 3, in left area, the statistical result for executing and generating after data quality checking can be shown.
For example, data quality checking and cleaning device are according to the right side when user clicks " all executing " button of right area Total data quality rule, filters out dirty data from pending data shown by side region.It shows and counts in left area As a result, there is 57 to meet rule, 42 are not inconsistent normally.
When the user clicks when corresponding " execution " button of measurement unit<consistent rule>, statistics knot is shown in left area Fruit has 95 to meet rule, and 4 are not inconsistent normally.
Referring to shown in Fig. 3, intermediate region can show pending data, and intermediate region content is omitted in Fig. 3.
Referring to shown in Fig. 4, in the also exportable dirty data list in intermediate region, dirty data list includes depositing in pending data In the entry of dirty data.The entry output of dirty data will be present with colored shading mark in dirty data present in list, It is shown at a glance by the field that dirty data will be present in color identifier.
For example, intermediate region shows this when the user clicks in left area when the region of display " not being inconsistent normally 42 " 42 there are the entry of dirty data, there are the fields of dirty data to be identified with shading in an entry.It is indicated in Fig. 4 with shade filling Shading.
205, the data in dirty data list are updated, saves and updates forward and backward dirty data list.
User can modify to dirty data.Such as, it is possible to provide grouping, repeat screening, column assignment, arrange empty, capital and small letter A variety of editting functions such as conversion, date format conversion, modify to the data not being inconsistent normally for user.
Referring to shown in Fig. 4, left area shows the result that editor is grouped to model number field.By model number field content phase Same entry is divided into one group, and there are in the entry of dirty data, model number field content is that the data of " bb " have 2, and content is " BBB " Data have 14.User is by carrying out edit-modify to a certain grouping, by the model number field of total data entry in the grouping Unified modification.
Modification each time for user can preserve the dirty data of modification front and back, if the user find that modification is wrong Mistake can restore data to some old version.
206, it receives historical data and checks instruction, output historical data checks the old version of instruction.
Historical data checks any old version instructed for specifying dirty data list.Referring to shown in Fig. 4, to dirty number After having made 3 modifications, intermediate region, which is shown, 3 historical data versions under " old version " menu.User can be in the menu The lower any historical data version of selection.Data quality checking and cleaning device show a certain historical data according to user's operation The content of version.
The method of data quality checking and cleaning that the embodiment of the present disclosure provides determines K for a pending data For screening the quality of data rule of dirty data.For this K data quality rule, may be selected to execute it is therein any one or Multiple quality of data rules carry out dirty data screening, i.e., according to the specific business need in practical application, using different data Quality rule efficiently filters out dirty data.Further, the data in dirty data list may be updated, reach data cleansing Purpose.
Based on data quality checking and the method for cleaning described in the corresponding embodiment of above-mentioned Fig. 1-Fig. 4, Xia Shuwei Embodiment of the present disclosure can be used for executing embodiments of the present disclosure.
The embodiment of the present disclosure provides a kind of data quality checking and cleaning device, as shown in figure 5, data quality checking and clear Cleaning device includes:
Interface module 51, for obtaining pending data.
Control module 52, for setting the K data quality rule for detecting pending data quality, K >=1.
Processing module 53 filters out dirty for selecting M in K data quality rule to execute from pending data Data, and update the data in dirty data list, 1≤M≤K.
In one embodiment, control module 52 are used for range of receiving designated order, and range designated order is for specifying this The aimed quality rule of secondary inspection, wherein goal rule is one of M data quality rule.
Processing module 53 filters out the entry there are dirty data for detecting pending data according to goal rule.
In one embodiment, processing module 53, for exporting dirty data list, dirty data list includes pending data It is middle that there are the entries of dirty data;Color identifier is added for dirty data.
In one embodiment, processing module 53 rectify and improve labeled dirty data and are allowed to meet quality rule.
In one embodiment, processing module 53 save forward and backward dirty of update for during rectify and improve dirty data Data list.
Control module 52 checks instruction for receiving historical data, and historical data checks instruction for specifying dirty data to arrange Any old version of table.
Processing module 53 checks the old version of instruction for exporting historical data.
In one embodiment, processing module 53, for by grouping, repeat screening, column assignment, arrange empty, capital and small letter At least one of conversion, date format conversion editting function, rectify and improve dirty data.
The data quality checking and cleaning device that the embodiment of the present disclosure provides determine K use for a pending data In the quality of data rule of screening dirty data.For this K data quality rule, may be selected to execute it is therein any one or it is more A data quality rule carries out dirty data screening, i.e., according to the specific business need in practical application, using different data matter Gauge then, efficiently filters out dirty data.Further, the data in dirty data list may be updated, reach the mesh of data cleansing 's.
Based on data quality checking and the method for cleaning described in the corresponding embodiment of above-mentioned Fig. 1-Fig. 4, the disclosure Embodiment also provides a kind of computer readable storage medium, for example, non-transitorycomputer readable storage medium can be it is read-only Memory (English: Read Only Memory, ROM), random access memory (English: Random Access Memory, RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..It is stored with computer instruction on the storage medium, for executing Data quality checking and the method for cleaning described in the corresponding embodiment of above-mentioned Fig. 1-Fig. 4, details are not described herein again.
Those skilled in the art will readily occur to its of the disclosure after considering specification and practicing disclosure disclosed herein Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following Claim is pointed out.

Claims (10)

1. a kind of data quality checking and the method for cleaning, which is characterized in that the described method includes:
Obtain pending data;
Set the K data quality rule for detecting the pending data quality, K >=1;
It selects M in the K data quality rule to execute, filters out dirty data from the pending data, 1≤M≤ K;
Update the data in dirty data list.
2. the method according to claim 1, wherein including:
Range of receiving designated order, the quality rule that the range designated order is used to that this to be specified to check, wherein the target Rule is one of described M data quality rule;
The pending data is detected according to the goal rule, filters out the entry there are dirty data.
3. the method according to claim 1, wherein further include:
Dirty data list is exported, the dirty data list includes the entry in the pending data there are dirty data;
Color identifier is added for dirty data.
4. the method according to claim 1, wherein further include:
During rectifying and improving the dirty data, saves and rectify and improve the forward and backward dirty data list;
It receives historical data and checks instruction, the historical data checks any history instructed for specifying the dirty data list Version;
Export the old version that the historical data checks instruction.
5. according to the method described in claim 4, it is characterized by further comprising:
By being grouped, repeat screening, column assignment, column empty, capital and small letter conversion, date format conversion at least one of edit function Can, the dirty data is rectified and improved.
6. a kind of data quality checking and cleaning device characterized by comprising
Interface module, for obtaining pending data;
Control module sets the K data quality rule for detecting the pending data quality, K >=1;
Processing module is filtered out from the pending data for selecting M in the K data quality rule to execute Dirty data, and update the data in dirty data list, 1≤M≤K.
7. device according to claim 6, which is characterized in that
The control module is used for range of receiving designated order, the target that the range designated order is used to that this to be specified to check Quality rule, wherein the goal rule is one of described M data quality rule;
The processing module filters out the item there are dirty data for detecting the pending data according to the goal rule Mesh.
8. device according to claim 6, which is characterized in that
The processing module, for exporting dirty data list, the dirty data list includes that there are dirty in the pending data The entry of data;Color identifier is added for dirty data.
9. device according to claim 6, which is characterized in that
The processing module, for rectifying and improving dirty data, and during rectifying and improving dirty data, save update it is forward and backward described Dirty data list;
The control module checks instruction for receiving historical data, and the historical data checks that instruction is described dirty for specifying Any old version of data list;
The processing module checks the old version of instruction for exporting the historical data.
10. according to the method described in claim 9, it is characterized in that,
The processing module, for by the way that grouping, repetition screening, column assignment, column empty, capital and small letter is converted, date format conversion At least one of editting function, the dirty data is rectified and improved.
CN201910089853.XA 2019-01-30 2019-01-30 Data quality checking and the method and apparatus of cleaning Pending CN109815224A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910089853.XA CN109815224A (en) 2019-01-30 2019-01-30 Data quality checking and the method and apparatus of cleaning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910089853.XA CN109815224A (en) 2019-01-30 2019-01-30 Data quality checking and the method and apparatus of cleaning

Publications (1)

Publication Number Publication Date
CN109815224A true CN109815224A (en) 2019-05-28

Family

ID=66605913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910089853.XA Pending CN109815224A (en) 2019-01-30 2019-01-30 Data quality checking and the method and apparatus of cleaning

Country Status (1)

Country Link
CN (1) CN109815224A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110675048A (en) * 2019-09-19 2020-01-10 国网福建省电力有限公司 Power data quality detection method and system
CN110826851A (en) * 2019-09-25 2020-02-21 云知声智能科技股份有限公司 Quality control method and device
CN112199364A (en) * 2020-10-16 2021-01-08 平安国际智慧城市科技股份有限公司 Data cleaning method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101990208A (en) * 2009-07-31 2011-03-23 中国移动通信集团公司 Automatic data checking method, system and equipment
CN102708149A (en) * 2012-04-01 2012-10-03 河海大学 Data quality management method and system
US20130086010A1 (en) * 2011-09-30 2013-04-04 Johnson Controls Technology Company Systems and methods for data quality control and cleansing
CN107895013A (en) * 2017-11-13 2018-04-10 医渡云(北京)技术有限公司 Quality of data rule control method and device, storage medium, electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101990208A (en) * 2009-07-31 2011-03-23 中国移动通信集团公司 Automatic data checking method, system and equipment
US20130086010A1 (en) * 2011-09-30 2013-04-04 Johnson Controls Technology Company Systems and methods for data quality control and cleansing
CN102708149A (en) * 2012-04-01 2012-10-03 河海大学 Data quality management method and system
CN107895013A (en) * 2017-11-13 2018-04-10 医渡云(北京)技术有限公司 Quality of data rule control method and device, storage medium, electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张玉峰: "《企业竞争情报智能挖掘》", 31 July 2013 *
蔡立志 等: "《大数据测评》", 31 January 2015 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110675048A (en) * 2019-09-19 2020-01-10 国网福建省电力有限公司 Power data quality detection method and system
CN110826851A (en) * 2019-09-25 2020-02-21 云知声智能科技股份有限公司 Quality control method and device
CN110826851B (en) * 2019-09-25 2022-04-01 云知声智能科技股份有限公司 Quality control method and device
CN112199364A (en) * 2020-10-16 2021-01-08 平安国际智慧城市科技股份有限公司 Data cleaning method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107544806B (en) Visualize list method for drafting
CN110292775B (en) Method and device for acquiring difference data
KR20210100600A (en) software testing
CN109815224A (en) Data quality checking and the method and apparatus of cleaning
CN109101469A (en) The information that can search for is extracted from digitized document
CN106557854A (en) A kind of methods of exhibiting and device of operation flow
CN107844425A (en) A kind of database statement inspection method and device
CN106294128B (en) A kind of automated testing method and device exporting report data
CN110083814A (en) A kind of report form generation method and device, computer readable storage medium
CN109684332A (en) A kind of wide table generating method of data, apparatus and system
CN109885541A (en) The method and apparatus of information batch processing
CN111858600B (en) Data wide table construction method, device, equipment and storage medium
CN107609623A (en) Bar code processing method and device
CN110633078B (en) Method and device for automatically generating feature calculation codes
US20190042393A1 (en) Software analysis apparatus and software analysis method
US20080147587A1 (en) Decision support system
CN110674195A (en) Form-based query method
JP2010250864A (en) Information processing apparatus and program
KR101175475B1 (en) Workflow processing method and device
WO2012012905A1 (en) Systems and methods of rapid business discovery and transformation of business processes
CN108875060A (en) A kind of website identification method and identifying system
CN109343844A (en) A method of it is compared and is corrected based on Flex bill data
JP2009009342A (en) Information processing unit and program
CN107423276A (en) A kind of analysis report generation method and device
JP2001092811A (en) Document review supporting device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190528

RJ01 Rejection of invention patent application after publication