CN109815224A - Data quality checking and the method and apparatus of cleaning - Google Patents
Data quality checking and the method and apparatus of cleaning Download PDFInfo
- Publication number
- CN109815224A CN109815224A CN201910089853.XA CN201910089853A CN109815224A CN 109815224 A CN109815224 A CN 109815224A CN 201910089853 A CN201910089853 A CN 201910089853A CN 109815224 A CN109815224 A CN 109815224A
- Authority
- CN
- China
- Prior art keywords
- data
- dirty
- rule
- quality
- dirty data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The disclosure provides the method, apparatus of a kind of data quality checking and cleaning, is related to information technology field, can be improved the accuracy and efficiency of quality of data judgement, and completes the cleaning rectification of data.The specific technical proposal is: obtaining pending data;Set the K data quality rule for detecting the pending data quality, K >=1;It selects M in the K data quality rule to execute, dirty data, 1≤M≤K is filtered out from the pending data;Update the data in dirty data list.The disclosure is used for data quality checking and cleaning.
Description
Technical field
This disclosure relates to information technology field more particularly to data quality checking and the method and apparatus of cleaning.
Background technique
There are all kinds of dirty datas, traffic affecting consistency and accuracys in the operation system of enterprise.Data quality problem
Often have a significant impact to data precision of analysis and reliability, data of low quality cannot effectively support user into
Capable data analysis-decision system.Therefore every profession and trade is both needed to administer by data and solves the problems, such as that the quality of data is not high, data cleansing is data
One of important link of improvement.It is generally necessary to which business personnel compares one by one to data and carries out manual amendment's mode cleaning, to protect
Demonstrate,prove the quality of data.Should exist in the process and artificially judge the problems such as misjudgement is failed to judge and data rectify and improve low efficiency.
Summary of the invention
The embodiment of the present disclosure provides the method and apparatus of a kind of data quality checking and cleaning, can be improved the quality of data and sentences
Fixed accuracy and efficiency.The technical solution is as follows:
According to the first aspect of the embodiments of the present disclosure, a kind of method of data quality checking and cleaning, this method packet are provided
It includes:
Obtain pending data;
Set the K data quality rule for detecting the pending data quality, K >=1;
It selects M in the K data quality rule to execute, filters out dirty data from the pending data, 1≤
M≤K;
Update the data in dirty data list.
Technical solution provided by the present disclosure determines the K data matter for being used to screen dirty data for a pending data
Gauge is then.For this K data quality rule, may be selected to execute any one or multiple quality of data rules therein carry out it is dirty
Data screening is efficiently filtered out dirty according to the specific business need in practical application using different quality of data rules
Data.Further, the data in dirty data list may be updated, achieve the purpose that data cleansing.
In one embodiment, which comprises
Range of receiving designated order, the quality rule that the range designated order is used to that this to be specified to check, wherein described
Goal rule is one of described M data quality rule;
The pending data is detected according to the goal rule, filters out the entry there are dirty data.
By range designated order, dirty data screening can be carried out by the implementing result of some or certain several rules, to dirty
The granularity of data screening is finely divided.
In one embodiment, further includes:
Dirty data list is exported, the dirty data list includes the entry in the pending data there are dirty data;
Color identifier is added for dirty data.
In one embodiment, further includes:
During rectifying and improving the dirty data, saves and rectify and improve the forward and backward dirty data list;
It receives historical data and checks instruction, the historical data checks any instructed for specifying the dirty data list
Old version;
Export the old version that the historical data checks instruction.
The modification of dirty data will be preserved, if the user find that modification mistake can restore data to some history version
This.
In one embodiment, further includes:
By being grouped, repeat screening, column assignment, column empty, capital and small letter conversion, date format conversion at least one of volume
Function is collected, the dirty data is rectified and improved.
According to the second aspect of an embodiment of the present disclosure, a kind of data quality checking and cleaning device are provided, comprising:
Interface module, for obtaining pending data;
Control module sets the K data quality rule for detecting the pending data quality, K >=1;
Processing module is sieved from the pending data for selecting M in the K data quality rule to execute
Dirty data is selected, and updates the data in dirty data list, 1≤M≤K.
In one embodiment, the control module, is used for range of receiving designated order, and the range designated order is used for
The aimed quality rule for specifying this to check, wherein the goal rule is one of described M data quality rule;
The processing module, for filtering out according to the goal rule detection pending data, there are dirty datas
Entry.
In one embodiment, the processing module, for exporting dirty data list, the dirty data list includes described
There are the entries of dirty data in pending data;Color identifier is added for dirty data.
In one embodiment, the processing module, for rectifying and improving dirty data, and during rectifying and improving dirty data,
It saves and updates the forward and backward dirty data list;
The control module checks instruction for receiving historical data, and the historical data checks instruction for specifying
State any old version of dirty data list;
The processing module checks the old version of instruction for exporting the historical data.
In one embodiment, the processing module, for by grouping, repeat screening, column assignment, arrange empty, size
At least one of conversion, date format conversion editting function are write, the dirty data is rectified and improved.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure
Example, and together with specification for explaining the principles of this disclosure.
Fig. 1 is the flow chart of the method for a kind of data quality checking that the embodiment of the present disclosure provides and cleaning.
Fig. 2 is the flow chart of the method for a kind of data quality checking that the embodiment of the present disclosure provides and cleaning.
Fig. 3 is a kind of display interface schematic diagram that the embodiment of the present disclosure provides.
Fig. 4 is a kind of display interface schematic diagram that the embodiment of the present disclosure provides.
Fig. 5 is the structure chart of a kind of data quality checking that the embodiment of the present disclosure provides and cleaning device.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended
The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
The quality problems of data are determined by way of manual examination and verification and are cleaned, and are easy to appear misjudgement and are failed to judge and repair
Change inefficiency problem.Embodiment of the disclosure provides a kind of method of data quality checking and cleaning, can be according to different business
Scene carries out data quality problem lookup to Various types of data and dirty data is rectified and improved.
The embodiment of the present disclosure provides a kind of method of data quality checking and cleaning, as shown in Figure 1, the data quality checking
And cleaning method the following steps are included:
101, pending data is obtained.
The data that pending data can be passed to for the data or webservice interface of EXEL importing, from database
The data etc. of reading.Embodiment of the disclosure for pending data specific source without limitation.
102, the K data quality rule for detecting pending data quality is set.
According to the specific business need in practical application, the K quality of data rules for screening dirty data, K can be set
≥1。
For example, in the creation data of an enterprise, including the fields such as COM code, name of material, it can be respectively for each
Field sets quality rule.
103, M execution in K data quality rule is selected, dirty data is filtered out from pending data, and update
Data in dirty data list.
For K data quality rule, may be selected to execute any one or multiple quality of data rules therein carry out it is dirty
Data screening, to more pointedly go out to correspond to using different quality of data Rules Filterings according to the different demands of user
Dirty data.
The dirty data filtered out can form dirty data list, be shown by WEB page, and user can carry out data therein
Modification, achievees the purpose that data cleansing.
The method of data quality checking and cleaning that the embodiment of the present disclosure provides determines K for a pending data
For screening the quality of data rule of dirty data.For this K data quality rule, may be selected to execute it is therein any one or
Multiple quality of data rules carry out dirty data screening, i.e., according to the specific business need in practical application, using different data
Quality rule efficiently filters out dirty data.Further, the data in dirty data list may be updated, reach data cleansing
Purpose.
Method based on the corresponding embodiment of the above-mentioned Fig. 1 data quality checking provided and cleaning, another implementation of the disclosure
Example has done further supplementary explanation to the method for data quality checking and cleaning.Content in part of step is corresponding with Fig. 1
Embodiment in step it is same or like, below only elaborate to difference in step.
Referring to shown in Fig. 2, the method for data quality checking provided in this embodiment and cleaning the following steps are included:
201, pending data is obtained.
202, the K quality of data for detecting pending data quality is determined from preset S data quality rule
Rule.
In one embodiment, S data quality rule is preset.For certain a pending data, K therein can be called
A, as screening dirty data quality of data rule.Wherein, 1≤K≤S.
Referring to display interface schematic diagram shown in Fig. 3, user can be by input equipments such as mouse, touch screens from preset S
K are selected in quality of data rule.The K data quality rule that user chooses is shown in right area.
K data quality rule type includes but is not limited to that character length checks that rule, numeric format inspection are regular, unique
Property rule, rule of consistency, non-empty rule, threshold rule etc..Each rule may specify corresponding detection field, for example, metering
Unit<consistent rule>, which refers to, detects measurement unit field according to rule of consistency.
The above rule is only used as exemplary illustration, and the disclosure does not do specific quality of data rule and each value of K
It limits.
203, range of receiving designated order.
Range designated order is for specifying goal rule.Goal rule is one of M data quality rule.
Mouse trigger range designated order can be used in user.Referring to the right area of Fig. 3, user is clicked a certain using mouse
" execution " button corresponding after rule is checked in pending data with specified according to which or which quality of data rule
Dirty data.
By taking character length checks as an example, when user clicks the rule corresponding " execution " button, data quality checking and clear
Cleaning device according to character length check rule, pending data is checked, Fig. 3 left area output meet how many, no
Meet how many results.
By taking goal rule is the situation of rule of consistency as an example, " measurement unit<consistent rule>" of display in right area
Refer to, it is specified that measurement unit field is checked according to rule of consistency.
When user clicks measurement unit<consistent rule>corresponding " execution " button, data quality checking and cleaning device root
According to the measurement unit field of rule of consistency detection pending data, exporting the data for meeting the rule of consistency has m item, is not inconsistent
Close the data n item of the rule of consistency.
By range designated order, the granularity screened to dirty data can be carried out by the quality rule result of execution thin
Point.
204, dirty data list is exported, adds color identifier for dirty data.
Referring to shown in Fig. 3, in left area, the statistical result for executing and generating after data quality checking can be shown.
For example, data quality checking and cleaning device are according to the right side when user clicks " all executing " button of right area
Total data quality rule, filters out dirty data from pending data shown by side region.It shows and counts in left area
As a result, there is 57 to meet rule, 42 are not inconsistent normally.
When the user clicks when corresponding " execution " button of measurement unit<consistent rule>, statistics knot is shown in left area
Fruit has 95 to meet rule, and 4 are not inconsistent normally.
Referring to shown in Fig. 3, intermediate region can show pending data, and intermediate region content is omitted in Fig. 3.
Referring to shown in Fig. 4, in the also exportable dirty data list in intermediate region, dirty data list includes depositing in pending data
In the entry of dirty data.The entry output of dirty data will be present with colored shading mark in dirty data present in list,
It is shown at a glance by the field that dirty data will be present in color identifier.
For example, intermediate region shows this when the user clicks in left area when the region of display " not being inconsistent normally 42 "
42 there are the entry of dirty data, there are the fields of dirty data to be identified with shading in an entry.It is indicated in Fig. 4 with shade filling
Shading.
205, the data in dirty data list are updated, saves and updates forward and backward dirty data list.
User can modify to dirty data.Such as, it is possible to provide grouping, repeat screening, column assignment, arrange empty, capital and small letter
A variety of editting functions such as conversion, date format conversion, modify to the data not being inconsistent normally for user.
Referring to shown in Fig. 4, left area shows the result that editor is grouped to model number field.By model number field content phase
Same entry is divided into one group, and there are in the entry of dirty data, model number field content is that the data of " bb " have 2, and content is " BBB "
Data have 14.User is by carrying out edit-modify to a certain grouping, by the model number field of total data entry in the grouping
Unified modification.
Modification each time for user can preserve the dirty data of modification front and back, if the user find that modification is wrong
Mistake can restore data to some old version.
206, it receives historical data and checks instruction, output historical data checks the old version of instruction.
Historical data checks any old version instructed for specifying dirty data list.Referring to shown in Fig. 4, to dirty number
After having made 3 modifications, intermediate region, which is shown, 3 historical data versions under " old version " menu.User can be in the menu
The lower any historical data version of selection.Data quality checking and cleaning device show a certain historical data according to user's operation
The content of version.
The method of data quality checking and cleaning that the embodiment of the present disclosure provides determines K for a pending data
For screening the quality of data rule of dirty data.For this K data quality rule, may be selected to execute it is therein any one or
Multiple quality of data rules carry out dirty data screening, i.e., according to the specific business need in practical application, using different data
Quality rule efficiently filters out dirty data.Further, the data in dirty data list may be updated, reach data cleansing
Purpose.
Based on data quality checking and the method for cleaning described in the corresponding embodiment of above-mentioned Fig. 1-Fig. 4, Xia Shuwei
Embodiment of the present disclosure can be used for executing embodiments of the present disclosure.
The embodiment of the present disclosure provides a kind of data quality checking and cleaning device, as shown in figure 5, data quality checking and clear
Cleaning device includes:
Interface module 51, for obtaining pending data.
Control module 52, for setting the K data quality rule for detecting pending data quality, K >=1.
Processing module 53 filters out dirty for selecting M in K data quality rule to execute from pending data
Data, and update the data in dirty data list, 1≤M≤K.
In one embodiment, control module 52 are used for range of receiving designated order, and range designated order is for specifying this
The aimed quality rule of secondary inspection, wherein goal rule is one of M data quality rule.
Processing module 53 filters out the entry there are dirty data for detecting pending data according to goal rule.
In one embodiment, processing module 53, for exporting dirty data list, dirty data list includes pending data
It is middle that there are the entries of dirty data;Color identifier is added for dirty data.
In one embodiment, processing module 53 rectify and improve labeled dirty data and are allowed to meet quality rule.
In one embodiment, processing module 53 save forward and backward dirty of update for during rectify and improve dirty data
Data list.
Control module 52 checks instruction for receiving historical data, and historical data checks instruction for specifying dirty data to arrange
Any old version of table.
Processing module 53 checks the old version of instruction for exporting historical data.
In one embodiment, processing module 53, for by grouping, repeat screening, column assignment, arrange empty, capital and small letter
At least one of conversion, date format conversion editting function, rectify and improve dirty data.
The data quality checking and cleaning device that the embodiment of the present disclosure provides determine K use for a pending data
In the quality of data rule of screening dirty data.For this K data quality rule, may be selected to execute it is therein any one or it is more
A data quality rule carries out dirty data screening, i.e., according to the specific business need in practical application, using different data matter
Gauge then, efficiently filters out dirty data.Further, the data in dirty data list may be updated, reach the mesh of data cleansing
's.
Based on data quality checking and the method for cleaning described in the corresponding embodiment of above-mentioned Fig. 1-Fig. 4, the disclosure
Embodiment also provides a kind of computer readable storage medium, for example, non-transitorycomputer readable storage medium can be it is read-only
Memory (English: Read Only Memory, ROM), random access memory (English: Random Access Memory,
RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..It is stored with computer instruction on the storage medium, for executing
Data quality checking and the method for cleaning described in the corresponding embodiment of above-mentioned Fig. 1-Fig. 4, details are not described herein again.
Those skilled in the art will readily occur to its of the disclosure after considering specification and practicing disclosure disclosed herein
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or
Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure
Or conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by following
Claim is pointed out.
Claims (10)
1. a kind of data quality checking and the method for cleaning, which is characterized in that the described method includes:
Obtain pending data;
Set the K data quality rule for detecting the pending data quality, K >=1;
It selects M in the K data quality rule to execute, filters out dirty data from the pending data, 1≤M≤
K;
Update the data in dirty data list.
2. the method according to claim 1, wherein including:
Range of receiving designated order, the quality rule that the range designated order is used to that this to be specified to check, wherein the target
Rule is one of described M data quality rule;
The pending data is detected according to the goal rule, filters out the entry there are dirty data.
3. the method according to claim 1, wherein further include:
Dirty data list is exported, the dirty data list includes the entry in the pending data there are dirty data;
Color identifier is added for dirty data.
4. the method according to claim 1, wherein further include:
During rectifying and improving the dirty data, saves and rectify and improve the forward and backward dirty data list;
It receives historical data and checks instruction, the historical data checks any history instructed for specifying the dirty data list
Version;
Export the old version that the historical data checks instruction.
5. according to the method described in claim 4, it is characterized by further comprising:
By being grouped, repeat screening, column assignment, column empty, capital and small letter conversion, date format conversion at least one of edit function
Can, the dirty data is rectified and improved.
6. a kind of data quality checking and cleaning device characterized by comprising
Interface module, for obtaining pending data;
Control module sets the K data quality rule for detecting the pending data quality, K >=1;
Processing module is filtered out from the pending data for selecting M in the K data quality rule to execute
Dirty data, and update the data in dirty data list, 1≤M≤K.
7. device according to claim 6, which is characterized in that
The control module is used for range of receiving designated order, the target that the range designated order is used to that this to be specified to check
Quality rule, wherein the goal rule is one of described M data quality rule;
The processing module filters out the item there are dirty data for detecting the pending data according to the goal rule
Mesh.
8. device according to claim 6, which is characterized in that
The processing module, for exporting dirty data list, the dirty data list includes that there are dirty in the pending data
The entry of data;Color identifier is added for dirty data.
9. device according to claim 6, which is characterized in that
The processing module, for rectifying and improving dirty data, and during rectifying and improving dirty data, save update it is forward and backward described
Dirty data list;
The control module checks instruction for receiving historical data, and the historical data checks that instruction is described dirty for specifying
Any old version of data list;
The processing module checks the old version of instruction for exporting the historical data.
10. according to the method described in claim 9, it is characterized in that,
The processing module, for by the way that grouping, repetition screening, column assignment, column empty, capital and small letter is converted, date format conversion
At least one of editting function, the dirty data is rectified and improved.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910089853.XA CN109815224A (en) | 2019-01-30 | 2019-01-30 | Data quality checking and the method and apparatus of cleaning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910089853.XA CN109815224A (en) | 2019-01-30 | 2019-01-30 | Data quality checking and the method and apparatus of cleaning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109815224A true CN109815224A (en) | 2019-05-28 |
Family
ID=66605913
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910089853.XA Pending CN109815224A (en) | 2019-01-30 | 2019-01-30 | Data quality checking and the method and apparatus of cleaning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109815224A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110675048A (en) * | 2019-09-19 | 2020-01-10 | 国网福建省电力有限公司 | Power data quality detection method and system |
CN110826851A (en) * | 2019-09-25 | 2020-02-21 | 云知声智能科技股份有限公司 | Quality control method and device |
CN112199364A (en) * | 2020-10-16 | 2021-01-08 | 平安国际智慧城市科技股份有限公司 | Data cleaning method and device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101990208A (en) * | 2009-07-31 | 2011-03-23 | 中国移动通信集团公司 | Automatic data checking method, system and equipment |
CN102708149A (en) * | 2012-04-01 | 2012-10-03 | 河海大学 | Data quality management method and system |
US20130086010A1 (en) * | 2011-09-30 | 2013-04-04 | Johnson Controls Technology Company | Systems and methods for data quality control and cleansing |
CN107895013A (en) * | 2017-11-13 | 2018-04-10 | 医渡云(北京)技术有限公司 | Quality of data rule control method and device, storage medium, electronic equipment |
-
2019
- 2019-01-30 CN CN201910089853.XA patent/CN109815224A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101990208A (en) * | 2009-07-31 | 2011-03-23 | 中国移动通信集团公司 | Automatic data checking method, system and equipment |
US20130086010A1 (en) * | 2011-09-30 | 2013-04-04 | Johnson Controls Technology Company | Systems and methods for data quality control and cleansing |
CN102708149A (en) * | 2012-04-01 | 2012-10-03 | 河海大学 | Data quality management method and system |
CN107895013A (en) * | 2017-11-13 | 2018-04-10 | 医渡云(北京)技术有限公司 | Quality of data rule control method and device, storage medium, electronic equipment |
Non-Patent Citations (2)
Title |
---|
张玉峰: "《企业竞争情报智能挖掘》", 31 July 2013 * |
蔡立志 等: "《大数据测评》", 31 January 2015 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110675048A (en) * | 2019-09-19 | 2020-01-10 | 国网福建省电力有限公司 | Power data quality detection method and system |
CN110826851A (en) * | 2019-09-25 | 2020-02-21 | 云知声智能科技股份有限公司 | Quality control method and device |
CN110826851B (en) * | 2019-09-25 | 2022-04-01 | 云知声智能科技股份有限公司 | Quality control method and device |
CN112199364A (en) * | 2020-10-16 | 2021-01-08 | 平安国际智慧城市科技股份有限公司 | Data cleaning method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107544806B (en) | Visualize list method for drafting | |
CN110292775B (en) | Method and device for acquiring difference data | |
KR20210100600A (en) | software testing | |
CN109815224A (en) | Data quality checking and the method and apparatus of cleaning | |
CN109101469A (en) | The information that can search for is extracted from digitized document | |
CN106557854A (en) | A kind of methods of exhibiting and device of operation flow | |
CN107844425A (en) | A kind of database statement inspection method and device | |
CN106294128B (en) | A kind of automated testing method and device exporting report data | |
CN110083814A (en) | A kind of report form generation method and device, computer readable storage medium | |
CN109684332A (en) | A kind of wide table generating method of data, apparatus and system | |
CN109885541A (en) | The method and apparatus of information batch processing | |
CN111858600B (en) | Data wide table construction method, device, equipment and storage medium | |
CN107609623A (en) | Bar code processing method and device | |
CN110633078B (en) | Method and device for automatically generating feature calculation codes | |
US20190042393A1 (en) | Software analysis apparatus and software analysis method | |
US20080147587A1 (en) | Decision support system | |
CN110674195A (en) | Form-based query method | |
JP2010250864A (en) | Information processing apparatus and program | |
KR101175475B1 (en) | Workflow processing method and device | |
WO2012012905A1 (en) | Systems and methods of rapid business discovery and transformation of business processes | |
CN108875060A (en) | A kind of website identification method and identifying system | |
CN109343844A (en) | A method of it is compared and is corrected based on Flex bill data | |
JP2009009342A (en) | Information processing unit and program | |
CN107423276A (en) | A kind of analysis report generation method and device | |
JP2001092811A (en) | Document review supporting device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190528 |
|
RJ01 | Rejection of invention patent application after publication |