CN109344145B - Data standard specification-based data cleaning method, device and system - Google Patents
Data standard specification-based data cleaning method, device and system Download PDFInfo
- Publication number
- CN109344145B CN109344145B CN201811040620.2A CN201811040620A CN109344145B CN 109344145 B CN109344145 B CN 109344145B CN 201811040620 A CN201811040620 A CN 201811040620A CN 109344145 B CN109344145 B CN 109344145B
- Authority
- CN
- China
- Prior art keywords
- data
- work order
- problem report
- report work
- standard specification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a data cleaning method, a device and a system based on data standard specifications, wherein the method comprises the following steps: acquiring data standard specification information and a data source; performing quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to a first processing account; and after the problem report work order is processed, storing the processed problem report work order into the knowledge base. According to the invention, based on standard data specification information, quality detection is carried out on a data source needing to be cleaned, a problem report work order is generated and sent to a related processing account, and after a processor finishes processing the problem report work order, the problem report work order is stored in a knowledge base, so that the processor can use the solution of the problem report work order which is processed for reference in the subsequent data cleaning process, and the efficiency of data cleaning is improved. The invention can be widely applied to the field of data processing.
Description
Technical Field
The invention relates to the field of data processing, in particular to a data cleaning method, a device and a system based on data standard specifications.
Background
With the rapid progress of society, the data generated by mobile phones and computers increases by hundreds of millions every day, and the application of the data cleaning technology is more and more extensive, so that it is important to effectively acquire useful information from massive data.
Data cleansing literally means to cleanse dirty Data, which is the last procedure to find and correct recognizable error Data in a Data file, and is mainly divided into four categories, namely Data missing, data repeating, data error and Data unavailable. However, different cleaning methods exist for different types of data, so that different data standard specifications need to be adopted.
The problem report work order is not integrated in the existing data cleaning method, so that the problem phenomenon and the solution in the problem report work order cannot be repeatedly utilized in the subsequent cleaning process are caused, and the efficiency of the prior art still has an improvement space to a certain extent.
Disclosure of Invention
To solve the above technical problems, the present invention aims to: the data cleaning method, the device and the system based on the standard specification can improve efficiency.
The first technical scheme adopted by the invention is as follows:
a data cleaning method based on data standard specifications comprises the following steps:
acquiring data standard specification information and a data source;
performing quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to a first processing account;
and after the problem report work order is processed, storing the processed problem report work order into the knowledge base.
Further, the step of performing quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to the first processing account specifically includes:
configuring the data standard specification of each field in the data source according to the data standard specification information;
adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in a data source;
and generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to the first processing account.
Further, the method also comprises the following steps:
and inquiring the problem report work order which adopts the same data standard specification and is processed from the knowledge base according to the data standard specification information.
Further, the method also comprises the following steps:
the method comprises the steps of obtaining first information input by a user, and searching a problem report work order which contains the first information and is processed in a knowledge base according to the first information.
The second technical scheme adopted by the invention is as follows:
a data cleansing apparatus based on data standard specifications, comprising:
a memory for storing a program;
and the processor is used for loading the program to execute a data cleaning method based on the data standard specification.
The third technical scheme adopted by the invention is as follows:
a data cleansing system based on data standard specifications, comprising:
the acquisition module is used for acquiring a data source;
the data standard specification information management module is used for adding, modifying and deleting data standard specification information;
the quality detection module is used for performing quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to the first processing account;
the problem report work order processing module is used for processing the problem report work order;
and the knowledge base is used for inquiring and storing the processed problem report work order.
Further, the quality detection module includes:
the mapping configuration unit is used for configuring the data standard specification of each field in the data source according to the data standard specification information;
the task execution scheduling unit is used for adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in the data source;
and the work order management unit is used for generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to the first processing account.
Further, still include:
and the query module is used for querying the problem report work order which adopts the same data standard specification and is processed from the knowledge base according to the data standard specification information.
Further, still include:
and the searching module is used for acquiring first information input by a user and searching a problem report work order which contains the first information and is processed in the knowledge base according to the first information.
Further, the work order management unit is further configured to:
acquiring second information input by a user, and distributing a problem report work order from a first processing account to a second processing account;
or
And acquiring third information input by the user, and sending the problem report work order to a set external system.
The invention has the beneficial effects that: according to the invention, based on standard data specification information, quality detection is carried out on a data source needing to be cleaned, a problem report work order is generated and sent to a related processing account, and after a processor finishes processing the problem report work order, the problem report work order is stored in a knowledge base, so that the processor can use the solution of the problem report work order which is processed for reference in the subsequent data cleaning process, and the efficiency of data cleaning is improved.
Drawings
FIG. 1 is a flowchart of a data cleansing method based on data standard specifications according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the drawings and the specific embodiments.
Referring to fig. 1, a data cleansing method based on a data standard specification, which may be implemented by a computer.
The method comprises the following steps:
s1, obtaining data standard specification information and a data source. The data standard specification information can contain a plurality of rules, and a processor can add, delete and modify the rules in the data standard specification information according to actual needs.
And S2, performing quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to the first processing account. In the process of quality detection of the data source, a problem existing in the data source is found, that is, the data source is found not to meet the condition of the rule in the data standard specification information, and the problem report work order records the problem existing in the data source, for example, the mth data of the nth field has a problem. The problem report work order, in which the data problems of the data source are recorded, is then transmitted to the account number of the handler, i.e., the first processing account number, which may be fixed or set during each data cleaning process.
And S3, storing the processed problem report work order into a knowledge base after the problem report work order is processed. Wherein the solution of the handler is recorded in the processed problem report work order. For example, the mth data of the nth field has a problem, and a solution to the problem is to delete, merge, replace, or otherwise manipulate the data. Therefore, if a processor encounters similar problems in the subsequent data cleaning process, the previous solution can be found, and the efficiency of data cleaning is improved.
As a preferred embodiment, the step S2 specifically includes:
s21, configuring the data standard specification of each field in the data source according to the data standard specification information; and establishing association between each field in the data source and the data standard specification corresponding to each field in a mapping mode.
S22, adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in a data source; the method in this embodiment can execute multiple data cleaning tasks simultaneously, so that a task scheduling function needs to be added.
And S23, generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to the first processing account. In this embodiment, the problem report work order includes the data problem in each field.
As a preferred embodiment, in order to facilitate a solution for a handler to use past problem report worksheets for reference, the present embodiment further comprises the following steps:
and S4, inquiring the processed problem report work order adopting the same data standard specification from the knowledge base according to the data standard specification information. The embodiment can automatically match cases adopting the same data standard specification from the knowledge base according to the data standard specification information selected by the processor, and present the cases to the user. The user can conveniently find the solution of the relevant case, and therefore the efficiency of data cleaning is improved.
As a preferred embodiment, the method further comprises the following steps:
and S5, acquiring first information input by a user, and searching a problem report work order which contains the first information and is processed in a knowledge base according to the first information. In this embodiment, a user may perform a search by inputting first information, where the first information may be a name of a related field or a format of processed data, and the like, and in this embodiment, when there is no data cleaning case using the same data standard specification in the past, an approximate data cleaning scheme may be searched in a processed problem report worksheet by using keywords, so that a handler may refer to a solution of the past data cleaning case, and the efficiency of data cleaning may be improved.
A data cleansing apparatus based on data standard specifications, comprising:
a memory for storing a program; the memory can be a storage device such as a U disk, a hard disk or an optical disk.
And the processor is used for loading the program to execute the data cleaning method based on the data standard specification of any one of the embodiments.
The embodiment discloses a data cleaning system based on data standard specification, including:
the acquisition module is used for acquiring a data source; the data source may originate from a data interface of an external system, a local database or a storage medium.
The data standard specification information management module is used for adding, modifying and deleting data standard specification information; the data standard specification information can contain a plurality of rules, and a processor can add, delete and modify the rules in the data standard specification information according to actual needs.
And the quality detection module is used for carrying out quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to the first processing account. In the process of quality detection of the data source, a problem existing in the data source is found, that is, the data source is found not to meet the condition of the rule in the data standard specification information, and the problem report work order records the problem existing in the data source, for example, the mth data of the nth field has a problem. Then, the problem report work order recording the data problems of the data source is transmitted to an account of the processing person, that is, a first processing account, which may be fixed or set in each data cleansing process.
The problem report work order processing module is used for processing the problem report work order; in this module, the handler may log in his/her account and process the problem report work order, for example, the problem indicated in the problem report work order may be processed by deleting, adding, modifying, and the like. The final solution may be stored in the knowledge base along with the issue report work order.
And the knowledge base is used for inquiring and storing the processed problem report work order. The processor can search the knowledge base for the solution of the problem report work order with similar situations in the past so as to improve the efficiency of data cleaning.
The system can conveniently process human management data standard specification information, improves the flexibility of data cleaning, can fully utilize the existing problem report worksheet as a reference case, and improves the efficiency of data cleaning.
As a preferred embodiment, the quality detection module comprises:
and the mapping configuration unit is used for configuring the data standard specification of each field in the data source according to the data standard specification information. And the mapping configuration unit establishes association between each field in the data source and the data standard specification corresponding to each field in a mapping mode.
The task execution scheduling unit is used for adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in the data source; the system in this embodiment can execute multiple data cleaning tasks at the same time, so that a task scheduling function needs to be added.
And the work order management unit is used for generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to the first processing account. In this embodiment, the problem report work order includes the data problem in each field.
As a preferred embodiment, in order to facilitate a solution for a processor to use past problem report worksheets for reference, the embodiment further includes:
and the query module is used for querying the problem report work order which adopts the same data standard specification and is processed from the knowledge base according to the data standard specification information. The embodiment can automatically match cases adopting the same data standard specification from the knowledge base according to the data standard specification information selected by the processor, and present the cases to the user. The user can conveniently find the solutions of the related cases, and therefore the efficiency of data cleaning is improved.
As a preferred embodiment, further comprising:
and the searching module is used for acquiring first information input by a user and searching a problem report work order which contains the first information and is processed in the knowledge base according to the first information. In this embodiment, a user may perform a search by inputting first information, where the first information may be a name of a related field or a format of processed data, and the like, and in this embodiment, when there is no data cleaning case using the same data standard specification in the past, an approximate data cleaning scheme may be searched in a processed problem report worksheet by using keywords, so that a handler may refer to a solution of the past data cleaning case, and the efficiency of data cleaning may be improved.
As a preferred embodiment, in order to facilitate the conversion of the problem report work order to the process, the work order management unit is further configured to:
acquiring second information input by a user, and distributing a problem report work order from a first processing account to a second processing account;
or
And acquiring third information input by the user, and sending the problem report work order to a set external system.
The problem report work order can be flexibly distributed to different handlers to be processed, and can also be sent to an external system.
The step numbers in the above method embodiments are set for convenience of illustration only, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (3)
1. A data cleaning method based on data standard specification is characterized in that: the method comprises the following steps:
acquiring data standard specification information and a data source;
configuring the data standard specification of each field in the data source according to the data standard specification information, wherein the association between each field in the data source and the data standard specification corresponding to each field is established in a mapping mode;
adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in a data source;
generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to a first processing account, wherein the problem report work order comprises data problems existing in each field in the data source;
acquiring second information input by a user, and distributing a problem report work order from a first processing account to a second processing account; or acquiring third information input by a user, and sending the problem report work order to a set external system;
after the problem report work order is processed, storing the processed problem report work order into a knowledge base, wherein the processed problem report work order records a solution of a processor of a first processing account;
inquiring a problem report work order which adopts the same data standard specification and is processed from a knowledge base according to the data standard specification information;
the method comprises the steps of obtaining first information input by a user, and searching a problem report work order which contains the first information and is processed in a knowledge base according to the first information, wherein the first information is a field name or a data format.
2. The utility model provides a data belt cleaning device based on data standard specification which characterized in that: the method comprises the following steps:
a memory for storing a program;
a processor for loading the program to execute a data cleansing method based on data standard specification as claimed in claim 1.
3. A data cleaning system based on data standard specification is characterized in that: the method comprises the following steps:
the acquisition module is used for acquiring a data source;
the data standard specification information management module is used for adding, modifying and deleting data standard specification information, wherein each field in the data source is associated with the data standard specification corresponding to each field in a mapping mode;
the mapping configuration unit is used for configuring the data standard specification of each field in the data source according to the data standard specification information;
the task execution scheduling unit is used for adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in the data source;
the work order management unit is used for generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to the first processing account; the problem report work order sending system is used for obtaining second information input by a user, distributing a problem report work order from a first processing account to a second processing account, or obtaining third information input by the user, and sending the problem report work order to a set external system, wherein the problem report work order comprises data problems existing in each field in a data source;
the problem report work order processing module is used for processing the problem report work order, wherein the solution of a processor of the first processing account number is recorded in the processed problem report work order;
the knowledge base is used for inquiring and storing the processed problem report work order;
the query module is used for querying the problem report work order which adopts the same data standard specification and is processed from the knowledge base according to the data standard specification information;
the search module is used for acquiring first information input by a user, and searching a problem report work order which contains the first information and is processed in a knowledge base according to the first information, wherein the first information is a field name or a data format.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811040620.2A CN109344145B (en) | 2018-09-07 | 2018-09-07 | Data standard specification-based data cleaning method, device and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811040620.2A CN109344145B (en) | 2018-09-07 | 2018-09-07 | Data standard specification-based data cleaning method, device and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109344145A CN109344145A (en) | 2019-02-15 |
CN109344145B true CN109344145B (en) | 2022-12-27 |
Family
ID=65304922
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811040620.2A Active CN109344145B (en) | 2018-09-07 | 2018-09-07 | Data standard specification-based data cleaning method, device and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109344145B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113032669A (en) * | 2021-03-09 | 2021-06-25 | 国轩高科美国研究院 | Product problem processing method, device and equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101739618A (en) * | 2009-12-21 | 2010-06-16 | 北京世纪互联宽带数据中心有限公司 | Integrated service processing system |
CN101853277A (en) * | 2010-05-14 | 2010-10-06 | 南京信息工程大学 | Vulnerability data mining method based on classification and association analysis |
CN102394885A (en) * | 2011-11-09 | 2012-03-28 | 中国人民解放军信息工程大学 | Information classification protection automatic verification method based on data stream |
CN103678665A (en) * | 2013-12-24 | 2014-03-26 | 焦点科技股份有限公司 | Heterogeneous large data integration method and system based on data warehouses |
CN103902731A (en) * | 2014-04-16 | 2014-07-02 | 国家电网公司 | Intelligent information maintenance method based on knowledge base inquiry |
CN105808939A (en) * | 2016-03-04 | 2016-07-27 | 新博卓畅技术(北京)有限公司 | Data rule engine system and method |
CN106777227A (en) * | 2016-12-26 | 2017-05-31 | 河南信安通信技术股份有限公司 | Multidimensional data convergence analysis system and method based on cloud platform |
CN107239581A (en) * | 2017-07-07 | 2017-10-10 | 小草数语(北京)科技有限公司 | Data cleaning method and device |
CN108169621A (en) * | 2017-12-05 | 2018-06-15 | 国电南瑞科技股份有限公司 | Taiwan area power-off event complementing method based on support vector machines |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080288889A1 (en) * | 2004-02-20 | 2008-11-20 | Herbert Dennis Hunt | Data visualization application |
US7590619B2 (en) * | 2004-03-22 | 2009-09-15 | Microsoft Corporation | Search system using user behavior data |
US20120179564A1 (en) * | 2005-09-14 | 2012-07-12 | Adam Soroca | System for retrieving mobile communication facility user data from a plurality of providers |
WO2008054037A1 (en) * | 2006-11-03 | 2008-05-08 | Yeong-Ae Kim | A system of management, information providing and information acquisition for vending machine based upon wire and wireless communication and a method of management, information providing and information acquisition for vending machine using the system |
CN106294492A (en) * | 2015-06-08 | 2017-01-04 | 深圳中兴网信科技有限公司 | Data cleaning method and cleaning engine |
CN106815338A (en) * | 2016-12-25 | 2017-06-09 | 北京中海投资管理有限公司 | A kind of real-time storage of big data, treatment and inquiry system |
CN106611053B (en) * | 2016-12-26 | 2020-05-01 | 河南信安通信技术股份有限公司 | Data cleaning and indexing method |
CN106951315B (en) * | 2017-03-17 | 2020-05-22 | 北京搜狐新媒体信息技术有限公司 | ETL-based data task scheduling method and system |
-
2018
- 2018-09-07 CN CN201811040620.2A patent/CN109344145B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101739618A (en) * | 2009-12-21 | 2010-06-16 | 北京世纪互联宽带数据中心有限公司 | Integrated service processing system |
CN101853277A (en) * | 2010-05-14 | 2010-10-06 | 南京信息工程大学 | Vulnerability data mining method based on classification and association analysis |
CN102394885A (en) * | 2011-11-09 | 2012-03-28 | 中国人民解放军信息工程大学 | Information classification protection automatic verification method based on data stream |
CN103678665A (en) * | 2013-12-24 | 2014-03-26 | 焦点科技股份有限公司 | Heterogeneous large data integration method and system based on data warehouses |
CN103902731A (en) * | 2014-04-16 | 2014-07-02 | 国家电网公司 | Intelligent information maintenance method based on knowledge base inquiry |
CN105808939A (en) * | 2016-03-04 | 2016-07-27 | 新博卓畅技术(北京)有限公司 | Data rule engine system and method |
CN106777227A (en) * | 2016-12-26 | 2017-05-31 | 河南信安通信技术股份有限公司 | Multidimensional data convergence analysis system and method based on cloud platform |
CN107239581A (en) * | 2017-07-07 | 2017-10-10 | 小草数语(北京)科技有限公司 | Data cleaning method and device |
CN108169621A (en) * | 2017-12-05 | 2018-06-15 | 国电南瑞科技股份有限公司 | Taiwan area power-off event complementing method based on support vector machines |
Non-Patent Citations (2)
Title |
---|
"HADCLEAN: A hybrid approach to data cleaning in data warehouses";Arindam Paul;《2012 International Conference on Information Retrieval & Knowledge Management》;20120528;第136-142页 * |
"数据清洗研究综述";王曰芬 等;《现代图书情报技术》;20071225;第50-56页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109344145A (en) | 2019-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7971231B2 (en) | Configuration management database (CMDB) which establishes policy artifacts and automatic tagging of the same | |
US7406477B2 (en) | Database system with methodology for automated determination and selection of optimal indexes | |
US8463811B2 (en) | Automated correlation discovery for semi-structured processes | |
US20200007588A1 (en) | Method and System for Automated Cybersecurity Incident and Artifact Visualization and Correlation for Security Operation Centers and Computer Emergency Response Teams | |
JP2010524060A (en) | Data merging in distributed computing | |
US20090083221A1 (en) | System and Method for Estimating and Storing Skills for Reuse | |
US20150113008A1 (en) | Providing automatable units for infrastructure support | |
CN109344145B (en) | Data standard specification-based data cleaning method, device and system | |
CN110704417A (en) | Metadata management method, equipment and storage medium | |
US20110082839A1 (en) | Generating intellectual property intelligence using a patent search engine | |
CN104391844A (en) | Data management system and tool | |
US20220114516A1 (en) | Systems and methods for discovery of automation opportunities | |
US20150006578A1 (en) | Dynamic search system | |
CN111178028B (en) | Financial data cleaning method, equipment and storage medium | |
CN115577078B (en) | Engineering cost audit information retrieval method, system, equipment and storage medium | |
KR101113690B1 (en) | Apparatus and method for anslyzing activity information | |
US11663542B2 (en) | Electronic knowledge creation and management visual transformation tool | |
US20230141506A1 (en) | Pre-constructed query recommendations for data analytics | |
Gupta et al. | Provenance in context of Hadoop as a Service (HaaS)-State of the Art and Research Directions | |
US20230100289A1 (en) | Searchable data processing operation documentation associated with data processing of raw data | |
CN108363617B (en) | Asynchronous importing method for offline list on SSR (simple sequence repeat) | |
Naamane | A systematic literature review: benefits and challenges of cloud-based big data analytics. | |
CN106709005B (en) | Method, device and system for processing redundant index in database system | |
US20140089911A1 (en) | Rationalizing functions to identify re-usable services | |
CN111914059A (en) | Employee welfare complaint processing method, system, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |