CN109344145B - Data standard specification-based data cleaning method, device and system - Google Patents

Data standard specification-based data cleaning method, device and system Download PDF

Info

Publication number
CN109344145B
CN109344145B CN201811040620.2A CN201811040620A CN109344145B CN 109344145 B CN109344145 B CN 109344145B CN 201811040620 A CN201811040620 A CN 201811040620A CN 109344145 B CN109344145 B CN 109344145B
Authority
CN
China
Prior art keywords
data
work order
problem report
report work
standard specification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811040620.2A
Other languages
Chinese (zh)
Other versions
CN109344145A (en
Inventor
刘汉亮
邓强
宋勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beiming Software Co ltd
Original Assignee
Beiming Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beiming Software Co ltd filed Critical Beiming Software Co ltd
Priority to CN201811040620.2A priority Critical patent/CN109344145B/en
Publication of CN109344145A publication Critical patent/CN109344145A/en
Application granted granted Critical
Publication of CN109344145B publication Critical patent/CN109344145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a data cleaning method, a device and a system based on data standard specifications, wherein the method comprises the following steps: acquiring data standard specification information and a data source; performing quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to a first processing account; and after the problem report work order is processed, storing the processed problem report work order into the knowledge base. According to the invention, based on standard data specification information, quality detection is carried out on a data source needing to be cleaned, a problem report work order is generated and sent to a related processing account, and after a processor finishes processing the problem report work order, the problem report work order is stored in a knowledge base, so that the processor can use the solution of the problem report work order which is processed for reference in the subsequent data cleaning process, and the efficiency of data cleaning is improved. The invention can be widely applied to the field of data processing.

Description

Data standard specification-based data cleaning method, device and system
Technical Field
The invention relates to the field of data processing, in particular to a data cleaning method, a device and a system based on data standard specifications.
Background
With the rapid progress of society, the data generated by mobile phones and computers increases by hundreds of millions every day, and the application of the data cleaning technology is more and more extensive, so that it is important to effectively acquire useful information from massive data.
Data cleansing literally means to cleanse dirty Data, which is the last procedure to find and correct recognizable error Data in a Data file, and is mainly divided into four categories, namely Data missing, data repeating, data error and Data unavailable. However, different cleaning methods exist for different types of data, so that different data standard specifications need to be adopted.
The problem report work order is not integrated in the existing data cleaning method, so that the problem phenomenon and the solution in the problem report work order cannot be repeatedly utilized in the subsequent cleaning process are caused, and the efficiency of the prior art still has an improvement space to a certain extent.
Disclosure of Invention
To solve the above technical problems, the present invention aims to: the data cleaning method, the device and the system based on the standard specification can improve efficiency.
The first technical scheme adopted by the invention is as follows:
a data cleaning method based on data standard specifications comprises the following steps:
acquiring data standard specification information and a data source;
performing quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to a first processing account;
and after the problem report work order is processed, storing the processed problem report work order into the knowledge base.
Further, the step of performing quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to the first processing account specifically includes:
configuring the data standard specification of each field in the data source according to the data standard specification information;
adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in a data source;
and generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to the first processing account.
Further, the method also comprises the following steps:
and inquiring the problem report work order which adopts the same data standard specification and is processed from the knowledge base according to the data standard specification information.
Further, the method also comprises the following steps:
the method comprises the steps of obtaining first information input by a user, and searching a problem report work order which contains the first information and is processed in a knowledge base according to the first information.
The second technical scheme adopted by the invention is as follows:
a data cleansing apparatus based on data standard specifications, comprising:
a memory for storing a program;
and the processor is used for loading the program to execute a data cleaning method based on the data standard specification.
The third technical scheme adopted by the invention is as follows:
a data cleansing system based on data standard specifications, comprising:
the acquisition module is used for acquiring a data source;
the data standard specification information management module is used for adding, modifying and deleting data standard specification information;
the quality detection module is used for performing quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to the first processing account;
the problem report work order processing module is used for processing the problem report work order;
and the knowledge base is used for inquiring and storing the processed problem report work order.
Further, the quality detection module includes:
the mapping configuration unit is used for configuring the data standard specification of each field in the data source according to the data standard specification information;
the task execution scheduling unit is used for adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in the data source;
and the work order management unit is used for generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to the first processing account.
Further, still include:
and the query module is used for querying the problem report work order which adopts the same data standard specification and is processed from the knowledge base according to the data standard specification information.
Further, still include:
and the searching module is used for acquiring first information input by a user and searching a problem report work order which contains the first information and is processed in the knowledge base according to the first information.
Further, the work order management unit is further configured to:
acquiring second information input by a user, and distributing a problem report work order from a first processing account to a second processing account;
or
And acquiring third information input by the user, and sending the problem report work order to a set external system.
The invention has the beneficial effects that: according to the invention, based on standard data specification information, quality detection is carried out on a data source needing to be cleaned, a problem report work order is generated and sent to a related processing account, and after a processor finishes processing the problem report work order, the problem report work order is stored in a knowledge base, so that the processor can use the solution of the problem report work order which is processed for reference in the subsequent data cleaning process, and the efficiency of data cleaning is improved.
Drawings
FIG. 1 is a flowchart of a data cleansing method based on data standard specifications according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the drawings and the specific embodiments.
Referring to fig. 1, a data cleansing method based on a data standard specification, which may be implemented by a computer.
The method comprises the following steps:
s1, obtaining data standard specification information and a data source. The data standard specification information can contain a plurality of rules, and a processor can add, delete and modify the rules in the data standard specification information according to actual needs.
And S2, performing quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to the first processing account. In the process of quality detection of the data source, a problem existing in the data source is found, that is, the data source is found not to meet the condition of the rule in the data standard specification information, and the problem report work order records the problem existing in the data source, for example, the mth data of the nth field has a problem. The problem report work order, in which the data problems of the data source are recorded, is then transmitted to the account number of the handler, i.e., the first processing account number, which may be fixed or set during each data cleaning process.
And S3, storing the processed problem report work order into a knowledge base after the problem report work order is processed. Wherein the solution of the handler is recorded in the processed problem report work order. For example, the mth data of the nth field has a problem, and a solution to the problem is to delete, merge, replace, or otherwise manipulate the data. Therefore, if a processor encounters similar problems in the subsequent data cleaning process, the previous solution can be found, and the efficiency of data cleaning is improved.
As a preferred embodiment, the step S2 specifically includes:
s21, configuring the data standard specification of each field in the data source according to the data standard specification information; and establishing association between each field in the data source and the data standard specification corresponding to each field in a mapping mode.
S22, adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in a data source; the method in this embodiment can execute multiple data cleaning tasks simultaneously, so that a task scheduling function needs to be added.
And S23, generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to the first processing account. In this embodiment, the problem report work order includes the data problem in each field.
As a preferred embodiment, in order to facilitate a solution for a handler to use past problem report worksheets for reference, the present embodiment further comprises the following steps:
and S4, inquiring the processed problem report work order adopting the same data standard specification from the knowledge base according to the data standard specification information. The embodiment can automatically match cases adopting the same data standard specification from the knowledge base according to the data standard specification information selected by the processor, and present the cases to the user. The user can conveniently find the solution of the relevant case, and therefore the efficiency of data cleaning is improved.
As a preferred embodiment, the method further comprises the following steps:
and S5, acquiring first information input by a user, and searching a problem report work order which contains the first information and is processed in a knowledge base according to the first information. In this embodiment, a user may perform a search by inputting first information, where the first information may be a name of a related field or a format of processed data, and the like, and in this embodiment, when there is no data cleaning case using the same data standard specification in the past, an approximate data cleaning scheme may be searched in a processed problem report worksheet by using keywords, so that a handler may refer to a solution of the past data cleaning case, and the efficiency of data cleaning may be improved.
A data cleansing apparatus based on data standard specifications, comprising:
a memory for storing a program; the memory can be a storage device such as a U disk, a hard disk or an optical disk.
And the processor is used for loading the program to execute the data cleaning method based on the data standard specification of any one of the embodiments.
The embodiment discloses a data cleaning system based on data standard specification, including:
the acquisition module is used for acquiring a data source; the data source may originate from a data interface of an external system, a local database or a storage medium.
The data standard specification information management module is used for adding, modifying and deleting data standard specification information; the data standard specification information can contain a plurality of rules, and a processor can add, delete and modify the rules in the data standard specification information according to actual needs.
And the quality detection module is used for carrying out quality detection on the data source according to the data standard specification information, generating a problem report work order and sending the problem report work order to the first processing account. In the process of quality detection of the data source, a problem existing in the data source is found, that is, the data source is found not to meet the condition of the rule in the data standard specification information, and the problem report work order records the problem existing in the data source, for example, the mth data of the nth field has a problem. Then, the problem report work order recording the data problems of the data source is transmitted to an account of the processing person, that is, a first processing account, which may be fixed or set in each data cleansing process.
The problem report work order processing module is used for processing the problem report work order; in this module, the handler may log in his/her account and process the problem report work order, for example, the problem indicated in the problem report work order may be processed by deleting, adding, modifying, and the like. The final solution may be stored in the knowledge base along with the issue report work order.
And the knowledge base is used for inquiring and storing the processed problem report work order. The processor can search the knowledge base for the solution of the problem report work order with similar situations in the past so as to improve the efficiency of data cleaning.
The system can conveniently process human management data standard specification information, improves the flexibility of data cleaning, can fully utilize the existing problem report worksheet as a reference case, and improves the efficiency of data cleaning.
As a preferred embodiment, the quality detection module comprises:
and the mapping configuration unit is used for configuring the data standard specification of each field in the data source according to the data standard specification information. And the mapping configuration unit establishes association between each field in the data source and the data standard specification corresponding to each field in a mapping mode.
The task execution scheduling unit is used for adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in the data source; the system in this embodiment can execute multiple data cleaning tasks at the same time, so that a task scheduling function needs to be added.
And the work order management unit is used for generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to the first processing account. In this embodiment, the problem report work order includes the data problem in each field.
As a preferred embodiment, in order to facilitate a solution for a processor to use past problem report worksheets for reference, the embodiment further includes:
and the query module is used for querying the problem report work order which adopts the same data standard specification and is processed from the knowledge base according to the data standard specification information. The embodiment can automatically match cases adopting the same data standard specification from the knowledge base according to the data standard specification information selected by the processor, and present the cases to the user. The user can conveniently find the solutions of the related cases, and therefore the efficiency of data cleaning is improved.
As a preferred embodiment, further comprising:
and the searching module is used for acquiring first information input by a user and searching a problem report work order which contains the first information and is processed in the knowledge base according to the first information. In this embodiment, a user may perform a search by inputting first information, where the first information may be a name of a related field or a format of processed data, and the like, and in this embodiment, when there is no data cleaning case using the same data standard specification in the past, an approximate data cleaning scheme may be searched in a processed problem report worksheet by using keywords, so that a handler may refer to a solution of the past data cleaning case, and the efficiency of data cleaning may be improved.
As a preferred embodiment, in order to facilitate the conversion of the problem report work order to the process, the work order management unit is further configured to:
acquiring second information input by a user, and distributing a problem report work order from a first processing account to a second processing account;
or
And acquiring third information input by the user, and sending the problem report work order to a set external system.
The problem report work order can be flexibly distributed to different handlers to be processed, and can also be sent to an external system.
The step numbers in the above method embodiments are set for convenience of illustration only, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (3)

1. A data cleaning method based on data standard specification is characterized in that: the method comprises the following steps:
acquiring data standard specification information and a data source;
configuring the data standard specification of each field in the data source according to the data standard specification information, wherein the association between each field in the data source and the data standard specification corresponding to each field is established in a mapping mode;
adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in a data source;
generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to a first processing account, wherein the problem report work order comprises data problems existing in each field in the data source;
acquiring second information input by a user, and distributing a problem report work order from a first processing account to a second processing account; or acquiring third information input by a user, and sending the problem report work order to a set external system;
after the problem report work order is processed, storing the processed problem report work order into a knowledge base, wherein the processed problem report work order records a solution of a processor of a first processing account;
inquiring a problem report work order which adopts the same data standard specification and is processed from a knowledge base according to the data standard specification information;
the method comprises the steps of obtaining first information input by a user, and searching a problem report work order which contains the first information and is processed in a knowledge base according to the first information, wherein the first information is a field name or a data format.
2. The utility model provides a data belt cleaning device based on data standard specification which characterized in that: the method comprises the following steps:
a memory for storing a program;
a processor for loading the program to execute a data cleansing method based on data standard specification as claimed in claim 1.
3. A data cleaning system based on data standard specification is characterized in that: the method comprises the following steps:
the acquisition module is used for acquiring a data source;
the data standard specification information management module is used for adding, modifying and deleting data standard specification information, wherein each field in the data source is associated with the data standard specification corresponding to each field in a mapping mode;
the mapping configuration unit is used for configuring the data standard specification of each field in the data source according to the data standard specification information;
the task execution scheduling unit is used for adding a data quality detection task, configuring a first processing account and executing task scheduling to obtain a quality detection result of each field in the data source;
the work order management unit is used for generating a problem report work order according to the quality detection result of each field in the data source and sending the problem report work order to the first processing account; the problem report work order sending system is used for obtaining second information input by a user, distributing a problem report work order from a first processing account to a second processing account, or obtaining third information input by the user, and sending the problem report work order to a set external system, wherein the problem report work order comprises data problems existing in each field in a data source;
the problem report work order processing module is used for processing the problem report work order, wherein the solution of a processor of the first processing account number is recorded in the processed problem report work order;
the knowledge base is used for inquiring and storing the processed problem report work order;
the query module is used for querying the problem report work order which adopts the same data standard specification and is processed from the knowledge base according to the data standard specification information;
the search module is used for acquiring first information input by a user, and searching a problem report work order which contains the first information and is processed in a knowledge base according to the first information, wherein the first information is a field name or a data format.
CN201811040620.2A 2018-09-07 2018-09-07 Data standard specification-based data cleaning method, device and system Active CN109344145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811040620.2A CN109344145B (en) 2018-09-07 2018-09-07 Data standard specification-based data cleaning method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811040620.2A CN109344145B (en) 2018-09-07 2018-09-07 Data standard specification-based data cleaning method, device and system

Publications (2)

Publication Number Publication Date
CN109344145A CN109344145A (en) 2019-02-15
CN109344145B true CN109344145B (en) 2022-12-27

Family

ID=65304922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811040620.2A Active CN109344145B (en) 2018-09-07 2018-09-07 Data standard specification-based data cleaning method, device and system

Country Status (1)

Country Link
CN (1) CN109344145B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113032669A (en) * 2021-03-09 2021-06-25 国轩高科美国研究院 Product problem processing method, device and equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739618A (en) * 2009-12-21 2010-06-16 北京世纪互联宽带数据中心有限公司 Integrated service processing system
CN101853277A (en) * 2010-05-14 2010-10-06 南京信息工程大学 Vulnerability data mining method based on classification and association analysis
CN102394885A (en) * 2011-11-09 2012-03-28 中国人民解放军信息工程大学 Information classification protection automatic verification method based on data stream
CN103678665A (en) * 2013-12-24 2014-03-26 焦点科技股份有限公司 Heterogeneous large data integration method and system based on data warehouses
CN103902731A (en) * 2014-04-16 2014-07-02 国家电网公司 Intelligent information maintenance method based on knowledge base inquiry
CN105808939A (en) * 2016-03-04 2016-07-27 新博卓畅技术(北京)有限公司 Data rule engine system and method
CN106777227A (en) * 2016-12-26 2017-05-31 河南信安通信技术股份有限公司 Multidimensional data convergence analysis system and method based on cloud platform
CN107239581A (en) * 2017-07-07 2017-10-10 小草数语(北京)科技有限公司 Data cleaning method and device
CN108169621A (en) * 2017-12-05 2018-06-15 国电南瑞科技股份有限公司 Taiwan area power-off event complementing method based on support vector machines

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080288889A1 (en) * 2004-02-20 2008-11-20 Herbert Dennis Hunt Data visualization application
US7590619B2 (en) * 2004-03-22 2009-09-15 Microsoft Corporation Search system using user behavior data
US20120179564A1 (en) * 2005-09-14 2012-07-12 Adam Soroca System for retrieving mobile communication facility user data from a plurality of providers
WO2008054037A1 (en) * 2006-11-03 2008-05-08 Yeong-Ae Kim A system of management, information providing and information acquisition for vending machine based upon wire and wireless communication and a method of management, information providing and information acquisition for vending machine using the system
CN106294492A (en) * 2015-06-08 2017-01-04 深圳中兴网信科技有限公司 Data cleaning method and cleaning engine
CN106815338A (en) * 2016-12-25 2017-06-09 北京中海投资管理有限公司 A kind of real-time storage of big data, treatment and inquiry system
CN106611053B (en) * 2016-12-26 2020-05-01 河南信安通信技术股份有限公司 Data cleaning and indexing method
CN106951315B (en) * 2017-03-17 2020-05-22 北京搜狐新媒体信息技术有限公司 ETL-based data task scheduling method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101739618A (en) * 2009-12-21 2010-06-16 北京世纪互联宽带数据中心有限公司 Integrated service processing system
CN101853277A (en) * 2010-05-14 2010-10-06 南京信息工程大学 Vulnerability data mining method based on classification and association analysis
CN102394885A (en) * 2011-11-09 2012-03-28 中国人民解放军信息工程大学 Information classification protection automatic verification method based on data stream
CN103678665A (en) * 2013-12-24 2014-03-26 焦点科技股份有限公司 Heterogeneous large data integration method and system based on data warehouses
CN103902731A (en) * 2014-04-16 2014-07-02 国家电网公司 Intelligent information maintenance method based on knowledge base inquiry
CN105808939A (en) * 2016-03-04 2016-07-27 新博卓畅技术(北京)有限公司 Data rule engine system and method
CN106777227A (en) * 2016-12-26 2017-05-31 河南信安通信技术股份有限公司 Multidimensional data convergence analysis system and method based on cloud platform
CN107239581A (en) * 2017-07-07 2017-10-10 小草数语(北京)科技有限公司 Data cleaning method and device
CN108169621A (en) * 2017-12-05 2018-06-15 国电南瑞科技股份有限公司 Taiwan area power-off event complementing method based on support vector machines

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"HADCLEAN: A hybrid approach to data cleaning in data warehouses";Arindam Paul;《2012 International Conference on Information Retrieval & Knowledge Management》;20120528;第136-142页 *
"数据清洗研究综述";王曰芬 等;《现代图书情报技术》;20071225;第50-56页 *

Also Published As

Publication number Publication date
CN109344145A (en) 2019-02-15

Similar Documents

Publication Publication Date Title
US7971231B2 (en) Configuration management database (CMDB) which establishes policy artifacts and automatic tagging of the same
US7406477B2 (en) Database system with methodology for automated determination and selection of optimal indexes
US8463811B2 (en) Automated correlation discovery for semi-structured processes
US20200007588A1 (en) Method and System for Automated Cybersecurity Incident and Artifact Visualization and Correlation for Security Operation Centers and Computer Emergency Response Teams
JP2010524060A (en) Data merging in distributed computing
US20090083221A1 (en) System and Method for Estimating and Storing Skills for Reuse
US20150113008A1 (en) Providing automatable units for infrastructure support
CN109344145B (en) Data standard specification-based data cleaning method, device and system
CN110704417A (en) Metadata management method, equipment and storage medium
US20110082839A1 (en) Generating intellectual property intelligence using a patent search engine
CN104391844A (en) Data management system and tool
US20220114516A1 (en) Systems and methods for discovery of automation opportunities
US20150006578A1 (en) Dynamic search system
CN111178028B (en) Financial data cleaning method, equipment and storage medium
CN115577078B (en) Engineering cost audit information retrieval method, system, equipment and storage medium
KR101113690B1 (en) Apparatus and method for anslyzing activity information
US11663542B2 (en) Electronic knowledge creation and management visual transformation tool
US20230141506A1 (en) Pre-constructed query recommendations for data analytics
Gupta et al. Provenance in context of Hadoop as a Service (HaaS)-State of the Art and Research Directions
US20230100289A1 (en) Searchable data processing operation documentation associated with data processing of raw data
CN108363617B (en) Asynchronous importing method for offline list on SSR (simple sequence repeat)
Naamane A systematic literature review: benefits and challenges of cloud-based big data analytics.
CN106709005B (en) Method, device and system for processing redundant index in database system
US20140089911A1 (en) Rationalizing functions to identify re-usable services
CN111914059A (en) Employee welfare complaint processing method, system, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant