WO2021143463A1 - Data cleaning method and apparatus - Google Patents

Data cleaning method and apparatus Download PDF

Info

Publication number
WO2021143463A1
WO2021143463A1 PCT/CN2020/138010 CN2020138010W WO2021143463A1 WO 2021143463 A1 WO2021143463 A1 WO 2021143463A1 CN 2020138010 W CN2020138010 W CN 2020138010W WO 2021143463 A1 WO2021143463 A1 WO 2021143463A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
information item
cleaned
cleaning
type
Prior art date
Application number
PCT/CN2020/138010
Other languages
French (fr)
Chinese (zh)
Inventor
胡云
龚健
李邱林
唐明辉
贾西贝
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2021143463A1 publication Critical patent/WO2021143463A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Educational Administration (AREA)
  • General Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A data cleaning method and apparatus, the method comprising: receiving service data from multiple objects, the service data comprising multiple information items (101); and performing data cleaning on each information item in turn, the data cleaning comprising (102): determining whether an information item belongs to a preset type for cleaning based on a confirmation result (102a); if the information item belongs to a preset object for cleaning based on a confirmation result, then invoking a confirmation result corresponding to the information item, and using the confirmation result as the cleaned data of the information item (102b); and, if the information item does not belong to the preset type for cleaning based on a confirmation result, then cleaning the information items in turn on the basis of a preset plurality of data cleaning rules to obtain cleaned data of the information items (102c). The present invention implements unified data output for multi-object service data cleaning, solving the problem of the difficulty of achieving data fusion for multi-object data conflicts.

Description

数据清洗方法及装置Data cleaning method and device
技术领域清Technical field clear
本发明涉及数据处理领域,尤其涉及一种数据清洗方法及装置。The invention relates to the field of data processing, in particular to a data cleaning method and device.
背景技术Background technique
政务数据采集目前存在如下特点:第一:数据采集难。政府的业务极其复杂,既有数十个直属部门,如公安局、卫计委、人社局、民政局、市场监管委、交通运输委、公积金中心等,还有对应若干区县级单位。这些委办局或机构对应有数十项权责清单和数十个核心系统,这些系统每天都能产生大量的电子化数据。另外,政府还能接入大量的外部数据,如用水、用电、用气、电信业、银行业相关的数据。除了结构化数据,在政府部门内部,有大量的非结构化数据,这些数据包含各种证照的电子件,图片,办公文档,视频,压缩文件等,此外在智慧城市建设过程中还还需要充分采集物联网数据,这些文件既要解决存储的问题,也要解决使用的问题。要提高政府部门的社会管理与城市治理能力,就必须提高对非结构化数据的存储、分析和计算能力,同时将各委办局的业务数据进行共享融合,利用数据辅助管理与决策。要将如此众多复杂的部门和业务数据整合到一起,形成一个统一的融合资源库,是一件极其困难的事情,政府部门急需行业解决方案来提高对政务数据的综合管控能力。第二:数据质控难,数据标准不一,数据质量差。政府部门下拥有较多的委办局,各委办局的业务系统基本属于分散建设,分散运营维护的情况,缺少政府层面的统一规划。国家层面虽然有相应的政务信息资源目录和数据元规范,但标准的建设相对滞后,标准的推广和执行也存在较大的问题,导致各委办局的业务系中对于政务数据的标准引用不统一,数据元定义不一致,加上数据采集录入环节的不规范,使得各委办局的数据质量较差,想要统一数据标准,规范数据质量困难重重。建设智慧城市,提高各委办局之间政务数据的融合共享,建立统一的数据标准和数据质量监控体系是重中之重,没有规范的质量监控和数据标准,政府部门收集上来的数据只会是杂乱无章的,无法起到政务数据应有的价值。建立城市级数据中心需要做好政务数据的数据标准管理和质量管理。第三:数据融合困难,政务数据来源多样。政府部门的业务庞杂,各委办局之间在对政务信息资源管理方面存在很多的重复性部分,例如关于公民、法人、房屋、空间地理等基础信息,不同的委办局都有相关的全部或部分数据,且各委办局之间的数据标准和数据定义都存在较大的差异,甚至同一个委办局的不同系统中对于同一对象的数据也有不同,政务信息资源存在多源多数的问题。如何在众多的数据来源中 选择最准确、最合适的数据,极大地考验着政府部门对政务业务和政务数据的理解和处理方式。第四:数据实时采集难。目前政府数据治理项目建设如火如荼,但绝大多数数据治理的项目解决的都是历史数据的迁移与存储的问题,很难实时获取相关业务办理信息,而对实时数据获取的缺失会极大的影响政府行政效率。随着政府效率的提升,对数据的响应速度也随之提高。如网格巡查人员采集到事件,快速流转到融合库,通过简单的清洗融合,再关联到更多信息(如企业信息),再分发给网格处置人员,网格处理人员的处理动态更新后又流转到融合平台。这整个数据处理过程,通常都控制在1分钟以内,第五:数据应用困难。以往的政务信息系统或者政务类数据仓库项目重在进行单个部门的数据收集整合,重在进行内部数据的统计分析,不能很直观地让公民感受到政府部门行政事务办理的效率改善和服务质量的提升。公民在进行政务类业务办理时还是需要多跑腿,多准备材料,甚至还会遇到各政府部门之间互相推诿扯皮的情况,极大地消耗了公民的时间和精力。社会大众急切的希望政府部门之间的数据能互融共通,能让大众有更好的政务服务体验,政府部门也希望改善自身对政务数据的把控能力,挖掘政务数据更多的应用价值,促进政务数据公开共享,提高政府治理能力和服务水平。Government data collection currently has the following characteristics: First: data collection is difficult. The government's business is extremely complex. There are dozens of directly affiliated departments, such as the Public Security Bureau, the Health and Family Planning Commission, the Human Resources and Social Security Bureau, the Civil Affairs Bureau, the Market Supervision Commission, the Transportation Commission, the Provident Fund Center, etc., as well as several district and county-level units. These commissions, offices, or agencies correspond to dozens of lists of rights and responsibilities and dozens of core systems, and these systems can generate a large amount of electronic data every day. In addition, the government can also access a large amount of external data, such as data related to water, electricity, gas, telecommunications, and banking. In addition to structured data, there is a large amount of unstructured data within government departments. These data include electronic files of various licenses, pictures, office documents, videos, compressed files, etc. In addition, sufficient information is needed in the process of smart city construction. Collecting data from the Internet of Things, these files must not only solve the storage problem, but also solve the problem of use. To improve the social management and urban governance capabilities of government departments, it is necessary to improve the storage, analysis, and calculation capabilities of unstructured data, and at the same time share and integrate the business data of various commissions, offices, and bureaus, and use data to assist management and decision-making. It is extremely difficult to integrate so many complex departments and business data to form a unified integrated resource database. Government departments urgently need industry solutions to improve their comprehensive management and control capabilities for government affairs data. Second: Data quality control is difficult, data standards are different, and data quality is poor. There are many commissions and bureaus under government departments, and the business systems of the commissions and bureaus are basically decentralized construction, decentralized operation and maintenance, and lack of unified planning at the government level. Although there are corresponding government information resource catalogs and data element specifications at the national level, the construction of standards is relatively lagging, and there are also major problems in the promotion and implementation of standards. As a result, the business departments of various commissions, offices and bureaus do not quote government data standards. Unification, inconsistent definitions of data elements, and irregularities in data collection and entry have made the data quality of various commissions, offices, and bureaus poor. It is difficult to standardize data quality in order to unify data standards. Building a smart city, improving the integration and sharing of government affairs data among various commissions, offices, and bureaus. The establishment of a unified data standard and data quality monitoring system is the top priority. Without standardized quality monitoring and data standards, the data collected by government departments will only It is disorganized and cannot play the value of government data. The establishment of a city-level data center requires a good data standard management and quality management of government data. Third: Data fusion is difficult, and government data sources are diverse. The business of government departments is complex, and there are many repetitive parts in the management of government information resources between commissions, offices, and bureaus, such as basic information about citizens, legal persons, houses, and spatial geography. Different commissions, offices, and bureaus have all relevant information. Or part of the data, and the data standards and data definitions of the various commissions, offices and bureaus are quite different. Even different systems of the same commission, office and bureau have different data for the same object. There are multiple sources of government information resources. problem. How to choose the most accurate and suitable data from the many data sources greatly tests the government departments' understanding and processing methods of government affairs and government affairs data. Fourth: It is difficult to collect data in real time. At present, the construction of government data governance projects is in full swing, but most of the data governance projects solve the problem of historical data migration and storage. It is difficult to obtain relevant business management information in real time, and the lack of real-time data acquisition will greatly affect Government administrative efficiency. With the improvement of government efficiency, the speed of response to data has also increased. For example, grid inspectors collect events, quickly transfer them to the fusion library, through simple cleaning and fusion, and then associate with more information (such as enterprise information), and then distribute to the grid processing personnel, after the processing of the grid processing personnel is dynamically updated Then flow to the integration platform. This entire data processing process is usually controlled within 1 minute. Fifth: Data application is difficult. In the past, government affairs information systems or government affairs data warehouse projects focused on the collection and integration of individual department data, and the statistical analysis of internal data. They could not intuitively make citizens feel the efficiency improvement and service quality of government departments’ administrative affairs. promote. Citizens still need to run errands and prepare more materials when dealing with government affairs, and even encounter situations where various government departments preside over each other, which greatly consumes citizens' time and energy. The public eagerly hope that the data between government departments can be mutually integrated, so that the public can have a better government service experience. Government departments also hope to improve their ability to control government affairs data and tap more application value of government affairs data. Promote the open sharing of government affairs data, and improve government governance capabilities and service levels.
因此,亟待提出一种数据清洗方法及装置,以解决多个对象数据冲突难以实现数据融合的问题。Therefore, it is urgent to propose a data cleaning method and device to solve the problem of data fusion that is difficult to achieve data fusion of multiple object data conflicts.
发明内容Summary of the invention
有鉴于此,本发明提供一种数据清洗方法及装置,以实现对多个对象业务数据清洗实现统一的数据输出,解决多个对象数据冲突难以实现数据融合的问题。In view of this, the present invention provides a data cleaning method and device to achieve unified data output for multiple object business data cleaning, and solve the problem that multiple object data conflicts are difficult to achieve data fusion.
第一方面,本发明提供一种数据清洗方法,所述方法包括:接收来自多个对象的业务数据,所述业务数据包括多个信息项;依次对各信息项进行数据清洗,所述数据清洗包括:判断所述信息项是否属于预设的基于认定结果进行清洗的类型;若所述信息项属于预设的基于认定结果进行清洗的对象,则调用所述信息项对应的认定结果,并以所述认定结果作为所述信息项清洗后的数据;若所述信息项不属于预设的基于认定结果进行清洗的类型,则根据预设的多个数据清洗规则,依次对所述信息项进行清洗,得到所述信息项清洗后的数据;所述预设的多个数据清洗规则包括:根据所述信息项的数据产生时间进行清洗的第一规则、根据所述信息项的数据中的最大值或最小值进行清洗的第二规则、根据所述信息项的数据中少数服从多数原则进行清洗的第三规则、以及根据所述信息项的数据所属对象的优先级清洗的第 四规则。In a first aspect, the present invention provides a data cleaning method, the method includes: receiving business data from multiple objects, the business data including multiple information items; performing data cleaning on each information item in turn, the data cleaning Including: judging whether the information item belongs to the preset type of cleaning based on the identification result; if the information item belongs to the preset object that is cleaned based on the identification result, calling the identification result corresponding to the information item, and The identification result is used as the data after the information item is cleaned; if the information item does not belong to the preset type that is cleaned based on the identification result, the information item is sequentially processed according to multiple preset data cleaning rules Cleaning to obtain the cleaned data of the information item; the preset multiple data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, and a maximum value of the data of the information item; A second rule for cleaning based on the value or minimum value, a third rule for cleaning based on the principle of majority in the minority of the data of the information item, and a fourth rule for cleaning based on the priority of the object to which the data of the information item belongs.
第二方面,本发明提供一种数据清洗装置,包括:数据接收单元,用于接收来自多个对象的业务数据,所述业务数据包括多个信息项;数据判断单元,用于判断所述信息项是否属于预设的基于认定结果进行清洗的类型;数据清洗单元,用于若所述信息项属于预设的基于认定结果进行清洗的对象,则调用所述信息项对应的认定结果,并以所述认定结果作为所述信息项清洗后的数据;若所述信息项不属于预设的基于认定结果进行清洗的类型,则根据预设的多个数据清洗规则,依次对所述信息项进行清洗,得到所述信息项清洗后的数据;所述预设的多个数据清洗规则包括:根据所述信息项的数据产生时间进行清洗的第一规则、根据所述信息项的数据中的最大值或最小值进行清洗的第二规则、根据所述信息项的数据中少数服从多数原则进行清洗的第三规则、以及根据所述信息项的数据所属对象的优先级清洗的第四规则。In a second aspect, the present invention provides a data cleaning device, including: a data receiving unit for receiving business data from multiple objects, the business data including multiple information items; a data judging unit, for judging the information Whether the item belongs to the preset type that is cleaned based on the identification result; the data cleaning unit is used to call the identification result corresponding to the information item if the information item belongs to the preset object to be cleaned based on the identification result, and The identification result is used as the data after the information item is cleaned; if the information item does not belong to the preset type that is cleaned based on the identification result, the information item is sequentially processed according to multiple preset data cleaning rules Cleaning to obtain the cleaned data of the information item; the preset multiple data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, and a maximum value of the data of the information item; A second rule for cleaning based on the value or minimum value, a third rule for cleaning based on the principle of majority in the minority of the data of the information item, and a fourth rule for cleaning based on the priority of the object to which the data of the information item belongs.
第三方面,本发明提供一种计算机可读存储介质,存储有程序,所述程序包括用于执行如上述数据清洗方法的指令。In a third aspect, the present invention provides a computer-readable storage medium that stores a program, and the program includes instructions for executing the above-mentioned data cleaning method.
第四方面,本发明提供一种计算机,包括存储有计算机程序的可读介质,所述程序包括用于执行上述数据清洗方法的指令。In a fourth aspect, the present invention provides a computer including a readable medium storing a computer program, the program including instructions for executing the above data cleaning method.
本发明数据清洗方法及装置通过基于针对来自多个对象的业务数据,确定与该业务数据中信息项对应的数据清洗规则,进而依据该相应的数据清洗规则进行数据清洗,以实现对多个对象业务数据清洗实现统一的数据输出,解决多个对象数据冲突难以实现数据融合的问题。The data cleaning method and device of the present invention determine the data cleaning rules corresponding to the information items in the business data based on the business data from multiple objects, and then perform data cleaning according to the corresponding data cleaning rules to realize the data cleaning for multiple objects. Business data cleaning realizes unified data output, and solves the problem of multiple object data conflicts that are difficult to achieve data fusion.
附图说明Description of the drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.
图1为本发明第一实施例提供的数据清洗方法流程图;FIG. 1 is a flowchart of a data cleaning method provided by the first embodiment of the present invention;
图2为本发明第二实施例提供的数据清洗方法流程图;2 is a flowchart of a data cleaning method provided by a second embodiment of the present invention;
图3为本发明第三实施例提供的数据清洗装置结构框图。Fig. 3 is a structural block diagram of a data cleaning device provided by a third embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图对本发明实施例进行详细描述。The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合;并且,基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。It should be noted that the following embodiments and the features in the embodiments can be combined with each other if there is no conflict; and, based on the embodiments in the present disclosure, those of ordinary skill in the art can obtain the results without creative work. All other embodiments fall within the protection scope of the present disclosure.
需要说明的是,下文描述在所附权利要求书的范围内的实施例的各种方面。应显而易见,本文中所描述的方面可体现于广泛多种形式中,且本文中所描述的任何特定结构及/或功能仅为说明性的。基于本公开,所属领域的技术人员应了解,本文中所描述的一个方面可与任何其它方面独立地实施,且可以各种方式组合这些方面中的两者或两者以上。举例来说,可使用本文中所阐述的任何数目个方面来实施设备及/或实践方法。另外,可使用除了本文中所阐述的方面中的一或多者之外的其它结构及/或功能性实施此设备及/或实践此方法。It should be noted that various aspects of the embodiments within the scope of the appended claims are described below. It should be obvious that the aspects described herein can be embodied in a wide variety of forms, and any specific structure and/or function described herein are only illustrative. Based on the present disclosure, those skilled in the art should understand that one aspect described herein can be implemented independently of any other aspects, and two or more of these aspects can be combined in various ways. For example, any number of aspects set forth herein can be used to implement devices and/or methods of practice. In addition, other structures and/or functionalities other than one or more of the aspects set forth herein may be used to implement this device and/or practice this method.
如图1所示,本发明第一实施例提供的一种数据清洗方法包括:As shown in FIG. 1, a data cleaning method provided by the first embodiment of the present invention includes:
步骤101:接收来自多个对象的业务数据,所述业务数据包括多个信息项;Step 101: Receive business data from multiple objects, where the business data includes multiple information items;
步骤102:依次对各信息项进行数据清洗,所述数据清洗具体包括:Step 102: Perform data cleaning on each information item in sequence, and the data cleaning specifically includes:
步骤102a:判断所述信息项是否属于预设的基于认定结果进行清洗的类型; Step 102a: Determine whether the information item belongs to a preset type that is cleaned based on the identification result;
该认证结果具体可以为基于权威认定的结果,也就是说某些信息项,比如性别通过数据调研得到权威认定信息项的“一数一源”来源单位及认证结果,基于“一数一源”确定结果实现多个对象(多个部门,也称多源)数据融合。The certification result can specifically be based on the result of authoritative identification, which means that certain information items, such as gender, are based on the "one number one source" source unit and the certification result of the authority identification information item obtained through data research, based on the "one number one source" The determined result realizes the data fusion of multiple objects (multiple departments, also called multiple sources).
步骤102b:若所述信息项属于预设的基于认定结果进行清洗的对象,则调用所述信息项对应的认定结果,并以所述认定结果作为所述信息项清洗后的数据; Step 102b: If the information item belongs to a preset object to be cleaned based on the identification result, call the identification result corresponding to the information item, and use the identification result as the cleaned data of the information item;
步骤102c:若所述信息项不属于预设的基于认定结果进行清洗的类型,则根据预设的多个数据清洗规则,依次对所述信息项进行清洗,得到所述信息项清洗后的数据;所述预设的多个数据清洗规则包括:根据所述信息项的数据产生时间进行清洗的第一规则、根据所述信息项的数据中的最大值或最小值进行清洗的第二规则、根据所述信息项的数据中少数服从多数原则进行清洗的第三规则、以及根据所述信息项的数据所属对象的优先级清洗的第四规则。Step 102c: If the information item does not belong to the preset type to be cleaned based on the identification result, the information item is cleaned in sequence according to multiple preset data cleaning rules to obtain the cleaned data of the information item The multiple preset data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, a second rule for cleaning according to the maximum or minimum value in the data of the information item, The third rule for cleaning according to the principle of majority in the minority of the data of the information item, and the fourth rule for cleaning according to the priority of the object to which the data of the information item belongs.
本实施例基于针对来自多个对象的业务数据,确定与该业务数据中信息项对应的数据清洗规则进行数据清洗,以实现对多个对象业务数据清洗实现统一的数据输出,解决多个对象数据冲突难以实现数据融合的问题。This embodiment is based on the business data from multiple objects, determining the data cleaning rules corresponding to the information items in the business data to perform data cleaning, so as to achieve unified data output for multiple object business data cleaning, and solve multiple object data Conflict is difficult to achieve data fusion.
如图2所示,本发明第二实施例提供的一种数据清洗方法为图1所示方法的优选实现方式,具体包括:As shown in FIG. 2, a data cleaning method provided by the second embodiment of the present invention is a preferred implementation of the method shown in FIG. 1, and specifically includes:
步骤201:接收来自多个对象的业务数据;Step 201: Receive business data from multiple objects;
步骤202:判断所述信息项是否属于预设的基于认定结果进行清洗的类型;Step 202: Determine whether the information item belongs to a preset type that is cleaned based on the identification result;
步骤203:若所述信息项属于预设的基于认定结果进行清洗的对象,则调用所述信息项对应的认定结果,并以所述认定结果作为所述信息项清洗后的数据;Step 203: If the information item belongs to a preset object to be cleaned based on the identification result, call the identification result corresponding to the information item, and use the identification result as the cleaned data of the information item;
步骤204:若所述信息项不属于预设的基于认定结果进行清洗的类型,则根据预设的多个数据清洗规则继续判断;Step 204: If the information item does not belong to the preset type to be cleaned based on the identification result, continue to judge according to multiple preset data cleaning rules;
步骤205:判断所述信息项是否属于预设的根据所述第一规则进行清洗的类型;具体操作时,该第一规则用于表征基于数据新鲜度融合策略:通过对比多来源信息项的业务办理时间,将最新业务办理时间或最早业务办理时间的数据作为融合数据。Step 205: Determine whether the information item belongs to the preset type to be cleaned according to the first rule; in specific operations, the first rule is used to characterize the data freshness fusion strategy: by comparing the business of the information items from multiple sources For the processing time, the data of the latest business processing time or the earliest business processing time is regarded as the fusion data.
步骤206:若所述信息项属于预设的根据所述第一规则进行清洗的类型,则继续判断所述信息项是否属于根据所述信息项的数据产生时间由前往后的次序进行清洗的第一类型,或者是否属于根据所述信息项的数据产生时间由后往前的次序进行清洗的第二类型。Step 206: If the information item belongs to the preset type to be cleaned according to the first rule, continue to determine whether the information item belongs to the first cleaned according to the data generation time of the information item in descending order. One type, or whether it belongs to the second type that is cleaned according to the data generation time of the information item from back to front.
上述第一类型即根据最旧值进行数据清洗。具体通过对比同一基础数据的业务办理时间和入库时间,将最早的业务办理时间的数据作为融合数据的基础数据,完成一数一源的过程。第二类型即根据最新值进行数据清洗。具体通过对比同一基础数据的业务办理时间和入库时间,将最新业务办理时间的数据作为融合数据的基础数据,完成一数一源的过程。如对于个人婚姻状况的登记,社保局去年的数据显示是未婚,民政部今年的数据显示是已婚,则该人口的婚姻字段信息就以民政部的已婚为准。The first type mentioned above is to perform data cleaning based on the oldest value. Specifically, by comparing the business processing time and warehousing time of the same basic data, the earliest business processing time data is used as the basic data of the fusion data to complete the process of counting one source. The second type is to clean data based on the latest value. Specifically, by comparing the business processing time and warehousing time of the same basic data, the latest business processing time data is used as the basic data of the fusion data to complete the process of counting one source. For the registration of personal marital status, last year's data from the Social Security Bureau showed that they were unmarried, and the Ministry of Civil Affairs' data this year showed that they were married, then the marriage field information of the population is subject to the Ministry of Civil Affairs' married.
步骤207:若所述信息项属于所述第一类型,则将所述信息项的数据产生时间中最早的时间作为所述信息项清洗后的数据;若所述信息项属于所述第二类型,则将所述信息项的数据产生时间中最新的时间作为所述信息项清洗后的数据。Step 207: If the information item belongs to the first type, the earliest time in the data generation time of the information item is taken as the data after the information item is cleaned; if the information item belongs to the second type , The latest time in the data generation time of the information item is taken as the data after the information item is cleaned.
步骤208:若判断所述信息项不属于根据所述第一规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第二规则进行清洗的类型;Step 208: If it is judged that the information item does not belong to the type cleaned according to the first rule, continue to judge whether the information item belongs to the type cleaned according to the second rule;
具体操作时,第二规则用于表征基于数据最值融合策略:通过对比多源数据的同一信息项数据,以字段值最大值或最小值的委办局数据为融合数据。例如对某人薪资数据共有3个部门有登记,其中在人才服务中心为1万,在税务局为1.1万,在社保局为1.2万。现有计税应用分析场景,要求不漏税,则该人口的薪资数据应取最大值(即社保局薪资数据)作为融合数据。In specific operations, the second rule is used to characterize the fusion strategy based on the most value of the data: by comparing the data of the same information item of multiple source data, the commission office data with the maximum or minimum field value is used as the fusion data. For example, there are 3 departments registered for a person’s salary data, among which 10,000 are registered in the talent service center, 11,000 in the tax bureau, and 12,000 in the social security bureau. The existing tax calculation application analysis scenario requires no tax evasion, and the salary data of the population should take the maximum value (that is, the salary data of the Social Security Bureau) as the fusion data.
又如有区域妇女初育年龄数据,某妇女的初育年龄共有3个部门有登记,其中公安局登记的为26岁,街道办登记的为23岁,卫计委登记的为20岁。现有区域早育婴儿健康调查场景,要求不漏查,则以该妇女初育年龄登记最小值(即卫计委数据)作为融合数据。In addition, if there are regional women’s age at first childbirth, a woman’s age at first childbearing is registered in 3 departments. Among them, the public security bureau registered 26 years old, the street office registered 23 years old, and the Health and Family Planning Commission registered 20 years old. Existing regional health survey scenarios for early childbearing infants require no omissions, and the minimum registered age of the woman at first childbirth (that is, data from the Health and Family Planning Commission) is used as the fusion data.
步骤209:若所述信息项属于预设的根据所述第二规则进行清洗的类型,则继续判断所述信息项是否属于根据所述信息项的数据中的最大值进行清洗的第三类型,或者是否属于根据所述信息项的数据中的最小值进行清洗的第四类型。Step 209: If the information item belongs to the preset type that is cleaned according to the second rule, continue to determine whether the information item belongs to the third type that is cleaned according to the maximum value in the data of the information item. Or whether it belongs to the fourth type of cleaning according to the minimum value in the data of the information item.
具体如:最大值通过对比同一基础数据的具体数据,以字段值最大的委办局数据为融合数据,完成一数一源的过程。如对个人薪资情况的统计在公安局显示是1万,在社保局显示是1.2万,则该人口的薪资数据就以社保局的数据为准。最小值通过对比同一基础数据的具体数据,以字段值最小的委办局数据为融合数据,完成一数一源的过程。如结婚年龄情况统计,以各委办局的最小值数据为融合数据。Specifically, the maximum value is compared with the specific data of the same basic data, and the commission office data with the largest field value is used as the fusion data to complete the one-to-one process. If the statistics of the personal salary situation is 10,000 in the Public Security Bureau and 12,000 in the Social Security Bureau, the salary data of the population shall be subject to the data of the Social Security Bureau. The minimum value compares the specific data of the same basic data and uses the commission office data with the smallest field value as the fusion data to complete the one-to-one process. Such as marriage age statistics, use the minimum data of each commission, office, and bureau as the fusion data.
步骤210:若所述信息项属于所述三类型,则将所述信息项的数据中的最大值作为所述信息项清洗后的数据;若所述信息项属于所述第二类型,则将所述信息项的数据中的最小值作为所述信息项清洗后的数据。Step 210: If the information item belongs to the three types, the maximum value among the data of the information item is taken as the data after the information item is cleaned; if the information item belongs to the second type, then The smallest value among the data of the information item is taken as the data after the information item is cleaned.
步骤211:若判断所述信息项不属于根据所述第二规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第三规则进行清洗的类型;Step 211: If it is judged that the information item does not belong to the type cleaned according to the second rule, continue to judge whether the information item belongs to the type cleaned according to the third rule;
该第三规则具体用于表征基于大多数原则融合策略:通过对多源数据的同一信息项数据比较,少数服从多数,以多数的数值为融合数据。例如某人的居住地信息,共有10个来源部门都有登记,其中9个来源部门都登记为“深圳”,1个来源部门登记为“广州”,基于大数据原则融合(“少数服从多数”)策略融合,则最终确定“深圳”为其居住地信息。The third rule is specifically used to characterize the fusion strategy based on the principle of majority: by comparing the same information item data of multiple source data, the minority obeys the majority, and the majority value is used as the fusion data. For example, a person’s place of residence information is registered in 10 source departments, among which 9 source departments are registered as "Shenzhen", and one source department is registered as "Guangzhou", based on the integration of big data principles ("the minority obeys the majority" ) Strategic integration, the final determination of "Shenzhen" as the residence information.
步骤212:若所述信息项属于预设的根据所述第三规则进行清洗的类型,则将所述信息项的数据进行统计;Step 212: If the information item belongs to a preset type that is cleaned according to the third rule, perform statistics on the data of the information item;
具体如:通过对同一数值的比较,少数服从多数,以多数的数值为融合数据,解决单一部门数据错误的问题,如居住地信息。Specifically, by comparing the same value, the minority obeys the majority, and the majority value is used as the fusion data to solve the problem of data errors in a single department, such as residence information.
步骤213:将所述信息项的数据中所占数量比例的最大的数据作为所述信息项清洗后的数据。Step 213: Use the data with the largest proportion among the data of the information item as the data after the information item is cleaned.
步骤214:若判断所述信息项不属于根据所述第三规则进行清洗的类型,则根据第四规则进行数据清洗,具体将所述信息项的数据中所属对象的优先级最高的数据作为所述信息项 清洗后的数据。Step 214: If it is determined that the information item does not belong to the type that is cleaned according to the third rule, perform data cleansing according to the fourth rule, and specifically take the data with the highest priority of the object in the data of the information item as the data item. The data after cleaning the information item.
具体地,第四规则用于表征基于指定优先级来源融合策略:对多源数据指定信息项来源优先级,系统基于信息项优先级依序融合数据。在前序优先级存在数据的情况下,以前序优先级的数据为准,如果前序优先级的数据为空,则按优先级轮询获取后序来源数据融合。来源优先级通过对不同数据项对不同委办局的来源数据指定优先级别,确定最终的政务数据。在前序优先级存在数据的情况下,以前序优先级的数据为准,如果前序优先级的数据为空,则采用优先级轮次递增的方式获取有效数据作为融合的基础数据。Specifically, the fourth rule is used to characterize a fusion strategy based on a designated priority source: specify the source priority of an information item for multi-source data, and the system sequentially fuses the data based on the priority of the information item. In the case that there is data in the previous priority, the data of the previous priority shall prevail. If the data of the previous priority is empty, the latter will be polled according to the priority to obtain the data fusion of the subsequent source. The source priority determines the final government affairs data by assigning priority levels to the source data of different commissions, offices, and bureaus for different data items. In the case that there is data in the previous priority, the data of the previous priority shall prevail. If the data of the previous priority is empty, the effective data will be obtained in the way of increasing priority rounds as the basic data of the fusion.
针对自然世界层面表达同一个事物或描述元数据,虽然数据的生产者是唯一,在数据汇聚后需要发现数据完整性、局部冗余需进行数据质量划分。实体是按照客观存在并可相互区别的事物进行分解,实体识别把汇聚的数据进行识别相同实体并统一存储。针对同一实体中的数据进行形式、语义、数量的分析,将数据分解为孤证数据、可疑数据。孤证数据来源为权威机构发布为归为可信数据、非权威机构发布的数据为待确认确权数据;可疑数据是与自然规律有违的数据,导致无法被实体确认。这些数据之间可以通过数据核实机制进行数据可信性数据级别的转换。To express the same thing or describe metadata at the level of the natural world, although the producer of the data is the only one, after data aggregation, data integrity and local redundancy need to be divided into data quality. Entities are decomposed according to things that exist objectively and can be distinguished from each other. Entity recognition uses the aggregated data to identify the same entities and store them uniformly. Analyze the form, semantics, and quantity of the data in the same entity, and decompose the data into isolated evidence data and suspicious data. The source of orphan evidence data is classified as credible data released by an authoritative institution, and data released by a non-authoritative institution is data to be confirmed; suspicious data is data that violates the laws of nature and cannot be confirmed by the entity. These data can be converted between data credibility and data level through the data verification mechanism.
本实施例基于针对多来源的同一信息项(如人的性别),对于多源数据融合应数据属性及特征分析自动优选出不同来源数据信息项的融合策略,根据不同数据应用场景适配不同的融合策略来完成,解决多源数据冲突实现数据融合,数据融合规则包括基于认定结果、基于数据新鲜度融合、基于最值(最大值或最小值)融合、基于大多数原则融合、基于指定优先级来源融合等,由此解决政务信息数据海量、多源、异构等特征,保证共享和应用数据可用,以实现数据共享和数据应用。This embodiment is based on the same information item from multiple sources (such as the gender of a person). For multi-source data fusion, the data attribute and feature analysis automatically selects the fusion strategy of data information items from different sources, and adapts different data according to different data application scenarios. The fusion strategy is completed to solve the conflicts of multi-source data and realize data fusion. The data fusion rules include fusion based on the identified result, fusion based on data freshness, fusion based on the most value (maximum or minimum), fusion based on most principles, and based on designated priority Source integration, etc., thereby solving the characteristics of massive, multi-source, and heterogeneous government information data, ensuring the availability of sharing and application data, so as to realize data sharing and data application.
图3所示,本发明第三实施例提供的一种数据清洗装置,其为图1以及图2所示方法对应的装置实施例,图1以及图2的解释说明可以应用于本实施例,具体包括:As shown in FIG. 3, a data cleaning device provided by the third embodiment of the present invention is an embodiment of the device corresponding to the method shown in FIG. 1 and FIG. 2. The explanation of FIG. 1 and FIG. 2 can be applied to this embodiment. Specifically:
数据接收单元301,用于接收来自多个对象的业务数据,所述业务数据包括多个信息项;The data receiving unit 301 is configured to receive business data from multiple objects, where the business data includes multiple information items;
数据判断单元302,用于判断所述信息项是否属于预设的基于认定结果进行清洗的类型;The data judging unit 302 is used to judge whether the information item belongs to a preset type that is cleaned based on the identification result;
数据清洗单元303,用于若所述信息项属于预设的基于认定结果进行清洗的对象,则调用所述信息项对应的认定结果,并以所述认定结果作为所述信息项清洗后的数据;若所述信息项不属于预设的基于认定结果进行清洗的类型,则根据预设的多个数据清洗规则,依次对所述信息项进行清洗,得到所述信息项清洗后的数据;所述预设的多个数据清洗规则包括:根据所述信息项的数据产生时间进行清洗的第一规则、根据所述信息项的数据中的最大值或最 小值进行清洗的第二规则、根据所述信息项的数据中少数服从多数原则进行清洗的第三规则、以及根据所述信息项的数据所属对象的优先级清洗的第四规则。The data cleaning unit 303 is configured to, if the information item belongs to a preset object to be cleaned based on the identification result, call the identification result corresponding to the information item, and use the identification result as the cleaned data of the information item If the information item does not belong to the preset type that is cleaned based on the identification result, the information item is cleaned in sequence according to multiple preset data cleaning rules to obtain the cleaned data of the information item; The preset multiple data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, a second rule for cleaning according to the maximum or minimum value in the data of the information item, and The third rule that the minority of the data of the information item is cleaned according to the principle of majority, and the fourth rule of cleansing according to the priority of the object that the data of the information item belongs to.
具体操作时,所述数据清洗单元303包括:During specific operations, the data cleaning unit 303 includes:
第一数据判断模块(图中未示出),用于判断所述信息项是否属于预设的根据所述第一规则进行清洗的类型;若所述信息项属于预设的根据所述第一规则进行清洗的类型,则继续判断所述信息项是否属于根据所述信息项的数据产生时间由前往后的次序进行清洗的第一类型,或者是否属于根据所述信息项的数据产生时间由后往前的次序进行清洗的第二类型;The first data judgment module (not shown in the figure) is used to judge whether the information item belongs to the preset type that is cleaned according to the first rule; if the information item belongs to the preset type according to the first According to the type of cleaning according to the rules, continue to determine whether the information item belongs to the first type that is cleaned according to the data generation time of the information item in the descending order, or whether it belongs to the latter according to the data generation time of the information item. The second type of cleaning in the previous order;
第一数据清洗模块(图中未示出),用于若所述信息项属于所述第一类型,则将所述信息项的数据产生时间中最早的时间作为所述信息项清洗后的数据;若所述信息项属于所述第二类型,则将所述信息项的数据产生时间中最新的时间作为所述信息项清洗后的数据;The first data cleaning module (not shown in the figure) is configured to, if the information item belongs to the first type, use the earliest time in the data generation time of the information item as the cleaned data of the information item ; If the information item belongs to the second type, the latest time among the data generation time of the information item is taken as the cleaned data of the information item;
第二数据判断模块(图中未示出),用于若判断所述信息项不属于根据所述第一规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第二规则进行清洗的类型;若所述信息项属于预设的根据所述第二规则进行清洗的类型,则继续判断所述信息项是否属于根据所述信息项的数据中的最大值进行清洗的第三类型,或者是否属于根据所述信息项的数据中的最小值进行清洗的第四类型;The second data judgment module (not shown in the figure) is used to determine whether the information item belongs to the type that is cleaned according to the first rule if it is judged that the information item belongs to the type that is cleaned according to the second rule. The type of cleaning; if the information item belongs to the preset type that is cleaned according to the second rule, continue to determine whether the information item belongs to the third type that is cleaned according to the maximum value in the data of the information item , Or whether it belongs to the fourth type of cleaning based on the minimum value in the data of the information item;
第二数据清洗模块(图中未示出),用于若所述信息项属于所述三类型,则将所述信息项的数据中的最大值作为所述信息项清洗后的数据;若所述信息项属于所述第二类型,则将所述信息项的数据中的最小值作为所述信息项清洗后的数据。The second data cleaning module (not shown in the figure) is configured to, if the information item belongs to the three types, use the maximum value among the data of the information item as the data after the information item is cleaned; If the information item belongs to the second type, the smallest value among the data of the information item is taken as the cleaned data of the information item.
进一步地,所述数据清洗单元303还包括:Further, the data cleaning unit 303 further includes:
第三数据判断模块(图中未示出),用于判断所述信息项不属于根据所述第二规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第三规则进行清洗的类型;The third data judgment module (not shown in the figure) is used to judge that the information item does not belong to the type that is cleaned according to the second rule, and then continues to judge whether the information item belongs to the cleaned according to the third rule type;
第三数据清洗模块(图中未示出),用于若所述信息项属于预设的根据所述第三规则进行清洗的类型,则将所述信息项的数据中所占数量比例的最大的数据作为所述信息项清洗后的数据;The third data cleaning module (not shown in the figure) is configured to, if the information item belongs to a preset type that is cleaned according to the third rule, set the largest proportion of the data in the information item The data of is used as the cleaned data of the information item;
第三数据判断模块(图中未示出),用于若判断所述信息项不属于根据所述第三规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第四规则进行清洗的类型;The third data judging module (not shown in the figure) is used to determine whether the information item belongs to the type that is cleaned according to the fourth rule if it is judged that the information item does not belong to the type that is cleaned according to the fourth rule. Type of cleaning;
第四数据清洗模块(图中未示出),用于若所述信息项属于预设的根据所述第四规则进行清洗的类型,则将所述信息项的数据中所属对象的优先级最高的数据作为所述信息项清洗 后的数据。The fourth data cleaning module (not shown in the figure) is used to set the priority of the object in the data of the information item to the highest if the information item belongs to a preset type that is cleaned according to the fourth rule The data is used as the cleaned data of the information item.
本实施例数据清洗装置的工作原理具体为:数据清洗单元303优先选择“基于权威认定来源融合策略”(亦即“一数一源融合策略”融合数据,权威认定来源通过数据调研确定,形成按信息项的一数一源部门清单,在融合数据过程调用);若信息项未匹配到“基于权威认定来源融合策略”,数据清洗单元303则按属性特征分析结果融合数据,即基于数据的属性及特征分析自动生成信息项匹配的融合策略。数据清洗单元303基于属性特征分析结果确定的融合策略,针对要融合的信息项数据,预先检查“是否按业务时间融合数据”(即第一规则),若是,进行业务时间分析,采用“基于数据新鲜度融合策略”融合数据;若信息项未匹配“基于数据新鲜度融合策略”,则检查“是否按数据最值融合数据”(即第二规则),若是,进行数据最值分析,采用“基于数据最值融合策略”融合数据;若信息项未匹配“基于数据最值融合策略”,则检查“是否按大多数原则融合数据”(即第三规则),如是,进行数据分布统计,采用“基于大多数原则融合数据”;若信息项未匹配“基于大多数原则融合策略”,则采用“基于指定优先级融合策略”(即第四规则)融合数据。通过分析处理,根据数据的属性特征分析匹配融合策略,自动实现多源数据融合(按主题/实体组织数据)。The working principle of the data cleaning device of this embodiment is specifically as follows: the data cleaning unit 303 preferentially selects the "integration strategy based on the authoritative identification source" (that is, the "one number one source integration strategy" to fuse data, and the authoritative identification source is determined through data research, forming a basis The list of the source department of the information item is called in the process of fusing data); if the information item does not match the "integration strategy based on authoritative identification of the source", the data cleaning unit 303 merges the data according to the attribute feature analysis result, that is, based on the attribute of the data And feature analysis automatically generates a fusion strategy for matching information items. Based on the fusion strategy determined by the attribute feature analysis result, the data cleaning unit 303 pre-checks "whether the data is fused according to business time" (that is, the first rule) for the information item data to be fused, and if so, performs business time analysis and adopts "data-based Freshness fusion strategy" fuse data; if the information item does not match the "data freshness fusion strategy", then check "whether to fuse data according to the maximum value of the data" (that is, the second rule), if yes, perform data maximum value analysis and adopt " Fusion data based on the most value fusion strategy of data; if the information item does not match the most value fusion strategy based on data, then check “whether the data is fused according to the principle of majority” (the third rule), if so, perform data distribution statistics and adopt "Fuse data based on the majority principle"; if the information item does not match the "fusion strategy based on the majority principle", then use the "fusion strategy based on the designated priority" (that is, the fourth rule) to fuse the data. Through analysis and processing, the matching fusion strategy is analyzed according to the attribute characteristics of the data, and multi-source data fusion is automatically realized (organizing data by subject/entity).
本实施例融合业务(数据调研确定数据的一数一源部门)及智能数据分析方法,实现场景化的多源数据融合;根据信息项预设的对应的数据清洗规则,多数据融合策略智能优选,确保多源数据融合质量;全流程自动化实现数据属性及特征分析、数据融合,全面提升数据集成开发效率,有效解决来自于多对象业务数据的完整性问题、一致性问题、准确性问题、关联性等问题,以此来提高政务数据质量。This embodiment integrates services (data surveys to determine the number one source department of the data) and intelligent data analysis methods to realize scenario-based multi-source data fusion; according to the corresponding data cleaning rules preset for information items, multiple data fusion strategies are intelligently selected , To ensure the quality of multi-source data fusion; the entire process is automated to realize data attribute and feature analysis, data fusion, comprehensively improve the efficiency of data integration development, and effectively solve the integrity, consistency, accuracy, and association issues from multi-object business data To improve the quality of government data.
本发明还提供一种计算机可读存储介质,存储有程序,所述程序包括用于执行上述方法的指令。The present invention also provides a computer-readable storage medium that stores a program, and the program includes instructions for executing the above-mentioned method.
本发明还提供一种计算机,包括存储有计算机程序的可读介质,所述程序包括用于执行上述方法的指令。上述计算机可读存储介质以及计算机具有上述数据清洗方法相应的技术效果,不再赘述。The present invention also provides a computer including a readable medium storing a computer program, the program including instructions for executing the above method. The above-mentioned computer-readable storage medium and the computer have the corresponding technical effects of the above-mentioned data cleaning method, and will not be repeated here.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. All should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (10)

  1. 一种数据清洗方法,其特征在于,所述方法包括:A data cleaning method, characterized in that the method includes:
    接收来自多个对象的业务数据,所述业务数据包括多个信息项;Receiving business data from multiple objects, the business data including multiple information items;
    依次对各信息项进行数据清洗,所述数据清洗包括:Perform data cleaning on each information item in turn, and the data cleaning includes:
    判断所述信息项是否属于预设的基于认定结果进行清洗的类型;Judging whether the information item belongs to a preset type that is cleaned based on the identification result;
    若所述信息项属于预设的基于认定结果进行清洗的对象,则调用所述信息项对应的认定结果,并以所述认定结果作为所述信息项清洗后的数据;If the information item belongs to a preset object to be cleaned based on the identification result, call the identification result corresponding to the information item, and use the identification result as the cleaned data of the information item;
    若所述信息项不属于预设的基于认定结果进行清洗的类型,则根据预设的多个数据清洗规则,依次对所述信息项进行清洗,得到所述信息项清洗后的数据;所述预设的多个数据清洗规则包括:根据所述信息项的数据产生时间进行清洗的第一规则、根据所述信息项的数据中的最大值或最小值进行清洗的第二规则、根据所述信息项的数据中少数服从多数原则进行清洗的第三规则、以及根据所述信息项的数据所属对象的优先级清洗的第四规则。If the information item does not belong to the preset type to be cleaned based on the identification result, the information item is cleaned in sequence according to multiple preset data cleaning rules to obtain the cleaned data of the information item; The multiple preset data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, a second rule for cleaning according to the maximum or minimum value in the data of the information item, and according to the The third rule that the minority of the data of the information item is cleaned according to the principle of majority, and the fourth rule that is cleaned according to the priority of the object to which the data of the information item belongs.
  2. 根据权利要求1所述的数据清洗方法,其特征在于,根据预设的多个数据清洗规则,The data cleaning method according to claim 1, wherein according to a plurality of preset data cleaning rules,
    依次对所述信息项进行清洗,得到所述信息项清洗后的数据的步骤,包括:The step of sequentially cleaning the information items to obtain cleaned data of the information items includes:
    判断所述信息项是否属于预设的根据所述第一规则进行清洗的类型;Judging whether the information item belongs to a preset type that is cleaned according to the first rule;
    若所述信息项属于预设的根据所述第一规则进行清洗的类型,则继续判断所述信息项是否属于根据所述信息项的数据产生时间由前往后的次序进行清洗的第一类型,或者是否属于根据所述信息项的数据产生时间由后往前的次序进行清洗的第二类型;If the information item belongs to the preset type to be cleaned according to the first rule, continue to determine whether the information item belongs to the first type that is cleaned in the descending order according to the data generation time of the information item, Or whether it belongs to the second type of cleaning according to the data generation time of the information item in a backward-to-forward order;
    若所述信息项属于所述第一类型,则将所述信息项的数据产生时间中最早的时间作为所述信息项清洗后的数据;If the information item belongs to the first type, the earliest time in the data generation time of the information item is taken as the cleaned data of the information item;
    若所述信息项属于所述第二类型,则将所述信息项的数据产生时间中最新的时间作为所述信息项清洗后的数据。If the information item belongs to the second type, the latest time in the data generation time of the information item is taken as the cleaned data of the information item.
  3. 根据权利要求2所述的数据清洗方法,其特征在于,根据预设的多个数据清洗规则,依次对所述信息项进行清洗,得到所述信息项清洗后的数据的步骤,包括:The data cleaning method according to claim 2, wherein the step of sequentially cleaning the information items according to a plurality of preset data cleaning rules to obtain the cleaned data of the information items comprises:
    若判断所述信息项不属于根据所述第一规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第二规则进行清洗的类型;If it is determined that the information item does not belong to the type cleaned according to the first rule, continue to determine whether the information item belongs to the type cleaned according to the second rule;
    若所述信息项属于预设的根据所述第二规则进行清洗的类型,则继续判断所述信息项是否属于根据所述信息项的数据中的最大值进行清洗的第三类型,或者是否属于根据所述信息 项的数据中的最小值进行清洗的的第四类型;If the information item belongs to the preset type to be cleaned according to the second rule, continue to determine whether the information item belongs to the third type that is cleaned according to the maximum value in the data of the information item, or whether it belongs to The fourth type of cleaning according to the minimum value in the data of the information item;
    若所述信息项属于所述三类型,则将所述信息项的数据中的最大值作为所述信息项清洗后的数据;If the information item belongs to the three types, use the maximum value among the data of the information item as the cleaned data of the information item;
    若所述信息项属于所述第二类型,则将所述信息项的数据中的最小值作为所述信息项清洗后的数据。If the information item belongs to the second type, the smallest value among the data of the information item is taken as the cleaned data of the information item.
  4. 根据权利要求3所述的数据清洗方法,其特征在于,根据预设的多个数据清洗规则,依次对所述信息项进行清洗,得到所述信息项清洗后的数据的步骤,包括:若判断所述信息项不属于根据所述第二规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第三规则进行清洗的类型;The data cleaning method according to claim 3, wherein the step of sequentially cleaning the information items according to a plurality of preset data cleaning rules to obtain the cleaned data of the information items comprises: If the information item does not belong to the type cleaned according to the second rule, continue to determine whether the information item belongs to the type cleaned according to the third rule;
    若所述信息项属于预设的根据所述第三规则进行清洗的类型,则将所述信息项的数据中所占数量比例的最大的数据作为所述信息项清洗后的数据。If the information item belongs to a preset type that is cleaned according to the third rule, the data with the largest proportion among the data of the information item is taken as the cleaned data of the information item.
  5. 根据权利要求4所述的数据清洗方法,其特征在于,根据预设的多个数据清洗规则,依次对所述信息项进行清洗,得到所述信息项清洗后的数据的步骤,包括:The data cleaning method according to claim 4, wherein the step of sequentially cleaning the information items according to a plurality of preset data cleaning rules to obtain the cleaned data of the information items comprises:
    若判断所述信息项不属于根据所述第三规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第四规则进行清洗的类型;If it is determined that the information item does not belong to the type cleaned according to the third rule, continue to determine whether the information item belongs to the type cleaned according to the fourth rule;
    若所述信息项属于预设的根据所述第四规则进行清洗的类型,则将所述信息项的数据中所属对象的优先级最高的数据作为所述信息项清洗后的数据。If the information item belongs to a preset type that is cleaned according to the fourth rule, the data with the highest priority of the object in the data of the information item is taken as the cleaned data of the information item.
  6. 一种数据清洗装置,其特征在于,包括:A data cleaning device is characterized in that it comprises:
    数据接收单元,用于接收来自多个对象的业务数据,所述业务数据包括多个信息项;A data receiving unit, configured to receive business data from multiple objects, the business data including multiple information items;
    数据判断单元,用于判断所述信息项是否属于预设的基于认定结果进行清洗的类型;A data judging unit for judging whether the information item belongs to a preset type that is cleaned based on the verification result;
    数据清洗单元,用于若所述信息项属于预设的基于认定结果进行清洗的对象,则调用所述信息项对应的认定结果,并以所述认定结果作为所述信息项清洗后的数据;若所述信息项不属于预设的基于认定结果进行清洗的类型,则根据预设的多个数据清洗规则,依次对所述信息项进行清洗,得到所述信息项清洗后的数据;所述预设的多个数据清洗规则包括:根据所述信息项的数据产生时间进行清洗的第一规则、根据所述信息项的数据中的最大值或最小值进行清洗的第二规则、根据所述信息项的数据中少数服从多数原则进行清洗的第三规则、以及根据所述信息项的数据所属对象的优先级清洗的第四规则。A data cleaning unit, configured to, if the information item belongs to a preset object to be cleaned based on the identification result, call the identification result corresponding to the information item, and use the identification result as the cleaned data of the information item; If the information item does not belong to the preset type to be cleaned based on the identification result, the information item is cleaned in sequence according to multiple preset data cleaning rules to obtain the cleaned data of the information item; The preset multiple data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, a second rule for cleaning according to the maximum or minimum value in the data of the information item, and according to the The third rule that the minority of the data of the information item is cleaned according to the principle of majority, and the fourth rule that is cleaned according to the priority of the object to which the data of the information item belongs.
  7. 根据权利要求6所述的数据清洗装置,其特征在于,所述数据清洗单元包括:The data cleaning device according to claim 6, wherein the data cleaning unit comprises:
    第一数据判断模块,用于判断所述信息项是否属于预设的根据所述第一规则进行清洗的类型;若所述信息项属于预设的根据所述第一规则进行清洗的类型,则继续判断所述信息项是否属于根据所述信息项的数据产生时间由前往后的次序进行清洗的第一类型,或者是否属于根据所述信息项的数据产生时间由后往前的次序进行清洗的第二类型;The first data judgment module is used to judge whether the information item belongs to the preset type that is cleaned according to the first rule; if the information item belongs to the preset type that is cleaned according to the first rule, then Continue to determine whether the information item belongs to the first type that is cleaned in the descending order according to the data generation time of the information item, or whether it belongs to the first type that is cleaned in the descending order according to the data generation time of the information item Second type
    第一数据清洗模块,用于若所述信息项属于所述第一类型,则将所述信息项的数据产生时间中最早的时间作为所述信息项清洗后的数据;若所述信息项属于所述第二类型,则将所述信息项的数据产生时间中最新的时间作为所述信息项清洗后的数据;The first data cleaning module is configured to, if the information item belongs to the first type, use the earliest time in the data generation time of the information item as the data after the information item is cleaned; if the information item belongs to For the second type, the latest time among the data generation times of the information item is used as the cleaned data of the information item;
    第二数据判断模块,用于若判断所述信息项不属于根据所述第一规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第二规则进行清洗的类型;若所述信息项属于预设的根据所述第二规则进行清洗的类型,则继续判断所述信息项是否属于根据所述信息项的数据中的最大值进行清洗的第三类型,或者是否属于根据所述信息项的数据中的最小值进行清洗的第四类型;The second data judgment module is configured to, if it is judged that the information item does not belong to the type that is cleaned according to the first rule, continue to judge whether the information item belongs to the type that is cleaned according to the second rule; if the If the information item belongs to the preset type to be cleaned according to the second rule, then continue to determine whether the information item belongs to the third type that is cleaned according to the maximum value in the data of the information item, or whether it belongs to the third type that is cleaned according to the The fourth type of cleaning the minimum value in the data of the information item;
    第二数据清洗模块,用于若所述信息项属于所述三类型,则将所述信息项的数据中的最大值作为所述信息项清洗后的数据;若所述信息项属于所述第二类型,则将所述信息项的数据中的最小值作为所述信息项清洗后的数据。The second data cleaning module is configured to, if the information item belongs to the three types, use the maximum value of the data of the information item as the data after the information item is cleaned; if the information item belongs to the first In the second type, the minimum value among the data of the information item is taken as the data after the information item is cleaned.
  8. 根据权利要求7所述的数据清洗装置,其特征在于,所述数据清洗单元还包括:8. The data cleaning device according to claim 7, wherein the data cleaning unit further comprises:
    第三数据判断模块,用于判断所述信息项不属于根据所述第二规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第三规则进行清洗的类型;The third data judgment module is configured to judge that the information item does not belong to the type cleaned according to the second rule, and then continue to judge whether the information item belongs to the type cleaned according to the third rule;
    第三数据清洗模块,用于若所述信息项属于预设的根据所述第三规则进行清洗的类型,则将所述信息项的数据中所占数量比例的最大的数据作为所述信息项清洗后的数据;The third data cleaning module is configured to, if the information item belongs to a preset type that is cleaned according to the third rule, use the data with the largest proportion among the data of the information item as the information item Data after cleaning;
    第三数据判断模块,用于若判断所述信息项不属于根据所述第三规则进行清洗的类型,则继续判断所述信息项是否属于根据所述第四规则进行清洗的类型;The third data judgment module is configured to, if it is judged that the information item does not belong to the type cleaned according to the third rule, continue to judge whether the information item belongs to the type cleaned according to the fourth rule;
    第四数据清洗模块,用于若所述信息项属于预设的根据所述第四规则进行清洗的类型,则将所述信息项的数据中所属对象的优先级最高的数据作为所述信息项清洗后的数据。The fourth data cleaning module is configured to, if the information item belongs to a preset type that is cleaned according to the fourth rule, use the highest priority data of the object in the data of the information item as the information item Data after cleaning.
  9. 一种计算机可读存储介质,存储有程序,其特征在于,所述程序包括用于执行如权利要求1-5中任一项所述方法的指令。A computer-readable storage medium storing a program, wherein the program includes instructions for executing the method according to any one of claims 1-5.
  10. 一种计算机,包括存储有计算机程序的可读介质,其特征在于,所述程序包括A computer including a readable medium storing a computer program, characterized in that the program includes
    用于执行如权利要求1-5中任一项所述方法的指令。Instructions for performing the method according to any one of claims 1-5.
PCT/CN2020/138010 2020-01-17 2020-12-21 Data cleaning method and apparatus WO2021143463A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010051037.2A CN111291029B (en) 2020-01-17 2020-01-17 Data cleaning method and device
CN202010051037.2 2020-01-17

Publications (1)

Publication Number Publication Date
WO2021143463A1 true WO2021143463A1 (en) 2021-07-22

Family

ID=71023404

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/138010 WO2021143463A1 (en) 2020-01-17 2020-12-21 Data cleaning method and apparatus

Country Status (2)

Country Link
CN (1) CN111291029B (en)
WO (1) WO2021143463A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291029B (en) * 2020-01-17 2024-03-08 深圳市华傲数据技术有限公司 Data cleaning method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160147798A1 (en) * 2014-11-25 2016-05-26 International Business Machines Corporation Data cleansing and governance using prioritization schema
CN107657049A (en) * 2017-09-30 2018-02-02 深圳市华傲数据技术有限公司 A kind of data processing method based on data warehouse
CN109634949A (en) * 2018-12-28 2019-04-16 浙江大学 A kind of blended data cleaning method based on more versions of data
CN110196912A (en) * 2019-04-15 2019-09-03 贵州电网有限责任公司 A kind of power grid archives parallel model construction method based on trust regular network
CN110597793A (en) * 2019-07-30 2019-12-20 深圳市华傲数据技术有限公司 Data management method and device, electronic equipment and computer readable storage medium
CN111291029A (en) * 2020-01-17 2020-06-16 深圳市华傲数据技术有限公司 Data cleaning method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150019303A1 (en) * 2013-07-11 2015-01-15 Bank Of America Corporation Data quality integration
CN107193858B (en) * 2017-03-28 2018-09-11 福州金瑞迪软件技术有限公司 Intelligent Service application platform and method towards multi-source heterogeneous data fusion
CN109711685A (en) * 2018-12-14 2019-05-03 杨冰之 A kind of government affairs big data processing platform

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160147798A1 (en) * 2014-11-25 2016-05-26 International Business Machines Corporation Data cleansing and governance using prioritization schema
CN107657049A (en) * 2017-09-30 2018-02-02 深圳市华傲数据技术有限公司 A kind of data processing method based on data warehouse
CN109634949A (en) * 2018-12-28 2019-04-16 浙江大学 A kind of blended data cleaning method based on more versions of data
CN110196912A (en) * 2019-04-15 2019-09-03 贵州电网有限责任公司 A kind of power grid archives parallel model construction method based on trust regular network
CN110597793A (en) * 2019-07-30 2019-12-20 深圳市华傲数据技术有限公司 Data management method and device, electronic equipment and computer readable storage medium
CN111291029A (en) * 2020-01-17 2020-06-16 深圳市华傲数据技术有限公司 Data cleaning method and device

Also Published As

Publication number Publication date
CN111291029B (en) 2024-03-08
CN111291029A (en) 2020-06-16

Similar Documents

Publication Publication Date Title
CN110765337B (en) Service providing method based on internet big data
Perboli et al. A new taxonomy of smart city projects
CN112685385A (en) Big data platform for smart city construction
WO2023108967A1 (en) Joint credit scoring method and apparatus based on privacy protection calculation and cross-organization
CN103825755A (en) Power secondary system modeling method and system
Zhang et al. Research on the integration of heterogeneous information resources in university management informatization based on data mining algorithms
CN111080261A (en) Visual data asset management system based on big data
CN113722301A (en) Big data processing method, device and system based on education information and storage medium
CN116777284A (en) Space and attribute data integrated quality inspection method
WO2021143463A1 (en) Data cleaning method and apparatus
CN111353085A (en) Cloud mining network public opinion analysis method based on feature model
CN117436768A (en) Unified supervision index method based on data management
Shen et al. The relationship between supply chain resilience, supply chain integration, and supply chain performance: A MASEM analysis
Kapucu et al. The use of documentary data for network analysis in emergency and crisis management
Tse et al. Risks facing smart city information security in Hangzhou
Ye et al. Research on Enterprise Risk Prediction Path Based on Knowledge Graph
Chen et al. Complex network controllability analysis on business architecture optimization
Cheng et al. Research on freight big data application based on Railway Data Service Platform
Zhang et al. Simulation of enterprise human resource scheduling algorithm optimization in the context of smart city
Ge et al. Research on Network Data Monitoring and Legal Evidence Integration Based on Cloud Computing
Xia et al. Research on Block Data Intelligent Platform of Urban Public Security Emergency Management
Jia et al. Study on standard system of aerospace quality data resources integration under the background of big data
Tian AI-Assisted Dynamic Modeling for Data Management in a Distributed System
Sun Prison Cloud and Big Data Application—Auxiliary Decision Making Support System
CN117056304A (en) Method and device for constructing main database based on cloud platform and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20913977

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 071222)

122 Ep: pct application non-entry in european phase

Ref document number: 20913977

Country of ref document: EP

Kind code of ref document: A1