WO2021143463A1

WO2021143463A1 - Data cleaning method and apparatus

Info

Publication number: WO2021143463A1
Application number: PCT/CN2020/138010
Authority: WO
Inventors: 胡云; 龚健; 李邱林; 唐明辉; 贾西贝
Original assignee: 深圳市华傲数据技术有限公司
Priority date: 2020-01-17
Filing date: 2020-12-21
Publication date: 2021-07-22
Also published as: CN111291029B; CN111291029A

Abstract

A data cleaning method and apparatus, the method comprising: receiving service data from multiple objects, the service data comprising multiple information items (101); and performing data cleaning on each information item in turn, the data cleaning comprising (102): determining whether an information item belongs to a preset type for cleaning based on a confirmation result (102a); if the information item belongs to a preset object for cleaning based on a confirmation result, then invoking a confirmation result corresponding to the information item, and using the confirmation result as the cleaned data of the information item (102b); and, if the information item does not belong to the preset type for cleaning based on a confirmation result, then cleaning the information items in turn on the basis of a preset plurality of data cleaning rules to obtain cleaned data of the information items (102c). The present invention implements unified data output for multi-object service data cleaning, solving the problem of the difficulty of achieving data fusion for multi-object data conflicts.

Description

Data cleaning method and device

Technical field clear

The invention relates to the field of data processing, in particular to a data cleaning method and device.

Background technique

Government data collection currently has the following characteristics: First: data collection is difficult. The government's business is extremely complex. There are dozens of directly affiliated departments, such as the Public Security Bureau, the Health and Family Planning Commission, the Human Resources and Social Security Bureau, the Civil Affairs Bureau, the Market Supervision Commission, the Transportation Commission, the Provident Fund Center, etc., as well as several district and county-level units. These commissions, offices, or agencies correspond to dozens of lists of rights and responsibilities and dozens of core systems, and these systems can generate a large amount of electronic data every day. In addition, the government can also access a large amount of external data, such as data related to water, electricity, gas, telecommunications, and banking. In addition to structured data, there is a large amount of unstructured data within government departments. These data include electronic files of various licenses, pictures, office documents, videos, compressed files, etc. In addition, sufficient information is needed in the process of smart city construction. Collecting data from the Internet of Things, these files must not only solve the storage problem, but also solve the problem of use. To improve the social management and urban governance capabilities of government departments, it is necessary to improve the storage, analysis, and calculation capabilities of unstructured data, and at the same time share and integrate the business data of various commissions, offices, and bureaus, and use data to assist management and decision-making. It is extremely difficult to integrate so many complex departments and business data to form a unified integrated resource database. Government departments urgently need industry solutions to improve their comprehensive management and control capabilities for government affairs data. Second: Data quality control is difficult, data standards are different, and data quality is poor. There are many commissions and bureaus under government departments, and the business systems of the commissions and bureaus are basically decentralized construction, decentralized operation and maintenance, and lack of unified planning at the government level. Although there are corresponding government information resource catalogs and data element specifications at the national level, the construction of standards is relatively lagging, and there are also major problems in the promotion and implementation of standards. As a result, the business departments of various commissions, offices and bureaus do not quote government data standards. Unification, inconsistent definitions of data elements, and irregularities in data collection and entry have made the data quality of various commissions, offices, and bureaus poor. It is difficult to standardize data quality in order to unify data standards. Building a smart city, improving the integration and sharing of government affairs data among various commissions, offices, and bureaus. The establishment of a unified data standard and data quality monitoring system is the top priority. Without standardized quality monitoring and data standards, the data collected by government departments will only It is disorganized and cannot play the value of government data. The establishment of a city-level data center requires a good data standard management and quality management of government data. Third: Data fusion is difficult, and government data sources are diverse. The business of government departments is complex, and there are many repetitive parts in the management of government information resources between commissions, offices, and bureaus, such as basic information about citizens, legal persons, houses, and spatial geography. Different commissions, offices, and bureaus have all relevant information. Or part of the data, and the data standards and data definitions of the various commissions, offices and bureaus are quite different. Even different systems of the same commission, office and bureau have different data for the same object. There are multiple sources of government information resources. problem. How to choose the most accurate and suitable data from the many data sources greatly tests the government departments' understanding and processing methods of government affairs and government affairs data. Fourth: It is difficult to collect data in real time. At present, the construction of government data governance projects is in full swing, but most of the data governance projects solve the problem of historical data migration and storage. It is difficult to obtain relevant business management information in real time, and the lack of real-time data acquisition will greatly affect Government administrative efficiency. With the improvement of government efficiency, the speed of response to data has also increased. For example, grid inspectors collect events, quickly transfer them to the fusion library, through simple cleaning and fusion, and then associate with more information (such as enterprise information), and then distribute to the grid processing personnel, after the processing of the grid processing personnel is dynamically updated Then flow to the integration platform. This entire data processing process is usually controlled within 1 minute. Fifth: Data application is difficult. In the past, government affairs information systems or government affairs data warehouse projects focused on the collection and integration of individual department data, and the statistical analysis of internal data. They could not intuitively make citizens feel the efficiency improvement and service quality of government departments’ administrative affairs. promote. Citizens still need to run errands and prepare more materials when dealing with government affairs, and even encounter situations where various government departments preside over each other, which greatly consumes citizens' time and energy. The public eagerly hope that the data between government departments can be mutually integrated, so that the public can have a better government service experience. Government departments also hope to improve their ability to control government affairs data and tap more application value of government affairs data. Promote the open sharing of government affairs data, and improve government governance capabilities and service levels.

Therefore, it is urgent to propose a data cleaning method and device to solve the problem of data fusion that is difficult to achieve data fusion of multiple object data conflicts.

Summary of the invention

In view of this, the present invention provides a data cleaning method and device to achieve unified data output for multiple object business data cleaning, and solve the problem that multiple object data conflicts are difficult to achieve data fusion.

In a first aspect, the present invention provides a data cleaning method, the method includes: receiving business data from multiple objects, the business data including multiple information items; performing data cleaning on each information item in turn, the data cleaning Including: judging whether the information item belongs to the preset type of cleaning based on the identification result; if the information item belongs to the preset object that is cleaned based on the identification result, calling the identification result corresponding to the information item, and The identification result is used as the data after the information item is cleaned; if the information item does not belong to the preset type that is cleaned based on the identification result, the information item is sequentially processed according to multiple preset data cleaning rules Cleaning to obtain the cleaned data of the information item; the preset multiple data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, and a maximum value of the data of the information item; A second rule for cleaning based on the value or minimum value, a third rule for cleaning based on the principle of majority in the minority of the data of the information item, and a fourth rule for cleaning based on the priority of the object to which the data of the information item belongs.

In a second aspect, the present invention provides a data cleaning device, including: a data receiving unit for receiving business data from multiple objects, the business data including multiple information items; a data judging unit, for judging the information Whether the item belongs to the preset type that is cleaned based on the identification result; the data cleaning unit is used to call the identification result corresponding to the information item if the information item belongs to the preset object to be cleaned based on the identification result, and The identification result is used as the data after the information item is cleaned; if the information item does not belong to the preset type that is cleaned based on the identification result, the information item is sequentially processed according to multiple preset data cleaning rules Cleaning to obtain the cleaned data of the information item; the preset multiple data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, and a maximum value of the data of the information item; A second rule for cleaning based on the value or minimum value, a third rule for cleaning based on the principle of majority in the minority of the data of the information item, and a fourth rule for cleaning based on the priority of the object to which the data of the information item belongs.

In a third aspect, the present invention provides a computer-readable storage medium that stores a program, and the program includes instructions for executing the above-mentioned data cleaning method.

In a fourth aspect, the present invention provides a computer including a readable medium storing a computer program, the program including instructions for executing the above data cleaning method.

The data cleaning method and device of the present invention determine the data cleaning rules corresponding to the information items in the business data based on the business data from multiple objects, and then perform data cleaning according to the corresponding data cleaning rules to realize the data cleaning for multiple objects. Business data cleaning realizes unified data output, and solves the problem of multiple object data conflicts that are difficult to achieve data fusion.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, without creative work, other drawings can be obtained from these drawings.

FIG. 1 is a flowchart of a data cleaning method provided by the first embodiment of the present invention;

2 is a flowchart of a data cleaning method provided by a second embodiment of the present invention;

Fig. 3 is a structural block diagram of a data cleaning device provided by a third embodiment of the present invention.

Detailed ways

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

It should be noted that the following embodiments and the features in the embodiments can be combined with each other if there is no conflict; and, based on the embodiments in the present disclosure, those of ordinary skill in the art can obtain the results without creative work. All other embodiments fall within the protection scope of the present disclosure.

It should be noted that various aspects of the embodiments within the scope of the appended claims are described below. It should be obvious that the aspects described herein can be embodied in a wide variety of forms, and any specific structure and/or function described herein are only illustrative. Based on the present disclosure, those skilled in the art should understand that one aspect described herein can be implemented independently of any other aspects, and two or more of these aspects can be combined in various ways. For example, any number of aspects set forth herein can be used to implement devices and/or methods of practice. In addition, other structures and/or functionalities other than one or more of the aspects set forth herein may be used to implement this device and/or practice this method.

As shown in FIG. 1, a data cleaning method provided by the first embodiment of the present invention includes:

Step 101: Receive business data from multiple objects, where the business data includes multiple information items;

Step 102: Perform data cleaning on each information item in sequence, and the data cleaning specifically includes:

Step 102a: Determine whether the information item belongs to a preset type that is cleaned based on the identification result;

The certification result can specifically be based on the result of authoritative identification, which means that certain information items, such as gender, are based on the "one number one source" source unit and the certification result of the authority identification information item obtained through data research, based on the "one number one source" The determined result realizes the data fusion of multiple objects (multiple departments, also called multiple sources).

Step 102b: If the information item belongs to a preset object to be cleaned based on the identification result, call the identification result corresponding to the information item, and use the identification result as the cleaned data of the information item;

Step 102c: If the information item does not belong to the preset type to be cleaned based on the identification result, the information item is cleaned in sequence according to multiple preset data cleaning rules to obtain the cleaned data of the information item The multiple preset data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, a second rule for cleaning according to the maximum or minimum value in the data of the information item, The third rule for cleaning according to the principle of majority in the minority of the data of the information item, and the fourth rule for cleaning according to the priority of the object to which the data of the information item belongs.

This embodiment is based on the business data from multiple objects, determining the data cleaning rules corresponding to the information items in the business data to perform data cleaning, so as to achieve unified data output for multiple object business data cleaning, and solve multiple object data Conflict is difficult to achieve data fusion.

As shown in FIG. 2, a data cleaning method provided by the second embodiment of the present invention is a preferred implementation of the method shown in FIG. 1, and specifically includes:

Step 201: Receive business data from multiple objects;

Step 202: Determine whether the information item belongs to a preset type that is cleaned based on the identification result;

Step 203: If the information item belongs to a preset object to be cleaned based on the identification result, call the identification result corresponding to the information item, and use the identification result as the cleaned data of the information item;

Step 204: If the information item does not belong to the preset type to be cleaned based on the identification result, continue to judge according to multiple preset data cleaning rules;

Step 205: Determine whether the information item belongs to the preset type to be cleaned according to the first rule; in specific operations, the first rule is used to characterize the data freshness fusion strategy: by comparing the business of the information items from multiple sources For the processing time, the data of the latest business processing time or the earliest business processing time is regarded as the fusion data.

Step 206: If the information item belongs to the preset type to be cleaned according to the first rule, continue to determine whether the information item belongs to the first cleaned according to the data generation time of the information item in descending order. One type, or whether it belongs to the second type that is cleaned according to the data generation time of the information item from back to front.

The first type mentioned above is to perform data cleaning based on the oldest value. Specifically, by comparing the business processing time and warehousing time of the same basic data, the earliest business processing time data is used as the basic data of the fusion data to complete the process of counting one source. The second type is to clean data based on the latest value. Specifically, by comparing the business processing time and warehousing time of the same basic data, the latest business processing time data is used as the basic data of the fusion data to complete the process of counting one source. For the registration of personal marital status, last year's data from the Social Security Bureau showed that they were unmarried, and the Ministry of Civil Affairs' data this year showed that they were married, then the marriage field information of the population is subject to the Ministry of Civil Affairs' married.

Step 207: If the information item belongs to the first type, the earliest time in the data generation time of the information item is taken as the data after the information item is cleaned; if the information item belongs to the second type , The latest time in the data generation time of the information item is taken as the data after the information item is cleaned.

Step 208: If it is judged that the information item does not belong to the type cleaned according to the first rule, continue to judge whether the information item belongs to the type cleaned according to the second rule;

In specific operations, the second rule is used to characterize the fusion strategy based on the most value of the data: by comparing the data of the same information item of multiple source data, the commission office data with the maximum or minimum field value is used as the fusion data. For example, there are 3 departments registered for a person’s salary data, among which 10,000 are registered in the talent service center, 11,000 in the tax bureau, and 12,000 in the social security bureau. The existing tax calculation application analysis scenario requires no tax evasion, and the salary data of the population should take the maximum value (that is, the salary data of the Social Security Bureau) as the fusion data.

In addition, if there are regional women’s age at first childbirth, a woman’s age at first childbearing is registered in 3 departments. Among them, the public security bureau registered 26 years old, the street office registered 23 years old, and the Health and Family Planning Commission registered 20 years old. Existing regional health survey scenarios for early childbearing infants require no omissions, and the minimum registered age of the woman at first childbirth (that is, data from the Health and Family Planning Commission) is used as the fusion data.

Step 209: If the information item belongs to the preset type that is cleaned according to the second rule, continue to determine whether the information item belongs to the third type that is cleaned according to the maximum value in the data of the information item. Or whether it belongs to the fourth type of cleaning according to the minimum value in the data of the information item.

Specifically, the maximum value is compared with the specific data of the same basic data, and the commission office data with the largest field value is used as the fusion data to complete the one-to-one process. If the statistics of the personal salary situation is 10,000 in the Public Security Bureau and 12,000 in the Social Security Bureau, the salary data of the population shall be subject to the data of the Social Security Bureau. The minimum value compares the specific data of the same basic data and uses the commission office data with the smallest field value as the fusion data to complete the one-to-one process. Such as marriage age statistics, use the minimum data of each commission, office, and bureau as the fusion data.

Step 210: If the information item belongs to the three types, the maximum value among the data of the information item is taken as the data after the information item is cleaned; if the information item belongs to the second type, then The smallest value among the data of the information item is taken as the data after the information item is cleaned.

Step 211: If it is judged that the information item does not belong to the type cleaned according to the second rule, continue to judge whether the information item belongs to the type cleaned according to the third rule;

The third rule is specifically used to characterize the fusion strategy based on the principle of majority: by comparing the same information item data of multiple source data, the minority obeys the majority, and the majority value is used as the fusion data. For example, a person’s place of residence information is registered in 10 source departments, among which 9 source departments are registered as "Shenzhen", and one source department is registered as "Guangzhou", based on the integration of big data principles ("the minority obeys the majority" ) Strategic integration, the final determination of "Shenzhen" as the residence information.

Step 212: If the information item belongs to a preset type that is cleaned according to the third rule, perform statistics on the data of the information item;

Specifically, by comparing the same value, the minority obeys the majority, and the majority value is used as the fusion data to solve the problem of data errors in a single department, such as residence information.

Step 213: Use the data with the largest proportion among the data of the information item as the data after the information item is cleaned.

Step 214: If it is determined that the information item does not belong to the type that is cleaned according to the third rule, perform data cleansing according to the fourth rule, and specifically take the data with the highest priority of the object in the data of the information item as the data item. The data after cleaning the information item.

Specifically, the fourth rule is used to characterize a fusion strategy based on a designated priority source: specify the source priority of an information item for multi-source data, and the system sequentially fuses the data based on the priority of the information item. In the case that there is data in the previous priority, the data of the previous priority shall prevail. If the data of the previous priority is empty, the latter will be polled according to the priority to obtain the data fusion of the subsequent source. The source priority determines the final government affairs data by assigning priority levels to the source data of different commissions, offices, and bureaus for different data items. In the case that there is data in the previous priority, the data of the previous priority shall prevail. If the data of the previous priority is empty, the effective data will be obtained in the way of increasing priority rounds as the basic data of the fusion.

To express the same thing or describe metadata at the level of the natural world, although the producer of the data is the only one, after data aggregation, data integrity and local redundancy need to be divided into data quality. Entities are decomposed according to things that exist objectively and can be distinguished from each other. Entity recognition uses the aggregated data to identify the same entities and store them uniformly. Analyze the form, semantics, and quantity of the data in the same entity, and decompose the data into isolated evidence data and suspicious data. The source of orphan evidence data is classified as credible data released by an authoritative institution, and data released by a non-authoritative institution is data to be confirmed; suspicious data is data that violates the laws of nature and cannot be confirmed by the entity. These data can be converted between data credibility and data level through the data verification mechanism.

This embodiment is based on the same information item from multiple sources (such as the gender of a person). For multi-source data fusion, the data attribute and feature analysis automatically selects the fusion strategy of data information items from different sources, and adapts different data according to different data application scenarios. The fusion strategy is completed to solve the conflicts of multi-source data and realize data fusion. The data fusion rules include fusion based on the identified result, fusion based on data freshness, fusion based on the most value (maximum or minimum), fusion based on most principles, and based on designated priority Source integration, etc., thereby solving the characteristics of massive, multi-source, and heterogeneous government information data, ensuring the availability of sharing and application data, so as to realize data sharing and data application.

As shown in FIG. 3, a data cleaning device provided by the third embodiment of the present invention is an embodiment of the device corresponding to the method shown in FIG. 1 and FIG. 2. The explanation of FIG. 1 and FIG. 2 can be applied to this embodiment. Specifically:

The data receiving unit 301 is configured to receive business data from multiple objects, where the business data includes multiple information items;

The data judging unit 302 is used to judge whether the information item belongs to a preset type that is cleaned based on the identification result;

The data cleaning unit 303 is configured to, if the information item belongs to a preset object to be cleaned based on the identification result, call the identification result corresponding to the information item, and use the identification result as the cleaned data of the information item If the information item does not belong to the preset type that is cleaned based on the identification result, the information item is cleaned in sequence according to multiple preset data cleaning rules to obtain the cleaned data of the information item; The preset multiple data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, a second rule for cleaning according to the maximum or minimum value in the data of the information item, and The third rule that the minority of the data of the information item is cleaned according to the principle of majority, and the fourth rule of cleansing according to the priority of the object that the data of the information item belongs to.

During specific operations, the data cleaning unit 303 includes:

The first data judgment module (not shown in the figure) is used to judge whether the information item belongs to the preset type that is cleaned according to the first rule; if the information item belongs to the preset type according to the first According to the type of cleaning according to the rules, continue to determine whether the information item belongs to the first type that is cleaned according to the data generation time of the information item in the descending order, or whether it belongs to the latter according to the data generation time of the information item. The second type of cleaning in the previous order;

The first data cleaning module (not shown in the figure) is configured to, if the information item belongs to the first type, use the earliest time in the data generation time of the information item as the cleaned data of the information item ; If the information item belongs to the second type, the latest time among the data generation time of the information item is taken as the cleaned data of the information item;

The second data judgment module (not shown in the figure) is used to determine whether the information item belongs to the type that is cleaned according to the first rule if it is judged that the information item belongs to the type that is cleaned according to the second rule. The type of cleaning; if the information item belongs to the preset type that is cleaned according to the second rule, continue to determine whether the information item belongs to the third type that is cleaned according to the maximum value in the data of the information item , Or whether it belongs to the fourth type of cleaning based on the minimum value in the data of the information item;

The second data cleaning module (not shown in the figure) is configured to, if the information item belongs to the three types, use the maximum value among the data of the information item as the data after the information item is cleaned; If the information item belongs to the second type, the smallest value among the data of the information item is taken as the cleaned data of the information item.

Further, the data cleaning unit 303 further includes:

The third data judgment module (not shown in the figure) is used to judge that the information item does not belong to the type that is cleaned according to the second rule, and then continues to judge whether the information item belongs to the cleaned according to the third rule type;

The third data cleaning module (not shown in the figure) is configured to, if the information item belongs to a preset type that is cleaned according to the third rule, set the largest proportion of the data in the information item The data of is used as the cleaned data of the information item;

The third data judging module (not shown in the figure) is used to determine whether the information item belongs to the type that is cleaned according to the fourth rule if it is judged that the information item does not belong to the type that is cleaned according to the fourth rule. Type of cleaning;

The fourth data cleaning module (not shown in the figure) is used to set the priority of the object in the data of the information item to the highest if the information item belongs to a preset type that is cleaned according to the fourth rule The data is used as the cleaned data of the information item.

The working principle of the data cleaning device of this embodiment is specifically as follows: the data cleaning unit 303 preferentially selects the "integration strategy based on the authoritative identification source" (that is, the "one number one source integration strategy" to fuse data, and the authoritative identification source is determined through data research, forming a basis The list of the source department of the information item is called in the process of fusing data); if the information item does not match the "integration strategy based on authoritative identification of the source", the data cleaning unit 303 merges the data according to the attribute feature analysis result, that is, based on the attribute of the data And feature analysis automatically generates a fusion strategy for matching information items. Based on the fusion strategy determined by the attribute feature analysis result, the data cleaning unit 303 pre-checks "whether the data is fused according to business time" (that is, the first rule) for the information item data to be fused, and if so, performs business time analysis and adopts "data-based Freshness fusion strategy" fuse data; if the information item does not match the "data freshness fusion strategy", then check "whether to fuse data according to the maximum value of the data" (that is, the second rule), if yes, perform data maximum value analysis and adopt " Fusion data based on the most value fusion strategy of data; if the information item does not match the most value fusion strategy based on data, then check “whether the data is fused according to the principle of majority” (the third rule), if so, perform data distribution statistics and adopt "Fuse data based on the majority principle"; if the information item does not match the "fusion strategy based on the majority principle", then use the "fusion strategy based on the designated priority" (that is, the fourth rule) to fuse the data. Through analysis and processing, the matching fusion strategy is analyzed according to the attribute characteristics of the data, and multi-source data fusion is automatically realized (organizing data by subject/entity).

This embodiment integrates services (data surveys to determine the number one source department of the data) and intelligent data analysis methods to realize scenario-based multi-source data fusion; according to the corresponding data cleaning rules preset for information items, multiple data fusion strategies are intelligently selected , To ensure the quality of multi-source data fusion; the entire process is automated to realize data attribute and feature analysis, data fusion, comprehensively improve the efficiency of data integration development, and effectively solve the integrity, consistency, accuracy, and association issues from multi-object business data To improve the quality of government data.

The present invention also provides a computer-readable storage medium that stores a program, and the program includes instructions for executing the above-mentioned method.

The present invention also provides a computer including a readable medium storing a computer program, the program including instructions for executing the above method. The above-mentioned computer-readable storage medium and the computer have the corresponding technical effects of the above-mentioned data cleaning method, and will not be repeated here.

The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. All should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

A data cleaning method, characterized in that the method includes:

Receiving business data from multiple objects, the business data including multiple information items;

Perform data cleaning on each information item in turn, and the data cleaning includes:

Judging whether the information item belongs to a preset type that is cleaned based on the identification result;

If the information item belongs to a preset object to be cleaned based on the identification result, call the identification result corresponding to the information item, and use the identification result as the cleaned data of the information item;

If the information item does not belong to the preset type to be cleaned based on the identification result, the information item is cleaned in sequence according to multiple preset data cleaning rules to obtain the cleaned data of the information item; The multiple preset data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, a second rule for cleaning according to the maximum or minimum value in the data of the information item, and according to the The third rule that the minority of the data of the information item is cleaned according to the principle of majority, and the fourth rule that is cleaned according to the priority of the object to which the data of the information item belongs.
The data cleaning method according to claim 1, wherein according to a plurality of preset data cleaning rules,

The step of sequentially cleaning the information items to obtain cleaned data of the information items includes:

Judging whether the information item belongs to a preset type that is cleaned according to the first rule;

If the information item belongs to the preset type to be cleaned according to the first rule, continue to determine whether the information item belongs to the first type that is cleaned in the descending order according to the data generation time of the information item, Or whether it belongs to the second type of cleaning according to the data generation time of the information item in a backward-to-forward order;

If the information item belongs to the first type, the earliest time in the data generation time of the information item is taken as the cleaned data of the information item;

If the information item belongs to the second type, the latest time in the data generation time of the information item is taken as the cleaned data of the information item.
The data cleaning method according to claim 2, wherein the step of sequentially cleaning the information items according to a plurality of preset data cleaning rules to obtain the cleaned data of the information items comprises:

If it is determined that the information item does not belong to the type cleaned according to the first rule, continue to determine whether the information item belongs to the type cleaned according to the second rule;

If the information item belongs to the preset type to be cleaned according to the second rule, continue to determine whether the information item belongs to the third type that is cleaned according to the maximum value in the data of the information item, or whether it belongs to The fourth type of cleaning according to the minimum value in the data of the information item;

If the information item belongs to the three types, use the maximum value among the data of the information item as the cleaned data of the information item;

If the information item belongs to the second type, the smallest value among the data of the information item is taken as the cleaned data of the information item.
The data cleaning method according to claim 3, wherein the step of sequentially cleaning the information items according to a plurality of preset data cleaning rules to obtain the cleaned data of the information items comprises: If the information item does not belong to the type cleaned according to the second rule, continue to determine whether the information item belongs to the type cleaned according to the third rule;

If the information item belongs to a preset type that is cleaned according to the third rule, the data with the largest proportion among the data of the information item is taken as the cleaned data of the information item.
The data cleaning method according to claim 4, wherein the step of sequentially cleaning the information items according to a plurality of preset data cleaning rules to obtain the cleaned data of the information items comprises:

If it is determined that the information item does not belong to the type cleaned according to the third rule, continue to determine whether the information item belongs to the type cleaned according to the fourth rule;

If the information item belongs to a preset type that is cleaned according to the fourth rule, the data with the highest priority of the object in the data of the information item is taken as the cleaned data of the information item.
A data cleaning device is characterized in that it comprises:

A data receiving unit, configured to receive business data from multiple objects, the business data including multiple information items;

A data judging unit for judging whether the information item belongs to a preset type that is cleaned based on the verification result;

A data cleaning unit, configured to, if the information item belongs to a preset object to be cleaned based on the identification result, call the identification result corresponding to the information item, and use the identification result as the cleaned data of the information item; If the information item does not belong to the preset type to be cleaned based on the identification result, the information item is cleaned in sequence according to multiple preset data cleaning rules to obtain the cleaned data of the information item; The preset multiple data cleaning rules include: a first rule for cleaning according to the data generation time of the information item, a second rule for cleaning according to the maximum or minimum value in the data of the information item, and according to the The third rule that the minority of the data of the information item is cleaned according to the principle of majority, and the fourth rule that is cleaned according to the priority of the object to which the data of the information item belongs.
The data cleaning device according to claim 6, wherein the data cleaning unit comprises:

The first data judgment module is used to judge whether the information item belongs to the preset type that is cleaned according to the first rule; if the information item belongs to the preset type that is cleaned according to the first rule, then Continue to determine whether the information item belongs to the first type that is cleaned in the descending order according to the data generation time of the information item, or whether it belongs to the first type that is cleaned in the descending order according to the data generation time of the information item Second type

The first data cleaning module is configured to, if the information item belongs to the first type, use the earliest time in the data generation time of the information item as the data after the information item is cleaned; if the information item belongs to For the second type, the latest time among the data generation times of the information item is used as the cleaned data of the information item;

The second data judgment module is configured to, if it is judged that the information item does not belong to the type that is cleaned according to the first rule, continue to judge whether the information item belongs to the type that is cleaned according to the second rule; if the If the information item belongs to the preset type to be cleaned according to the second rule, then continue to determine whether the information item belongs to the third type that is cleaned according to the maximum value in the data of the information item, or whether it belongs to the third type that is cleaned according to the The fourth type of cleaning the minimum value in the data of the information item;

The second data cleaning module is configured to, if the information item belongs to the three types, use the maximum value of the data of the information item as the data after the information item is cleaned; if the information item belongs to the first In the second type, the minimum value among the data of the information item is taken as the data after the information item is cleaned.
8. The data cleaning device according to claim 7, wherein the data cleaning unit further comprises:

The third data judgment module is configured to judge that the information item does not belong to the type cleaned according to the second rule, and then continue to judge whether the information item belongs to the type cleaned according to the third rule;

The third data cleaning module is configured to, if the information item belongs to a preset type that is cleaned according to the third rule, use the data with the largest proportion among the data of the information item as the information item Data after cleaning;

The third data judgment module is configured to, if it is judged that the information item does not belong to the type cleaned according to the third rule, continue to judge whether the information item belongs to the type cleaned according to the fourth rule;

The fourth data cleaning module is configured to, if the information item belongs to a preset type that is cleaned according to the fourth rule, use the highest priority data of the object in the data of the information item as the information item Data after cleaning.
A computer-readable storage medium storing a program, wherein the program includes instructions for executing the method according to any one of claims 1-5.
A computer including a readable medium storing a computer program, characterized in that the program includes

Instructions for performing the method according to any one of claims 1-5.