WO2020211299A1 - Data cleansing method - Google Patents

Data cleansing method Download PDF

Info

Publication number
WO2020211299A1
WO2020211299A1 PCT/CN2019/109121 CN2019109121W WO2020211299A1 WO 2020211299 A1 WO2020211299 A1 WO 2020211299A1 CN 2019109121 W CN2019109121 W CN 2019109121W WO 2020211299 A1 WO2020211299 A1 WO 2020211299A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
field
cleaning
missing
preset
Prior art date
Application number
PCT/CN2019/109121
Other languages
French (fr)
Chinese (zh)
Inventor
张礼成
Original Assignee
苏宁云计算有限公司
苏宁易购集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏宁云计算有限公司, 苏宁易购集团股份有限公司 filed Critical 苏宁云计算有限公司
Priority to CA3177209A priority Critical patent/CA3177209A1/en
Publication of WO2020211299A1 publication Critical patent/WO2020211299A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries

Definitions

  • This application relates to the field of big data processing technology, and in particular to a data cleaning method.
  • Data cleaning is an indispensable link in the entire data analysis process, and the quality of the results is directly related to the model effect and the final data analysis conclusion.
  • Data cleaning refers to the process of re-auditing and verifying data. The purpose is to delete duplicate data, correct existing errors, and ensure data consistency. In actual operation, data cleaning usually takes up 50%-80% of the time of the data analysis process.
  • Data cleaning includes offline data cleaning and real-time data cleaning.
  • Offline data cleaning can use complex processing to perform more fine-grained data cleaning by sacrificing performance, including missing value processing, outlier value processing, duplicate value processing, and empty value processing.
  • the existing data cleaning process is usually integrated with the data analysis process, and the two are highly coupled. The data cleaning process is greatly affected by other codes of data analysis, data loss is prone to occur, and data security is poor.
  • a data cleaning method includes:
  • the deleting or filling of the fields containing missing values in the data to be cleaned includes:
  • the missing rate of the field is calculated
  • the fields containing missing values are deleted or filled.
  • the deleting or filling the field containing the missing value according to the missing rate and attribute importance of the field includes:
  • the missing value of the field is completed.
  • the method further includes:
  • the filtering the data in the data stream to obtain the data to be cleaned includes: filtering the data in the data stream according to the filtering rule to obtain the data to be cleaned.
  • the filtering processing on the data in the data stream includes:
  • Row-level filtering remove unnecessary rows in the data
  • the preset judgment rule includes a legality rule and a logic rule
  • the detecting whether the preliminary cleaning data conforms to the preset judgment rule includes:
  • the preliminary cleaning data does not conform to the logic rule, the preliminary cleaning data is deleted, and a warning instruction is generated.
  • the first data source and the second data source are different data types of the same distributed messaging system.
  • the distributed messaging system is Kafka
  • the first data source and the second data source are Kafka
  • the two different topics of the data stream are based on Spark Streaming.
  • a data cleaning device includes:
  • the data acquisition module is used to acquire data from the first data source, and use the acquired data to establish an independent data stream;
  • the data filtering module is used to filter the data in the data stream to obtain the data to be cleaned;
  • the preliminary cleaning module is used to delete or fill the fields containing missing values in the data to be cleaned to obtain preliminary cleaning data
  • the final cleaning module is used to detect whether the preliminary cleaning data meets the preset judgment rules, delete the data that does not meet the judgment rules, and obtain the final cleaning data;
  • the data output module is used to output the final cleaning data to the second data source.
  • a computer device includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor implements the following steps when the computer program is executed:
  • a data cleaning method, device, computer equipment, and storage medium Data cleaning is performed by establishing an independent data stream.
  • the data obtained from a first data source is cleaned and then put into another data source for subsequent business To process, make the data cleaning process independent from the data analysis code, reduce the coupling between codes, and effectively improve the security of data;
  • the present invention puts data filtering in the first step of data cleaning, thereby reducing the amount of data that needs to be cleaned in the future, and greatly improving the efficiency of data cleaning.
  • FIG. 1 is a schematic flowchart of a data cleaning method in an embodiment
  • Fig. 2 is a structural block diagram of a data cleaning device in an embodiment.
  • the present application provides a data cleaning method, including the following steps:
  • Step 101 Obtain data from a first data source, and use the acquired data to establish an independent data stream.
  • the first data source is the source from which the data is obtained;
  • the data stream is an ordered data sequence of bytes with a starting point and an ending point.
  • the present invention performs data cleaning by establishing an independent data stream, separates the data cleaning process from the data analysis code, and reduces the coupling between codes.
  • Step 102 Perform filtering processing on the data in the data stream to obtain data to be cleaned.
  • data filtering is placed in the first step of data cleaning, which can effectively reduce the amount of data that needs to be cleaned later and greatly improve the efficiency of data cleaning.
  • Step 103 Delete or fill fields in the data to be cleaned that contain missing values to obtain preliminary cleaned data
  • Missing value refers to the lack of information in the data, that is, the value of one or some attributes of the data is incomplete.
  • Step 104 Detect whether the preliminary cleaning data meets the preset judgment rule, delete the data that does not meet the judgment rule, and obtain the final cleaning data;
  • Step 105 Output the final cleaning data to the second data source.
  • the second data source is another data source different from the first data source, and is used to store data for subsequent business use or processing.
  • the data cleaning process of the present invention is independent of other processing processes of data analysis, is not affected by other codes, and has higher data security.
  • data cleaning is performed by establishing an independent data stream, and the data obtained from the first data source is cleaned and then put into another data source for subsequent business processing, so that the data cleaning process is changed from
  • the data analysis code is independent, which reduces the coupling between the codes and effectively improves the security of the data.
  • the first data source and the second data source are different data types of the same distributed messaging system.
  • the distributed messaging system is Kafka
  • the first data source and the second data source are Kafka.
  • the two different topics of the data stream are based on Spark Streaming.
  • the deleting or filling of the fields containing missing values in the data to be cleaned includes:
  • the missing rate of the field is calculated
  • the fields containing missing values are deleted or filled.
  • the missing rate of the field is the ratio of the number of missing values of the field to the total number
  • the criteria for judging the importance of the attributes of the fields are determined based on the indicators that need to be analyzed. If you need to give a user portrait or label to provide data for subsequent precision marketing, then you need to collect user attribute information, such as the user’s age, gender and other attribute information It is an important field.
  • the deleting or filling the field containing the missing value according to the missing rate and attribute importance of the field includes:
  • the field can be filled according to the data distribution. More specifically, if the data is evenly distributed, the average value is used to fill the field; if the data distribution is skewed, the median is used to fill the field Fill it.
  • the missing value of the field is completed.
  • the completion of the missing value of the field includes:
  • the average value before and after can be used as the completion value, and when there are many missing values, the value obtained by smoothing can be used as the completion value;
  • the missing rate threshold may be any value from 90% to 95%.
  • the metadata describing the data attributes of the data in the first data source is first explored, and then the quality problems existing in the data are obtained according to the metadata analysis.
  • a filtering rule is set for the quality problem, and the step 102 performs filtering processing on the data in the data stream according to the filtering rule to obtain the data to be cleaned.
  • Metadata also known as intermediary data and relay data, is data describing data, mainly information describing data attributes, used to support functions such as indicating storage locations, historical data, resource search, and file recording.
  • encapsulating the data attributes that need to be processed as metadata can make the program more scalable.
  • formulating corresponding filtering rules for data quality issues is conducive to improving the efficiency of data filtering.
  • the filtering processing on the data in the data stream includes:
  • Row-level filtering remove unnecessary rows in the data
  • the combination of row-level filtering and column-level filtering can effectively speed up data filtering.
  • the log data includes nearly 200 fields such as the visitor’s IP address, browser information, client terminal device information, specific access time, specific pages accessed, superior interview pages, and access duration.
  • the requirement of this embodiment is to count each The traffic volume of each channel and the traffic volume of independent IP.
  • Row-level filtering only choose to keep the log data related to the channel, so as to filter out the log data that does not contain the channel;
  • pv is the abbreviation of Page View, that is, the number of page views. Every time a user visits each page in the website, it is recorded once, and the number of multiple visits to the same page by the user is accumulated into the total number of pv;
  • uv is the abbreviation of unique visitor, which refers to a natural person who visits and browses this webpage through the Internet.
  • subsequent data processing may require statistics on user retention rates, and data such as the access time of each IP address may be further recorded.
  • the user retention rate is the ratio of old users to total users.
  • the preset judgment rule includes a legality rule and a logic rule
  • the detecting whether the preliminary cleaning data meets the preset judgment rule includes:
  • the legality rules are the format requirements rules for values, dates, and field contents.
  • the field type legality rule is "YYYY-MM-DD"
  • gender is male, female or unknown; the date of birth is earlier than or equal to today;
  • the preliminary cleaning data does not conform to the logic rule, the preliminary cleaning data is deleted, and a warning instruction is generated.
  • Logic rules are common sense rules used to determine whether data conforms to logic; for example, people’s age is generally between 0 and 120, and if there is an age of 200 years old, it is judged that this piece of data is abnormal.
  • steps in the flowchart of FIG. 1 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIG. 1 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • a data cleaning device which includes: a data acquisition module, a data filtering module, a preliminary cleaning module, a final cleaning module, and a data output module, wherein:
  • the data acquisition module is used to acquire data from the first data source, and use the acquired data to establish an independent data stream;
  • the data filtering module is used to filter the data in the data stream to obtain the data to be cleaned;
  • the preliminary cleaning module is used to delete or fill the fields containing missing values in the data to be cleaned to obtain preliminary cleaning data
  • the final cleaning module is used to detect whether the preliminary cleaning data meets the preset judgment rules, delete the data that does not meet the judgment rules, and obtain the final cleaning data.
  • the data output module is used to output the final cleaning data to the second data source.
  • the first data source and the second data source are different data types of the same distributed messaging system.
  • the preliminary cleaning module includes a missing rate submodule, an importance degree submodule, and a missing value processing submodule, wherein:
  • the missing rate sub-module is used to calculate the missing rate of the field according to the ratio of the number of missing values in the field to the total number;
  • the importance degree sub-module is used to determine the attribute importance degree of the field according to the index to be analyzed
  • the missing value processing sub-module is used to delete or fill fields containing missing values according to the field’s missing rate and attribute importance.
  • the missing value processing sub-module includes a comparison unit and a primary processing unit, wherein:
  • the comparison unit is used to compare the missing rate and attribute importance of the field with preset missing rate thresholds and important rating thresholds, respectively; the primary processing unit is used to fill, delete, or complete the fields.
  • the missing value of the field is completed.
  • the data cleaning device further includes a data exploration module.
  • the data exploration module is used to first explore the metadata describing data attributes of the data in the first data source before filtering the data in the data stream, and then according to The metadata analysis obtains the quality problems existing in the data, and the filtering rules are set according to the quality problems.
  • the data filtering module includes a row-level filtering unit and a column-level filtering unit, wherein: the row-level filtering unit is used to remove unnecessary rows in the data; the column-level filtering unit is used for When a row has multiple columns, only the fields corresponding to the required columns are selected and retained.
  • the final cleaning module includes a legality detection unit, a logic detection unit, and a final processing unit, wherein:
  • the legality detection unit is used to detect whether the preliminary cleaning data conforms to a preset legality rule
  • the logic detection unit is used to detect whether the preliminary cleaning data meets a preset logic rule
  • the final processing unit is configured to set the preliminary cleaning data that does not meet the legality rule to the maximum value that meets the legality rule or delete, delete the preliminary cleaning data that does not meet the logic rule, and generate a warning instruction.
  • Each module in the above-mentioned data cleaning device can be implemented in whole or in part by software, hardware and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided, and the computer device may be a terminal.
  • the computer equipment includes a processor, a memory, a network interface, a display screen, and an input device connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and a computer program.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a data cleaning method.
  • the display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen
  • the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the computer equipment shell , It can also be an external keyboard, touchpad, or mouse.
  • a computer device including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, the following steps are implemented: From a first data source Obtain data, use the acquired data to establish an independent data stream; filter the data in the data stream to obtain the data to be cleaned; delete or fill in the fields containing missing values in the data to be cleaned to obtain preliminary cleaned data; Whether the cleaning data meets the preset judgment rule, delete the data that does not meet the judgment rule, and obtain the final cleaning data; output the final cleaning data to the second data source.
  • the processor further implements the following steps when executing the computer program: calculate the missing rate of the field according to the ratio of the number of missing values of the field to the total number; determine the attribute importance of the field according to the index to be analyzed ; According to the missing rate of the field and the importance of the attribute, delete or fill the field containing the missing value.
  • the processor further implements the following steps when executing the computer program: when the missing rate of the field is lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, fill in the field; When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, delete the field; when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is higher than When the important rating threshold is preset, the missing value of the field is completed.
  • the processor further implements the following steps when executing the computer program: exploring the metadata describing the data attributes of the data in the first data source, analyzing the quality problems existing in the data according to the metadata analysis, and setting according to the quality problems
  • the filtering rule is to perform filtering processing on the data in the data stream according to the filtering rule to obtain the data to be cleaned.
  • the processor also implements the following steps when executing the computer program: row-level filtering, which removes unnecessary rows from the data; column-level filtering, when a row has multiple columns, only select and retain the required The field corresponding to the column.
  • the preset judgment rules include legality rules and logic rules.
  • the processor further implements the following steps when executing the computer program: if the preliminary cleaning data does not meet the legality rules, set the preliminary cleaning data to Meet the maximum value of the legality rule, or delete; if the preliminary cleaning data does not meet the logic rule, delete the preliminary cleaning data and generate a warning instruction.
  • a computer-readable storage medium on which a computer program is stored.
  • the following steps are implemented: obtain data from a first data source, and use the obtained data to establish an independent The data stream of the data stream; the data in the data stream is filtered to obtain the data to be cleaned; the fields containing missing values in the data to be cleaned are deleted or filled to obtain the preliminary cleaned data; whether the preliminary cleaned data meets the preset judgment rules, Delete the data that does not meet the judgment rule to obtain the final cleaned data; output the final cleaned data to the second data source.
  • the following steps are also implemented: according to the ratio of the number of missing values of the field to the total number, the missing rate of the field is calculated; and the attribute of the field is determined to be important according to the indicators to be analyzed. Degree; according to the missing rate of the field and the importance of the attribute, the field containing the missing value is deleted or filled.
  • the following steps are further implemented: when the missing rate of the field is lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, the field is filled; When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, delete the field; when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is high In the preset important rating threshold, the missing value of the field is completed.
  • the following steps are further implemented: the metadata describing the data attributes of the data in the first data source is explored, the quality problem of the data is obtained according to the metadata analysis, and the quality problem is set according to the quality problem.
  • the filtering rules are determined, and the data in the data stream is filtered according to the filtering rules to obtain the data to be cleaned.
  • the following steps are also implemented: row-level filtering, which removes unnecessary rows from the data; column-level filtering, when a row has multiple columns, only select and retain all The corresponding fields need to be listed.
  • the preset judgment rules include legality rules and logic rules.
  • the computer program also implements the following steps when being executed by the processor: if the preliminary cleaning data does not meet the legality rules, set the preliminary cleaning data to In order to comply with the maximum value of the legality rule, or delete; if the preliminary cleaning data does not meet the logic rule, the preliminary cleaning data is deleted, and a warning instruction is generated.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (SynchLink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

A data cleansing method. The method comprises: acquiring data from a first data source, and establishing an independent data stream by using the acquired data (101); filtering the data in the data stream to obtain data to be cleansed (102); deleting or filling a field comprising a missing value in the data to be cleansed, to obtain preliminary cleansed data (103); detecting whether the preliminary cleansed data conforms to a preset determination rule, and deleting the data not conforming to the determination rule to obtain final cleansed data (104); and outputting the final cleansed data to a second data source (105). By using the above-mentioned method, data security can be improved.

Description

数据清理方法Data cleaning method 技术领域Technical field
本申请涉及大数据处理技术领域,特别是涉及一种数据清理方法。This application relates to the field of big data processing technology, and in particular to a data cleaning method.
背景技术Background technique
随着网络时代的到来,大量信息数据持续不断地涌入网络,数据量以每年50%的速度在增长。在庞大的数据来源支持下,企业决策越来越以数据分析为基础,而非传统的仅仅依靠经验和直觉。数据清洗是整个数据分析过程中不可缺少的一个环节,其结果质量直接关系到模型效果和最终的数据分析结论。数据清洗是指对数据进行重新审核和校验的过程,目的在于删除重复数据,纠正存在的错误,并保证数据一致性。在实际操作中,数据清洗通常会占据数据分析过程的50%—80%的时间。With the advent of the Internet age, a large amount of information data continues to flood into the Internet, and the amount of data is increasing at a rate of 50% every year. With the support of huge data sources, corporate decisions are increasingly based on data analysis, rather than relying solely on experience and intuition. Data cleaning is an indispensable link in the entire data analysis process, and the quality of the results is directly related to the model effect and the final data analysis conclusion. Data cleaning refers to the process of re-auditing and verifying data. The purpose is to delete duplicate data, correct existing errors, and ensure data consistency. In actual operation, data cleaning usually takes up 50%-80% of the time of the data analysis process.
数据清洗包括离线数据清洗和实时数据清洗两类,离线数据清洗可以通过牺牲性能的方式,借助复杂的处理对数据进行更细粒度的清洗,包括缺失值处理、异常值处理、重复值处理、空值填充、统一单位、是否标准化处理、是否删除无必要的变量以及是否排序等;相比于离线数据清洗,实时数据清洗因为实时要求,更倾向于数据的缺值填充、过滤以及数据合法性检查,但是现有的数据清理过程通常与数据分析过程是一体的,两者耦合性大,数据清理过程受到数据分析其他代码作用的影响大,容易发生数据丢失,数据的安全性较差。Data cleaning includes offline data cleaning and real-time data cleaning. Offline data cleaning can use complex processing to perform more fine-grained data cleaning by sacrificing performance, including missing value processing, outlier value processing, duplicate value processing, and empty value processing. Value filling, unified units, whether to standardize processing, whether to delete unnecessary variables, and whether to sort, etc.; compared with offline data cleaning, real-time data cleaning is more inclined to fill, filter, and check data legality due to real-time requirements. However, the existing data cleaning process is usually integrated with the data analysis process, and the two are highly coupled. The data cleaning process is greatly affected by other codes of data analysis, data loss is prone to occur, and data security is poor.
发明内容Summary of the invention
基于此,有必要针对上述技术问题,提供一种数据清洗方法,能够提高数据安全性。Based on this, it is necessary to provide a data cleaning method for the above technical problems, which can improve data security.
一种数据清洗方法,方法包括:A data cleaning method, the method includes:
从第一数据源中获取数据,利用获取的数据建立一个独立的数据流;Obtain data from the first data source, and use the acquired data to establish an independent data stream;
对数据流中的数据进行过滤处理,得到待清洗数据;Filter the data in the data stream to obtain the data to be cleaned;
对待清洗数据中包含缺失值的字段进行删除或填充,得到初步清洗数据;Delete or fill the fields containing missing values in the data to be cleaned to obtain preliminary cleaned data;
检测初步清洗数据是否符合预设的判定规则,删除不符合判定规则的数据,得到最终清洗数据;Detect whether the preliminary cleaning data meets the preset judgment rules, delete the data that does not meet the judgment rules, and obtain the final cleaning data;
将最终清洗数据输出到第二数据源。Output the final cleaning data to the second data source.
在其中一个实施例中,所述对待清洗数据中包含缺失值的字段进行删除或填充包括:In one of the embodiments, the deleting or filling of the fields containing missing values in the data to be cleaned includes:
根据字段的缺失值条数占总条数的比例,计算得到字段的缺失率;According to the ratio of the number of missing values in the field to the total number, the missing rate of the field is calculated;
根据需要分析的指标,确定字段的属性重要程度;Determine the importance of the attribute of the field according to the indicators to be analyzed;
根据字段的缺失率和属性重要程度,对包含缺失值的字段进行删除或填充。According to the field's missing rate and attribute importance, the fields containing missing values are deleted or filled.
在其中一个实施例中,所述根据字段的缺失率和属性重要程度,对包含缺失值的字段进行删除或填充包括:In one of the embodiments, the deleting or filling the field containing the missing value according to the missing rate and attribute importance of the field includes:
当字段的缺失率低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,对字段进行填充;When the missing rate of the field is lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, fill in the field;
当字段的缺失率不低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,删除字段;When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, delete the field;
当字段的缺失率不低于预设的缺失率阈值且属性重要程度高于预设的重要评级阈值时,对字段的缺失值进行补全。When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is higher than the preset important rating threshold, the missing value of the field is completed.
在其中一个实施例中,该方法还包括:In one of the embodiments, the method further includes:
探查第一数据源中数据的描述数据属性的元数据,根据所述元数据分析得到数据存在的质量问题,根据质量问题设定过滤规则;Exploring the metadata describing data attributes of the data in the first data source, analyzing the quality problems existing in the data according to the metadata analysis, and setting filtering rules according to the quality problems;
所述对数据流中的数据进行过滤处理,得到待清洗数据,包括:根据所述 过滤规则对数据流中的数据进行过滤处理,得到待清洗数据。The filtering the data in the data stream to obtain the data to be cleaned includes: filtering the data in the data stream according to the filtering rule to obtain the data to be cleaned.
在其中一个实施例中,所述对数据流中的数据进行过滤处理包括:In one of the embodiments, the filtering processing on the data in the data stream includes:
行级过滤,将数据中不需要的行剔除掉;Row-level filtering, remove unnecessary rows in the data;
列级过滤,当一行具有多个列的时候,只选取并保留所需列对应的字段。Column-level filtering. When a row has multiple columns, only the fields corresponding to the required columns are selected and retained.
在其中一个实施例中,所述预设的判定规则包括合法性规则和逻辑规则,所述检测初步清洗数据是否符合预设的判定规则包括:In one of the embodiments, the preset judgment rule includes a legality rule and a logic rule, and the detecting whether the preliminary cleaning data conforms to the preset judgment rule includes:
如果初步清洗数据不符合所述合法性规则,将初步清洗数据设为符合所述合法性规则的最大值,或者删除;If the preliminary cleaning data does not meet the legality rules, set the preliminary cleaning data to the maximum value that meets the legality rules, or delete;
如果初步清洗数据不符合所述逻辑规则,将初步清洗数据删除,并生成警告指令。If the preliminary cleaning data does not conform to the logic rule, the preliminary cleaning data is deleted, and a warning instruction is generated.
在其中一个实施例中,第一数据源和第二数据源为同一分布式消息系统的不同数据类别,进一步地,所述分布式消息系统为Kafka,第一数据源和第二数据源为Kafka的两个不同的Topic;数据流采用基于Spark Streaming的数据流。In one of the embodiments, the first data source and the second data source are different data types of the same distributed messaging system. Further, the distributed messaging system is Kafka, and the first data source and the second data source are Kafka The two different topics of the data stream are based on Spark Streaming.
一种数据清洗装置,所述装置包括:A data cleaning device, the device includes:
数据获取模块,用于从第一数据源中获取数据,利用获取的数据建立一个独立的数据流;The data acquisition module is used to acquire data from the first data source, and use the acquired data to establish an independent data stream;
数据过滤模块,用于对数据流中的数据进行过滤处理,得到待清洗数据;The data filtering module is used to filter the data in the data stream to obtain the data to be cleaned;
初步清洗模块,用于对待清洗数据中包含缺失值的字段进行删除或填充,得到初步清洗数据;The preliminary cleaning module is used to delete or fill the fields containing missing values in the data to be cleaned to obtain preliminary cleaning data;
最终清洗模块,用于检测初步清洗数据是否符合预设的判定规则,删除不符合判定规则的数据,得到最终清洗数据;The final cleaning module is used to detect whether the preliminary cleaning data meets the preset judgment rules, delete the data that does not meet the judgment rules, and obtain the final cleaning data;
数据输出模块,用于将最终清洗数据输出到第二数据源。The data output module is used to output the final cleaning data to the second data source.
一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上 运行的计算机程序,处理器执行该计算机程序时实现以下步骤:A computer device includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor implements the following steps when the computer program is executed:
从第一数据源中获取数据,利用获取的数据建立一个独立的数据流;Obtain data from the first data source, and use the acquired data to establish an independent data stream;
对数据流中的数据进行过滤处理,得到待清洗数据;Filter the data in the data stream to obtain the data to be cleaned;
对待清洗数据中包含缺失值的字段进行删除或填充,得到初步清洗数据;Delete or fill the fields containing missing values in the data to be cleaned to obtain preliminary cleaned data;
检测初步清洗数据是否符合预设的判定规则,删除不符合判定规则的数据,得到最终清洗数据;Detect whether the preliminary cleaning data meets the preset judgment rules, delete the data that does not meet the judgment rules, and obtain the final cleaning data;
将最终清洗数据输出到第二数据源。Output the final cleaning data to the second data source.
一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以下步骤:A computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
从第一数据源中获取数据,利用获取的数据建立一个独立的数据流;Obtain data from the first data source, and use the acquired data to establish an independent data stream;
对数据流中的数据进行过滤处理,得到待清洗数据;Filter the data in the data stream to obtain the data to be cleaned;
对待清洗数据中包含缺失值的字段进行删除或填充,得到初步清洗数据;Delete or fill the fields containing missing values in the data to be cleaned to obtain preliminary cleaned data;
检测初步清洗数据是否符合预设的判定规则,删除不符合判定规则的数据,得到最终清洗数据;Detect whether the preliminary cleaning data meets the preset judgment rules, delete the data that does not meet the judgment rules, and obtain the final cleaning data;
将最终清洗数据输出到第二数据源。Output the final cleaning data to the second data source.
与现有技术相比,本发明的有益效果是:Compared with the prior art, the beneficial effects of the present invention are:
一种数据清洗方法、装置、计算机设备和存储介质,通过建立一个独立的数据流来进行数据清洗,将从第一数据源中获取的数据进行清洗后放入另一个数据源中供后续的业务来处理,使得数据清洗过程从数据分析代码中独立出来,减小了代码间的耦合性,有效提高了数据的安全性;A data cleaning method, device, computer equipment, and storage medium. Data cleaning is performed by establishing an independent data stream. The data obtained from a first data source is cleaned and then put into another data source for subsequent business To process, make the data cleaning process independent from the data analysis code, reduce the coupling between codes, and effectively improve the security of data;
进一步地,本发明将数据过滤放在数据清洗的第一步,从而减少了后续需要清洗的数据量,极大地提高数据的清洗效率。Further, the present invention puts data filtering in the first step of data cleaning, thereby reducing the amount of data that needs to be cleaned in the future, and greatly improving the efficiency of data cleaning.
附图说明Description of the drawings
图1为一个实施例中数据清洗方法的流程示意图;FIG. 1 is a schematic flowchart of a data cleaning method in an embodiment;
图2为一个实施例中数据清洗装置的结构框图。Fig. 2 is a structural block diagram of a data cleaning device in an embodiment.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.
在一个实施例中,如图1所示,本申请提供了一种数据清洗方法,包括以下步骤:In an embodiment, as shown in FIG. 1, the present application provides a data cleaning method, including the following steps:
步骤101,从第一数据源中获取数据,利用获取的数据建立一个独立的数据流。Step 101: Obtain data from a first data source, and use the acquired data to establish an independent data stream.
其中,第一数据源为获取数据的来源;数据流是一组有序,有起点和终点的字节的数据序列。Among them, the first data source is the source from which the data is obtained; the data stream is an ordered data sequence of bytes with a starting point and an ending point.
具体地,本发明通过建立一个独立的数据流来进行数据清洗,将数据清洗过程从数据分析代码中独立出来,减小了代码间的耦合性。Specifically, the present invention performs data cleaning by establishing an independent data stream, separates the data cleaning process from the data analysis code, and reduces the coupling between codes.
步骤102,对数据流中的数据进行过滤处理,得到待清洗数据。Step 102: Perform filtering processing on the data in the data stream to obtain data to be cleaned.
具体地,数据过滤放在数据清洗的第一步,可以有效的减少后续需要清洗的数据量,极大地提高数据的清洗效率。Specifically, data filtering is placed in the first step of data cleaning, which can effectively reduce the amount of data that needs to be cleaned later and greatly improve the efficiency of data cleaning.
步骤103,对待清洗数据中包含缺失值的字段进行删除或填充,得到初步清洗数据;Step 103: Delete or fill fields in the data to be cleaned that contain missing values to obtain preliminary cleaned data;
缺失值是指数据中缺少信息,即数据某个或某些属性的值是不完全的。Missing value refers to the lack of information in the data, that is, the value of one or some attributes of the data is incomplete.
步骤104,检测初步清洗数据是否符合预设的判定规则,删除不符合判定规则的数据,得到最终清洗数据;Step 104: Detect whether the preliminary cleaning data meets the preset judgment rule, delete the data that does not meet the judgment rule, and obtain the final cleaning data;
步骤105,将最终清洗数据输出到第二数据源。Step 105: Output the final cleaning data to the second data source.
第二数据源为与第一数据源不同的另一个数据源,其用于存储供后续业务使用或处理的数据。The second data source is another data source different from the first data source, and is used to store data for subsequent business use or processing.
具体地,本发明的数据清洗过程独立于数据分析的其他处理过程,不受其他代码影响,数据的安全性更高。Specifically, the data cleaning process of the present invention is independent of other processing processes of data analysis, is not affected by other codes, and has higher data security.
上述数据清洗方法中,通过建立一个独立的数据流来进行数据清洗,将从第一数据源中获取的数据进行清洗后放入另一个数据源中供后续的业务来处理,使得数据清洗过程从数据分析代码中独立出来,减小了代码间的耦合性,有效提高了数据的安全性。In the above data cleaning method, data cleaning is performed by establishing an independent data stream, and the data obtained from the first data source is cleaned and then put into another data source for subsequent business processing, so that the data cleaning process is changed from The data analysis code is independent, which reduces the coupling between the codes and effectively improves the security of the data.
作为具体实施方式的一种,第一数据源和第二数据源为同一分布式消息系统的不同数据类别,例如,所述分布式消息系统为Kafka,第一数据源和第二数据源为Kafka的两个不同的Topic;数据流采用基于Spark Streaming的数据流。As a specific implementation manner, the first data source and the second data source are different data types of the same distributed messaging system. For example, the distributed messaging system is Kafka, and the first data source and the second data source are Kafka. The two different topics of the data stream are based on Spark Streaming.
在其中一个实施例中,所述对待清洗数据中包含缺失值的字段进行删除或填充包括:In one of the embodiments, the deleting or filling of the fields containing missing values in the data to be cleaned includes:
根据字段的缺失值条数占总条数的比例,计算得到字段的缺失率;According to the ratio of the number of missing values in the field to the total number, the missing rate of the field is calculated;
根据需要分析的指标,确定字段的属性重要程度;Determine the importance of the attribute of the field according to the indicators to be analyzed;
根据字段的缺失率和属性重要程度,对包含缺失值的字段进行删除或填充。According to the field's missing rate and attribute importance, the fields containing missing values are deleted or filled.
所述字段的缺失率为字段的缺失值条数占总条数的比例;The missing rate of the field is the ratio of the number of missing values of the field to the total number;
例如:工资字段总共有100条记录,有20条记录为缺失值,缺失率就是20%。For example: There are 100 records in the salary field, 20 records are missing values, and the missing rate is 20%.
字段的属性重要程度判断标准根据需要分析的指标决定,如需要给用户画像或打标签,以便为后续的精准营销提供数据,那么就需要收集用户的属性信息,例如用户的年龄、性别等属性信息就是重要字段。The criteria for judging the importance of the attributes of the fields are determined based on the indicators that need to be analyzed. If you need to give a user portrait or label to provide data for subsequent precision marketing, then you need to collect user attribute information, such as the user’s age, gender and other attribute information It is an important field.
在其中一个实施例中,所述根据字段的缺失率和属性重要程度,对包含缺失值的字段进行删除或填充包括:In one of the embodiments, the deleting or filling the field containing the missing value according to the missing rate and attribute importance of the field includes:
当字段的缺失率低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,对字段进行填充;When the missing rate of the field is lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, fill in the field;
具体地,若字段属性为数值型数据,则根据数据分布情况填充字段即可,进一步具体地,若数据分布均匀,则使用均值对字段进行填充;若数据分布倾斜,则使用中位数对字段进行填充。Specifically, if the field attribute is numeric data, the field can be filled according to the data distribution. More specifically, if the data is evenly distributed, the average value is used to fill the field; if the data distribution is skewed, the median is used to fill the field Fill it.
当字段的缺失率不低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,删除字段;When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, delete the field;
当字段的缺失率不低于预设的缺失率阈值且属性重要程度高于预设的重要评级阈值时,对字段的缺失值进行补全。When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is higher than the preset important rating threshold, the missing value of the field is completed.
具体地,所述对字段的缺失值进行补全包括:Specifically, the completion of the missing value of the field includes:
通过其他信息补全,例如使用身份证号码推算性别、籍贯、出生日期、年龄等;Complete with other information, such as using the ID number to calculate gender, hometown, date of birth, age, etc.;
通过前后数据补全,例如时间序列缺数据时,可以使用前后的均值作为补全值,缺失值多时,可以通过平滑处理得到的数值作为补全值;Through data completion before and after, for example, when the time series lacks data, the average value before and after can be used as the completion value, and when there are many missing values, the value obtained by smoothing can be used as the completion value;
无法补全的,必须要剔除,但是不要删掉,后续可以会使用。If you cannot complete it, you must delete it, but don't delete it, you can use it later.
作为具体实施方式的一种,所述缺失率阈值可以为90%-95%中任一数值。As a specific implementation manner, the missing rate threshold may be any value from 90% to 95%.
在其中一个实施例中,在对数据流中的数据进行过滤处理之前,先探查第一数据源中数据的描述数据属性的元数据,再根据所述元数据分析得到数据存在的质量问题,根据质量问题设定过滤规则,所述步骤102根据所述过滤规则对数据流中的数据进行过滤处理,得到待清洗数据。In one of the embodiments, before the data in the data stream is filtered, the metadata describing the data attributes of the data in the first data source is first explored, and then the quality problems existing in the data are obtained according to the metadata analysis. A filtering rule is set for the quality problem, and the step 102 performs filtering processing on the data in the data stream according to the filtering rule to obtain the data to be cleaned.
元数据又称中介数据、中继数据,为描述数据的数据,主要是描述数据属性的信息,用来支持如指示存储位置、历史数据、资源查找、文件记录等功能。Metadata, also known as intermediary data and relay data, is data describing data, mainly information describing data attributes, used to support functions such as indicating storage locations, historical data, resource search, and file recording.
具体地,将需要处理的数据属性封装为元数据,可以使程序具备更好的可 扩展性。同时针对数据的质量问题制定相应的过滤规则,有利于提高数据过滤的效率。Specifically, encapsulating the data attributes that need to be processed as metadata can make the program more scalable. At the same time, formulating corresponding filtering rules for data quality issues is conducive to improving the efficiency of data filtering.
在其中一个实施例中,所述对数据流中的数据进行过滤处理包括:In one of the embodiments, the filtering processing on the data in the data stream includes:
行级过滤,将数据中不需要的行剔除掉;Row-level filtering, remove unnecessary rows in the data;
列级过滤,当一行具有多个列的时候,只选取并保留所需列对应的字段。Column-level filtering. When a row has multiple columns, only the fields corresponding to the required columns are selected and retained.
具体地,行级过滤和列级过滤相结合,可以有效地加快数据过滤速度。Specifically, the combination of row-level filtering and column-level filtering can effectively speed up data filtering.
例如,分渠道计算pv/uv的流程:For example, the process of calculating pv/uv by channel:
日志数据包括访问者的IP地址、浏览器信息、客户端终端设备信息、具体访问时间、所访问的具体页面以及上级受访页面以及访问时长等近200个字段,本实施例的需求是统计每个渠道的点击量和独立IP的访问量。The log data includes nearly 200 fields such as the visitor’s IP address, browser information, client terminal device information, specific access time, specific pages accessed, superior interview pages, and access duration. The requirement of this embodiment is to count each The traffic volume of each channel and the traffic volume of independent IP.
行级过滤,只选择保留和渠道相关的日志数据,从而过滤掉不包含渠道的日志数据;Row-level filtering, only choose to keep the log data related to the channel, so as to filter out the log data that does not contain the channel;
列级过滤,从所述和渠道相关的日志数据包括的近200个字段中选择cid(渠道名称)、uid(设备标识)、ip地址,过滤掉不需要的字段,就可以统计得到每个渠道的pv/nv;Column-level filtering, select cid (channel name), uid (device identification), ip address from nearly 200 fields included in the log data related to the channel, filter out unnecessary fields, and then you can get statistics for each channel Pv/nv;
pv是Page View的简写,即页面浏览量,用户每1次对网站中的每个网页访问均被记录1次,用户对同一页面的多次访问量累计成为pv总数;pv is the abbreviation of Page View, that is, the number of page views. Every time a user visits each page in the website, it is recorded once, and the number of multiple visits to the same page by the user is accumulated into the total number of pv;
uv是unique visitor的简写,是指通过互联网访问、浏览这个网页的自然人。uv is the abbreviation of unique visitor, which refers to a natural person who visits and browses this webpage through the Internet.
本实施例中,考虑到扩展性,比如后续数据处理可能需要统计用户留存率,可以进一步记录每个ip地址的访问时间等数据。In this embodiment, considering scalability, for example, subsequent data processing may require statistics on user retention rates, and data such as the access time of each IP address may be further recorded.
所述用户留存率为老用户占总用户的比例。The user retention rate is the ratio of old users to total users.
在其中一个实施例中,所述预设的判定规则包括合法性规则和逻辑规则, 所述检测初步清洗数据是否符合预设的判定规则包括:In one of the embodiments, the preset judgment rule includes a legality rule and a logic rule, and the detecting whether the preliminary cleaning data meets the preset judgment rule includes:
如果初步清洗数据不符合所述合法性规则,将初步清洗数据设为符合所述合法性规则的最大值,或者删除;If the preliminary cleaning data does not meet the legality rules, set the preliminary cleaning data to the maximum value that meets the legality rules, or delete;
合法性规则为数值、日期、字段内容等格式要求规则。The legality rules are the format requirements rules for values, dates, and field contents.
具体地,字段类型合法性规则:日期字段格式为"YYYY-MM-DD"Specifically, the field type legality rule: the date field format is "YYYY-MM-DD"
字段内容合法性规则:性别为男、女或未知;出生日期早于或等于今天;The legality rules of the field content: gender is male, female or unknown; the date of birth is earlier than or equal to today;
如果初步清洗数据不符合所述逻辑规则,将初步清洗数据删除,并生成警告指令。If the preliminary cleaning data does not conform to the logic rule, the preliminary cleaning data is deleted, and a warning instruction is generated.
逻辑规则为用于判断数据是否符合逻辑的常理规则;例如,人的年龄一般都在0-120之间,如果有出现200岁的年龄则判断此条数据异常。Logic rules are common sense rules used to determine whether data conforms to logic; for example, people’s age is generally between 0 and 120, and if there is an age of 200 years old, it is judged that this piece of data is abnormal.
数据通过合法性规则和逻辑规则的清洗后,去除了不符合格式要求与逻辑规则的数据,得到有效的最终清洗数据。After the data is cleaned by legality rules and logic rules, data that does not meet the format requirements and logic rules is removed, and effective final cleaned data is obtained.
应该理解的是,虽然图1的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图1中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of FIG. 1 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIG. 1 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
在一个实施例中,如图2所示,提供了一种数据清洗装置,包括:数据获取模块、数据过滤模块、初步清洗模块、最终清洗模块和数据输出模块,其中:In one embodiment, as shown in FIG. 2, a data cleaning device is provided, which includes: a data acquisition module, a data filtering module, a preliminary cleaning module, a final cleaning module, and a data output module, wherein:
数据获取模块,用于从第一数据源中获取数据,利用获取的数据建立一个独立的数据流;The data acquisition module is used to acquire data from the first data source, and use the acquired data to establish an independent data stream;
数据过滤模块,用于对数据流中的数据进行过滤处理,得到待清洗数据;The data filtering module is used to filter the data in the data stream to obtain the data to be cleaned;
初步清洗模块,用于对待清洗数据中包含缺失值的字段进行删除或填充,得到初步清洗数据;The preliminary cleaning module is used to delete or fill the fields containing missing values in the data to be cleaned to obtain preliminary cleaning data;
最终清洗模块,用于检测初步清洗数据是否符合预设的判定规则,删除不符合判定规则的数据,得到最终清洗数据。The final cleaning module is used to detect whether the preliminary cleaning data meets the preset judgment rules, delete the data that does not meet the judgment rules, and obtain the final cleaning data.
数据输出模块,用于将最终清洗数据输出到第二数据源。The data output module is used to output the final cleaning data to the second data source.
具体实施时,第一数据源和第二数据源为同一分布式消息系统的不同数据类别。In specific implementation, the first data source and the second data source are different data types of the same distributed messaging system.
在一个实施例中,初步清洗模块包括缺失率子模块、重要程度子模块和缺失值处理子模块,其中:In an embodiment, the preliminary cleaning module includes a missing rate submodule, an importance degree submodule, and a missing value processing submodule, wherein:
缺失率子模块,用于根据字段的缺失值条数占总条数的比例,计算得到字段的缺失率;The missing rate sub-module is used to calculate the missing rate of the field according to the ratio of the number of missing values in the field to the total number;
重要程度子模块,用于根据需要分析的指标,确定字段的属性重要程度;The importance degree sub-module is used to determine the attribute importance degree of the field according to the index to be analyzed;
缺失值处理子模块,用于根据字段的缺失率和属性重要程度,对包含缺失值的字段进行删除或填充。The missing value processing sub-module is used to delete or fill fields containing missing values according to the field’s missing rate and attribute importance.
进一步地,所述缺失值处理子模块包括比较单元和初级处理单元,其中:Further, the missing value processing sub-module includes a comparison unit and a primary processing unit, wherein:
所述比较单元用于将字段的缺失率和属性重要程度分别与预设的缺失率阈值和重要评级阈值相比较;初级处理单元用于对字段进行填充、删除或补全操作。The comparison unit is used to compare the missing rate and attribute importance of the field with preset missing rate thresholds and important rating thresholds, respectively; the primary processing unit is used to fill, delete, or complete the fields.
当字段的缺失率低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,对字段进行填充;When the missing rate of the field is lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, fill in the field;
当字段的缺失率不低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,删除字段;When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, delete the field;
当字段的缺失率不低于预设的缺失率阈值且属性重要程度高于预设的重要评级阈值时,对字段的缺失值进行补全。When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is higher than the preset important rating threshold, the missing value of the field is completed.
在一个实施例中,数据清洗装置还包括数据探查模块,数据探查模块用于在对数据流中的数据进行过滤处理之前,先探查第一数据源中数据的描述数据属性的元数据,再根据所述元数据分析得到数据存在的质量问题,根据质量问题设定过滤规则。In one embodiment, the data cleaning device further includes a data exploration module. The data exploration module is used to first explore the metadata describing data attributes of the data in the first data source before filtering the data in the data stream, and then according to The metadata analysis obtains the quality problems existing in the data, and the filtering rules are set according to the quality problems.
在一个实施例中,数据过滤模块包括行级过滤单元和列级过滤单元,其中:所述行级过滤单元用于将数据中不需要的行剔除掉;所述列级过滤单元,用于当一行具有多个列的时候,只选取并保留所需列对应的字段。In one embodiment, the data filtering module includes a row-level filtering unit and a column-level filtering unit, wherein: the row-level filtering unit is used to remove unnecessary rows in the data; the column-level filtering unit is used for When a row has multiple columns, only the fields corresponding to the required columns are selected and retained.
在一个实施例中,最终清洗模块包括合法性检测单元、逻辑检测单元和最终处理单元,其中:In an embodiment, the final cleaning module includes a legality detection unit, a logic detection unit, and a final processing unit, wherein:
所述合法性检测单元用于检测初步清洗数据是否符合预设的合法性规则;The legality detection unit is used to detect whether the preliminary cleaning data conforms to a preset legality rule;
所述逻辑检测单元用于检测初步清洗数据是否符合预设的逻辑规则;The logic detection unit is used to detect whether the preliminary cleaning data meets a preset logic rule;
最终处理单元用于将不符合所述合法性规则的初步清洗数据设为符合所述合法性规则的最大值或者删除,将不符合所述逻辑规则的初步清洗数据删除,并生成警告指令。The final processing unit is configured to set the preliminary cleaning data that does not meet the legality rule to the maximum value that meets the legality rule or delete, delete the preliminary cleaning data that does not meet the logic rule, and generate a warning instruction.
关于数据清洗装置的具体限定可以参见上文中对于数据清洗方法的限定,在此不再赘述。上述数据清洗装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the data cleaning device, please refer to the above definition of the data cleaning method, which will not be repeated here. Each module in the above-mentioned data cleaning device can be implemented in whole or in part by software, hardware and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是终端。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输 入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种数据清洗方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided, and the computer device may be a terminal. The computer equipment includes a processor, a memory, a network interface, a display screen, and an input device connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize a data cleaning method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, or it can be a button, a trackball or a touchpad set on the computer equipment shell , It can also be an external keyboard, touchpad, or mouse.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现以下步骤:从第一数据源中获取数据,利用获取的数据建立一个独立的数据流;对数据流中的数据进行过滤处理,得到待清洗数据;对待清洗数据中包含缺失值的字段进行删除或填充,得到初步清洗数据;检测初步清洗数据是否符合预设的判定规则,删除不符合判定规则的数据,得到最终清洗数据;将最终清洗数据输出到第二数据源。In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the following steps are implemented: From a first data source Obtain data, use the acquired data to establish an independent data stream; filter the data in the data stream to obtain the data to be cleaned; delete or fill in the fields containing missing values in the data to be cleaned to obtain preliminary cleaned data; Whether the cleaning data meets the preset judgment rule, delete the data that does not meet the judgment rule, and obtain the final cleaning data; output the final cleaning data to the second data source.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:根据字段的缺失值条数占总条数的比例,计算得到字段的缺失率;根据需要分析的指标,确定字段的属性重要程度;根据字段的缺失率和属性重要程度,对包含缺失值的字段进行删除或填充。In one embodiment, the processor further implements the following steps when executing the computer program: calculate the missing rate of the field according to the ratio of the number of missing values of the field to the total number; determine the attribute importance of the field according to the index to be analyzed ; According to the missing rate of the field and the importance of the attribute, delete or fill the field containing the missing value.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:当字段的缺失率低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,对字段进行填充;当字段的缺失率不低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,删除字段;当字段的缺失率不低于预设的缺失率阈值且 属性重要程度高于预设的重要评级阈值时,对字段的缺失值进行补全。In one embodiment, the processor further implements the following steps when executing the computer program: when the missing rate of the field is lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, fill in the field; When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, delete the field; when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is higher than When the important rating threshold is preset, the missing value of the field is completed.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:探查第一数据源中数据的描述数据属性的元数据,根据所述元数据分析得到数据存在的质量问题,根据质量问题设定过滤规则,根据所述过滤规则对数据流中的数据进行过滤处理,得到待清洗数据。In one embodiment, the processor further implements the following steps when executing the computer program: exploring the metadata describing the data attributes of the data in the first data source, analyzing the quality problems existing in the data according to the metadata analysis, and setting according to the quality problems The filtering rule is to perform filtering processing on the data in the data stream according to the filtering rule to obtain the data to be cleaned.
在一个实施例中,处理器执行计算机程序时还实现以下步骤:行级过滤,将数据中不需要的行剔除掉;列级过滤,当一行具有多个列的时候,只选取并保留所需列对应的字段。In one embodiment, the processor also implements the following steps when executing the computer program: row-level filtering, which removes unnecessary rows from the data; column-level filtering, when a row has multiple columns, only select and retain the required The field corresponding to the column.
所述预设的判定规则包括合法性规则和逻辑规则,在一个实施例中,处理器执行计算机程序时还实现以下步骤:如果初步清洗数据不符合所述合法性规则,将初步清洗数据设为符合所述合法性规则的最大值,或者删除;如果初步清洗数据不符合所述逻辑规则,将初步清洗数据删除,并生成警告指令。The preset judgment rules include legality rules and logic rules. In one embodiment, the processor further implements the following steps when executing the computer program: if the preliminary cleaning data does not meet the legality rules, set the preliminary cleaning data to Meet the maximum value of the legality rule, or delete; if the preliminary cleaning data does not meet the logic rule, delete the preliminary cleaning data and generate a warning instruction.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现以下步骤:从第一数据源中获取数据,利用获取的数据建立一个独立的数据流;对数据流中的数据进行过滤处理,得到待清洗数据;对待清洗数据中包含缺失值的字段进行删除或填充,得到初步清洗数据;检测初步清洗数据是否符合预设的判定规则,删除不符合判定规则的数据,得到最终清洗数据;将最终清洗数据输出到第二数据源。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed by a processor, the following steps are implemented: obtain data from a first data source, and use the obtained data to establish an independent The data stream of the data stream; the data in the data stream is filtered to obtain the data to be cleaned; the fields containing missing values in the data to be cleaned are deleted or filled to obtain the preliminary cleaned data; whether the preliminary cleaned data meets the preset judgment rules, Delete the data that does not meet the judgment rule to obtain the final cleaned data; output the final cleaned data to the second data source.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:根据字段的缺失值条数占总条数的比例,计算得到字段的缺失率;根据需要分析的指标,确定字段的属性重要程度;根据字段的缺失率和属性重要程度,对包含缺失值的字段进行删除或填充。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: according to the ratio of the number of missing values of the field to the total number, the missing rate of the field is calculated; and the attribute of the field is determined to be important according to the indicators to be analyzed. Degree; according to the missing rate of the field and the importance of the attribute, the field containing the missing value is deleted or filled.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:当字段的 缺失率低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,对字段进行填充;当字段的缺失率不低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,删除字段;当字段的缺失率不低于预设的缺失率阈值且属性重要程度高于预设的重要评级阈值时,对字段的缺失值进行补全。In one embodiment, when the computer program is executed by the processor, the following steps are further implemented: when the missing rate of the field is lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, the field is filled; When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, delete the field; when the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is high In the preset important rating threshold, the missing value of the field is completed.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:探查第一数据源中数据的描述数据属性的元数据,根据所述元数据分析得到数据存在的质量问题,根据质量问题设定过滤规则,根据所述过滤规则对数据流中的数据进行过滤处理,得到待清洗数据。In one embodiment, when the computer program is executed by the processor, the following steps are further implemented: the metadata describing the data attributes of the data in the first data source is explored, the quality problem of the data is obtained according to the metadata analysis, and the quality problem is set according to the quality problem. The filtering rules are determined, and the data in the data stream is filtered according to the filtering rules to obtain the data to be cleaned.
在一个实施例中,计算机程序被处理器执行时还实现以下步骤:行级过滤,将数据中不需要的行剔除掉;列级过滤,当一行具有多个列的时候,只选取并保留所需列对应的字段。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: row-level filtering, which removes unnecessary rows from the data; column-level filtering, when a row has multiple columns, only select and retain all The corresponding fields need to be listed.
所述预设的判定规则包括合法性规则和逻辑规则,在一个实施例中,计算机程序被处理器执行时还实现以下步骤:如果初步清洗数据不符合所述合法性规则,将初步清洗数据设为符合所述合法性规则的最大值,或者删除;如果初步清洗数据不符合所述逻辑规则,将初步清洗数据删除,并生成警告指令。The preset judgment rules include legality rules and logic rules. In one embodiment, the computer program also implements the following steps when being executed by the processor: if the preliminary cleaning data does not meet the legality rules, set the preliminary cleaning data to In order to comply with the maximum value of the legality rule, or delete; if the preliminary cleaning data does not meet the logic rule, the preliminary cleaning data is deleted, and a warning instruction is generated.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形 式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(SynchLink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer readable storage. In the medium, when the computer program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (SynchLink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, they should It is considered as the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (10)

  1. 一种数据清洗方法,所述方法包括:A data cleaning method, the method includes:
    从第一数据源中获取数据,利用获取的数据建立一个独立的数据流;Obtain data from the first data source, and use the acquired data to establish an independent data stream;
    对所述数据流中的数据进行过滤处理,得到待清洗数据;Filtering data in the data stream to obtain data to be cleaned;
    对所述待清洗数据中包含缺失值的字段进行删除或填充,得到初步清洗数据;Deleting or filling fields containing missing values in the data to be cleaned to obtain preliminary cleaned data;
    检测所述初步清洗数据是否符合预设的判定规则,删除不符合判定规则的数据,得到最终清洗数据;Detecting whether the preliminary cleaning data meets the preset judgment rule, deleting the data that does not meet the judgment rule, and obtaining the final cleaning data;
    将所述最终清洗数据输出到第二数据源。The final cleaning data is output to the second data source.
  2. 根据权利要求1所述的方法,其特征在于,所述对所述待清洗数据中包含缺失值的字段进行删除或填充包括:The method according to claim 1, wherein the deleting or filling a field containing a missing value in the data to be cleaned comprises:
    根据字段的缺失值条数占总条数的比例,计算得到字段的缺失率;According to the ratio of the number of missing values in the field to the total number, the missing rate of the field is calculated;
    根据需要分析的指标,确定字段的属性重要程度;Determine the importance of the attribute of the field according to the indicators to be analyzed;
    根据字段的缺失率和属性重要程度,对所述包含缺失值的字段进行删除或填充。According to the missing rate of the field and the importance of the attribute, the field containing the missing value is deleted or filled.
  3. 根据权利要求2所述的方法,其特征在于,所述根据字段的缺失率和属性重要程度,对所述包含缺失值的字段进行删除或填充包括:The method according to claim 2, wherein the deleting or filling the field containing missing values according to the missing rate and attribute importance of the field comprises:
    当字段的缺失率低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,对字段进行填充;When the missing rate of the field is lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, fill in the field;
    当字段的缺失率不低于预设的缺失率阈值且属性重要程度低于预设的重要评级阈值时,删除字段;When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is lower than the preset important rating threshold, delete the field;
    当字段的缺失率不低于预设的缺失率阈值且属性重要程度高于预设的重要评级阈值时,对字段的缺失值进行补全。When the missing rate of the field is not lower than the preset missing rate threshold and the attribute importance is higher than the preset important rating threshold, the missing value of the field is completed.
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1, wherein the method further comprises:
    探查第一数据源中数据的描述数据属性的元数据,根据所述元数据分析得到数据存在的质量问题,根据所述质量问题设定过滤规则;Exploring the metadata describing the data attributes of the data in the first data source, analyzing the quality problems existing in the data according to the metadata analysis, and setting filtering rules according to the quality problems;
    所述对所述数据流中的数据进行过滤处理,得到待清洗数据,包括:根据所述过滤规则对所述数据流中的数据进行过滤处理,得到待清洗数据。The filtering processing on the data in the data stream to obtain the data to be cleaned includes: filtering the data in the data stream according to the filtering rules to obtain the data to be cleaned.
  5. 根据权利要求1至4任意一项所述的方法,其特征在于,所述对所述数据流中的数据进行过滤处理包括:The method according to any one of claims 1 to 4, wherein the filtering processing on the data in the data stream comprises:
    行级过滤,将数据中不需要的行剔除掉;Row-level filtering, remove unnecessary rows in the data;
    列级过滤,当一行具有多个列的时候,只选取并保留所需列对应的字段。Column-level filtering. When a row has multiple columns, only the fields corresponding to the required columns are selected and retained.
  6. 根据权利要求1至4任意一项所述的方法,其特征在于,所述预设的判定规则包括合法性规则和逻辑规则,所述检测所述初步清洗数据是否符合预设的判定规则包括:The method according to any one of claims 1 to 4, wherein the preset determination rule includes a legality rule and a logic rule, and the detecting whether the preliminary cleaning data meets the preset determination rule includes:
    如果所述初步清洗数据不符合所述合法性规则,将所述初步清洗数据设为符合所述合法性规则的最大值,或者删除;If the preliminary cleaning data does not meet the legality rules, set the preliminary cleaning data to the maximum value that meets the legality rules, or delete;
    如果所述初步清洗数据不符合所述逻辑规则,将所述初步清洗数据删除,并生成警告指令。If the preliminary cleaning data does not conform to the logic rule, delete the preliminary cleaning data, and generate a warning instruction.
  7. 根据权利要求1所述的方法,其特征在于,所述第一数据源和第二数据源为同一分布式消息系统的不同数据类别,进一步地,所述分布式消息系统为Kafka,所述第一数据源和第二数据源为Kafka的两个不同的Topic;所述数据流采用基于Spark Streaming的数据流。The method according to claim 1, wherein the first data source and the second data source are different data types of the same distributed messaging system, and further, the distributed messaging system is Kafka, and the first The first data source and the second data source are two different topics of Kafka; the data stream adopts a data stream based on Spark Streaming.
  8. 一种数据清洗装置,其特征在于,所述装置包括:A data cleaning device, characterized in that the device includes:
    数据获取模块,用于从第一数据源中获取数据,利用获取的数据建立一个独立的数据流;The data acquisition module is used to acquire data from the first data source, and use the acquired data to establish an independent data stream;
    数据过滤模块,用于对所述数据流中的数据进行过滤处理,得到待清洗数 据;The data filtering module is used to filter the data in the data stream to obtain the data to be cleaned;
    初步清洗模块,用于对所述待清洗数据中包含缺失值的字段进行删除或填充,得到初步清洗数据;The preliminary cleaning module is used to delete or fill the fields containing missing values in the data to be cleaned to obtain preliminary cleaning data;
    最终清洗模块,用于检测所述初步清洗数据是否符合预设的判定规则,删除不符合判定规则的数据,得到最终清洗数据;The final cleaning module is used to detect whether the preliminary cleaning data meets the preset judgment rule, delete the data that does not meet the judgment rule, and obtain the final cleaning data;
    数据输出模块,用于将所述最终清洗数据输出到第二数据源。The data output module is used to output the final cleaning data to the second data source.
  9. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至7中任一项所述方法的步骤。A computer device comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, wherein the processor implements any one of claims 1 to 7 when the computer program is executed The steps of the method.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的方法的步骤。A computer-readable storage medium with a computer program stored thereon, wherein the computer program implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed by a processor.
PCT/CN2019/109121 2019-04-17 2019-09-29 Data cleansing method WO2020211299A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA3177209A CA3177209A1 (en) 2019-04-17 2019-09-29 Data cleaning method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910308949.0 2019-04-17
CN201910308949.0A CN110162519A (en) 2019-04-17 2019-04-17 Data clearing method

Publications (1)

Publication Number Publication Date
WO2020211299A1 true WO2020211299A1 (en) 2020-10-22

Family

ID=67639550

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/109121 WO2020211299A1 (en) 2019-04-17 2019-09-29 Data cleansing method

Country Status (3)

Country Link
CN (1) CN110162519A (en)
CA (1) CA3177209A1 (en)
WO (1) WO2020211299A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535697A (en) * 2021-07-07 2021-10-22 广州三叠纪元智能科技有限公司 Climbing frame data cleaning method, climbing frame control device and storage medium
CN114385606A (en) * 2021-12-09 2022-04-22 湖北省信产通信服务有限公司数字科技分公司 Big data cleaning method and system, storage medium and electronic equipment

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162519A (en) * 2019-04-17 2019-08-23 苏宁易购集团股份有限公司 Data clearing method
CN110716928A (en) * 2019-09-09 2020-01-21 上海凯京信达科技集团有限公司 Data processing method, device, equipment and storage medium
CN110704410A (en) * 2019-09-27 2020-01-17 中冶赛迪重庆信息技术有限公司 Data cleaning method, system and equipment
CN110781176A (en) * 2019-11-06 2020-02-11 国网山东省电力公司威海供电公司 Power grid data quality improvement method based on data correlation
CN110990447B (en) * 2019-12-19 2023-09-15 北京锐安科技有限公司 Data exploration method, device, equipment and storage medium
CN111563071A (en) * 2020-04-03 2020-08-21 深圳价值在线信息科技股份有限公司 Data cleaning method and device, terminal equipment and computer readable storage medium
CN111966735A (en) * 2020-07-22 2020-11-20 山东高速信息工程有限公司 NIFI-based micro-service data interaction method and system
CN111859814B (en) * 2020-07-30 2023-07-28 中国电建集团昆明勘测设计研究院有限公司 Rock aging deformation prediction method and system based on LSTM deep learning
CN112287562B (en) * 2020-11-18 2023-03-10 国网新疆电力有限公司经济技术研究院 Power equipment retired data completion method and system
CN113268476A (en) * 2021-06-07 2021-08-17 一汽解放汽车有限公司 Data cleaning method and device applied to Internet of vehicles and computer equipment
CN113568811A (en) * 2021-07-28 2021-10-29 中国南方电网有限责任公司 Distributed safety monitoring data processing method
CN114356902A (en) * 2021-12-14 2022-04-15 中核武汉核电运行技术股份有限公司 Industrial data quality management method and device
CN115794795B (en) * 2022-12-08 2023-09-22 湖北华中电力科技开发有限责任公司 Power distribution station electricity consumption data standardization cleaning method, device, system and storage medium
CN116186698A (en) * 2022-12-16 2023-05-30 广东技术师范大学 Machine learning-based secure data processing method, medium and equipment
CN115809406B (en) * 2023-02-03 2023-05-12 佰聆数据股份有限公司 Fine granularity classification method, device, equipment and storage medium for electric power users

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179599A1 (en) * 2012-10-11 2016-06-23 University Of Southern California Data processing framework for data cleansing
CN105989163A (en) * 2015-03-04 2016-10-05 中国移动通信集团福建有限公司 Data real-time processing method and system
CN106294745A (en) * 2016-08-10 2017-01-04 东方网力科技股份有限公司 Big data cleaning method and device
CN109255523A (en) * 2018-08-16 2019-01-22 北京奥技异科技发展有限公司 Analysis indexes computing platform based on KKS coding rule and big data framework
CN109492002A (en) * 2018-10-19 2019-03-19 浙江大学华南工业技术研究院 A kind of storage of smart grid big data and analysis system and processing method
CN110162519A (en) * 2019-04-17 2019-08-23 苏宁易购集团股份有限公司 Data clearing method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025301A (en) * 2017-04-25 2017-08-08 西安理工大学 Flight ensures the method for cleaning of data
CN108596386A (en) * 2018-04-20 2018-09-28 上海市司法局 A kind of prediction convict repeats the method and system of crime probability
CN109063964A (en) * 2018-07-02 2018-12-21 浙江百先得服饰有限公司 A kind of platform data processing system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179599A1 (en) * 2012-10-11 2016-06-23 University Of Southern California Data processing framework for data cleansing
CN105989163A (en) * 2015-03-04 2016-10-05 中国移动通信集团福建有限公司 Data real-time processing method and system
CN106294745A (en) * 2016-08-10 2017-01-04 东方网力科技股份有限公司 Big data cleaning method and device
CN109255523A (en) * 2018-08-16 2019-01-22 北京奥技异科技发展有限公司 Analysis indexes computing platform based on KKS coding rule and big data framework
CN109492002A (en) * 2018-10-19 2019-03-19 浙江大学华南工业技术研究院 A kind of storage of smart grid big data and analysis system and processing method
CN110162519A (en) * 2019-04-17 2019-08-23 苏宁易购集团股份有限公司 Data clearing method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535697A (en) * 2021-07-07 2021-10-22 广州三叠纪元智能科技有限公司 Climbing frame data cleaning method, climbing frame control device and storage medium
CN114385606A (en) * 2021-12-09 2022-04-22 湖北省信产通信服务有限公司数字科技分公司 Big data cleaning method and system, storage medium and electronic equipment

Also Published As

Publication number Publication date
CA3177209A1 (en) 2020-10-22
CN110162519A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
WO2020211299A1 (en) Data cleansing method
WO2020140678A1 (en) Abnormal application detection method and apparatus, and computer device and storage medium
US20200050968A1 (en) Interactive interfaces for machine learning model evaluations
CN108509485B (en) Data preprocessing method and device, computer equipment and storage medium
WO2019218699A1 (en) Fraud transaction determining method and apparatus, computer device, and storage medium
US7562067B2 (en) Systems and methods for estimating functional relationships in a database
US20090193054A1 (en) Tracking changes to a business object
WO2020037917A1 (en) User behavior data recommendation method, server and computer readable medium
CN110457294B (en) Data processing method and device
WO2022126984A1 (en) Cache data detection method and apparatus, computer device and storage medium
CN112434024B (en) Relational database-oriented data dictionary generation method, device, equipment and medium
US9009850B2 (en) Database management by analyzing usage of database fields
WO2019052162A1 (en) Method, apparatus and device for improving data cleaning efficiency, and readable storage medium
CN110647447B (en) Abnormal instance detection method, device, equipment and medium for distributed system
CN111553137B (en) Report generation method and device, storage medium and computer equipment
CN107153702A (en) A kind of data processing method and device
CN107832333A (en) Method and system based on distributed treatment and DPI data structure user network data fingerprint
CN115660711A (en) User ID generation method and device, electronic equipment and readable storage medium
CN111858278A (en) Log analysis method and system based on big data processing and readable storage device
Ahsaan et al. Big data analytics: challenges and technologies
CN115544007A (en) Label preprocessing method and device, computer equipment and storage medium
CN115470279A (en) Data source conversion method, device, equipment and medium based on enterprise data
WO2019062013A1 (en) Electronic apparatus, user grouping method and system, and computer-readable storage medium
CN114495137A (en) Bill abnormity detection model generation method and bill abnormity detection method
CN110674224B (en) Entity data processing method, device and equipment and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19925369

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19925369

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 22/04/2022)

ENP Entry into the national phase

Ref document number: 3177209

Country of ref document: CA

122 Ep: pct application non-entry in european phase

Ref document number: 19925369

Country of ref document: EP

Kind code of ref document: A1