CN110162519A - Data clearing method - Google Patents

Data clearing method Download PDF

Info

Publication number
CN110162519A
CN110162519A CN201910308949.0A CN201910308949A CN110162519A CN 110162519 A CN110162519 A CN 110162519A CN 201910308949 A CN201910308949 A CN 201910308949A CN 110162519 A CN110162519 A CN 110162519A
Authority
CN
China
Prior art keywords
data
field
cleaning
miss rate
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910308949.0A
Other languages
Chinese (zh)
Inventor
张礼成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xingyun Digital Technology Co Ltd
Original Assignee
Suningcom Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suningcom Group Co Ltd filed Critical Suningcom Group Co Ltd
Priority to CN201910308949.0A priority Critical patent/CN110162519A/en
Publication of CN110162519A publication Critical patent/CN110162519A/en
Priority to CA3177209A priority patent/CA3177209A1/en
Priority to PCT/CN2019/109121 priority patent/WO2020211299A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Preliminary Treatment Of Fibers (AREA)

Abstract

This application involves a kind of data cleaning methods.The described method includes: obtaining data from the first data source, an independent data flow is established using the data of acquisition;Processing is filtered to the data in the data flow, obtains data to be cleaned;The field in the data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;It detects whether the preliminary cleaning data meet preset decision rule, deletes the data for not meeting decision rule, obtain finally cleaning data;The final cleaning data are output to the second data source.Information Security can be improved using this method.

Description

Data clearing method
Technical field
This application involves big data processing technology fields, more particularly to a kind of data clearing method.
Background technique
With the arrival of cybertimes, bulk information data continuously pour in network, and data volume is with annual 50% Speed is increasing.In the case where huge data source is supported, based on business decision is increasingly analyzed by data, and it is unconventional only Only rely on experience and intuition.Data cleansing is that an indispensable link, outcome quality are straight in entire data analysis process It connects and is related to modelling effect and final data analysis conclusion.Data cleansing refers to the mistake that data are audited and verified again Journey, it is therefore intended that deleting duplicated data corrects existing mistake, and guarantees data consistency.In actual operation, data cleansing 50% -80% time of data analysis process would generally be occupied.
Data cleansing includes off-line data cleaning and real time data cleans two classes, and off-line data cleaning can pass through sacrifice Can mode, more fine-grained cleaning carried out to data by complicated processing, including missing values processing, outlier processing, again Complex value processing, null value filling, uniform units, whether standardization, whether delete unnecessarily variable and whether sort; It is cleaned compared to off-line data, real time data cleaning is more likely to missing value filling, filtering and the number of data because of requirement of real time According to validity checking, but existing data scrubbing process be usually with data analysis process it is integrated, the two coupling is big, number Being analyzed the effect of other codes by data according to scale removal process is influenced big, is easy to happen loss of data, the safety of data is poor.
Summary of the invention
Based on this, it is necessary in view of the above technical problems, provide a kind of data cleaning method, can be improved data safety Property.
A kind of data cleaning method, method include:
Data are obtained from the first data source, establish an independent data flow using the data of acquisition;
Processing is filtered to the data in data flow, obtains data to be cleaned;
The field in data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;
Whether the preliminary cleaning data of detection meet preset decision rule, delete the data for not meeting decision rule, obtain Final cleaning data;
Final cleaning data are output to the second data source.
It is described in one of the embodiments, that deletion or filling bag are carried out to the field in data to be cleaned including missing values It includes:
The ratio that total number is accounted for according to the missing values item number of field, is calculated the miss rate of field;
The index analyzed as needed determines the Importance of attribute degree of field;
According to the miss rate of field and Importance of attribute degree, the field comprising missing values is deleted or filled.
The miss rate and Importance of attribute degree according to field in one of the embodiments, to including missing values Field is deleted or is filled
When field miss rate is lower than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading threshold When value, field is filled;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading When threshold value, field is deleted;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important grading When threshold value, completion is carried out to the missing values of field.
This method in one of the embodiments, further include:
The metadata for detecting the description data attribute of data in the first data source, obtains data according to the metadata analysis Existing quality problems set filtering rule according to quality problems;
The data in data flow are filtered processing, obtain data to be cleaned, comprising: according to the filtering rule Processing is filtered to the data in data flow, obtains data to be cleaned.
The data in data flow are filtered processing and include: in one of the embodiments,
The filtering of row grade, row unwanted in data is weeded out;
The filtering of column grade, when a line has multiple column, the corresponding field of column needed for only choosing and retaining.
The preset decision rule includes legitimacy rule and logic rules, the inspection in one of the embodiments, Whether the preliminary cleaning data of survey, which meet preset decision rule, includes:
If tentatively cleaning data do not meet the legitimacy rule, preliminary cleaning data are set as meeting the legitimacy The maximum value of rule, or delete;
If tentatively cleaning data do not meet the logic rules, data deletion will be tentatively cleaned, and generates warning instruction.
The first data source and the second data source are the different numbers of same distributed information system in one of the embodiments, According to classification, further, the distributed information system is Kafka, and the first data source and the second data source are two of Kafka Different Topic;Data flow uses the data flow based on Spark Streaming.
A kind of data cleansing device, described device include:
Data acquisition module, for obtaining data from the first data source, using the data of acquisition establish one it is independent Data flow;
Data filtering module obtains data to be cleaned for being filtered processing to the data in data flow;
Preliminary cleaning module obtains just for the field in data to be cleaned including missing values to be deleted or filled Step cleaning data;
Final cleaning module, for detecting whether preliminary cleaning data meet preset decision rule, deletion, which is not met, to be sentenced The data of set pattern then, obtain finally cleaning data;
Data outputting module, for final cleaning data to be output to the second data source.
A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage Computer program, processor perform the steps of when executing the computer program
Data are obtained from the first data source, establish an independent data flow using the data of acquisition;
Processing is filtered to the data in data flow, obtains data to be cleaned;
The field in data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;
Whether the preliminary cleaning data of detection meet preset decision rule, delete the data for not meeting decision rule, obtain Final cleaning data;
Final cleaning data are output to the second data source.
A kind of computer readable storage medium, is stored thereon with computer program, which is executed by processor When perform the steps of
Data are obtained from the first data source, establish an independent data flow using the data of acquisition;
Processing is filtered to the data in data flow, obtains data to be cleaned;
The field in data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;
Whether the preliminary cleaning data of detection meet preset decision rule, delete the data for not meeting decision rule, obtain Final cleaning data;
Final cleaning data are output to the second data source.
Compared with prior art, the beneficial effects of the present invention are:
A kind of data cleaning method, device, computer equipment and storage medium, by establish an independent data flow come Data cleansing is carried out, is put into after being cleaned from the data obtained in the first data source in another data source for subsequent industry Business, so that data cleansing process is independent from data analysis code, reduces the coupling between code, effectively mentions to handle The high safety of data;
Further, data filtering is placed on the first step of data cleansing by the present invention, subsequent needs to clean to reduce Data volume, greatly improve the cleaning efficiency of data.
Detailed description of the invention
Fig. 1 is the flow diagram of data cleaning method in one embodiment;
Fig. 2 is the structural block diagram of data cleansing device in one embodiment.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the application, and do not have to In restriction the application.
In one embodiment, as shown in Figure 1, this application provides a kind of data cleaning methods, comprising the following steps:
Step 101, data are obtained from the first data source, establishes an independent data flow using the data of acquisition.
Wherein, the first data source is to obtain the source of data;Data flow is one group orderly, there is the byte of beginning and end Data sequence.
Specifically, the present invention carries out data cleansing by establishing an independent data flow, by data cleansing process from It is independent in data analysis code, reduce the coupling between code.
Step 102, processing is filtered to the data in data flow, obtains data to be cleaned.
Specifically, data filtering is placed on the first step of data cleansing, can effectively reduce the subsequent data for needing to clean Amount, greatlys improve the cleaning efficiency of data.
Step 103, the field in data to be cleaned including missing values is deleted or is filled, obtain tentatively cleaning number According to;
Missing values, which refer to, lacks information in data, i.e. the value of some or certain attributes of data is incomplete.
Step 104, whether the preliminary cleaning data of detection meet preset decision rule, delete the number for not meeting decision rule According to obtaining finally cleaning data;
Step 105, final cleaning data are output to the second data source.
Second data source be another data source different from the first data source, be used for store for follow-up business use or The data of processing.
Specifically, other treatment processes that data cleansing process of the invention is analyzed independently of data, not by other codes It influences, the safety of data is higher.
In above-mentioned data cleaning method, data cleansing is carried out by establishing an independent data flow, it will be from the first number It is put into another data source for subsequent business and handles after being cleaned according to the data obtained in source, so that data cleansing Journey is independent from data analysis code, reduces the coupling between code, effectively increases the safety of data.
As one kind of specific embodiment, the first data source and the second data source be same distributed information system not Same data category, for example, the distributed information system is Kafka, the first data source and the second data source are two of Kafka Different Topic;Data flow uses the data flow based on Spark Streaming.
It is described in one of the embodiments, that deletion or filling bag are carried out to the field in data to be cleaned including missing values It includes:
The ratio that total number is accounted for according to the missing values item number of field, is calculated the miss rate of field;
The index analyzed as needed determines the Importance of attribute degree of field;
According to the miss rate of field and Importance of attribute degree, the field comprising missing values is deleted or filled.
The miss rate of the field is that the missing values item number of field accounts for the ratio of total number;
Such as: a total of 100 records of wage field have 20 to be recorded as missing values, miss rate is exactly 20%.
The index that the Importance of attribute degree judgment criteria of field is analyzed as needed determines, such as needs to draw a portrait or beat to user Label, to provide data for subsequent precision marketing, then just needing to collect the attribute information of user, such as the year of user The attribute informations such as age, gender are exactly significant field.
The miss rate and Importance of attribute degree according to field in one of the embodiments, to including missing values Field is deleted or is filled
When field miss rate is lower than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading threshold When value, field is filled;
Specifically, if field attribute is numeric type data, field is filled according to data distribution, is further had Body, if data distribution is uniform, field is filled using mean value;If data distribution tilts, using median to word Section is filled.
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading When threshold value, field is deleted;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important grading When threshold value, completion is carried out to the missing values of field.
Specifically, the missing values to field carry out completion and include:
Gender, native place, date of birth, age etc. are calculated by other information completion, such as using ID card No.;
By front and back Supplementing Data, such as when time series lacks data, the mean value of front and back can be used as completion value, it is scarce It, can be by numerical value that smoothing processing obtains as completion value when mistake value is more;
It can not completion, it is necessary to it rejects, but not delete, it is subsequent to will use.
As one kind of specific embodiment, the miss rate threshold value can be any value in 90%-95%.
In one of the embodiments, before being filtered processing to the data in data flow, the first data are first detected The metadata of the description data attribute of data, obtains quality problems existing for data, root further according to the metadata analysis in source Filtering rule is set according to quality problems, the step 102 is filtered place to the data in data flow according to the filtering rule Reason, obtains data to be cleaned.
Metadata is also known as broker data, relaying data and mainly describes the letter of data attribute for the data for describing data Breath, for supporting such as to indicate storage location, historical data, resource lookup, file record function.
Specifically, data attribute to be treated is encapsulated as metadata, program can be made to have better scalability. Quality problems for data formulate corresponding filtering rule simultaneously, are conducive to the efficiency for improving data filtering.
The data in data flow are filtered processing and include: in one of the embodiments,
The filtering of row grade, row unwanted in data is weeded out;
The filtering of column grade, when a line has multiple column, the corresponding field of column needed for only choosing and retaining.
Specifically, the filtering of row grade and the filtering of column grade combine, and can effectively accelerate data filtering speed.
For example, distributing canal road calculates the process of pv/uv:
When daily record data includes the IP address of visitor, browser information, client terminal device information, specific access Between, the specific page that is accessed and nearly 200 fields, the demand of the present embodiment such as the interviewed page of higher level and access duration be Count the click volume of each channel and the amount of access of independent IP.
The filtering of row grade, only selection retains daily record data relevant with channel, to filter out the log number not comprising channel According to;
The filtering of column grade, selects cid (channel name from nearly 200 fields that the daily record data relevant with channel includes Claim), uid (device identification), the address ip, filter out unwanted field, so that it may statistics obtain the pv/nv of each channel;
Pv is writing a Chinese character in simplified form for Page View, i.e. page browsing amount, and user every 1 time to each web page access quilt in website Record 1 time, user is accumulative to the multiple amount of access of the same page to become pv sum;
Uv is writing a Chinese character in simplified form for unique visitor, refers to through internet access, browses the natural person of this webpage.
In the present embodiment, it is contemplated that scalability, such as follow-up data processing may need counting user retention ratio, Ke Yijin The data such as the access time of each address ip of one-step recording.
User's retention ratio is the ratio of the total user of old user Zhan.
The preset decision rule includes legitimacy rule and logic rules, the inspection in one of the embodiments, Whether the preliminary cleaning data of survey, which meet preset decision rule, includes:
If tentatively cleaning data do not meet the legitimacy rule, preliminary cleaning data are set as meeting the legitimacy The maximum value of rule, or delete;
Legitimacy rule is the call formats such as numerical value, date, field contents rule.
Specifically, field type legitimacy rule: date field format is " YYYY-MM-DD "
Field contents legitimacy rule: gender is male, female or unknown;Date of birth earlier than or be equal to today;
If tentatively cleaning data do not meet the logic rules, data deletion will be tentatively cleaned, and generates warning instruction.
Logic rules are for judging the whether logical convention rule of data;For example, the age of people is generally all in 0- Between 120, this data exception is judged if the age for occurring 200 years old.
After the cleaning that data pass through legitimacy rule and logic rules, eliminates and do not meet call format and logic rules Data obtain effective final cleaning data.
It should be understood that although each step in the flow chart of Fig. 1 is successively shown according to the instruction of arrow, this A little steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly state otherwise herein, these steps It executes there is no the limitation of stringent sequence, these steps can execute in other order.Moreover, at least part in Fig. 1 Step may include that perhaps these sub-steps of multiple stages or stage are executed in synchronization to multiple sub-steps It completes, but can execute at different times, the execution sequence in these sub-steps or stage, which is also not necessarily, successively to be carried out, But it can be executed in turn or alternately at least part of the sub-step or stage of other steps or other steps.
In one embodiment, as shown in Fig. 2, providing a kind of data cleansing device, comprising: data acquisition module, number According to filtering module, preliminary cleaning module, final cleaning module and data outputting module, in which:
Data acquisition module, for obtaining data from the first data source, using the data of acquisition establish one it is independent Data flow;
Data filtering module obtains data to be cleaned for being filtered processing to the data in data flow;
Preliminary cleaning module obtains just for the field in data to be cleaned including missing values to be deleted or filled Step cleaning data;
Final cleaning module, for detecting whether preliminary cleaning data meet preset decision rule, deletion, which is not met, to be sentenced The data of set pattern then, obtain finally cleaning data.
Data outputting module, for final cleaning data to be output to the second data source.
When it is implemented, the first data source and the second data source are the different data classification of same distributed information system.
In one embodiment, preliminary cleaning module includes at miss rate submodule, significance level submodule and missing values Manage submodule, in which:
Miss rate submodule accounts for the ratio of total number for the missing values item number according to field, and lacking for field is calculated Mistake rate;
Significance level submodule, the index for analyzing as needed determine the Importance of attribute degree of field;
Missing values handle submodule, for the miss rate and Importance of attribute degree according to field, to the word comprising missing values Duan Jinhang is deleted or filling.
Further, the missing values processing submodule includes comparing unit and primary treatment unit, in which:
The comparing unit be used for by the miss rate of field and Importance of attribute degree respectively with preset miss rate threshold value and Important rating threshold compares;Primary treatment unit for field is filled, delete or completion operation.
When field miss rate is lower than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading threshold When value, field is filled;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading When threshold value, field is deleted;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important grading When threshold value, completion is carried out to the missing values of field.
In one embodiment, data cleansing device further includes data exploration module, and data exploration module is used in logarithm Before being filtered processing according to the data in stream, the metadata of the description data attribute of data in the first data source is first detected, then Quality problems existing for data are obtained according to the metadata analysis, set filtering rule according to quality problems.
In one embodiment, data filtering module includes row grade filter element and column grade filter element, in which: the row Grade filter element is for weeding out row unwanted in data;The column grade filter element, for there are multiple column when a line When, the corresponding field of column needed for only choosing and retaining.
In one embodiment, final cleaning module includes legitimacy detection unit, logic detection unit and final process Unit, in which:
The legitimacy detection unit is for detecting whether preliminary cleaning data meet preset legitimacy rule;
The logic detection unit is for detecting whether preliminary cleaning data meet preset logic rules;
Final process unit is described legal for being set as meeting by the preliminary cleaning data for not meeting the legitimacy rule Property rule maximum value or deletion, will not meet the logic rules preliminary cleaning data delete, and generate warning instruction.
Specific about data cleansing device limits the restriction that may refer to above for data cleaning method, herein not It repeats again.Modules in above-mentioned data cleansing device can be realized fully or partially through software, hardware and combinations thereof.On Stating each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also store in a software form In memory in computer equipment, the corresponding operation of the above modules is executed in order to which processor calls.
In one embodiment, a kind of computer equipment is provided, which can be terminal.The computer is set Standby includes processor, memory, network interface, display screen and the input unit connected by system bus.Wherein, the computer The processor of equipment is for providing calculating and control ability.The memory of the computer equipment include non-volatile memory medium, Built-in storage.The non-volatile memory medium is stored with operating system and computer program.The built-in storage is non-volatile deposits The operation of operating system and computer program in storage media provides environment.The network interface of the computer equipment is used for and outside Terminal by network connection communication.To realize a kind of data cleaning method when the computer program is executed by processor.The meter The display screen for calculating machine equipment can be liquid crystal display or electric ink display screen, and the input unit of the computer equipment can be with It is the touch layer covered on display screen, is also possible to the key being arranged on computer equipment shell, trace ball or Trackpad, may be used also To be external keyboard, Trackpad or mouse etc..
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory And the computer program that can be run on a processor, processor perform the steps of when executing computer program from the first data Data are obtained in source, establish an independent data flow using the data of acquisition;Processing is filtered to the data in data flow, Obtain data to be cleaned;The field in data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data; Whether the preliminary cleaning data of detection meet preset decision rule, delete the data for not meeting decision rule, are finally cleaned Data;Final cleaning data are output to the second data source.
In one embodiment, the missing values according to field are also performed the steps of when processor executes computer program Item number accounts for the ratio of total number, and the miss rate of field is calculated;The index analyzed as needed determines the Importance of attribute of field Degree;According to the miss rate of field and Importance of attribute degree, the field comprising missing values is deleted or filled.
In one embodiment, it also performs the steps of when processor executes computer program when the miss rate of field is low In preset miss rate threshold value and when Importance of attribute degree is lower than preset important rating threshold, field is filled;Work as word The miss rate of section is not less than preset miss rate threshold value and when Importance of attribute degree is lower than preset important rating threshold, cancel (CANCL) Section;When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important rating threshold When, completion is carried out to the missing values of field.
In one embodiment, it also performs the steps of and is detected in the first data source when processor executes computer program The metadata of the description data attribute of data, obtains quality problems existing for data according to the metadata analysis, according to quality Problem sets filtering rule, is filtered processing to the data in data flow according to the filtering rule, obtains data to be cleaned.
In one embodiment, capable grade filtering is also performed the steps of when processor executes computer program, it will be in data Unwanted row weeds out;The filtering of column grade, when a line has multiple column, the corresponding word of column needed for only choosing and retaining Section.
The preset decision rule includes legitimacy rule and logic rules, and in one embodiment, processor executes If also performing the steps of preliminary cleaning data when computer program does not meet the legitimacy rule, number will be tentatively cleaned According to being set as meeting the maximum value of the legitimacy rule, or delete;If tentatively cleaning data do not meet the logic rules, Data deletion will be tentatively cleaned, and generates warning instruction.
In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated Machine program performs the steps of when being executed by processor obtains data from the first data source, establishes one using the data of acquisition A independent data flow;Processing is filtered to the data in data flow, obtains data to be cleaned;To including in data to be cleaned The field of missing values is deleted or is filled, and obtains tentatively cleaning data;Whether the preliminary cleaning data of detection, which meet, preset is sentenced Set pattern then, deletes the data for not meeting decision rule, obtains finally cleaning data;Final cleaning data are output to the second data Source.
In one embodiment, the missing according to field is also performed the steps of when computer program is executed by processor Value item number accounts for the ratio of total number, and the miss rate of field is calculated;The index analyzed as needed determines the attribute weight of field Want degree;According to the miss rate of field and Importance of attribute degree, the field comprising missing values is deleted or filled.
In one embodiment, the miss rate when field is also performed the steps of when computer program is executed by processor Lower than preset miss rate threshold value and when Importance of attribute degree is lower than preset important rating threshold, field is filled;When The miss rate of field is deleted not less than preset miss rate threshold value and when Importance of attribute degree is lower than preset important rating threshold Field;When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important rating threshold When, completion is carried out to the missing values of field.
In one embodiment, it is also performed the steps of when computer program is executed by processor and detects the first data source The metadata of the description data attribute of middle data, obtains quality problems existing for data according to the metadata analysis, according to matter Amount problem sets filtering rule, is filtered processing to the data in data flow according to the filtering rule, obtains number to be cleaned According to.
In one embodiment, capable grade filtering is also performed the steps of when computer program is executed by processor, by data In unwanted row weed out;The filtering of column grade, when a line has multiple column, the corresponding word of column needed for only choosing and retaining Section.
The preset decision rule includes legitimacy rule and logic rules, in one embodiment, computer program If also performing the steps of preliminary cleaning data when being executed by processor does not meet the legitimacy rule, will tentatively clean Data are set as meeting the maximum value of the legitimacy rule, or delete;If tentatively cleaning data do not meet the logic rule Then, data deletion will be tentatively cleaned, and generates warning instruction.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (SynchLink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.
The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the concept of this application, various modifications and improvements can be made, these belong to the protection of the application Range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.

Claims (10)

1. a kind of data cleaning method, which comprises
Data are obtained from the first data source, establish an independent data flow using the data of acquisition;
Processing is filtered to the data in the data flow, obtains data to be cleaned;
The field in the data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;
It detects whether the preliminary cleaning data meet preset decision rule, deletes the data for not meeting decision rule, obtain Final cleaning data;
The final cleaning data are output to the second data source.
2. the method according to claim 1, wherein described to the word in the data to be cleaned including missing values Duan Jinhang is deleted or filling includes:
The ratio that total number is accounted for according to the missing values item number of field, is calculated the miss rate of field;
The index analyzed as needed determines the Importance of attribute degree of field;
According to the miss rate of field and Importance of attribute degree, the field comprising missing values is deleted or filled.
3. according to the method described in claim 2, it is characterized in that, the miss rate and Importance of attribute degree according to field, The field comprising missing values is deleted or filled and includes:
When the miss rate of field is lower than preset miss rate threshold value and Importance of attribute degree is lower than preset important rating threshold, Field is filled;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is lower than preset important rating threshold When, delete field;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important rating threshold When, completion is carried out to the missing values of field.
4. the method according to claim 1, wherein the method also includes:
The metadata for detecting the description data attribute of data in the first data source, obtains data according to the metadata analysis and exists Quality problems, according to the quality problems set filtering rule;
The data in the data flow are filtered processing, obtain data to be cleaned, comprising: according to the filtering rule Processing is filtered to the data in the data flow, obtains data to be cleaned.
5. the method according to claim 1, which is characterized in that the data in the data flow Being filtered processing includes:
The filtering of row grade, row unwanted in data is weeded out;
The filtering of column grade, when a line has multiple column, the corresponding field of column needed for only choosing and retaining.
6. the method according to claim 1, which is characterized in that the preset decision rule includes closing Method rule and logic rules, whether the detection preliminary cleaning data, which meet preset decision rule, includes:
If the preliminary cleaning data do not meet the legitimacy rule, the preliminary cleaning data are set as meeting the conjunction The maximum value of method rule, or delete;
If the preliminary cleaning data do not meet the logic rules, the preliminary cleaning data are deleted, and generate warning Instruction.
7. the method according to claim 1, wherein first data source and the second data source are same distribution The different data classification of formula message system, further, the distributed information system be Kafka, first data source and Second data source is two different Topic of Kafka;The data flow uses the data flow based on Spark Streaming.
8. a kind of data cleansing device, which is characterized in that described device includes:
Data acquisition module establishes an independent data using the data of acquisition for obtaining data from the first data source Stream;
Data filtering module obtains data to be cleaned for being filtered processing to the data in the data flow;
Preliminary cleaning module obtains just for the field in the data to be cleaned including missing values to be deleted or filled Step cleaning data;
Final cleaning module, for detecting whether the preliminary cleaning data meet preset decision rule, deletion, which is not met, to be sentenced The data of set pattern then, obtain finally cleaning data;
Data outputting module, for the final cleaning data to be output to the second data source.
9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the processor realizes any one of claims 1 to 7 institute when executing the computer program The step of stating method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of method described in any one of claims 1 to 7 is realized when being executed by processor.
CN201910308949.0A 2019-04-17 2019-04-17 Data clearing method Pending CN110162519A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201910308949.0A CN110162519A (en) 2019-04-17 2019-04-17 Data clearing method
CA3177209A CA3177209A1 (en) 2019-04-17 2019-09-29 Data cleaning method
PCT/CN2019/109121 WO2020211299A1 (en) 2019-04-17 2019-09-29 Data cleansing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910308949.0A CN110162519A (en) 2019-04-17 2019-04-17 Data clearing method

Publications (1)

Publication Number Publication Date
CN110162519A true CN110162519A (en) 2019-08-23

Family

ID=67639550

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910308949.0A Pending CN110162519A (en) 2019-04-17 2019-04-17 Data clearing method

Country Status (3)

Country Link
CN (1) CN110162519A (en)
CA (1) CA3177209A1 (en)
WO (1) WO2020211299A1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704410A (en) * 2019-09-27 2020-01-17 中冶赛迪重庆信息技术有限公司 Data cleaning method, system and equipment
CN110716928A (en) * 2019-09-09 2020-01-21 上海凯京信达科技集团有限公司 Data processing method, device, equipment and storage medium
CN110781176A (en) * 2019-11-06 2020-02-11 国网山东省电力公司威海供电公司 Power grid data quality improvement method based on data correlation
CN110990447A (en) * 2019-12-19 2020-04-10 北京锐安科技有限公司 Data probing method, device, equipment and storage medium
CN111563071A (en) * 2020-04-03 2020-08-21 深圳价值在线信息科技股份有限公司 Data cleaning method and device, terminal equipment and computer readable storage medium
WO2020211299A1 (en) * 2019-04-17 2020-10-22 苏宁云计算有限公司 Data cleansing method
CN111859814A (en) * 2020-07-30 2020-10-30 中国电建集团昆明勘测设计研究院有限公司 Rock aging deformation prediction method and system based on LSTM deep learning
CN111966735A (en) * 2020-07-22 2020-11-20 山东高速信息工程有限公司 NIFI-based micro-service data interaction method and system
CN112287562A (en) * 2020-11-18 2021-01-29 国网新疆电力有限公司经济技术研究院 Power equipment retired data completion method and system
CN113268476A (en) * 2021-06-07 2021-08-17 一汽解放汽车有限公司 Data cleaning method and device applied to Internet of vehicles and computer equipment
CN113568811A (en) * 2021-07-28 2021-10-29 中国南方电网有限责任公司 Distributed safety monitoring data processing method
CN114385606A (en) * 2021-12-09 2022-04-22 湖北省信产通信服务有限公司数字科技分公司 Big data cleaning method and system, storage medium and electronic equipment
CN114549052A (en) * 2022-01-20 2022-05-27 深圳市宝视佳科技有限公司 Data-based accurate marketing method, device, equipment and storage medium
CN115809406A (en) * 2023-02-03 2023-03-17 佰聆数据股份有限公司 Power consumer fine-grained classification method, device, equipment and storage medium
CN116186698A (en) * 2022-12-16 2023-05-30 广东技术师范大学 Machine learning-based secure data processing method, medium and equipment
CN116303377A (en) * 2022-11-23 2023-06-23 南京视察者智能科技有限公司 Government affair data cleaning and filtering method
CN117540151A (en) * 2023-12-08 2024-02-09 深圳市亲邻科技有限公司 Data preprocessing method of data pushing system

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535697B (en) * 2021-07-07 2024-05-24 广州三叠纪元智能科技有限公司 Climbing frame data cleaning method, climbing frame control device and storage medium
CN114356902A (en) * 2021-12-14 2022-04-15 中核武汉核电运行技术股份有限公司 Industrial data quality management method and device
CN115794795B (en) * 2022-12-08 2023-09-22 湖北华中电力科技开发有限责任公司 Power distribution station electricity consumption data standardization cleaning method, device, system and storage medium
CN116961729B (en) * 2023-08-09 2024-10-11 深圳市恩斯仪器设备有限公司 Beidou-based voice communication method, system, equipment and medium
CN117290315B (en) * 2023-10-11 2024-06-25 河南师范大学 Data classification cleaning method
CN118467524A (en) * 2024-07-11 2024-08-09 浪潮智慧城市科技有限公司 Data cleaning method and device for supplementing data missing values

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179599A1 (en) * 2012-10-11 2016-06-23 University Of Southern California Data processing framework for data cleansing
CN105989163A (en) * 2015-03-04 2016-10-05 中国移动通信集团福建有限公司 Data real-time processing method and system
CN107025301A (en) * 2017-04-25 2017-08-08 西安理工大学 Flight ensures the method for cleaning of data
CN108596386A (en) * 2018-04-20 2018-09-28 上海市司法局 A kind of prediction convict repeats the method and system of crime probability
CN109063964A (en) * 2018-07-02 2018-12-21 浙江百先得服饰有限公司 A kind of platform data processing system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294745A (en) * 2016-08-10 2017-01-04 东方网力科技股份有限公司 Big data cleaning method and device
CN109255523B (en) * 2018-08-16 2021-07-20 北京奥技异科技发展有限公司 Analytical index computing platform based on KKS coding rule and big data architecture
CN109492002B (en) * 2018-10-19 2021-03-23 浙江大学华南工业技术研究院 Smart power grid big data storage and analysis system and processing method
CN110162519A (en) * 2019-04-17 2019-08-23 苏宁易购集团股份有限公司 Data clearing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179599A1 (en) * 2012-10-11 2016-06-23 University Of Southern California Data processing framework for data cleansing
CN105989163A (en) * 2015-03-04 2016-10-05 中国移动通信集团福建有限公司 Data real-time processing method and system
CN107025301A (en) * 2017-04-25 2017-08-08 西安理工大学 Flight ensures the method for cleaning of data
CN108596386A (en) * 2018-04-20 2018-09-28 上海市司法局 A kind of prediction convict repeats the method and system of crime probability
CN109063964A (en) * 2018-07-02 2018-12-21 浙江百先得服饰有限公司 A kind of platform data processing system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
潘腾辉等: ""面向数据库清洗的数据质量控制设计"", 《信息技术》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020211299A1 (en) * 2019-04-17 2020-10-22 苏宁云计算有限公司 Data cleansing method
CN110716928A (en) * 2019-09-09 2020-01-21 上海凯京信达科技集团有限公司 Data processing method, device, equipment and storage medium
CN110704410A (en) * 2019-09-27 2020-01-17 中冶赛迪重庆信息技术有限公司 Data cleaning method, system and equipment
CN110781176A (en) * 2019-11-06 2020-02-11 国网山东省电力公司威海供电公司 Power grid data quality improvement method based on data correlation
CN110990447A (en) * 2019-12-19 2020-04-10 北京锐安科技有限公司 Data probing method, device, equipment and storage medium
CN110990447B (en) * 2019-12-19 2023-09-15 北京锐安科技有限公司 Data exploration method, device, equipment and storage medium
CN111563071A (en) * 2020-04-03 2020-08-21 深圳价值在线信息科技股份有限公司 Data cleaning method and device, terminal equipment and computer readable storage medium
CN111966735A (en) * 2020-07-22 2020-11-20 山东高速信息工程有限公司 NIFI-based micro-service data interaction method and system
CN111859814A (en) * 2020-07-30 2020-10-30 中国电建集团昆明勘测设计研究院有限公司 Rock aging deformation prediction method and system based on LSTM deep learning
CN111859814B (en) * 2020-07-30 2023-07-28 中国电建集团昆明勘测设计研究院有限公司 Rock aging deformation prediction method and system based on LSTM deep learning
CN112287562A (en) * 2020-11-18 2021-01-29 国网新疆电力有限公司经济技术研究院 Power equipment retired data completion method and system
CN112287562B (en) * 2020-11-18 2023-03-10 国网新疆电力有限公司经济技术研究院 Power equipment retired data completion method and system
CN113268476A (en) * 2021-06-07 2021-08-17 一汽解放汽车有限公司 Data cleaning method and device applied to Internet of vehicles and computer equipment
CN113568811A (en) * 2021-07-28 2021-10-29 中国南方电网有限责任公司 Distributed safety monitoring data processing method
CN114385606A (en) * 2021-12-09 2022-04-22 湖北省信产通信服务有限公司数字科技分公司 Big data cleaning method and system, storage medium and electronic equipment
CN114549052A (en) * 2022-01-20 2022-05-27 深圳市宝视佳科技有限公司 Data-based accurate marketing method, device, equipment and storage medium
CN116303377A (en) * 2022-11-23 2023-06-23 南京视察者智能科技有限公司 Government affair data cleaning and filtering method
CN116186698A (en) * 2022-12-16 2023-05-30 广东技术师范大学 Machine learning-based secure data processing method, medium and equipment
CN116186698B (en) * 2022-12-16 2024-08-16 田帅领 Machine learning-based secure data processing method, medium and equipment
CN115809406A (en) * 2023-02-03 2023-03-17 佰聆数据股份有限公司 Power consumer fine-grained classification method, device, equipment and storage medium
CN117540151A (en) * 2023-12-08 2024-02-09 深圳市亲邻科技有限公司 Data preprocessing method of data pushing system

Also Published As

Publication number Publication date
CA3177209A1 (en) 2020-10-22
WO2020211299A1 (en) 2020-10-22

Similar Documents

Publication Publication Date Title
CN110162519A (en) Data clearing method
CN108509485B (en) Data preprocessing method and device, computer equipment and storage medium
JP6771751B2 (en) Risk assessment method and system
US20200192894A1 (en) System and method for using data incident based modeling and prediction
CN108876133A (en) Risk assessment processing method, device, server and medium based on business information
US20120131438A1 (en) Method and System of Web Page Content Filtering
CN104077407B (en) A kind of intelligent data search system and method
CN101493913A (en) Method and system for assessing user credit in internet
CN110428322A (en) A kind of adaptation method and device of business datum
CN106408184A (en) User credit evaluation model based on multi-source heterogeneous data
CN107040397A (en) A kind of service parameter acquisition methods and device
CN104424231A (en) Multi-dimensional data processing method and device
CN109284369B (en) Method, system, device and medium for judging importance of securities news information
CN112991079B (en) Multi-card co-occurrence medical treatment fraud detection method, system, cloud end and medium
CN109584037A (en) Calculation method, device and the computer equipment that user credit of providing a loan scores
CN113537960B (en) Determination method, device and equipment for abnormal resource transfer link
CN110728301A (en) Credit scoring method, device, terminal and storage medium for individual user
Mahmood et al. Adaptive automated teller machines
CN107832333A (en) Method and system based on distributed treatment and DPI data structure user network data fingerprint
CN112347245A (en) Viewpoint mining method and device for investment and financing field mechanism and electronic equipment
CN112231420A (en) Data analysis method, data analysis device, electronic device, and storage medium
CN206497498U (en) A kind of integrated system of credit rating information data based on enterprise's reference business
CN111858278A (en) Log analysis method and system based on big data processing and readable storage device
CN109919667A (en) A kind of method and apparatus of the IP of enterprise for identification
CN112801784A (en) Bit currency address mining method and device for digital currency exchange

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210730

Address after: Room 834, Yingying building, No.99, Tuanjie Road, yanchuangyuan, Jiangbei new district, Nanjing, Jiangsu Province

Applicant after: Nanjing Xingyun Digital Technology Co.,Ltd.

Address before: 210000 No. 1 Suning Avenue, Xuanwu District, Nanjing City, Jiangsu Province

Applicant before: SUNING.COM Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190823