CN110162519A - Data clearing method - Google Patents
Data clearing method Download PDFInfo
- Publication number
- CN110162519A CN110162519A CN201910308949.0A CN201910308949A CN110162519A CN 110162519 A CN110162519 A CN 110162519A CN 201910308949 A CN201910308949 A CN 201910308949A CN 110162519 A CN110162519 A CN 110162519A
- Authority
- CN
- China
- Prior art keywords
- data
- field
- cleaning
- miss rate
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000004140 cleaning Methods 0.000 claims abstract description 92
- 238000012545 processing Methods 0.000 claims abstract description 31
- 238000001914 filtration Methods 0.000 claims description 36
- 238000004590 computer program Methods 0.000 claims description 25
- 238000001514 detection method Methods 0.000 claims description 11
- 238000003860 storage Methods 0.000 claims description 11
- 238000012217 deletion Methods 0.000 claims description 10
- 230000037430 deletion Effects 0.000 claims description 10
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims 1
- 230000008569 process Effects 0.000 description 13
- 238000007405 data analysis Methods 0.000 description 7
- 230000008878 coupling Effects 0.000 description 4
- 238000010168 coupling process Methods 0.000 description 4
- 238000005859 coupling reaction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011221 initial treatment Methods 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 230000014759 maintenance of location Effects 0.000 description 2
- 241000196324 Embryophyta Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005201 scrubbing Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
- 238000009333 weeding Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24568—Data stream processing; Continuous queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Preliminary Treatment Of Fibers (AREA)
Abstract
This application involves a kind of data cleaning methods.The described method includes: obtaining data from the first data source, an independent data flow is established using the data of acquisition;Processing is filtered to the data in the data flow, obtains data to be cleaned;The field in the data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;It detects whether the preliminary cleaning data meet preset decision rule, deletes the data for not meeting decision rule, obtain finally cleaning data;The final cleaning data are output to the second data source.Information Security can be improved using this method.
Description
Technical field
This application involves big data processing technology fields, more particularly to a kind of data clearing method.
Background technique
With the arrival of cybertimes, bulk information data continuously pour in network, and data volume is with annual 50%
Speed is increasing.In the case where huge data source is supported, based on business decision is increasingly analyzed by data, and it is unconventional only
Only rely on experience and intuition.Data cleansing is that an indispensable link, outcome quality are straight in entire data analysis process
It connects and is related to modelling effect and final data analysis conclusion.Data cleansing refers to the mistake that data are audited and verified again
Journey, it is therefore intended that deleting duplicated data corrects existing mistake, and guarantees data consistency.In actual operation, data cleansing
50% -80% time of data analysis process would generally be occupied.
Data cleansing includes off-line data cleaning and real time data cleans two classes, and off-line data cleaning can pass through sacrifice
Can mode, more fine-grained cleaning carried out to data by complicated processing, including missing values processing, outlier processing, again
Complex value processing, null value filling, uniform units, whether standardization, whether delete unnecessarily variable and whether sort;
It is cleaned compared to off-line data, real time data cleaning is more likely to missing value filling, filtering and the number of data because of requirement of real time
According to validity checking, but existing data scrubbing process be usually with data analysis process it is integrated, the two coupling is big, number
Being analyzed the effect of other codes by data according to scale removal process is influenced big, is easy to happen loss of data, the safety of data is poor.
Summary of the invention
Based on this, it is necessary in view of the above technical problems, provide a kind of data cleaning method, can be improved data safety
Property.
A kind of data cleaning method, method include:
Data are obtained from the first data source, establish an independent data flow using the data of acquisition;
Processing is filtered to the data in data flow, obtains data to be cleaned;
The field in data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;
Whether the preliminary cleaning data of detection meet preset decision rule, delete the data for not meeting decision rule, obtain
Final cleaning data;
Final cleaning data are output to the second data source.
It is described in one of the embodiments, that deletion or filling bag are carried out to the field in data to be cleaned including missing values
It includes:
The ratio that total number is accounted for according to the missing values item number of field, is calculated the miss rate of field;
The index analyzed as needed determines the Importance of attribute degree of field;
According to the miss rate of field and Importance of attribute degree, the field comprising missing values is deleted or filled.
The miss rate and Importance of attribute degree according to field in one of the embodiments, to including missing values
Field is deleted or is filled
When field miss rate is lower than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading threshold
When value, field is filled;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading
When threshold value, field is deleted;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important grading
When threshold value, completion is carried out to the missing values of field.
This method in one of the embodiments, further include:
The metadata for detecting the description data attribute of data in the first data source, obtains data according to the metadata analysis
Existing quality problems set filtering rule according to quality problems;
The data in data flow are filtered processing, obtain data to be cleaned, comprising: according to the filtering rule
Processing is filtered to the data in data flow, obtains data to be cleaned.
The data in data flow are filtered processing and include: in one of the embodiments,
The filtering of row grade, row unwanted in data is weeded out;
The filtering of column grade, when a line has multiple column, the corresponding field of column needed for only choosing and retaining.
The preset decision rule includes legitimacy rule and logic rules, the inspection in one of the embodiments,
Whether the preliminary cleaning data of survey, which meet preset decision rule, includes:
If tentatively cleaning data do not meet the legitimacy rule, preliminary cleaning data are set as meeting the legitimacy
The maximum value of rule, or delete;
If tentatively cleaning data do not meet the logic rules, data deletion will be tentatively cleaned, and generates warning instruction.
The first data source and the second data source are the different numbers of same distributed information system in one of the embodiments,
According to classification, further, the distributed information system is Kafka, and the first data source and the second data source are two of Kafka
Different Topic;Data flow uses the data flow based on Spark Streaming.
A kind of data cleansing device, described device include:
Data acquisition module, for obtaining data from the first data source, using the data of acquisition establish one it is independent
Data flow;
Data filtering module obtains data to be cleaned for being filtered processing to the data in data flow;
Preliminary cleaning module obtains just for the field in data to be cleaned including missing values to be deleted or filled
Step cleaning data;
Final cleaning module, for detecting whether preliminary cleaning data meet preset decision rule, deletion, which is not met, to be sentenced
The data of set pattern then, obtain finally cleaning data;
Data outputting module, for final cleaning data to be output to the second data source.
A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage
Computer program, processor perform the steps of when executing the computer program
Data are obtained from the first data source, establish an independent data flow using the data of acquisition;
Processing is filtered to the data in data flow, obtains data to be cleaned;
The field in data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;
Whether the preliminary cleaning data of detection meet preset decision rule, delete the data for not meeting decision rule, obtain
Final cleaning data;
Final cleaning data are output to the second data source.
A kind of computer readable storage medium, is stored thereon with computer program, which is executed by processor
When perform the steps of
Data are obtained from the first data source, establish an independent data flow using the data of acquisition;
Processing is filtered to the data in data flow, obtains data to be cleaned;
The field in data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;
Whether the preliminary cleaning data of detection meet preset decision rule, delete the data for not meeting decision rule, obtain
Final cleaning data;
Final cleaning data are output to the second data source.
Compared with prior art, the beneficial effects of the present invention are:
A kind of data cleaning method, device, computer equipment and storage medium, by establish an independent data flow come
Data cleansing is carried out, is put into after being cleaned from the data obtained in the first data source in another data source for subsequent industry
Business, so that data cleansing process is independent from data analysis code, reduces the coupling between code, effectively mentions to handle
The high safety of data;
Further, data filtering is placed on the first step of data cleansing by the present invention, subsequent needs to clean to reduce
Data volume, greatly improve the cleaning efficiency of data.
Detailed description of the invention
Fig. 1 is the flow diagram of data cleaning method in one embodiment;
Fig. 2 is the structural block diagram of data cleansing device in one embodiment.
Specific embodiment
It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood
The application is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the application, and do not have to
In restriction the application.
In one embodiment, as shown in Figure 1, this application provides a kind of data cleaning methods, comprising the following steps:
Step 101, data are obtained from the first data source, establishes an independent data flow using the data of acquisition.
Wherein, the first data source is to obtain the source of data;Data flow is one group orderly, there is the byte of beginning and end
Data sequence.
Specifically, the present invention carries out data cleansing by establishing an independent data flow, by data cleansing process from
It is independent in data analysis code, reduce the coupling between code.
Step 102, processing is filtered to the data in data flow, obtains data to be cleaned.
Specifically, data filtering is placed on the first step of data cleansing, can effectively reduce the subsequent data for needing to clean
Amount, greatlys improve the cleaning efficiency of data.
Step 103, the field in data to be cleaned including missing values is deleted or is filled, obtain tentatively cleaning number
According to;
Missing values, which refer to, lacks information in data, i.e. the value of some or certain attributes of data is incomplete.
Step 104, whether the preliminary cleaning data of detection meet preset decision rule, delete the number for not meeting decision rule
According to obtaining finally cleaning data;
Step 105, final cleaning data are output to the second data source.
Second data source be another data source different from the first data source, be used for store for follow-up business use or
The data of processing.
Specifically, other treatment processes that data cleansing process of the invention is analyzed independently of data, not by other codes
It influences, the safety of data is higher.
In above-mentioned data cleaning method, data cleansing is carried out by establishing an independent data flow, it will be from the first number
It is put into another data source for subsequent business and handles after being cleaned according to the data obtained in source, so that data cleansing
Journey is independent from data analysis code, reduces the coupling between code, effectively increases the safety of data.
As one kind of specific embodiment, the first data source and the second data source be same distributed information system not
Same data category, for example, the distributed information system is Kafka, the first data source and the second data source are two of Kafka
Different Topic;Data flow uses the data flow based on Spark Streaming.
It is described in one of the embodiments, that deletion or filling bag are carried out to the field in data to be cleaned including missing values
It includes:
The ratio that total number is accounted for according to the missing values item number of field, is calculated the miss rate of field;
The index analyzed as needed determines the Importance of attribute degree of field;
According to the miss rate of field and Importance of attribute degree, the field comprising missing values is deleted or filled.
The miss rate of the field is that the missing values item number of field accounts for the ratio of total number;
Such as: a total of 100 records of wage field have 20 to be recorded as missing values, miss rate is exactly 20%.
The index that the Importance of attribute degree judgment criteria of field is analyzed as needed determines, such as needs to draw a portrait or beat to user
Label, to provide data for subsequent precision marketing, then just needing to collect the attribute information of user, such as the year of user
The attribute informations such as age, gender are exactly significant field.
The miss rate and Importance of attribute degree according to field in one of the embodiments, to including missing values
Field is deleted or is filled
When field miss rate is lower than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading threshold
When value, field is filled;
Specifically, if field attribute is numeric type data, field is filled according to data distribution, is further had
Body, if data distribution is uniform, field is filled using mean value;If data distribution tilts, using median to word
Section is filled.
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading
When threshold value, field is deleted;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important grading
When threshold value, completion is carried out to the missing values of field.
Specifically, the missing values to field carry out completion and include:
Gender, native place, date of birth, age etc. are calculated by other information completion, such as using ID card No.;
By front and back Supplementing Data, such as when time series lacks data, the mean value of front and back can be used as completion value, it is scarce
It, can be by numerical value that smoothing processing obtains as completion value when mistake value is more;
It can not completion, it is necessary to it rejects, but not delete, it is subsequent to will use.
As one kind of specific embodiment, the miss rate threshold value can be any value in 90%-95%.
In one of the embodiments, before being filtered processing to the data in data flow, the first data are first detected
The metadata of the description data attribute of data, obtains quality problems existing for data, root further according to the metadata analysis in source
Filtering rule is set according to quality problems, the step 102 is filtered place to the data in data flow according to the filtering rule
Reason, obtains data to be cleaned.
Metadata is also known as broker data, relaying data and mainly describes the letter of data attribute for the data for describing data
Breath, for supporting such as to indicate storage location, historical data, resource lookup, file record function.
Specifically, data attribute to be treated is encapsulated as metadata, program can be made to have better scalability.
Quality problems for data formulate corresponding filtering rule simultaneously, are conducive to the efficiency for improving data filtering.
The data in data flow are filtered processing and include: in one of the embodiments,
The filtering of row grade, row unwanted in data is weeded out;
The filtering of column grade, when a line has multiple column, the corresponding field of column needed for only choosing and retaining.
Specifically, the filtering of row grade and the filtering of column grade combine, and can effectively accelerate data filtering speed.
For example, distributing canal road calculates the process of pv/uv:
When daily record data includes the IP address of visitor, browser information, client terminal device information, specific access
Between, the specific page that is accessed and nearly 200 fields, the demand of the present embodiment such as the interviewed page of higher level and access duration be
Count the click volume of each channel and the amount of access of independent IP.
The filtering of row grade, only selection retains daily record data relevant with channel, to filter out the log number not comprising channel
According to;
The filtering of column grade, selects cid (channel name from nearly 200 fields that the daily record data relevant with channel includes
Claim), uid (device identification), the address ip, filter out unwanted field, so that it may statistics obtain the pv/nv of each channel;
Pv is writing a Chinese character in simplified form for Page View, i.e. page browsing amount, and user every 1 time to each web page access quilt in website
Record 1 time, user is accumulative to the multiple amount of access of the same page to become pv sum;
Uv is writing a Chinese character in simplified form for unique visitor, refers to through internet access, browses the natural person of this webpage.
In the present embodiment, it is contemplated that scalability, such as follow-up data processing may need counting user retention ratio, Ke Yijin
The data such as the access time of each address ip of one-step recording.
User's retention ratio is the ratio of the total user of old user Zhan.
The preset decision rule includes legitimacy rule and logic rules, the inspection in one of the embodiments,
Whether the preliminary cleaning data of survey, which meet preset decision rule, includes:
If tentatively cleaning data do not meet the legitimacy rule, preliminary cleaning data are set as meeting the legitimacy
The maximum value of rule, or delete;
Legitimacy rule is the call formats such as numerical value, date, field contents rule.
Specifically, field type legitimacy rule: date field format is " YYYY-MM-DD "
Field contents legitimacy rule: gender is male, female or unknown;Date of birth earlier than or be equal to today;
If tentatively cleaning data do not meet the logic rules, data deletion will be tentatively cleaned, and generates warning instruction.
Logic rules are for judging the whether logical convention rule of data;For example, the age of people is generally all in 0-
Between 120, this data exception is judged if the age for occurring 200 years old.
After the cleaning that data pass through legitimacy rule and logic rules, eliminates and do not meet call format and logic rules
Data obtain effective final cleaning data.
It should be understood that although each step in the flow chart of Fig. 1 is successively shown according to the instruction of arrow, this
A little steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly state otherwise herein, these steps
It executes there is no the limitation of stringent sequence, these steps can execute in other order.Moreover, at least part in Fig. 1
Step may include that perhaps these sub-steps of multiple stages or stage are executed in synchronization to multiple sub-steps
It completes, but can execute at different times, the execution sequence in these sub-steps or stage, which is also not necessarily, successively to be carried out,
But it can be executed in turn or alternately at least part of the sub-step or stage of other steps or other steps.
In one embodiment, as shown in Fig. 2, providing a kind of data cleansing device, comprising: data acquisition module, number
According to filtering module, preliminary cleaning module, final cleaning module and data outputting module, in which:
Data acquisition module, for obtaining data from the first data source, using the data of acquisition establish one it is independent
Data flow;
Data filtering module obtains data to be cleaned for being filtered processing to the data in data flow;
Preliminary cleaning module obtains just for the field in data to be cleaned including missing values to be deleted or filled
Step cleaning data;
Final cleaning module, for detecting whether preliminary cleaning data meet preset decision rule, deletion, which is not met, to be sentenced
The data of set pattern then, obtain finally cleaning data.
Data outputting module, for final cleaning data to be output to the second data source.
When it is implemented, the first data source and the second data source are the different data classification of same distributed information system.
In one embodiment, preliminary cleaning module includes at miss rate submodule, significance level submodule and missing values
Manage submodule, in which:
Miss rate submodule accounts for the ratio of total number for the missing values item number according to field, and lacking for field is calculated
Mistake rate;
Significance level submodule, the index for analyzing as needed determine the Importance of attribute degree of field;
Missing values handle submodule, for the miss rate and Importance of attribute degree according to field, to the word comprising missing values
Duan Jinhang is deleted or filling.
Further, the missing values processing submodule includes comparing unit and primary treatment unit, in which:
The comparing unit be used for by the miss rate of field and Importance of attribute degree respectively with preset miss rate threshold value and
Important rating threshold compares;Primary treatment unit for field is filled, delete or completion operation.
When field miss rate is lower than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading threshold
When value, field is filled;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is lower than preset important grading
When threshold value, field is deleted;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important grading
When threshold value, completion is carried out to the missing values of field.
In one embodiment, data cleansing device further includes data exploration module, and data exploration module is used in logarithm
Before being filtered processing according to the data in stream, the metadata of the description data attribute of data in the first data source is first detected, then
Quality problems existing for data are obtained according to the metadata analysis, set filtering rule according to quality problems.
In one embodiment, data filtering module includes row grade filter element and column grade filter element, in which: the row
Grade filter element is for weeding out row unwanted in data;The column grade filter element, for there are multiple column when a line
When, the corresponding field of column needed for only choosing and retaining.
In one embodiment, final cleaning module includes legitimacy detection unit, logic detection unit and final process
Unit, in which:
The legitimacy detection unit is for detecting whether preliminary cleaning data meet preset legitimacy rule;
The logic detection unit is for detecting whether preliminary cleaning data meet preset logic rules;
Final process unit is described legal for being set as meeting by the preliminary cleaning data for not meeting the legitimacy rule
Property rule maximum value or deletion, will not meet the logic rules preliminary cleaning data delete, and generate warning instruction.
Specific about data cleansing device limits the restriction that may refer to above for data cleaning method, herein not
It repeats again.Modules in above-mentioned data cleansing device can be realized fully or partially through software, hardware and combinations thereof.On
Stating each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also store in a software form
In memory in computer equipment, the corresponding operation of the above modules is executed in order to which processor calls.
In one embodiment, a kind of computer equipment is provided, which can be terminal.The computer is set
Standby includes processor, memory, network interface, display screen and the input unit connected by system bus.Wherein, the computer
The processor of equipment is for providing calculating and control ability.The memory of the computer equipment include non-volatile memory medium,
Built-in storage.The non-volatile memory medium is stored with operating system and computer program.The built-in storage is non-volatile deposits
The operation of operating system and computer program in storage media provides environment.The network interface of the computer equipment is used for and outside
Terminal by network connection communication.To realize a kind of data cleaning method when the computer program is executed by processor.The meter
The display screen for calculating machine equipment can be liquid crystal display or electric ink display screen, and the input unit of the computer equipment can be with
It is the touch layer covered on display screen, is also possible to the key being arranged on computer equipment shell, trace ball or Trackpad, may be used also
To be external keyboard, Trackpad or mouse etc..
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory
And the computer program that can be run on a processor, processor perform the steps of when executing computer program from the first data
Data are obtained in source, establish an independent data flow using the data of acquisition;Processing is filtered to the data in data flow,
Obtain data to be cleaned;The field in data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;
Whether the preliminary cleaning data of detection meet preset decision rule, delete the data for not meeting decision rule, are finally cleaned
Data;Final cleaning data are output to the second data source.
In one embodiment, the missing values according to field are also performed the steps of when processor executes computer program
Item number accounts for the ratio of total number, and the miss rate of field is calculated;The index analyzed as needed determines the Importance of attribute of field
Degree;According to the miss rate of field and Importance of attribute degree, the field comprising missing values is deleted or filled.
In one embodiment, it also performs the steps of when processor executes computer program when the miss rate of field is low
In preset miss rate threshold value and when Importance of attribute degree is lower than preset important rating threshold, field is filled;Work as word
The miss rate of section is not less than preset miss rate threshold value and when Importance of attribute degree is lower than preset important rating threshold, cancel (CANCL)
Section;When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important rating threshold
When, completion is carried out to the missing values of field.
In one embodiment, it also performs the steps of and is detected in the first data source when processor executes computer program
The metadata of the description data attribute of data, obtains quality problems existing for data according to the metadata analysis, according to quality
Problem sets filtering rule, is filtered processing to the data in data flow according to the filtering rule, obtains data to be cleaned.
In one embodiment, capable grade filtering is also performed the steps of when processor executes computer program, it will be in data
Unwanted row weeds out;The filtering of column grade, when a line has multiple column, the corresponding word of column needed for only choosing and retaining
Section.
The preset decision rule includes legitimacy rule and logic rules, and in one embodiment, processor executes
If also performing the steps of preliminary cleaning data when computer program does not meet the legitimacy rule, number will be tentatively cleaned
According to being set as meeting the maximum value of the legitimacy rule, or delete;If tentatively cleaning data do not meet the logic rules,
Data deletion will be tentatively cleaned, and generates warning instruction.
In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated
Machine program performs the steps of when being executed by processor obtains data from the first data source, establishes one using the data of acquisition
A independent data flow;Processing is filtered to the data in data flow, obtains data to be cleaned;To including in data to be cleaned
The field of missing values is deleted or is filled, and obtains tentatively cleaning data;Whether the preliminary cleaning data of detection, which meet, preset is sentenced
Set pattern then, deletes the data for not meeting decision rule, obtains finally cleaning data;Final cleaning data are output to the second data
Source.
In one embodiment, the missing according to field is also performed the steps of when computer program is executed by processor
Value item number accounts for the ratio of total number, and the miss rate of field is calculated;The index analyzed as needed determines the attribute weight of field
Want degree;According to the miss rate of field and Importance of attribute degree, the field comprising missing values is deleted or filled.
In one embodiment, the miss rate when field is also performed the steps of when computer program is executed by processor
Lower than preset miss rate threshold value and when Importance of attribute degree is lower than preset important rating threshold, field is filled;When
The miss rate of field is deleted not less than preset miss rate threshold value and when Importance of attribute degree is lower than preset important rating threshold
Field;When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important rating threshold
When, completion is carried out to the missing values of field.
In one embodiment, it is also performed the steps of when computer program is executed by processor and detects the first data source
The metadata of the description data attribute of middle data, obtains quality problems existing for data according to the metadata analysis, according to matter
Amount problem sets filtering rule, is filtered processing to the data in data flow according to the filtering rule, obtains number to be cleaned
According to.
In one embodiment, capable grade filtering is also performed the steps of when computer program is executed by processor, by data
In unwanted row weed out;The filtering of column grade, when a line has multiple column, the corresponding word of column needed for only choosing and retaining
Section.
The preset decision rule includes legitimacy rule and logic rules, in one embodiment, computer program
If also performing the steps of preliminary cleaning data when being executed by processor does not meet the legitimacy rule, will tentatively clean
Data are set as meeting the maximum value of the legitimacy rule, or delete;If tentatively cleaning data do not meet the logic rule
Then, data deletion will be tentatively cleaned, and generates warning instruction.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
To any reference of memory, storage, database or other media used in each embodiment provided herein,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (SynchLink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment
In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance
Shield all should be considered as described in this specification.
The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously
It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art
It says, without departing from the concept of this application, various modifications and improvements can be made, these belong to the protection of the application
Range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.
Claims (10)
1. a kind of data cleaning method, which comprises
Data are obtained from the first data source, establish an independent data flow using the data of acquisition;
Processing is filtered to the data in the data flow, obtains data to be cleaned;
The field in the data to be cleaned including missing values is deleted or filled, obtains tentatively cleaning data;
It detects whether the preliminary cleaning data meet preset decision rule, deletes the data for not meeting decision rule, obtain
Final cleaning data;
The final cleaning data are output to the second data source.
2. the method according to claim 1, wherein described to the word in the data to be cleaned including missing values
Duan Jinhang is deleted or filling includes:
The ratio that total number is accounted for according to the missing values item number of field, is calculated the miss rate of field;
The index analyzed as needed determines the Importance of attribute degree of field;
According to the miss rate of field and Importance of attribute degree, the field comprising missing values is deleted or filled.
3. according to the method described in claim 2, it is characterized in that, the miss rate and Importance of attribute degree according to field,
The field comprising missing values is deleted or filled and includes:
When the miss rate of field is lower than preset miss rate threshold value and Importance of attribute degree is lower than preset important rating threshold,
Field is filled;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is lower than preset important rating threshold
When, delete field;
When field miss rate is not less than preset miss rate threshold value and Importance of attribute degree is higher than preset important rating threshold
When, completion is carried out to the missing values of field.
4. the method according to claim 1, wherein the method also includes:
The metadata for detecting the description data attribute of data in the first data source, obtains data according to the metadata analysis and exists
Quality problems, according to the quality problems set filtering rule;
The data in the data flow are filtered processing, obtain data to be cleaned, comprising: according to the filtering rule
Processing is filtered to the data in the data flow, obtains data to be cleaned.
5. the method according to claim 1, which is characterized in that the data in the data flow
Being filtered processing includes:
The filtering of row grade, row unwanted in data is weeded out;
The filtering of column grade, when a line has multiple column, the corresponding field of column needed for only choosing and retaining.
6. the method according to claim 1, which is characterized in that the preset decision rule includes closing
Method rule and logic rules, whether the detection preliminary cleaning data, which meet preset decision rule, includes:
If the preliminary cleaning data do not meet the legitimacy rule, the preliminary cleaning data are set as meeting the conjunction
The maximum value of method rule, or delete;
If the preliminary cleaning data do not meet the logic rules, the preliminary cleaning data are deleted, and generate warning
Instruction.
7. the method according to claim 1, wherein first data source and the second data source are same distribution
The different data classification of formula message system, further, the distributed information system be Kafka, first data source and
Second data source is two different Topic of Kafka;The data flow uses the data flow based on Spark Streaming.
8. a kind of data cleansing device, which is characterized in that described device includes:
Data acquisition module establishes an independent data using the data of acquisition for obtaining data from the first data source
Stream;
Data filtering module obtains data to be cleaned for being filtered processing to the data in the data flow;
Preliminary cleaning module obtains just for the field in the data to be cleaned including missing values to be deleted or filled
Step cleaning data;
Final cleaning module, for detecting whether the preliminary cleaning data meet preset decision rule, deletion, which is not met, to be sentenced
The data of set pattern then, obtain finally cleaning data;
Data outputting module, for the final cleaning data to be output to the second data source.
9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor
Calculation machine program, which is characterized in that the processor realizes any one of claims 1 to 7 institute when executing the computer program
The step of stating method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
The step of method described in any one of claims 1 to 7 is realized when being executed by processor.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910308949.0A CN110162519A (en) | 2019-04-17 | 2019-04-17 | Data clearing method |
CA3177209A CA3177209A1 (en) | 2019-04-17 | 2019-09-29 | Data cleaning method |
PCT/CN2019/109121 WO2020211299A1 (en) | 2019-04-17 | 2019-09-29 | Data cleansing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910308949.0A CN110162519A (en) | 2019-04-17 | 2019-04-17 | Data clearing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110162519A true CN110162519A (en) | 2019-08-23 |
Family
ID=67639550
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910308949.0A Pending CN110162519A (en) | 2019-04-17 | 2019-04-17 | Data clearing method |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN110162519A (en) |
CA (1) | CA3177209A1 (en) |
WO (1) | WO2020211299A1 (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110704410A (en) * | 2019-09-27 | 2020-01-17 | 中冶赛迪重庆信息技术有限公司 | Data cleaning method, system and equipment |
CN110716928A (en) * | 2019-09-09 | 2020-01-21 | 上海凯京信达科技集团有限公司 | Data processing method, device, equipment and storage medium |
CN110781176A (en) * | 2019-11-06 | 2020-02-11 | 国网山东省电力公司威海供电公司 | Power grid data quality improvement method based on data correlation |
CN110990447A (en) * | 2019-12-19 | 2020-04-10 | 北京锐安科技有限公司 | Data probing method, device, equipment and storage medium |
CN111563071A (en) * | 2020-04-03 | 2020-08-21 | 深圳价值在线信息科技股份有限公司 | Data cleaning method and device, terminal equipment and computer readable storage medium |
WO2020211299A1 (en) * | 2019-04-17 | 2020-10-22 | 苏宁云计算有限公司 | Data cleansing method |
CN111859814A (en) * | 2020-07-30 | 2020-10-30 | 中国电建集团昆明勘测设计研究院有限公司 | Rock aging deformation prediction method and system based on LSTM deep learning |
CN111966735A (en) * | 2020-07-22 | 2020-11-20 | 山东高速信息工程有限公司 | NIFI-based micro-service data interaction method and system |
CN112287562A (en) * | 2020-11-18 | 2021-01-29 | 国网新疆电力有限公司经济技术研究院 | Power equipment retired data completion method and system |
CN113268476A (en) * | 2021-06-07 | 2021-08-17 | 一汽解放汽车有限公司 | Data cleaning method and device applied to Internet of vehicles and computer equipment |
CN113568811A (en) * | 2021-07-28 | 2021-10-29 | 中国南方电网有限责任公司 | Distributed safety monitoring data processing method |
CN114385606A (en) * | 2021-12-09 | 2022-04-22 | 湖北省信产通信服务有限公司数字科技分公司 | Big data cleaning method and system, storage medium and electronic equipment |
CN114549052A (en) * | 2022-01-20 | 2022-05-27 | 深圳市宝视佳科技有限公司 | Data-based accurate marketing method, device, equipment and storage medium |
CN115809406A (en) * | 2023-02-03 | 2023-03-17 | 佰聆数据股份有限公司 | Power consumer fine-grained classification method, device, equipment and storage medium |
CN116186698A (en) * | 2022-12-16 | 2023-05-30 | 广东技术师范大学 | Machine learning-based secure data processing method, medium and equipment |
CN116303377A (en) * | 2022-11-23 | 2023-06-23 | 南京视察者智能科技有限公司 | Government affair data cleaning and filtering method |
CN117540151A (en) * | 2023-12-08 | 2024-02-09 | 深圳市亲邻科技有限公司 | Data preprocessing method of data pushing system |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113535697B (en) * | 2021-07-07 | 2024-05-24 | 广州三叠纪元智能科技有限公司 | Climbing frame data cleaning method, climbing frame control device and storage medium |
CN114356902A (en) * | 2021-12-14 | 2022-04-15 | 中核武汉核电运行技术股份有限公司 | Industrial data quality management method and device |
CN115794795B (en) * | 2022-12-08 | 2023-09-22 | 湖北华中电力科技开发有限责任公司 | Power distribution station electricity consumption data standardization cleaning method, device, system and storage medium |
CN116961729B (en) * | 2023-08-09 | 2024-10-11 | 深圳市恩斯仪器设备有限公司 | Beidou-based voice communication method, system, equipment and medium |
CN117290315B (en) * | 2023-10-11 | 2024-06-25 | 河南师范大学 | Data classification cleaning method |
CN118467524A (en) * | 2024-07-11 | 2024-08-09 | 浪潮智慧城市科技有限公司 | Data cleaning method and device for supplementing data missing values |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160179599A1 (en) * | 2012-10-11 | 2016-06-23 | University Of Southern California | Data processing framework for data cleansing |
CN105989163A (en) * | 2015-03-04 | 2016-10-05 | 中国移动通信集团福建有限公司 | Data real-time processing method and system |
CN107025301A (en) * | 2017-04-25 | 2017-08-08 | 西安理工大学 | Flight ensures the method for cleaning of data |
CN108596386A (en) * | 2018-04-20 | 2018-09-28 | 上海市司法局 | A kind of prediction convict repeats the method and system of crime probability |
CN109063964A (en) * | 2018-07-02 | 2018-12-21 | 浙江百先得服饰有限公司 | A kind of platform data processing system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294745A (en) * | 2016-08-10 | 2017-01-04 | 东方网力科技股份有限公司 | Big data cleaning method and device |
CN109255523B (en) * | 2018-08-16 | 2021-07-20 | 北京奥技异科技发展有限公司 | Analytical index computing platform based on KKS coding rule and big data architecture |
CN109492002B (en) * | 2018-10-19 | 2021-03-23 | 浙江大学华南工业技术研究院 | Smart power grid big data storage and analysis system and processing method |
CN110162519A (en) * | 2019-04-17 | 2019-08-23 | 苏宁易购集团股份有限公司 | Data clearing method |
-
2019
- 2019-04-17 CN CN201910308949.0A patent/CN110162519A/en active Pending
- 2019-09-29 WO PCT/CN2019/109121 patent/WO2020211299A1/en active Application Filing
- 2019-09-29 CA CA3177209A patent/CA3177209A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160179599A1 (en) * | 2012-10-11 | 2016-06-23 | University Of Southern California | Data processing framework for data cleansing |
CN105989163A (en) * | 2015-03-04 | 2016-10-05 | 中国移动通信集团福建有限公司 | Data real-time processing method and system |
CN107025301A (en) * | 2017-04-25 | 2017-08-08 | 西安理工大学 | Flight ensures the method for cleaning of data |
CN108596386A (en) * | 2018-04-20 | 2018-09-28 | 上海市司法局 | A kind of prediction convict repeats the method and system of crime probability |
CN109063964A (en) * | 2018-07-02 | 2018-12-21 | 浙江百先得服饰有限公司 | A kind of platform data processing system |
Non-Patent Citations (1)
Title |
---|
潘腾辉等: ""面向数据库清洗的数据质量控制设计"", 《信息技术》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020211299A1 (en) * | 2019-04-17 | 2020-10-22 | 苏宁云计算有限公司 | Data cleansing method |
CN110716928A (en) * | 2019-09-09 | 2020-01-21 | 上海凯京信达科技集团有限公司 | Data processing method, device, equipment and storage medium |
CN110704410A (en) * | 2019-09-27 | 2020-01-17 | 中冶赛迪重庆信息技术有限公司 | Data cleaning method, system and equipment |
CN110781176A (en) * | 2019-11-06 | 2020-02-11 | 国网山东省电力公司威海供电公司 | Power grid data quality improvement method based on data correlation |
CN110990447A (en) * | 2019-12-19 | 2020-04-10 | 北京锐安科技有限公司 | Data probing method, device, equipment and storage medium |
CN110990447B (en) * | 2019-12-19 | 2023-09-15 | 北京锐安科技有限公司 | Data exploration method, device, equipment and storage medium |
CN111563071A (en) * | 2020-04-03 | 2020-08-21 | 深圳价值在线信息科技股份有限公司 | Data cleaning method and device, terminal equipment and computer readable storage medium |
CN111966735A (en) * | 2020-07-22 | 2020-11-20 | 山东高速信息工程有限公司 | NIFI-based micro-service data interaction method and system |
CN111859814A (en) * | 2020-07-30 | 2020-10-30 | 中国电建集团昆明勘测设计研究院有限公司 | Rock aging deformation prediction method and system based on LSTM deep learning |
CN111859814B (en) * | 2020-07-30 | 2023-07-28 | 中国电建集团昆明勘测设计研究院有限公司 | Rock aging deformation prediction method and system based on LSTM deep learning |
CN112287562A (en) * | 2020-11-18 | 2021-01-29 | 国网新疆电力有限公司经济技术研究院 | Power equipment retired data completion method and system |
CN112287562B (en) * | 2020-11-18 | 2023-03-10 | 国网新疆电力有限公司经济技术研究院 | Power equipment retired data completion method and system |
CN113268476A (en) * | 2021-06-07 | 2021-08-17 | 一汽解放汽车有限公司 | Data cleaning method and device applied to Internet of vehicles and computer equipment |
CN113568811A (en) * | 2021-07-28 | 2021-10-29 | 中国南方电网有限责任公司 | Distributed safety monitoring data processing method |
CN114385606A (en) * | 2021-12-09 | 2022-04-22 | 湖北省信产通信服务有限公司数字科技分公司 | Big data cleaning method and system, storage medium and electronic equipment |
CN114549052A (en) * | 2022-01-20 | 2022-05-27 | 深圳市宝视佳科技有限公司 | Data-based accurate marketing method, device, equipment and storage medium |
CN116303377A (en) * | 2022-11-23 | 2023-06-23 | 南京视察者智能科技有限公司 | Government affair data cleaning and filtering method |
CN116186698A (en) * | 2022-12-16 | 2023-05-30 | 广东技术师范大学 | Machine learning-based secure data processing method, medium and equipment |
CN116186698B (en) * | 2022-12-16 | 2024-08-16 | 田帅领 | Machine learning-based secure data processing method, medium and equipment |
CN115809406A (en) * | 2023-02-03 | 2023-03-17 | 佰聆数据股份有限公司 | Power consumer fine-grained classification method, device, equipment and storage medium |
CN117540151A (en) * | 2023-12-08 | 2024-02-09 | 深圳市亲邻科技有限公司 | Data preprocessing method of data pushing system |
Also Published As
Publication number | Publication date |
---|---|
CA3177209A1 (en) | 2020-10-22 |
WO2020211299A1 (en) | 2020-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110162519A (en) | Data clearing method | |
CN108509485B (en) | Data preprocessing method and device, computer equipment and storage medium | |
JP6771751B2 (en) | Risk assessment method and system | |
US20200192894A1 (en) | System and method for using data incident based modeling and prediction | |
CN108876133A (en) | Risk assessment processing method, device, server and medium based on business information | |
US20120131438A1 (en) | Method and System of Web Page Content Filtering | |
CN104077407B (en) | A kind of intelligent data search system and method | |
CN101493913A (en) | Method and system for assessing user credit in internet | |
CN110428322A (en) | A kind of adaptation method and device of business datum | |
CN106408184A (en) | User credit evaluation model based on multi-source heterogeneous data | |
CN107040397A (en) | A kind of service parameter acquisition methods and device | |
CN104424231A (en) | Multi-dimensional data processing method and device | |
CN109284369B (en) | Method, system, device and medium for judging importance of securities news information | |
CN112991079B (en) | Multi-card co-occurrence medical treatment fraud detection method, system, cloud end and medium | |
CN109584037A (en) | Calculation method, device and the computer equipment that user credit of providing a loan scores | |
CN113537960B (en) | Determination method, device and equipment for abnormal resource transfer link | |
CN110728301A (en) | Credit scoring method, device, terminal and storage medium for individual user | |
Mahmood et al. | Adaptive automated teller machines | |
CN107832333A (en) | Method and system based on distributed treatment and DPI data structure user network data fingerprint | |
CN112347245A (en) | Viewpoint mining method and device for investment and financing field mechanism and electronic equipment | |
CN112231420A (en) | Data analysis method, data analysis device, electronic device, and storage medium | |
CN206497498U (en) | A kind of integrated system of credit rating information data based on enterprise's reference business | |
CN111858278A (en) | Log analysis method and system based on big data processing and readable storage device | |
CN109919667A (en) | A kind of method and apparatus of the IP of enterprise for identification | |
CN112801784A (en) | Bit currency address mining method and device for digital currency exchange |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20210730 Address after: Room 834, Yingying building, No.99, Tuanjie Road, yanchuangyuan, Jiangbei new district, Nanjing, Jiangsu Province Applicant after: Nanjing Xingyun Digital Technology Co.,Ltd. Address before: 210000 No. 1 Suning Avenue, Xuanwu District, Nanjing City, Jiangsu Province Applicant before: SUNING.COM Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190823 |