CN106919719A - A kind of information completion method towards big data - Google Patents
A kind of information completion method towards big data Download PDFInfo
- Publication number
- CN106919719A CN106919719A CN201710156391.XA CN201710156391A CN106919719A CN 106919719 A CN106919719 A CN 106919719A CN 201710156391 A CN201710156391 A CN 201710156391A CN 106919719 A CN106919719 A CN 106919719A
- Authority
- CN
- China
- Prior art keywords
- data
- missing
- value
- tuple
- missing data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2219—Large Object storage; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
Abstract
The invention discloses a kind of information completion method towards big data, the characteristics of the method makes full use of missing data:The value of missing data is relevant chain of evidence with other attributes in tuple where it or combinations of attributes value, there is all of relevant evidence of missing data in the tuple of missing data by excavating every, comprehensive these relevant evidences turn into the chain of evidence for estimating missing attribute value, and the value of missing data is estimated finally by chain of evidence.Due to directly obtaining value from original data centralized calculation missing data relevant evidence chain predicting missing values, so the present invention not only possesses filling accuracy rate and anti-miss rate high when missing values are filled, and it is simple and easy to apply, it is not required to the distribution of data in GPRS data set, domain knowledge, estimate model also without being trained on data set, be that completion data save the substantial amounts of time.This clearly can be based on the operation of Map Reduce distributed programmed frameworks, can completion large-scale dataset in a distributed manner.
Description
Technical field
The present invention relates to a kind of information completion method towards big data, belong to Data Preprocessing Technology field.
Background technology
Recently as the high speed development of information technology, global metadata is lasting to be increased with astonishing speed, we
The big data epoch are marched toward in the world.In actual life due in data inputting occur omit, it is skimble-scamble measurement rule,
And many factors such as the limitation of collection condition cause the missing of data.The data of missing not only compromise the complete of data
Property, also result in data mining and deviation occur with the conclusion of data analysis.Often filled out in advance in order to avoid there is such case
Fill the data of these missings.The information completion of big data has been that data mining field carries out one of data prediction and important asks
Topic.And traditional big data complementing method generally existing filling accuracy rate is low, anti-miss rate is limited in one's ability to wait not enough.
So being badly in need of a kind of calculation for not only having had to missing data collection and preferably having filled accuracy rate but also there is stronger anti-miss rate
Method, and algorithm can preferably suitable for the environment of large-scale dataset.
The content of the invention
The technical problems to be solved by the invention are:A kind of information completion method towards big data is provided, with higher
Filling accuracy rate and anti-miss rate, and using distributed fill method adapt to large-scale dataset.
The present invention uses following technical scheme to solve above-mentioned technical problem:
A kind of information completion method towards big data, comprises the following steps:
Step 1, makes data set for D, and D has m row n column datas, and one data tuple of each behavior is each to be classified as a category
Property, each data tuple D of scan data setj, j=1 ..., m simultaneously number, and finds out deficiency of data tuple therein and not
The position of missing data in partial data tuple;
Step 2, other the non-missing datas in deficiency of data tuple are combined, and obtain the deficiency of data tuple
In the combination of non-missing data set, as the chain of evidence for estimating missing data value;
Step 3, according to the non-missing data of partial data tuple correspondence deletion sites, obtains that missing data is possible to be taken
Value, calculates the probability of the possible value p of missing data;
Step 4, the set combined according to the position of step 1 missing data, the non-missing data that step 2 is obtained and step
The 3 possible values of the missing data for obtaining are attached, then may with missing data by each non-missing data combination in set
Value be attached;
Step 5, to whole data set in non-missing data be combined, count each combination occur number of times;
Step 6, to each data tuple, selects one of property value, and remaining attribute in the data tuple is carried out into group
Close, the set combined, each combination appearance simultaneously in whole data set with selected property value in statistics set
Number of times;
Step 7, each non-missing data combination is entirely being counted during each deficiency of data tuple is extracted in the result of step 5
According to the quantity concentrated, non-missing data combination can with missing data during each deficiency of data tuple is extracted in the result of step 6
The number of times that the value of energy occurs simultaneously, the possible value of missing data is in its non-missing data group in calculating deficiency of data tuple
The probability of value under the conditions of conjunction, using the corresponding value of maximum probability value as missing data Filling power.
As a preferred embodiment of the present invention, the computing formula of the probability of the possible value p of missing data described in step 3
For:
Wherein, m is the quantity of all data tuples, and K (p) is the possible value p of missing data in each data tuple
The number of times that same deletion sites occur, P (p) is the probability of the possible value p of missing data.
Used as a preferred embodiment of the present invention, missing data is possible in calculating deficiency of data tuple described in step 7
Value computing formula of the probability of value under its non-missing data combination condition is:
Wherein, P (p) is the probability of the possible value p of missing data, and S (p ∪ C (y, u)) is non-in deficiency of data tuple
Missing data combines the number of times that C (y, u) occurs simultaneously with the possible value of missing data, and S (C (y, u)) is deficiency of data unit
Each quantity of non-missing data combination in whole data set in group,It is the card of the possible value of missing data
According to chain SjConfidence level be probability, SjTo number the chain of evidence of the deficiency of data tuple missing data for being j.
Used as a preferred embodiment of the present invention, the step 2, step 3 are limited without order.
Used as a preferred embodiment of the present invention, the step 5, step 6 are limited without order.
The present invention uses above technical scheme compared with prior art, with following technique effect:
1st, the inventive method uses UCI machine learning data, and the random rejecting wherein attribute for carrying out different proportion is worth to
Experimental data set, then carries out the filling of missing data, as a result shows the present invention when missing values are filled with filling higher
Accuracy rate and stability.
2nd, the suitable big data of the present invention, current most methods are that small-sized missing data collection is processed on unit, but
Now with information-based development, data volume sharp increase, large-scale dataset processes obviously improper on unit;
The present invention can realize filling large-scale data set in distributed data processing platform based on Map-Reduce programming models.
3rd, the inventive method is simple and easy to apply, it is not necessary to grasp the distribution of data intensive data, domain knowledge, it is not required that
Trained on data set and estimate model, be that completion data save the substantial amounts of time.
Brief description of the drawings
Fig. 1 is a kind of algorithm timing diagram of information completion method towards big data of the invention.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the implementation method is shown in the drawings.Below by
The implementation method being described with reference to the drawings is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
The present invention is a kind of property improved and comprehensive method, by based on association attributes composite set in missing tuple
Chain of evidence estimates what the value of missing data was proposed, and algorithm estimates the Filling power of missing values first, scans every in whole data set
Individual data tuple, incomplete data tuple is labeled as by the tuple with missing values, and will in incomplete data tuple
Complete different attribute value combination is used as estimating the evidence of missing values value.The substantial amounts of complete category so in imperfect tuple
Property combination just constitute the chain of evidence for estimating missing data, algorithm can be scanned during whole data set counts all data tuples again
Property value set.The core missions of algorithm are exactly that each confidence level for estimating missing values value evidence is calculated in chain of evidence.This
Sample just obtains estimating missing data value institute confidence level sum on evidence, chooses the estimate of confidence level sum maximum as filling out
Supplement with money.
For ease of public understanding technical scheme, below first to deficiency model involved in the present invention, missing number
According to filling principle, algorithm be based on Map-Reduce parallelization carry out brief introduction.
First, deficiency model
Make data set for D, D there are m row n column datas, i.e. data set D there are m data tuples, have n category per data tuple
Property, then data set D can be defined as:
D={ A1,A2,A3,…,An} (1)
Wherein, Ai(1≤i≤n) represents i-th Column Properties of data set D.
Data tuple is designated as in data set:
Dj={ Dj(Ai)|1≤j≤m,1≤i≤n} (2)
Wherein, Xj,i(1≤j≤m, 1≤i≤n) represents the value of the Column Properties of jth row i-th of data set D, makes j-th in D
The i-th attribute is D in tuplej(Ai)。
1, deficiency model is defined to be defined as follows:
Wherein, Dj(Ai)='' represent ith attribute value missing in j-th tuple.
When there is D in data tuplej(Ai)='' when then the data tuple be deficiency of data tuple, be designated as:
Do not exist D in the array tuple converselyj(Ai)='', then the array tuple is partial data tuple, is designated as:
Rj={ Dj(Ai)|Dj(Ai)!='',1≤j≤m,1≤i≤n} (5)
Define 2, by deficiency of data tupleIn non-missing
Data be missing data correlation attribute value composite set be defined as estimate missing values value chain of evidence:
Sj=C (y, u) | 1≤y≤n, 1≤u≤y } (6)
Wherein, C (y, u) is the correlation attribute value combination for estimating missing values value, i.e., choose u from y complete property value
Individual unordered property value, is designated as estimating the evidence of missing values value.
The main target of algorithm is exactly that the value of missing data in j-th data tuple is estimated by gathering.
2nd, the principle of missing data filling
In the arbitrary data tuple D of data set DjIn (1≤j≤m), there is a certain attribute set, it is assumed that be A.Data element
Group Dj(1≤j≤m) includes A, and and if only ifRegular shape is such asIn the tuple D of data set Dj(1≤
J≤m) middle establishment, whereinAndAt this moment institute in data set D is remembered
It is P (A ∪ B) to have in tuple comprising set A and set B's and A ∪ B's ratio.
The quantity that 3, support counting S represents a certain set in data set is defined, then rule in whole data set D
Support counting is defined as:
All tuples are C comprising the attribute set A also ratio comprising attribute set B in defining 4, note data set D, and this is bar
Part probability P (B | A), then it is regular in whole data set DConfidence level is defined as:
Confidence level is calculated as:
Core work of the present invention is the support counting i.e. S (p of possibility value and its relevant evidence chain for calculating missing data
∪Sj), then calculate the confidence level of each evidence in chain of evidence, confidence level on evidence be added and obtain the probable value and obtain phase
Close the confidence level of chain of evidenceThe maximum probable value of chain of evidence confidence level is finally used as the Filling power of missing data.
3rd, algorithm is based on the parallelization of Map-Reduce
Map-Reduce is a kind of parallelisation procedure design framework, is current cloud computing platform calculating mould the most popular
Type.Its basic thought is using the strategy divided and rule to large-scale dataset.Map-Reduce calculates data with Key/Value
Form carries out computing.MapReduce realizes that the core of parallelization is the two operations of Map and Reduce, and Map-Reduce is calculated
First by Segmentation of Data Set is into the small documents of many equal sizes and distributes to different nodes, each node carries out Map meters to framework
Calculate, and result of calculation is ranked up merging, the Value of identical Key carries out Reduce calculating in being placed on identity set.
The present invention provides the algorithm based on Map-Reduce programming frameworks, to realize the distributed operation of the algorithm.Such as Fig. 1
Shown, this algorithm is broadly divided into 5 stages, estimates to lack in data focus utilization missing data association attributes composite set first
The value of data is lost, the value that then will be estimated is filled into data set.
Stage 1, algorithm scan data set be each deficiency of data unit group echo uniquely number, and be given every it is not complete
Missing data position in entire data tuple, to determine the missing of which attribute data in the tuple.Wrapped in every record of output
Numbering containing each deficiency of data tuple, the position of missing data and the deficiency of data in the deficiency of data tuple
Tuple.These records constitute the destination file in the stage.
Stage 2, the stage are divided into 4 modules, and each module can be carried out simultaneously.
Module 1, the destination file of algorithm sweep phase 1 calculate the combination C of non-missing attribute values in deficiency of data tuple
The set S of (y, u)j, SjIt is imperfect using each is included as the chain of evidence for estimating missing data value, in every record of output
Non- missing data in the position of missing data, the deficiency of data tuple in the numbering of data tuple, the deficiency of data tuple
Composite set Sj.These records constitute the destination file of the module.
Module 2, algorithm statistics concentrate the probability P (p) of the value p and p of each attribute, and the value of missing data will
Come from p:
K () function representation is counted in formula, and K (p) is represented and lacked what property value p occurred on same attribute in whole data set
Number of times, m represents the quantity of data tuple.
Every position, the probability P (p) of property value p, p of the record comprising property value of output, these records constitute the mould
The output file of block.
Module 3, algorithm counts composite set C (y, u) of the non-missing data of each data tuple in whole data set
Quantity Oj, by the probabilistic query for estimating for missing data value in algorithm step below, every record bag of output
Composite set C (y, u) and its quantity O of the non-missing data containing each data tuplej.These records constitute the defeated of the module
Go out file.
Module 4, algorithm counted in each data tuple of whole data set non-missing data combine C (y, u) and certain
Property value, the property value can not be appeared in C (y, u), occurred simultaneously in whole data set, i.e. the number in same data tuple
Amount Tj.Concrete operations are that algorithm scans a data element ancestral first, and wherein certain property value is selected successively, in selection one every time
After property value, then remaining attribute in this data tuple is carried out into permutation and combination, the set just combined.Finally count successively
Each combination and quantity T of the selected property value in whole data concentrate on same data tuple in setj。
Tj=K (Dj(Ai)(1≤j≤m,1≤i≤n)∪C(y,u)(1≤y≤n,1≤u≤y)) (11)
Comprising surplus in a certain property value position, the property value, the data tuple in data tuple in every record of output
Combination, the quantity of the combination and a certain property value in same data tuple in the permutation and combination of remaining property value.
The chain of evidence S that stage 3, algorithm is relied on the estimation missing data of the output of module 1 in the stage 2jIn the stage 2
The output attribute value record of module 2 is attached.We have just obtained non-missing data in each deficiency of data tuple
The probability P (p) that combination C (y, u) and the possible Filling power p and p of each missing data occur in whole data set.
Specific operation be algorithm first according to deletion sites by the record of the output of module 1 in the stage 2 and mould in the stage 2
Block 2 output record be attached, thus obtained a missing data position for containing certain deficiency of data tuple,
The composite set S of non-missing data in possibility Filling power, the tuple of missing datajRecord.
Then the S during algorithm will be recorded just nowjIn each non-missing data combination C (y, u) it is possible with missing data
Value p is attached.
Every record of module output comprising combination C (y, u) of non-missing data in certain deficiency of data tuple, should
The probability P (p) that the position of missing data, the possible value of missing data, probable value p occur in whole data set in tuple.
Stage 4, algorithm goes in the destination file of the module 3 of stage 2 to search certain deficiency of data unit in the destination file of stage 3
The quantity O of combination C (y, u) of non-missing data in groupj.Now S (C (y, u))=Oj.And the result of module 4 is literary in the stage 2
Combination C (y, u) of non-missing data and missing data in certain deficiency of data tuple are searched in the destination file of stage 3 in part
The number of times T that possible value p occurs simultaneously in whole data setj, now S (p ∪ C (y, u))=Tj.We can just calculate
The probability of all missing datas value under conditions of its non-missing attribute values combines C (y, u) in deficiency of data tuple.
We choose maximum probability i.e. confidence levelMaximum estimate is used as final Filling power.
Its value, according to the possible value of missing data estimated in the stage 4, is filled into former missing number by the stage 5, algorithm
According in collection D.
Algorithm is exactly to carry out estimating missing data value in the 1st, 2,3,4 stages, due in most of data intensive datas
Between attribute and in the absence of apparent causal connection, can there is dependency relation in opposite data attribute, we are logical for this dependency relation
Cross missing data correlation attribute value composite set to embody, Map-Reduce is mainly completed in deficiency of data unit in this stage
The calculating of the correlation attribute value composite set of missing data in group, and estimate the value of missing data.
Stage 1, mark missing data collection
Input:Data file containing missing values.
Output:Data tuple label, data tuple.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FFOR each<key,value>DO
2.ADD tupleindex into tuple
3.FOR each<attri-v,tuple>DO
IF tuple contains missingvalue THEN
Outkey:tupleindex
Outvalue:tuple
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Outkey:key
Outvalue:tuple
Map functions scan data set is each data tuple addition mark tupleindex in stage 1, and Reduce functions are most
Output data form is (tupleindex, tuple) afterwards.
Stage 2, the stage are divided into 4 modules, and each module can be carried out simultaneously
Module 1, missing data association attributes composite set
Input:The destination file of stage 1.
Output:Data tuple label, missing values position, missing values correlation attribute value composite set.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FOR each<key,value>DO
2.IF tuple contains missingvalue THEN
ADD rest complete attribute into set comple-attri
3.Calculation complete attribute combination combi-attri in comple-
attri
4.Outkey:tupleindex+missingindex
Outvalue:combi-attri
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Outkey:key
Outvalue:comple-attri
Missingindex is missing from position of the data in deficiency of data tuple in module 1, and Map functions will be imperfect
Missing data correlation attribute value combination comple-attri is added to the set combi-attri of combinations of attributes in data tuple
In.The last output data form of Reduce functions is (tupleindex, missingindex, combi-attri).
The possibility value of module 2, missing data
Input:The destination file of stage 1.
Output:Array unit deck label, the possible value of missing data, the possible probability of missing data.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FOR each<key,value>DO
2.FOR each<attri-v,tuple>DO
Outkey:attriindex
Outvalue:attri-v
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Add value into list-pro
2.Calculation valuelist-pro length divided by m as pro
3.Outkey:attriindex
Outvalue:attriindex+list-pro+pro
Algorithm scans each data tuple in module 2, and Map functions record the value attri-v of each attribute and export every
Individual attribute number attriindex.Drawn in Reduce functions in each attribute may value list list-pro and each
The Probability p ro of possible value.
The quantity of module 3, statistical attribute value composite set
Input:The destination file of stage 1.
Output:Property value composite set, property value composite set quantity.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FOR each<key,value>DO
Calculation C(y,u)in tuple as combi-attri
2.Outkey:combi-attri
Outvalue:1
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Calculation number of combi-attri
2.Outkey:combi-attri
Outvalue:num_c
Map functions calculate the property value combination combi-attri in each data tuple, Reduce functions in module 3
Calculate quantity num_c of each property value combination in whole data set in each data tuple.
Module 4, the statistics quantity of property value combination and certain property value in same data tuple in whole data set
Input:The destination file of stage 1.
Output:Property value combination, the position of certain property value, certain property value, property value combination and certain property value exist
Quantity in same data tuple.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FOR each<key,value>DO
2.FOR each<attri-v,tuple>DO
Calculation C(y,u)in rest complete attribute as combi-attri
3.Outkey:combi-attri+attriindex+attri-v
Outvalue:1
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Calculation number of combi-attri+attriindex+attri-v
2.Outkey:combi-attri+attriindex+attri-v
Outvalue:num_caa
Map functions scan each data tuple in module 4, and property value attri-v, Ran Hou in data tuple are chosen successively
Property value combination combi-attri is calculated in remaining property value, combi-attri and attri-v is counted in Reduce functions
The quantity num_caa of same data tuple is concentrated in whole data.
Stage 3, missing data correlation attribute value composite set are connected with missing data probable value
Input:The destination file of module 1 in stage 2, the destination file of module 2 in the stage 2.
Output:Correlation attribute value combination, missing values position, possible value in deficiency of data tuple.
Map<Object,Text,Text,Text>
Input:Key=offset, value=missingindex+combi-attri
1.FOR each<key,value>DO
Split the value
2.Outkey:missingindex
Outvalue:combi-attri
Map<Object,Text,Text,Text>
Input:Key=offset, value=missingindex+pro-v
1.FOR each<key,value>DO
Split the value
2.Outkey:missingindex
Outvalaue:pro-v
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
2.Outkey:offset
Outvalue:combi-attri+missingindex+pro-v
Pro-v is missing from being worth possible value in stage 3, and first Map can be by the destination file each row of data of step module 1
Split, missingindex submits to Reduce as key values, combi-attri as value values.Second Map
The possible value file each row of data of missing values is carried out segmentation missingindex as key values, pro-v as value values,
Reduce is equally submitted to, the value of identical key values will be placed in same valuelist, and Reduce will be missing data
Association attributes combination combi-attri and possible value pro-v be attached, the data form for finally exporting is
(combi-attri, missingindex, pro-v).
Stage 4, the estimation possible value of missing values
Input:The destination file CAacount of module 3, the destination file CA-Aacount of module 4, stage in the stage 2 in stage 2
3 destination files.
Output:The estimate of missing data.
Map<Object,Text,Text,Text>
Input:Key=offset, value=missingindex+combi-attri+pro-v
1.FOR each<key,value>DO
Split the value
2.Research acount of combi-attri in CAacount recorded as num-combi-
attri
3.Research acount of combi-attri+pro-v in CA-Aacount recorded asnum-
combi-attri-a
4.Calculation num-combi-attri-a/num-combi-attri as credibility
5.Outkey:tupleindex+missingindex
Outvalue:credibility+pro-v
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Sum of credibility
2.IF sum of credibility is maximum THEN
Outkey:offset
Outvalue:pro-v
Stage 4 is the core of algorithm, will use it to estimation missing values value, and Map functions are first by the knot in stage 3
The each row of data of fruit file is split, and missing data correlation attribute value combination combi- is searched in file CAacount
The number num-combi-attri of attri, as S (combi-attri).Missing data is searched in file CA-Aacount
Correlation attribute value combination and the possible value combi-attri+pro-v of missing values obtain individual in appearing in same data tuple simultaneously
Number num-combi-attri-a, as S (combi-attri ∪ pro-v).And it is calculated the possible value of missing data
Confidence level credibility, Reduce function will estimate that all confidence level evidences of missing values are added, and evidence sum is maximum
The possible value pro-v of the missing values of value is used as final Filling power.
Stage 5:The value of the missing value estimation in stage 4 is filled into former missing data to concentrate
Input:Former missing data collection file, the value file of the missing value estimation of stage 4.
Output:Complete data set.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FOR each<key,value>DO
2.Outkey:offset
Outvalue:value
Map<Object,Text,Text,Text>
Input:Key=offset, value=missingindex+pro-v
1.FOR each<key,value>DO
2.Outkey:offset
Outvalaue:missingindex+pro-v
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
missingindex+pro-v in listA
2.FOR each in ListA DO
pro-v append to value
3.Outkey:key
Outvalue:com-tuple
In the stage 5, Map functions using the offset of former missing data collection file and the value file of missing value estimation as
Key is exported, by the value of former missing data collection and " Missing+Possiblevalue " of the value file of missing value estimation
Exported as value values, " missingindex+pro-v " is stored in listA by reduce functions in each valuelist
In, and all of estimate in ListA is filled into the value of missing data collection, final output partial data tuple com-
tuple。
Embodiment:The data are concentrated with 4 attributes and are respectively:Sex, height, whether smoke, school grade.Represent and lack
The data of mistake.
Sex | Height | Whether smoke | School grade |
Man | It is high | It is | It is good |
Female | It is high | It is | Difference |
Man | It is short | It is no | Difference |
Female | It is high | It is no | It is good |
Man | It is short | It is no | |
Female | It is no | It is good |
Stage 1:
The 4th attribute missing of five-tuple
The hexa-atomic group the 2nd attribute missing
Destination file:5,4, [man, it is short, it is no,]
6,2, [female,, it is no, good], [] represents tuple.
Stage 2:
Module 1:
Destination file:5,4,{<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short no>}
6,2,<Female>,<It is no>,<It is good>,<Female is no>,<Female is good>,<It is no good>,<Female is no good>}
{ } represents set.
Module 2:
Destination file:1, man:50%, female:50%
2, it is high:50%, it is short:33%
3, it is 33%, it is no:67%
4, it is good:50%, it is poor:33%
Output property location, and each property value probability.Module 3:
Destination file:<Man>:3
<Female>:3
<It is high>:3
<It is short>:2
<It is>:2
<It is no>:4
<It is good>:3
<Difference>:2
<It is male high>:1
<Man is short>:2
<Man is>:1
<Man is no>:2
<Man is good>:1
<Gao Shi>:2
<Gao Hao>:2
<It is short to be>:0
<It is short good>:0
<It is>:1
<Male height is>:1
<It is male high good>:1
<Man is>:1
<Male height is>:1……
Module 4:
By taking first data tuple as an example:
As a result:1, man,<It is high>, 1
1, man,<It is>, 1
1, man,<It is good>, 1
1, man,<Gao Shi>, 1
1, man,<Gao Hao>, 1
1, man,<It is>, 1
1, man,<Height is>, 1 ...
Stage 3:
By taking one of data tuple as an example:
According to deletion sites 4, by the record 5,4 of the module 1 of stage 2,<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short
It is no>,<Man is short no>With the record 4 of the module 2 of stage 2, it is good:50%, it is poor:33% is attached.
Recorded:
5, it is good:50%, it is poor:50%,<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short no>}
Then the S during algorithm will be recorded just nowjIn each non-missing data combination C (y, u) it is possible with missing data
Value p is attached.
Final result is exported:
<Man>, 4, it is good:50%+<It is short>, 4, it is good:50%+<It is no>, 4, it is good:50%+<Man is short>, 4, it is good:50%+<Man is no>,
4, it is good:50%+<It is short no>, 4, it is good:50%+<Man is short no>, 4, it is good:50%+<Man>, 4, it is poor:33%+<It is short>, 4, it is poor:33%+<
It is no>, 4, it is poor:33%+<Man is short>, 4, it is poor:33%+<Man is no>, 4, it is poor:33%+<It is short no>, 4, it is poor:33%+<Man is short no>,4,
Difference:33%
Stage 4:
Algorithm goes in the destination file of the module 3 of stage 2 to search non-in certain deficiency of data tuple in the destination file of stage 3
The quantity O of combination C (y, u) of missing dataj。
As a result:<Man>Quantity is 3
<It is short>Quantity is 2
<It is no>Quantity is 4
<Man is short>Quantity is 2
<Man is no>Quantity is 2
<It is short no>Quantity is 2
<Man is short no>Quantity is 2
And search in the destination file of module 4 in the stage 2 non-in certain deficiency of data tuple in the destination file of stage 3
The number of times T that combination C (y, u) of missing data and the possible value p of missing data occur simultaneously in whole data setj。
As a result:It is good,<Man>Quantity is 1
It is good,<It is short>Quantity is 0
It is good,<It is no>Quantity is 2
It is good,<Man is short>Quantity is 0
It is good,<Man is no>Quantity is 0
It is good,<It is short no>Quantity is 0
It is good,<Man is short no>Quantity is 0
Difference,<Man>Quantity is 1
Difference,<It is short>Quantity is 1
Difference,<It is no>Quantity is 1
Difference,<Man is short>Quantity is 1
Difference,<Man is no>Quantity is 1
Difference,<It is short no>Quantity is 1
Difference,<Man is short no>Quantity is 1
F(<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short noIt is good)=1/3+0/2+2/4+0/2
+ 0/2+0/2+0/2+50%=1.33
F(<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short noDifference)=1/3+1/2+1/4+1/2
+ 1/2+1/2+1/2+33%=3.41
Stage 5:Maximum is taken so the 4th Filling power of attribute missing data of five-tuple is " poor ".
Above example is only explanation technological thought of the invention, it is impossible to limit protection scope of the present invention with this, every
According to technological thought proposed by the present invention, any change done on the basis of technical scheme each falls within the scope of the present invention
Within.
Claims (5)
1. a kind of information completion method towards big data, it is characterised in that comprise the following steps:
Step 1, makes data set for D, and D has m row n column datas, and one data tuple of each behavior is each to be classified as an attribute, sweeps
Retouch each data tuple D of data setj, j=1 ..., m simultaneously number, and finds out deficiency of data tuple therein and imperfect number
According to the position of missing data in tuple;
Step 2, other the non-missing datas in deficiency of data tuple are combined, and obtain non-in the deficiency of data tuple
The set of missing data combination, as the chain of evidence for estimating missing data value;
Step 3, according to the non-missing data of partial data tuple correspondence deletion sites, obtains the possible value of missing data, counts
Calculate the probability of the possible value p of missing data;
Step 4, according to the position of step 1 missing data, the set of the non-missing data combination that step 2 is obtained is obtained with step 3
To the possible value of missing data be attached, then by set each combination of non-missing data is possible with missing data takes
Value is attached;
Step 5, to whole data set in non-missing data be combined, count each combination occur number of times;
Step 6, to each data tuple, selects one of property value, and remaining attribute in the data tuple is combined, and obtains
To the set of combination, the number of times that each combination occurs with selected property value in whole data set in statistics set simultaneously;
Step 7, each non-missing data combination is in whole data set during each deficiency of data tuple is extracted in the result of step 5
In quantity, non-missing data combination in each deficiency of data tuple is extracted in the result of step 6 possible with missing data
The number of times that value occurs simultaneously, the possible value of missing data is in its non-missing data combobar in calculating deficiency of data tuple
The probability of value under part, using the corresponding value of maximum probability value as missing data Filling power.
2. according to claim 1 towards the information completion method of big data, it is characterised in that missing data described in step 3
The computing formula of the probability of possible value p is:
Wherein, m is the quantity of all data tuples, and K (p) is that the possible value p of missing data is same in each data tuple
The number of times that deletion sites occur, P (p) is the probability of the possible value p of missing data.
3. according to claim 1 towards the information completion method of big data, it is characterised in that cannot not calculated completely described in step 7
The possible value of missing data computing formula of the probability of value under its non-missing data combination condition is in entire data tuple:
Wherein, P (p) is the probability of the possible value p of missing data, and S (p ∪ C (y, u)) is non-missing in deficiency of data tuple
Data combine the number of times that C (y, u) occurs simultaneously with the possible value of missing data, and S (C (y, u)) is in deficiency of data tuple
Each quantity of non-missing data combination in whole data set,It is the chain of evidence of the possible value of missing data
SjConfidence level be probability, SjTo number the chain of evidence of the deficiency of data tuple missing data for being j.
4. according to claim 1 towards the information completion method of big data, it is characterised in that the step 2, step 3 do not have
There is order to limit.
5. according to claim 1 towards the information completion method of big data, it is characterised in that the step 5, step 6 do not have
There is order to limit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710156391.XA CN106919719A (en) | 2017-03-16 | 2017-03-16 | A kind of information completion method towards big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710156391.XA CN106919719A (en) | 2017-03-16 | 2017-03-16 | A kind of information completion method towards big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106919719A true CN106919719A (en) | 2017-07-04 |
Family
ID=59460304
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710156391.XA Pending CN106919719A (en) | 2017-03-16 | 2017-03-16 | A kind of information completion method towards big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106919719A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107766294A (en) * | 2017-10-31 | 2018-03-06 | 北京金风科创风电设备有限公司 | Method and device for recovering missing data |
CN107958027A (en) * | 2017-11-16 | 2018-04-24 | 南京邮电大学 | A kind of Sensor Network data capture method ensured with QoS |
CN110413658A (en) * | 2019-07-23 | 2019-11-05 | 中经柏诚科技(北京)有限责任公司 | A kind of chain of evidence construction method based on the fact the correlation rule |
US20200278471A1 (en) * | 2017-09-12 | 2020-09-03 | Schlumberger Technology Corporation | Dynamic representation of exploration and/or production entity relationships |
CN111737463A (en) * | 2020-06-04 | 2020-10-02 | 江苏名通信息科技有限公司 | Big data missing value filling method, device and computer program |
CN116578557A (en) * | 2023-03-03 | 2023-08-11 | 齐鲁工业大学(山东省科学院) | Missing data filling method for data center |
-
2017
- 2017-03-16 CN CN201710156391.XA patent/CN106919719A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200278471A1 (en) * | 2017-09-12 | 2020-09-03 | Schlumberger Technology Corporation | Dynamic representation of exploration and/or production entity relationships |
US11619761B2 (en) * | 2017-09-12 | 2023-04-04 | Schlumberger Technology Corporation | Dynamic representation of exploration and/or production entity relationships |
CN107766294A (en) * | 2017-10-31 | 2018-03-06 | 北京金风科创风电设备有限公司 | Method and device for recovering missing data |
CN107958027A (en) * | 2017-11-16 | 2018-04-24 | 南京邮电大学 | A kind of Sensor Network data capture method ensured with QoS |
CN110413658A (en) * | 2019-07-23 | 2019-11-05 | 中经柏诚科技(北京)有限责任公司 | A kind of chain of evidence construction method based on the fact the correlation rule |
CN111737463A (en) * | 2020-06-04 | 2020-10-02 | 江苏名通信息科技有限公司 | Big data missing value filling method, device and computer program |
CN111737463B (en) * | 2020-06-04 | 2024-02-09 | 江苏名通信息科技有限公司 | Big data missing value filling method, device and computer readable memory |
CN116578557A (en) * | 2023-03-03 | 2023-08-11 | 齐鲁工业大学(山东省科学院) | Missing data filling method for data center |
CN116578557B (en) * | 2023-03-03 | 2024-04-02 | 齐鲁工业大学(山东省科学院) | Missing data filling method for data center |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106919719A (en) | A kind of information completion method towards big data | |
CN107220277A (en) | Image retrieval algorithm based on cartographical sketching | |
CN104765876B (en) | Magnanimity GNSS small documents cloud storage methods | |
CN104572965A (en) | Search-by-image system based on convolutional neural network | |
CN109359172A (en) | A kind of entity alignment optimization method divided based on figure | |
CN107004141A (en) | To the efficient mark of large sample group | |
CN104572833B (en) | A kind of mapping ruler creation method and device | |
CN105320764A (en) | 3D model retrieval method and 3D model retrieval apparatus based on slow increment features | |
CN110888859B (en) | Connection cardinality estimation method based on combined deep neural network | |
CN115145906B (en) | Preprocessing and completion method for structured data | |
CN104008420A (en) | Distributed outlier detection method and system based on automatic coding machine | |
CN102902826A (en) | Quick image retrieval method based on reference image indexes | |
CN109325062A (en) | A kind of data dependence method for digging and system based on distributed computing | |
CN107844548A (en) | A kind of data label method and apparatus | |
CN110533316A (en) | A kind of LCA (Life Cycle Analysis) method, system and storage medium based on big data | |
CN114817575B (en) | Large-scale electric power affair map processing method based on extended model | |
CN106649886A (en) | Method for searching for images by utilizing depth monitoring hash of triple label | |
CN103744958B (en) | A kind of Web page classification method based on Distributed Calculation | |
CN103678513B (en) | A kind of interactively retrieval type generates method and system | |
CN103020319A (en) | Real-time mobile space keyword approximate Top-k query method | |
Bannister et al. | Windows into geometric events: Data structures for time-windowed querying of temporal point sets | |
CN109472343A (en) | A kind of improvement sample data missing values based on GKNN fill up algorithm | |
CN107452001A (en) | A kind of remote sensing images sequences segmentation method based on improved FCM algorithm | |
CN107451617A (en) | One kind figure transduction semisupervised classification method | |
CN104731889B (en) | A kind of method for estimating query result size |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170704 |
|
RJ01 | Rejection of invention patent application after publication |