CN106919719A - A kind of information completion method towards big data - Google Patents

A kind of information completion method towards big data Download PDF

Info

Publication number
CN106919719A
CN106919719A CN201710156391.XA CN201710156391A CN106919719A CN 106919719 A CN106919719 A CN 106919719A CN 201710156391 A CN201710156391 A CN 201710156391A CN 106919719 A CN106919719 A CN 106919719A
Authority
CN
China
Prior art keywords
data
missing
value
tuple
missing data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710156391.XA
Other languages
Chinese (zh)
Inventor
徐小龙
崇卫之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201710156391.XA priority Critical patent/CN106919719A/en
Publication of CN106919719A publication Critical patent/CN106919719A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Abstract

The invention discloses a kind of information completion method towards big data, the characteristics of the method makes full use of missing data:The value of missing data is relevant chain of evidence with other attributes in tuple where it or combinations of attributes value, there is all of relevant evidence of missing data in the tuple of missing data by excavating every, comprehensive these relevant evidences turn into the chain of evidence for estimating missing attribute value, and the value of missing data is estimated finally by chain of evidence.Due to directly obtaining value from original data centralized calculation missing data relevant evidence chain predicting missing values, so the present invention not only possesses filling accuracy rate and anti-miss rate high when missing values are filled, and it is simple and easy to apply, it is not required to the distribution of data in GPRS data set, domain knowledge, estimate model also without being trained on data set, be that completion data save the substantial amounts of time.This clearly can be based on the operation of Map Reduce distributed programmed frameworks, can completion large-scale dataset in a distributed manner.

Description

A kind of information completion method towards big data
Technical field
The present invention relates to a kind of information completion method towards big data, belong to Data Preprocessing Technology field.
Background technology
Recently as the high speed development of information technology, global metadata is lasting to be increased with astonishing speed, we The big data epoch are marched toward in the world.In actual life due in data inputting occur omit, it is skimble-scamble measurement rule, And many factors such as the limitation of collection condition cause the missing of data.The data of missing not only compromise the complete of data Property, also result in data mining and deviation occur with the conclusion of data analysis.Often filled out in advance in order to avoid there is such case Fill the data of these missings.The information completion of big data has been that data mining field carries out one of data prediction and important asks Topic.And traditional big data complementing method generally existing filling accuracy rate is low, anti-miss rate is limited in one's ability to wait not enough.
So being badly in need of a kind of calculation for not only having had to missing data collection and preferably having filled accuracy rate but also there is stronger anti-miss rate Method, and algorithm can preferably suitable for the environment of large-scale dataset.
The content of the invention
The technical problems to be solved by the invention are:A kind of information completion method towards big data is provided, with higher Filling accuracy rate and anti-miss rate, and using distributed fill method adapt to large-scale dataset.
The present invention uses following technical scheme to solve above-mentioned technical problem:
A kind of information completion method towards big data, comprises the following steps:
Step 1, makes data set for D, and D has m row n column datas, and one data tuple of each behavior is each to be classified as a category Property, each data tuple D of scan data setj, j=1 ..., m simultaneously number, and finds out deficiency of data tuple therein and not The position of missing data in partial data tuple;
Step 2, other the non-missing datas in deficiency of data tuple are combined, and obtain the deficiency of data tuple In the combination of non-missing data set, as the chain of evidence for estimating missing data value;
Step 3, according to the non-missing data of partial data tuple correspondence deletion sites, obtains that missing data is possible to be taken Value, calculates the probability of the possible value p of missing data;
Step 4, the set combined according to the position of step 1 missing data, the non-missing data that step 2 is obtained and step The 3 possible values of the missing data for obtaining are attached, then may with missing data by each non-missing data combination in set Value be attached;
Step 5, to whole data set in non-missing data be combined, count each combination occur number of times;
Step 6, to each data tuple, selects one of property value, and remaining attribute in the data tuple is carried out into group Close, the set combined, each combination appearance simultaneously in whole data set with selected property value in statistics set Number of times;
Step 7, each non-missing data combination is entirely being counted during each deficiency of data tuple is extracted in the result of step 5 According to the quantity concentrated, non-missing data combination can with missing data during each deficiency of data tuple is extracted in the result of step 6 The number of times that the value of energy occurs simultaneously, the possible value of missing data is in its non-missing data group in calculating deficiency of data tuple The probability of value under the conditions of conjunction, using the corresponding value of maximum probability value as missing data Filling power.
As a preferred embodiment of the present invention, the computing formula of the probability of the possible value p of missing data described in step 3 For:
Wherein, m is the quantity of all data tuples, and K (p) is the possible value p of missing data in each data tuple The number of times that same deletion sites occur, P (p) is the probability of the possible value p of missing data.
Used as a preferred embodiment of the present invention, missing data is possible in calculating deficiency of data tuple described in step 7 Value computing formula of the probability of value under its non-missing data combination condition is:
Wherein, P (p) is the probability of the possible value p of missing data, and S (p ∪ C (y, u)) is non-in deficiency of data tuple Missing data combines the number of times that C (y, u) occurs simultaneously with the possible value of missing data, and S (C (y, u)) is deficiency of data unit Each quantity of non-missing data combination in whole data set in group,It is the card of the possible value of missing data According to chain SjConfidence level be probability, SjTo number the chain of evidence of the deficiency of data tuple missing data for being j.
Used as a preferred embodiment of the present invention, the step 2, step 3 are limited without order.
Used as a preferred embodiment of the present invention, the step 5, step 6 are limited without order.
The present invention uses above technical scheme compared with prior art, with following technique effect:
1st, the inventive method uses UCI machine learning data, and the random rejecting wherein attribute for carrying out different proportion is worth to Experimental data set, then carries out the filling of missing data, as a result shows the present invention when missing values are filled with filling higher Accuracy rate and stability.
2nd, the suitable big data of the present invention, current most methods are that small-sized missing data collection is processed on unit, but Now with information-based development, data volume sharp increase, large-scale dataset processes obviously improper on unit; The present invention can realize filling large-scale data set in distributed data processing platform based on Map-Reduce programming models.
3rd, the inventive method is simple and easy to apply, it is not necessary to grasp the distribution of data intensive data, domain knowledge, it is not required that Trained on data set and estimate model, be that completion data save the substantial amounts of time.
Brief description of the drawings
Fig. 1 is a kind of algorithm timing diagram of information completion method towards big data of the invention.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the implementation method is shown in the drawings.Below by The implementation method being described with reference to the drawings is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
The present invention is a kind of property improved and comprehensive method, by based on association attributes composite set in missing tuple Chain of evidence estimates what the value of missing data was proposed, and algorithm estimates the Filling power of missing values first, scans every in whole data set Individual data tuple, incomplete data tuple is labeled as by the tuple with missing values, and will in incomplete data tuple Complete different attribute value combination is used as estimating the evidence of missing values value.The substantial amounts of complete category so in imperfect tuple Property combination just constitute the chain of evidence for estimating missing data, algorithm can be scanned during whole data set counts all data tuples again Property value set.The core missions of algorithm are exactly that each confidence level for estimating missing values value evidence is calculated in chain of evidence.This Sample just obtains estimating missing data value institute confidence level sum on evidence, chooses the estimate of confidence level sum maximum as filling out Supplement with money.
For ease of public understanding technical scheme, below first to deficiency model involved in the present invention, missing number According to filling principle, algorithm be based on Map-Reduce parallelization carry out brief introduction.
First, deficiency model
Make data set for D, D there are m row n column datas, i.e. data set D there are m data tuples, have n category per data tuple Property, then data set D can be defined as:
D={ A1,A2,A3,…,An} (1)
Wherein, Ai(1≤i≤n) represents i-th Column Properties of data set D.
Data tuple is designated as in data set:
Dj={ Dj(Ai)|1≤j≤m,1≤i≤n} (2)
Wherein, Xj,i(1≤j≤m, 1≤i≤n) represents the value of the Column Properties of jth row i-th of data set D, makes j-th in D The i-th attribute is D in tuplej(Ai)。
1, deficiency model is defined to be defined as follows:
Wherein, Dj(Ai)='' represent ith attribute value missing in j-th tuple.
When there is D in data tuplej(Ai)='' when then the data tuple be deficiency of data tuple, be designated as:
Do not exist D in the array tuple converselyj(Ai)='', then the array tuple is partial data tuple, is designated as:
Rj={ Dj(Ai)|Dj(Ai)!='',1≤j≤m,1≤i≤n} (5)
Define 2, by deficiency of data tupleIn non-missing Data be missing data correlation attribute value composite set be defined as estimate missing values value chain of evidence:
Sj=C (y, u) | 1≤y≤n, 1≤u≤y } (6)
Wherein, C (y, u) is the correlation attribute value combination for estimating missing values value, i.e., choose u from y complete property value Individual unordered property value, is designated as estimating the evidence of missing values value.
The main target of algorithm is exactly that the value of missing data in j-th data tuple is estimated by gathering.
2nd, the principle of missing data filling
In the arbitrary data tuple D of data set DjIn (1≤j≤m), there is a certain attribute set, it is assumed that be A.Data element Group Dj(1≤j≤m) includes A, and and if only ifRegular shape is such asIn the tuple D of data set Dj(1≤ J≤m) middle establishment, whereinAndAt this moment institute in data set D is remembered It is P (A ∪ B) to have in tuple comprising set A and set B's and A ∪ B's ratio.
The quantity that 3, support counting S represents a certain set in data set is defined, then rule in whole data set D Support counting is defined as:
All tuples are C comprising the attribute set A also ratio comprising attribute set B in defining 4, note data set D, and this is bar Part probability P (B | A), then it is regular in whole data set DConfidence level is defined as:
Confidence level is calculated as:
Core work of the present invention is the support counting i.e. S (p of possibility value and its relevant evidence chain for calculating missing data ∪Sj), then calculate the confidence level of each evidence in chain of evidence, confidence level on evidence be added and obtain the probable value and obtain phase Close the confidence level of chain of evidenceThe maximum probable value of chain of evidence confidence level is finally used as the Filling power of missing data.
3rd, algorithm is based on the parallelization of Map-Reduce
Map-Reduce is a kind of parallelisation procedure design framework, is current cloud computing platform calculating mould the most popular Type.Its basic thought is using the strategy divided and rule to large-scale dataset.Map-Reduce calculates data with Key/Value Form carries out computing.MapReduce realizes that the core of parallelization is the two operations of Map and Reduce, and Map-Reduce is calculated First by Segmentation of Data Set is into the small documents of many equal sizes and distributes to different nodes, each node carries out Map meters to framework Calculate, and result of calculation is ranked up merging, the Value of identical Key carries out Reduce calculating in being placed on identity set.
The present invention provides the algorithm based on Map-Reduce programming frameworks, to realize the distributed operation of the algorithm.Such as Fig. 1 Shown, this algorithm is broadly divided into 5 stages, estimates to lack in data focus utilization missing data association attributes composite set first The value of data is lost, the value that then will be estimated is filled into data set.
Stage 1, algorithm scan data set be each deficiency of data unit group echo uniquely number, and be given every it is not complete Missing data position in entire data tuple, to determine the missing of which attribute data in the tuple.Wrapped in every record of output Numbering containing each deficiency of data tuple, the position of missing data and the deficiency of data in the deficiency of data tuple Tuple.These records constitute the destination file in the stage.
Stage 2, the stage are divided into 4 modules, and each module can be carried out simultaneously.
Module 1, the destination file of algorithm sweep phase 1 calculate the combination C of non-missing attribute values in deficiency of data tuple The set S of (y, u)j, SjIt is imperfect using each is included as the chain of evidence for estimating missing data value, in every record of output Non- missing data in the position of missing data, the deficiency of data tuple in the numbering of data tuple, the deficiency of data tuple Composite set Sj.These records constitute the destination file of the module.
Module 2, algorithm statistics concentrate the probability P (p) of the value p and p of each attribute, and the value of missing data will Come from p:
K () function representation is counted in formula, and K (p) is represented and lacked what property value p occurred on same attribute in whole data set Number of times, m represents the quantity of data tuple.
Every position, the probability P (p) of property value p, p of the record comprising property value of output, these records constitute the mould The output file of block.
Module 3, algorithm counts composite set C (y, u) of the non-missing data of each data tuple in whole data set Quantity Oj, by the probabilistic query for estimating for missing data value in algorithm step below, every record bag of output Composite set C (y, u) and its quantity O of the non-missing data containing each data tuplej.These records constitute the defeated of the module Go out file.
Module 4, algorithm counted in each data tuple of whole data set non-missing data combine C (y, u) and certain Property value, the property value can not be appeared in C (y, u), occurred simultaneously in whole data set, i.e. the number in same data tuple Amount Tj.Concrete operations are that algorithm scans a data element ancestral first, and wherein certain property value is selected successively, in selection one every time After property value, then remaining attribute in this data tuple is carried out into permutation and combination, the set just combined.Finally count successively Each combination and quantity T of the selected property value in whole data concentrate on same data tuple in setj
Tj=K (Dj(Ai)(1≤j≤m,1≤i≤n)∪C(y,u)(1≤y≤n,1≤u≤y)) (11)
Comprising surplus in a certain property value position, the property value, the data tuple in data tuple in every record of output Combination, the quantity of the combination and a certain property value in same data tuple in the permutation and combination of remaining property value.
The chain of evidence S that stage 3, algorithm is relied on the estimation missing data of the output of module 1 in the stage 2jIn the stage 2 The output attribute value record of module 2 is attached.We have just obtained non-missing data in each deficiency of data tuple The probability P (p) that combination C (y, u) and the possible Filling power p and p of each missing data occur in whole data set.
Specific operation be algorithm first according to deletion sites by the record of the output of module 1 in the stage 2 and mould in the stage 2 Block 2 output record be attached, thus obtained a missing data position for containing certain deficiency of data tuple, The composite set S of non-missing data in possibility Filling power, the tuple of missing datajRecord.
Then the S during algorithm will be recorded just nowjIn each non-missing data combination C (y, u) it is possible with missing data Value p is attached.
Every record of module output comprising combination C (y, u) of non-missing data in certain deficiency of data tuple, should The probability P (p) that the position of missing data, the possible value of missing data, probable value p occur in whole data set in tuple.
Stage 4, algorithm goes in the destination file of the module 3 of stage 2 to search certain deficiency of data unit in the destination file of stage 3 The quantity O of combination C (y, u) of non-missing data in groupj.Now S (C (y, u))=Oj.And the result of module 4 is literary in the stage 2 Combination C (y, u) of non-missing data and missing data in certain deficiency of data tuple are searched in the destination file of stage 3 in part The number of times T that possible value p occurs simultaneously in whole data setj, now S (p ∪ C (y, u))=Tj.We can just calculate The probability of all missing datas value under conditions of its non-missing attribute values combines C (y, u) in deficiency of data tuple.
We choose maximum probability i.e. confidence levelMaximum estimate is used as final Filling power.
Its value, according to the possible value of missing data estimated in the stage 4, is filled into former missing number by the stage 5, algorithm According in collection D.
Algorithm is exactly to carry out estimating missing data value in the 1st, 2,3,4 stages, due in most of data intensive datas Between attribute and in the absence of apparent causal connection, can there is dependency relation in opposite data attribute, we are logical for this dependency relation Cross missing data correlation attribute value composite set to embody, Map-Reduce is mainly completed in deficiency of data unit in this stage The calculating of the correlation attribute value composite set of missing data in group, and estimate the value of missing data.
Stage 1, mark missing data collection
Input:Data file containing missing values.
Output:Data tuple label, data tuple.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FFOR each<key,value>DO
2.ADD tupleindex into tuple
3.FOR each<attri-v,tuple>DO
IF tuple contains missingvalue THEN
Outkey:tupleindex
Outvalue:tuple
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Outkey:key
Outvalue:tuple
Map functions scan data set is each data tuple addition mark tupleindex in stage 1, and Reduce functions are most Output data form is (tupleindex, tuple) afterwards.
Stage 2, the stage are divided into 4 modules, and each module can be carried out simultaneously
Module 1, missing data association attributes composite set
Input:The destination file of stage 1.
Output:Data tuple label, missing values position, missing values correlation attribute value composite set.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FOR each<key,value>DO
2.IF tuple contains missingvalue THEN
ADD rest complete attribute into set comple-attri
3.Calculation complete attribute combination combi-attri in comple- attri
4.Outkey:tupleindex+missingindex
Outvalue:combi-attri
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Outkey:key
Outvalue:comple-attri
Missingindex is missing from position of the data in deficiency of data tuple in module 1, and Map functions will be imperfect Missing data correlation attribute value combination comple-attri is added to the set combi-attri of combinations of attributes in data tuple In.The last output data form of Reduce functions is (tupleindex, missingindex, combi-attri).
The possibility value of module 2, missing data
Input:The destination file of stage 1.
Output:Array unit deck label, the possible value of missing data, the possible probability of missing data.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FOR each<key,value>DO
2.FOR each<attri-v,tuple>DO
Outkey:attriindex
Outvalue:attri-v
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Add value into list-pro
2.Calculation valuelist-pro length divided by m as pro
3.Outkey:attriindex
Outvalue:attriindex+list-pro+pro
Algorithm scans each data tuple in module 2, and Map functions record the value attri-v of each attribute and export every Individual attribute number attriindex.Drawn in Reduce functions in each attribute may value list list-pro and each The Probability p ro of possible value.
The quantity of module 3, statistical attribute value composite set
Input:The destination file of stage 1.
Output:Property value composite set, property value composite set quantity.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FOR each<key,value>DO
Calculation C(y,u)in tuple as combi-attri
2.Outkey:combi-attri
Outvalue:1
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Calculation number of combi-attri
2.Outkey:combi-attri
Outvalue:num_c
Map functions calculate the property value combination combi-attri in each data tuple, Reduce functions in module 3 Calculate quantity num_c of each property value combination in whole data set in each data tuple.
Module 4, the statistics quantity of property value combination and certain property value in same data tuple in whole data set
Input:The destination file of stage 1.
Output:Property value combination, the position of certain property value, certain property value, property value combination and certain property value exist Quantity in same data tuple.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FOR each<key,value>DO
2.FOR each<attri-v,tuple>DO
Calculation C(y,u)in rest complete attribute as combi-attri
3.Outkey:combi-attri+attriindex+attri-v
Outvalue:1
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Calculation number of combi-attri+attriindex+attri-v
2.Outkey:combi-attri+attriindex+attri-v
Outvalue:num_caa
Map functions scan each data tuple in module 4, and property value attri-v, Ran Hou in data tuple are chosen successively Property value combination combi-attri is calculated in remaining property value, combi-attri and attri-v is counted in Reduce functions The quantity num_caa of same data tuple is concentrated in whole data.
Stage 3, missing data correlation attribute value composite set are connected with missing data probable value
Input:The destination file of module 1 in stage 2, the destination file of module 2 in the stage 2.
Output:Correlation attribute value combination, missing values position, possible value in deficiency of data tuple.
Map<Object,Text,Text,Text>
Input:Key=offset, value=missingindex+combi-attri
1.FOR each<key,value>DO
Split the value
2.Outkey:missingindex
Outvalue:combi-attri
Map<Object,Text,Text,Text>
Input:Key=offset, value=missingindex+pro-v
1.FOR each<key,value>DO
Split the value
2.Outkey:missingindex
Outvalaue:pro-v
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
2.Outkey:offset
Outvalue:combi-attri+missingindex+pro-v
Pro-v is missing from being worth possible value in stage 3, and first Map can be by the destination file each row of data of step module 1 Split, missingindex submits to Reduce as key values, combi-attri as value values.Second Map The possible value file each row of data of missing values is carried out segmentation missingindex as key values, pro-v as value values, Reduce is equally submitted to, the value of identical key values will be placed in same valuelist, and Reduce will be missing data Association attributes combination combi-attri and possible value pro-v be attached, the data form for finally exporting is (combi-attri, missingindex, pro-v).
Stage 4, the estimation possible value of missing values
Input:The destination file CAacount of module 3, the destination file CA-Aacount of module 4, stage in the stage 2 in stage 2 3 destination files.
Output:The estimate of missing data.
Map<Object,Text,Text,Text>
Input:Key=offset, value=missingindex+combi-attri+pro-v
1.FOR each<key,value>DO
Split the value
2.Research acount of combi-attri in CAacount recorded as num-combi- attri
3.Research acount of combi-attri+pro-v in CA-Aacount recorded asnum- combi-attri-a
4.Calculation num-combi-attri-a/num-combi-attri as credibility
5.Outkey:tupleindex+missingindex
Outvalue:credibility+pro-v
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
Sum of credibility
2.IF sum of credibility is maximum THEN
Outkey:offset
Outvalue:pro-v
Stage 4 is the core of algorithm, will use it to estimation missing values value, and Map functions are first by the knot in stage 3 The each row of data of fruit file is split, and missing data correlation attribute value combination combi- is searched in file CAacount The number num-combi-attri of attri, as S (combi-attri).Missing data is searched in file CA-Aacount Correlation attribute value combination and the possible value combi-attri+pro-v of missing values obtain individual in appearing in same data tuple simultaneously Number num-combi-attri-a, as S (combi-attri ∪ pro-v).And it is calculated the possible value of missing data Confidence level credibility, Reduce function will estimate that all confidence level evidences of missing values are added, and evidence sum is maximum The possible value pro-v of the missing values of value is used as final Filling power.
Stage 5:The value of the missing value estimation in stage 4 is filled into former missing data to concentrate
Input:Former missing data collection file, the value file of the missing value estimation of stage 4.
Output:Complete data set.
Map<Object,Text,Text,Text>
Input:Key=offset, value=tuple
1.FOR each<key,value>DO
2.Outkey:offset
Outvalue:value
Map<Object,Text,Text,Text>
Input:Key=offset, value=missingindex+pro-v
1.FOR each<key,value>DO
2.Outkey:offset
Outvalaue:missingindex+pro-v
Reduce<Text,Text,Text,Text>
1.FOR each in valuelist DO
missingindex+pro-v in listA
2.FOR each in ListA DO
pro-v append to value
3.Outkey:key
Outvalue:com-tuple
In the stage 5, Map functions using the offset of former missing data collection file and the value file of missing value estimation as Key is exported, by the value of former missing data collection and " Missing+Possiblevalue " of the value file of missing value estimation Exported as value values, " missingindex+pro-v " is stored in listA by reduce functions in each valuelist In, and all of estimate in ListA is filled into the value of missing data collection, final output partial data tuple com- tuple。
Embodiment:The data are concentrated with 4 attributes and are respectively:Sex, height, whether smoke, school grade.Represent and lack The data of mistake.
Sex Height Whether smoke School grade
Man It is high It is It is good
Female It is high It is Difference
Man It is short It is no Difference
Female It is high It is no It is good
Man It is short It is no
Female It is no It is good
Stage 1:
The 4th attribute missing of five-tuple
The hexa-atomic group the 2nd attribute missing
Destination file:5,4, [man, it is short, it is no,]
6,2, [female,, it is no, good], [] represents tuple.
Stage 2:
Module 1:
Destination file:5,4,{<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short no>}
6,2,<Female>,<It is no>,<It is good>,<Female is no>,<Female is good>,<It is no good>,<Female is no good>}
{ } represents set.
Module 2:
Destination file:1, man:50%, female:50%
2, it is high:50%, it is short:33%
3, it is 33%, it is no:67%
4, it is good:50%, it is poor:33%
Output property location, and each property value probability.Module 3:
Destination file:<Man>:3
<Female>:3
<It is high>:3
<It is short>:2
<It is>:2
<It is no>:4
<It is good>:3
<Difference>:2
<It is male high>:1
<Man is short>:2
<Man is>:1
<Man is no>:2
<Man is good>:1
<Gao Shi>:2
<Gao Hao>:2
<It is short to be>:0
<It is short good>:0
<It is>:1
<Male height is>:1
<It is male high good>:1
<Man is>:1
<Male height is>:1……
Module 4:
By taking first data tuple as an example:
As a result:1, man,<It is high>, 1
1, man,<It is>, 1
1, man,<It is good>, 1
1, man,<Gao Shi>, 1
1, man,<Gao Hao>, 1
1, man,<It is>, 1
1, man,<Height is>, 1 ...
Stage 3:
By taking one of data tuple as an example:
According to deletion sites 4, by the record 5,4 of the module 1 of stage 2,<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short It is no>,<Man is short no>With the record 4 of the module 2 of stage 2, it is good:50%, it is poor:33% is attached.
Recorded:
5, it is good:50%, it is poor:50%,<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short no>}
Then the S during algorithm will be recorded just nowjIn each non-missing data combination C (y, u) it is possible with missing data Value p is attached.
Final result is exported:
<Man>, 4, it is good:50%+<It is short>, 4, it is good:50%+<It is no>, 4, it is good:50%+<Man is short>, 4, it is good:50%+<Man is no>, 4, it is good:50%+<It is short no>, 4, it is good:50%+<Man is short no>, 4, it is good:50%+<Man>, 4, it is poor:33%+<It is short>, 4, it is poor:33%+< It is no>, 4, it is poor:33%+<Man is short>, 4, it is poor:33%+<Man is no>, 4, it is poor:33%+<It is short no>, 4, it is poor:33%+<Man is short no>,4, Difference:33%
Stage 4:
Algorithm goes in the destination file of the module 3 of stage 2 to search non-in certain deficiency of data tuple in the destination file of stage 3 The quantity O of combination C (y, u) of missing dataj
As a result:<Man>Quantity is 3
<It is short>Quantity is 2
<It is no>Quantity is 4
<Man is short>Quantity is 2
<Man is no>Quantity is 2
<It is short no>Quantity is 2
<Man is short no>Quantity is 2
And search in the destination file of module 4 in the stage 2 non-in certain deficiency of data tuple in the destination file of stage 3 The number of times T that combination C (y, u) of missing data and the possible value p of missing data occur simultaneously in whole data setj
As a result:It is good,<Man>Quantity is 1
It is good,<It is short>Quantity is 0
It is good,<It is no>Quantity is 2
It is good,<Man is short>Quantity is 0
It is good,<Man is no>Quantity is 0
It is good,<It is short no>Quantity is 0
It is good,<Man is short no>Quantity is 0
Difference,<Man>Quantity is 1
Difference,<It is short>Quantity is 1
Difference,<It is no>Quantity is 1
Difference,<Man is short>Quantity is 1
Difference,<Man is no>Quantity is 1
Difference,<It is short no>Quantity is 1
Difference,<Man is short no>Quantity is 1
F(<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short noIt is good)=1/3+0/2+2/4+0/2 + 0/2+0/2+0/2+50%=1.33
F(<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short noDifference)=1/3+1/2+1/4+1/2 + 1/2+1/2+1/2+33%=3.41
Stage 5:Maximum is taken so the 4th Filling power of attribute missing data of five-tuple is " poor ".
Above example is only explanation technological thought of the invention, it is impossible to limit protection scope of the present invention with this, every According to technological thought proposed by the present invention, any change done on the basis of technical scheme each falls within the scope of the present invention Within.

Claims (5)

1. a kind of information completion method towards big data, it is characterised in that comprise the following steps:
Step 1, makes data set for D, and D has m row n column datas, and one data tuple of each behavior is each to be classified as an attribute, sweeps Retouch each data tuple D of data setj, j=1 ..., m simultaneously number, and finds out deficiency of data tuple therein and imperfect number According to the position of missing data in tuple;
Step 2, other the non-missing datas in deficiency of data tuple are combined, and obtain non-in the deficiency of data tuple The set of missing data combination, as the chain of evidence for estimating missing data value;
Step 3, according to the non-missing data of partial data tuple correspondence deletion sites, obtains the possible value of missing data, counts Calculate the probability of the possible value p of missing data;
Step 4, according to the position of step 1 missing data, the set of the non-missing data combination that step 2 is obtained is obtained with step 3 To the possible value of missing data be attached, then by set each combination of non-missing data is possible with missing data takes Value is attached;
Step 5, to whole data set in non-missing data be combined, count each combination occur number of times;
Step 6, to each data tuple, selects one of property value, and remaining attribute in the data tuple is combined, and obtains To the set of combination, the number of times that each combination occurs with selected property value in whole data set in statistics set simultaneously;
Step 7, each non-missing data combination is in whole data set during each deficiency of data tuple is extracted in the result of step 5 In quantity, non-missing data combination in each deficiency of data tuple is extracted in the result of step 6 possible with missing data The number of times that value occurs simultaneously, the possible value of missing data is in its non-missing data combobar in calculating deficiency of data tuple The probability of value under part, using the corresponding value of maximum probability value as missing data Filling power.
2. according to claim 1 towards the information completion method of big data, it is characterised in that missing data described in step 3 The computing formula of the probability of possible value p is:
P ( p ) = K ( p ) m ,
Wherein, m is the quantity of all data tuples, and K (p) is that the possible value p of missing data is same in each data tuple The number of times that deletion sites occur, P (p) is the probability of the possible value p of missing data.
3. according to claim 1 towards the information completion method of big data, it is characterised in that cannot not calculated completely described in step 7 The possible value of missing data computing formula of the probability of value under its non-missing data combination condition is in entire data tuple:
F ( S j &DoubleRightArrow; p ) = &Sigma; S ( p &cup; C ( y , u ) ) S ( C ( y , u ) ) + P ( p ) ,
Wherein, P (p) is the probability of the possible value p of missing data, and S (p ∪ C (y, u)) is non-missing in deficiency of data tuple Data combine the number of times that C (y, u) occurs simultaneously with the possible value of missing data, and S (C (y, u)) is in deficiency of data tuple Each quantity of non-missing data combination in whole data set,It is the chain of evidence of the possible value of missing data SjConfidence level be probability, SjTo number the chain of evidence of the deficiency of data tuple missing data for being j.
4. according to claim 1 towards the information completion method of big data, it is characterised in that the step 2, step 3 do not have There is order to limit.
5. according to claim 1 towards the information completion method of big data, it is characterised in that the step 5, step 6 do not have There is order to limit.
CN201710156391.XA 2017-03-16 2017-03-16 A kind of information completion method towards big data Pending CN106919719A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710156391.XA CN106919719A (en) 2017-03-16 2017-03-16 A kind of information completion method towards big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710156391.XA CN106919719A (en) 2017-03-16 2017-03-16 A kind of information completion method towards big data

Publications (1)

Publication Number Publication Date
CN106919719A true CN106919719A (en) 2017-07-04

Family

ID=59460304

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710156391.XA Pending CN106919719A (en) 2017-03-16 2017-03-16 A kind of information completion method towards big data

Country Status (1)

Country Link
CN (1) CN106919719A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766294A (en) * 2017-10-31 2018-03-06 北京金风科创风电设备有限公司 Method and device for recovering missing data
CN107958027A (en) * 2017-11-16 2018-04-24 南京邮电大学 A kind of Sensor Network data capture method ensured with QoS
CN110413658A (en) * 2019-07-23 2019-11-05 中经柏诚科技(北京)有限责任公司 A kind of chain of evidence construction method based on the fact the correlation rule
US20200278471A1 (en) * 2017-09-12 2020-09-03 Schlumberger Technology Corporation Dynamic representation of exploration and/or production entity relationships
CN111737463A (en) * 2020-06-04 2020-10-02 江苏名通信息科技有限公司 Big data missing value filling method, device and computer program
CN116578557A (en) * 2023-03-03 2023-08-11 齐鲁工业大学(山东省科学院) Missing data filling method for data center

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200278471A1 (en) * 2017-09-12 2020-09-03 Schlumberger Technology Corporation Dynamic representation of exploration and/or production entity relationships
US11619761B2 (en) * 2017-09-12 2023-04-04 Schlumberger Technology Corporation Dynamic representation of exploration and/or production entity relationships
CN107766294A (en) * 2017-10-31 2018-03-06 北京金风科创风电设备有限公司 Method and device for recovering missing data
CN107958027A (en) * 2017-11-16 2018-04-24 南京邮电大学 A kind of Sensor Network data capture method ensured with QoS
CN110413658A (en) * 2019-07-23 2019-11-05 中经柏诚科技(北京)有限责任公司 A kind of chain of evidence construction method based on the fact the correlation rule
CN111737463A (en) * 2020-06-04 2020-10-02 江苏名通信息科技有限公司 Big data missing value filling method, device and computer program
CN111737463B (en) * 2020-06-04 2024-02-09 江苏名通信息科技有限公司 Big data missing value filling method, device and computer readable memory
CN116578557A (en) * 2023-03-03 2023-08-11 齐鲁工业大学(山东省科学院) Missing data filling method for data center
CN116578557B (en) * 2023-03-03 2024-04-02 齐鲁工业大学(山东省科学院) Missing data filling method for data center

Similar Documents

Publication Publication Date Title
CN106919719A (en) A kind of information completion method towards big data
CN107220277A (en) Image retrieval algorithm based on cartographical sketching
CN104765876B (en) Magnanimity GNSS small documents cloud storage methods
CN104572965A (en) Search-by-image system based on convolutional neural network
CN109359172A (en) A kind of entity alignment optimization method divided based on figure
CN107004141A (en) To the efficient mark of large sample group
CN104572833B (en) A kind of mapping ruler creation method and device
CN105320764A (en) 3D model retrieval method and 3D model retrieval apparatus based on slow increment features
CN110888859B (en) Connection cardinality estimation method based on combined deep neural network
CN115145906B (en) Preprocessing and completion method for structured data
CN104008420A (en) Distributed outlier detection method and system based on automatic coding machine
CN102902826A (en) Quick image retrieval method based on reference image indexes
CN109325062A (en) A kind of data dependence method for digging and system based on distributed computing
CN107844548A (en) A kind of data label method and apparatus
CN110533316A (en) A kind of LCA (Life Cycle Analysis) method, system and storage medium based on big data
CN114817575B (en) Large-scale electric power affair map processing method based on extended model
CN106649886A (en) Method for searching for images by utilizing depth monitoring hash of triple label
CN103744958B (en) A kind of Web page classification method based on Distributed Calculation
CN103678513B (en) A kind of interactively retrieval type generates method and system
CN103020319A (en) Real-time mobile space keyword approximate Top-k query method
Bannister et al. Windows into geometric events: Data structures for time-windowed querying of temporal point sets
CN109472343A (en) A kind of improvement sample data missing values based on GKNN fill up algorithm
CN107452001A (en) A kind of remote sensing images sequences segmentation method based on improved FCM algorithm
CN107451617A (en) One kind figure transduction semisupervised classification method
CN104731889B (en) A kind of method for estimating query result size

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170704

RJ01 Rejection of invention patent application after publication