CN106919719A

CN106919719A - A kind of information completion method towards big data

Info

Publication number: CN106919719A
Application number: CN201710156391.XA
Authority: CN
Inventors: 徐小龙; 崇卫之
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2017-03-16
Filing date: 2017-03-16
Publication date: 2017-07-04

Abstract

The invention discloses a kind of information completion method towards big data, the characteristics of the method makes full use of missing data：The value of missing data is relevant chain of evidence with other attributes in tuple where it or combinations of attributes value, there is all of relevant evidence of missing data in the tuple of missing data by excavating every, comprehensive these relevant evidences turn into the chain of evidence for estimating missing attribute value, and the value of missing data is estimated finally by chain of evidence.Due to directly obtaining value from original data centralized calculation missing data relevant evidence chain predicting missing values, so the present invention not only possesses filling accuracy rate and anti-miss rate high when missing values are filled, and it is simple and easy to apply, it is not required to the distribution of data in GPRS data set, domain knowledge, estimate model also without being trained on data set, be that completion data save the substantial amounts of time.This clearly can be based on the operation of Map Reduce distributed programmed frameworks, can completion large-scale dataset in a distributed manner.

Description

A kind of information completion method towards big data

Technical field

The present invention relates to a kind of information completion method towards big data, belong to Data Preprocessing Technology field.

Background technology

Recently as the high speed development of information technology, global metadata is lasting to be increased with astonishing speed, we The big data epoch are marched toward in the world.In actual life due in data inputting occur omit, it is skimble-scamble measurement rule, And many factors such as the limitation of collection condition cause the missing of data.The data of missing not only compromise the complete of data Property, also result in data mining and deviation occur with the conclusion of data analysis.Often filled out in advance in order to avoid there is such case Fill the data of these missings.The information completion of big data has been that data mining field carries out one of data prediction and important asks Topic.And traditional big data complementing method generally existing filling accuracy rate is low, anti-miss rate is limited in one's ability to wait not enough.

So being badly in need of a kind of calculation for not only having had to missing data collection and preferably having filled accuracy rate but also there is stronger anti-miss rate Method, and algorithm can preferably suitable for the environment of large-scale dataset.

The content of the invention

The technical problems to be solved by the invention are：A kind of information completion method towards big data is provided, with higher Filling accuracy rate and anti-miss rate, and using distributed fill method adapt to large-scale dataset.

The present invention uses following technical scheme to solve above-mentioned technical problem：

A kind of information completion method towards big data, comprises the following steps：

Step 1, makes data set for D, and D has m row n column datas, and one data tuple of each behavior is each to be classified as a category Property, each data tuple D of scan data set_j, j=1 ..., m simultaneously number, and finds out deficiency of data tuple therein and not The position of missing data in partial data tuple；

Step 2, other the non-missing datas in deficiency of data tuple are combined, and obtain the deficiency of data tuple In the combination of non-missing data set, as the chain of evidence for estimating missing data value；

Step 3, according to the non-missing data of partial data tuple correspondence deletion sites, obtains that missing data is possible to be taken Value, calculates the probability of the possible value p of missing data；

Step 4, the set combined according to the position of step 1 missing data, the non-missing data that step 2 is obtained and step The 3 possible values of the missing data for obtaining are attached, then may with missing data by each non-missing data combination in set Value be attached；

Step 5, to whole data set in non-missing data be combined, count each combination occur number of times；

Step 6, to each data tuple, selects one of property value, and remaining attribute in the data tuple is carried out into group Close, the set combined, each combination appearance simultaneously in whole data set with selected property value in statistics set Number of times；

Step 7, each non-missing data combination is entirely being counted during each deficiency of data tuple is extracted in the result of step 5 According to the quantity concentrated, non-missing data combination can with missing data during each deficiency of data tuple is extracted in the result of step 6 The number of times that the value of energy occurs simultaneously, the possible value of missing data is in its non-missing data group in calculating deficiency of data tuple The probability of value under the conditions of conjunction, using the corresponding value of maximum probability value as missing data Filling power.

As a preferred embodiment of the present invention, the computing formula of the probability of the possible value p of missing data described in step 3 For：

Wherein, m is the quantity of all data tuples, and K (p) is the possible value p of missing data in each data tuple The number of times that same deletion sites occur, P (p) is the probability of the possible value p of missing data.

Used as a preferred embodiment of the present invention, missing data is possible in calculating deficiency of data tuple described in step 7 Value computing formula of the probability of value under its non-missing data combination condition is：

Wherein, P (p) is the probability of the possible value p of missing data, and S (p ∪ C (y, u)) is non-in deficiency of data tuple Missing data combines the number of times that C (y, u) occurs simultaneously with the possible value of missing data, and S (C (y, u)) is deficiency of data unit Each quantity of non-missing data combination in whole data set in group,It is the card of the possible value of missing data According to chain S_jConfidence level be probability, S_jTo number the chain of evidence of the deficiency of data tuple missing data for being j.

Used as a preferred embodiment of the present invention, the step 2, step 3 are limited without order.

Used as a preferred embodiment of the present invention, the step 5, step 6 are limited without order.

The present invention uses above technical scheme compared with prior art, with following technique effect：

1st, the inventive method uses UCI machine learning data, and the random rejecting wherein attribute for carrying out different proportion is worth to Experimental data set, then carries out the filling of missing data, as a result shows the present invention when missing values are filled with filling higher Accuracy rate and stability.

2nd, the suitable big data of the present invention, current most methods are that small-sized missing data collection is processed on unit, but Now with information-based development, data volume sharp increase, large-scale dataset processes obviously improper on unit； The present invention can realize filling large-scale data set in distributed data processing platform based on Map-Reduce programming models.

3rd, the inventive method is simple and easy to apply, it is not necessary to grasp the distribution of data intensive data, domain knowledge, it is not required that Trained on data set and estimate model, be that completion data save the substantial amounts of time.

Brief description of the drawings

Fig. 1 is a kind of algorithm timing diagram of information completion method towards big data of the invention.

Specific embodiment

Embodiments of the present invention are described below in detail, the example of the implementation method is shown in the drawings.Below by The implementation method being described with reference to the drawings is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.

The present invention is a kind of property improved and comprehensive method, by based on association attributes composite set in missing tuple Chain of evidence estimates what the value of missing data was proposed, and algorithm estimates the Filling power of missing values first, scans every in whole data set Individual data tuple, incomplete data tuple is labeled as by the tuple with missing values, and will in incomplete data tuple Complete different attribute value combination is used as estimating the evidence of missing values value.The substantial amounts of complete category so in imperfect tuple Property combination just constitute the chain of evidence for estimating missing data, algorithm can be scanned during whole data set counts all data tuples again Property value set.The core missions of algorithm are exactly that each confidence level for estimating missing values value evidence is calculated in chain of evidence.This Sample just obtains estimating missing data value institute confidence level sum on evidence, chooses the estimate of confidence level sum maximum as filling out Supplement with money.

For ease of public understanding technical scheme, below first to deficiency model involved in the present invention, missing number According to filling principle, algorithm be based on Map-Reduce parallelization carry out brief introduction.

First, deficiency model

Make data set for D, D there are m row n column datas, i.e. data set D there are m data tuples, have n category per data tuple Property, then data set D can be defined as：

D={ A₁,A₂,A₃,…,A_n} (1)

Wherein, A_i(1≤i≤n) represents i-th Column Properties of data set D.

Data tuple is designated as in data set：

D_j={ D_j(A_i)|1≤j≤m,1≤i≤n} (2)

Wherein, X_j,i(1≤j≤m, 1≤i≤n) represents the value of the Column Properties of jth row i-th of data set D, makes j-th in D The i-th attribute is D in tuple_j(A_i)。

1, deficiency model is defined to be defined as follows：

Wherein, D_j(A_i)='' represent ith attribute value missing in j-th tuple.

When there is D in data tuple_j(A_i)='' when then the data tuple be deficiency of data tuple, be designated as：

Do not exist D in the array tuple conversely_j(A_i)='', then the array tuple is partial data tuple, is designated as：

R_j={ D_j(A_i)|D_j(A_i)！='',1≤j≤m,1≤i≤n} (5)

Define 2, by deficiency of data tupleIn non-missing Data be missing data correlation attribute value composite set be defined as estimate missing values value chain of evidence：

S_j=C (y, u) | 1≤y≤n, 1≤u≤y } (6)

Wherein, C (y, u) is the correlation attribute value combination for estimating missing values value, i.e., choose u from y complete property value Individual unordered property value, is designated as estimating the evidence of missing values value.

The main target of algorithm is exactly that the value of missing data in j-th data tuple is estimated by gathering.

2nd, the principle of missing data filling

In the arbitrary data tuple D of data set D_jIn (1≤j≤m), there is a certain attribute set, it is assumed that be A.Data element Group D_j(1≤j≤m) includes A, and and if only ifRegular shape is such asIn the tuple D of data set D_j(1≤ J≤m) middle establishment, whereinAndAt this moment institute in data set D is remembered It is P (A ∪ B) to have in tuple comprising set A and set B's and A ∪ B's ratio.

The quantity that 3, support counting S represents a certain set in data set is defined, then rule in whole data set D Support counting is defined as：

All tuples are C comprising the attribute set A also ratio comprising attribute set B in defining 4, note data set D, and this is bar Part probability P (B | A), then it is regular in whole data set DConfidence level is defined as：

Confidence level is calculated as：

Core work of the present invention is the support counting i.e. S (p of possibility value and its relevant evidence chain for calculating missing data ∪S_j), then calculate the confidence level of each evidence in chain of evidence, confidence level on evidence be added and obtain the probable value and obtain phase Close the confidence level of chain of evidenceThe maximum probable value of chain of evidence confidence level is finally used as the Filling power of missing data.

3rd, algorithm is based on the parallelization of Map-Reduce

Map-Reduce is a kind of parallelisation procedure design framework, is current cloud computing platform calculating mould the most popular Type.Its basic thought is using the strategy divided and rule to large-scale dataset.Map-Reduce calculates data with Key/Value Form carries out computing.MapReduce realizes that the core of parallelization is the two operations of Map and Reduce, and Map-Reduce is calculated First by Segmentation of Data Set is into the small documents of many equal sizes and distributes to different nodes, each node carries out Map meters to framework Calculate, and result of calculation is ranked up merging, the Value of identical Key carries out Reduce calculating in being placed on identity set.

The present invention provides the algorithm based on Map-Reduce programming frameworks, to realize the distributed operation of the algorithm.Such as Fig. 1 Shown, this algorithm is broadly divided into 5 stages, estimates to lack in data focus utilization missing data association attributes composite set first The value of data is lost, the value that then will be estimated is filled into data set.

Stage 1, algorithm scan data set be each deficiency of data unit group echo uniquely number, and be given every it is not complete Missing data position in entire data tuple, to determine the missing of which attribute data in the tuple.Wrapped in every record of output Numbering containing each deficiency of data tuple, the position of missing data and the deficiency of data in the deficiency of data tuple Tuple.These records constitute the destination file in the stage.

Stage 2, the stage are divided into 4 modules, and each module can be carried out simultaneously.

Module 1, the destination file of algorithm sweep phase 1 calculate the combination C of non-missing attribute values in deficiency of data tuple The set S of (y, u)_j, S_jIt is imperfect using each is included as the chain of evidence for estimating missing data value, in every record of output Non- missing data in the position of missing data, the deficiency of data tuple in the numbering of data tuple, the deficiency of data tuple Composite set S_j.These records constitute the destination file of the module.

Module 2, algorithm statistics concentrate the probability P (p) of the value p and p of each attribute, and the value of missing data will Come from p：

K () function representation is counted in formula, and K (p) is represented and lacked what property value p occurred on same attribute in whole data set Number of times, m represents the quantity of data tuple.

Every position, the probability P (p) of property value p, p of the record comprising property value of output, these records constitute the mould The output file of block.

Module 3, algorithm counts composite set C (y, u) of the non-missing data of each data tuple in whole data set Quantity O_j, by the probabilistic query for estimating for missing data value in algorithm step below, every record bag of output Composite set C (y, u) and its quantity O of the non-missing data containing each data tuple_j.These records constitute the defeated of the module Go out file.

Module 4, algorithm counted in each data tuple of whole data set non-missing data combine C (y, u) and certain Property value, the property value can not be appeared in C (y, u), occurred simultaneously in whole data set, i.e. the number in same data tuple Amount T_j.Concrete operations are that algorithm scans a data element ancestral first, and wherein certain property value is selected successively, in selection one every time After property value, then remaining attribute in this data tuple is carried out into permutation and combination, the set just combined.Finally count successively Each combination and quantity T of the selected property value in whole data concentrate on same data tuple in set_j。

T_j=K (D_j(A_i)(1≤j≤m,1≤i≤n)∪C(y,u)(1≤y≤n,1≤u≤y)) (11)

Comprising surplus in a certain property value position, the property value, the data tuple in data tuple in every record of output Combination, the quantity of the combination and a certain property value in same data tuple in the permutation and combination of remaining property value.

The chain of evidence S that stage 3, algorithm is relied on the estimation missing data of the output of module 1 in the stage 2_jIn the stage 2 The output attribute value record of module 2 is attached.We have just obtained non-missing data in each deficiency of data tuple The probability P (p) that combination C (y, u) and the possible Filling power p and p of each missing data occur in whole data set.

Specific operation be algorithm first according to deletion sites by the record of the output of module 1 in the stage 2 and mould in the stage 2 Block 2 output record be attached, thus obtained a missing data position for containing certain deficiency of data tuple, The composite set S of non-missing data in possibility Filling power, the tuple of missing data_jRecord.

Then the S during algorithm will be recorded just now_jIn each non-missing data combination C (y, u) it is possible with missing data Value p is attached.

Every record of module output comprising combination C (y, u) of non-missing data in certain deficiency of data tuple, should The probability P (p) that the position of missing data, the possible value of missing data, probable value p occur in whole data set in tuple.

Stage 4, algorithm goes in the destination file of the module 3 of stage 2 to search certain deficiency of data unit in the destination file of stage 3 The quantity O of combination C (y, u) of non-missing data in group_j.Now S (C (y, u))=O_j.And the result of module 4 is literary in the stage 2 Combination C (y, u) of non-missing data and missing data in certain deficiency of data tuple are searched in the destination file of stage 3 in part The number of times T that possible value p occurs simultaneously in whole data set_j, now S (p ∪ C (y, u))=T_j.We can just calculate The probability of all missing datas value under conditions of its non-missing attribute values combines C (y, u) in deficiency of data tuple.

We choose maximum probability i.e. confidence levelMaximum estimate is used as final Filling power.

Its value, according to the possible value of missing data estimated in the stage 4, is filled into former missing number by the stage 5, algorithm According in collection D.

Algorithm is exactly to carry out estimating missing data value in the 1st, 2,3,4 stages, due in most of data intensive datas Between attribute and in the absence of apparent causal connection, can there is dependency relation in opposite data attribute, we are logical for this dependency relation Cross missing data correlation attribute value composite set to embody, Map-Reduce is mainly completed in deficiency of data unit in this stage The calculating of the correlation attribute value composite set of missing data in group, and estimate the value of missing data.

Stage 1, mark missing data collection

Input：Data file containing missing values.

Output：Data tuple label, data tuple.

Map<Object,Text,Text,Text>

Input:Key=offset, value=tuple

1.FFOR each<key,value>DO

2.ADD tupleindex into tuple

3.FOR each<attri-v,tuple>DO

IF tuple contains missingvalue THEN

Outkey:tupleindex

Outvalue:tuple

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

Outkey:key

Outvalue:tuple

Map functions scan data set is each data tuple addition mark tupleindex in stage 1, and Reduce functions are most Output data form is (tupleindex, tuple) afterwards.

Stage 2, the stage are divided into 4 modules, and each module can be carried out simultaneously

Module 1, missing data association attributes composite set

Input：The destination file of stage 1.

Output：Data tuple label, missing values position, missing values correlation attribute value composite set.

Map<Object,Text,Text,Text>

Input:Key=offset, value=tuple

1.FOR each<key,value>DO

2.IF tuple contains missingvalue THEN

ADD rest complete attribute into set comple-attri

3.Calculation complete attribute combination combi-attri in comple- attri

4.Outkey:tupleindex+missingindex

Outvalue:combi-attri

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

Outkey:key

Outvalue:comple-attri

Missingindex is missing from position of the data in deficiency of data tuple in module 1, and Map functions will be imperfect Missing data correlation attribute value combination comple-attri is added to the set combi-attri of combinations of attributes in data tuple In.The last output data form of Reduce functions is (tupleindex, missingindex, combi-attri).

The possibility value of module 2, missing data

Input：The destination file of stage 1.

Output：Array unit deck label, the possible value of missing data, the possible probability of missing data.

Map<Object,Text,Text,Text>

Input:Key=offset, value=tuple

1.FOR each<key,value>DO

2.FOR each<attri-v,tuple>DO

Outkey:attriindex

Outvalue:attri-v

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

Add value into list-pro

2.Calculation valuelist-pro length divided by m as pro

3.Outkey:attriindex

Outvalue:attriindex+list-pro+pro

Algorithm scans each data tuple in module 2, and Map functions record the value attri-v of each attribute and export every Individual attribute number attriindex.Drawn in Reduce functions in each attribute may value list list-pro and each The Probability p ro of possible value.

The quantity of module 3, statistical attribute value composite set

Input：The destination file of stage 1.

Output：Property value composite set, property value composite set quantity.

Map<Object,Text,Text,Text>

Input:Key=offset, value=tuple

1.FOR each<key,value>DO

Calculation C(y,u)in tuple as combi-attri

2.Outkey:combi-attri

Outvalue:1

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

Calculation number of combi-attri

2.Outkey:combi-attri

Outvalue:num_c

Map functions calculate the property value combination combi-attri in each data tuple, Reduce functions in module 3 Calculate quantity num_c of each property value combination in whole data set in each data tuple.

Module 4, the statistics quantity of property value combination and certain property value in same data tuple in whole data set

Input：The destination file of stage 1.

Output：Property value combination, the position of certain property value, certain property value, property value combination and certain property value exist Quantity in same data tuple.

Map<Object,Text,Text,Text>

Input:Key=offset, value=tuple

1.FOR each<key,value>DO

2.FOR each<attri-v,tuple>DO

Calculation C(y,u)in rest complete attribute as combi-attri

3.Outkey:combi-attri+attriindex+attri-v

Outvalue：1

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

Calculation number of combi-attri+attriindex+attri-v

2.Outkey:combi-attri+attriindex+attri-v

Outvalue:num_caa

Map functions scan each data tuple in module 4, and property value attri-v, Ran Hou in data tuple are chosen successively Property value combination combi-attri is calculated in remaining property value, combi-attri and attri-v is counted in Reduce functions The quantity num_caa of same data tuple is concentrated in whole data.

Stage 3, missing data correlation attribute value composite set are connected with missing data probable value

Input：The destination file of module 1 in stage 2, the destination file of module 2 in the stage 2.

Output：Correlation attribute value combination, missing values position, possible value in deficiency of data tuple.

Map<Object,Text,Text,Text>

Input:Key=offset, value=missingindex+combi-attri

1.FOR each<key,value>DO

Split the value

2.Outkey:missingindex

Outvalue:combi-attri

Map<Object,Text,Text,Text>

Input:Key=offset, value=missingindex+pro-v

1.FOR each<key,value>DO

Split the value

2.Outkey:missingindex

Outvalaue:pro-v

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

2.Outkey:offset

Outvalue:combi-attri+missingindex+pro-v

Pro-v is missing from being worth possible value in stage 3, and first Map can be by the destination file each row of data of step module 1 Split, missingindex submits to Reduce as key values, combi-attri as value values.Second Map The possible value file each row of data of missing values is carried out segmentation missingindex as key values, pro-v as value values, Reduce is equally submitted to, the value of identical key values will be placed in same valuelist, and Reduce will be missing data Association attributes combination combi-attri and possible value pro-v be attached, the data form for finally exporting is (combi-attri, missingindex, pro-v).

Stage 4, the estimation possible value of missing values

Input：The destination file CAacount of module 3, the destination file CA-Aacount of module 4, stage in the stage 2 in stage 2 3 destination files.

Output：The estimate of missing data.

Map<Object,Text,Text,Text>

Input:Key=offset, value=missingindex+combi-attri+pro-v

1.FOR each<key,value>DO

Split the value

2.Research acount of combi-attri in CAacount recorded as num-combi- attri

3.Research acount of combi-attri+pro-v in CA-Aacount recorded asnum- combi-attri-a

4.Calculation num-combi-attri-a/num-combi-attri as credibility

5.Outkey:tupleindex+missingindex

Outvalue:credibility+pro-v

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

Sum of credibility

2.IF sum of credibility is maximum THEN

Outkey:offset

Outvalue:pro-v

Stage 4 is the core of algorithm, will use it to estimation missing values value, and Map functions are first by the knot in stage 3 The each row of data of fruit file is split, and missing data correlation attribute value combination combi- is searched in file CAacount The number num-combi-attri of attri, as S (combi-attri).Missing data is searched in file CA-Aacount Correlation attribute value combination and the possible value combi-attri+pro-v of missing values obtain individual in appearing in same data tuple simultaneously Number num-combi-attri-a, as S (combi-attri ∪ pro-v).And it is calculated the possible value of missing data Confidence level credibility, Reduce function will estimate that all confidence level evidences of missing values are added, and evidence sum is maximum The possible value pro-v of the missing values of value is used as final Filling power.

Stage 5：The value of the missing value estimation in stage 4 is filled into former missing data to concentrate

Input：Former missing data collection file, the value file of the missing value estimation of stage 4.

Output：Complete data set.

Map<Object,Text,Text,Text>

Input:Key=offset, value=tuple

1.FOR each<key,value>DO

2.Outkey:offset

Outvalue:value

Map<Object,Text,Text,Text>

Input:Key=offset, value=missingindex+pro-v

1.FOR each<key,value>DO

2.Outkey:offset

Outvalaue:missingindex+pro-v

Reduce<Text,Text,Text,Text>

1.FOR each in valuelist DO

missingindex+pro-v in listA

2.FOR each in ListA DO

pro-v append to value

3.Outkey:key

Outvalue:com-tuple

In the stage 5, Map functions using the offset of former missing data collection file and the value file of missing value estimation as Key is exported, by the value of former missing data collection and " Missing+Possiblevalue " of the value file of missing value estimation Exported as value values, " missingindex+pro-v " is stored in listA by reduce functions in each valuelist In, and all of estimate in ListA is filled into the value of missing data collection, final output partial data tuple com- tuple。

Embodiment：The data are concentrated with 4 attributes and are respectively：Sex, height, whether smoke, school grade.Represent and lack The data of mistake.

Sex	Height	Whether smoke	School grade
				Man	It is high	It is	It is good
Female	It is high	It is	Difference
				Man	It is short	It is no	Difference
Female	It is high	It is no	It is good
				Man	It is short	It is no
Female		It is no	It is good

Stage 1：

The 4th attribute missing of five-tuple

The hexa-atomic group the 2nd attribute missing

Destination file：5,4, [man, it is short, it is no,]

6,2, [female,, it is no, good], [] represents tuple.

Stage 2：

Module 1：

Destination file：5,4,{<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short no>}

6,2,<Female>,<It is no>,<It is good>,<Female is no>,<Female is good>,<It is no good>,<Female is no good>}

{ } represents set.

Module 2：

Destination file：1, man：50%, female：50%

2, it is high：50%, it is short：33%

3, it is 33%, it is no：67%

4, it is good：50%, it is poor：33%

Output property location, and each property value probability.Module 3：

Destination file：<Man>：3

<Female>：3

<It is high>：3

<It is short>：2

<It is>：2

<It is no>：4

<It is good>：3

<Difference>：2

<It is male high>：1

<Man is short>：2

<Man is>：1

<Man is no>：2

<Man is good>：1

<Gao Shi>：2

<Gao Hao>：2

<It is short to be>：0

<It is short good>：0

<It is>：1

<Male height is>：1

<It is male high good>：1

<Man is>：1

<Male height is>：1……

Module 4：

By taking first data tuple as an example：

As a result：1, man,<It is high>, 1

1, man,<It is>, 1

1, man,<It is good>, 1

1, man,<Gao Shi>, 1

1, man,<Gao Hao>, 1

1, man,<It is>, 1

1, man,<Height is>, 1 ...

Stage 3：

By taking one of data tuple as an example：

According to deletion sites 4, by the record 5,4 of the module 1 of stage 2,<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short It is no>,<Man is short no>With the record 4 of the module 2 of stage 2, it is good：50%, it is poor：33% is attached.

Recorded：

5, it is good：50%, it is poor：50%,<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short no>}

Final result is exported：

<Man>, 4, it is good：50%+<It is short>, 4, it is good：50%+<It is no>, 4, it is good：50%+<Man is short>, 4, it is good：50%+<Man is no>, 4, it is good：50%+<It is short no>, 4, it is good：50%+<Man is short no>, 4, it is good：50%+<Man>, 4, it is poor：33%+<It is short>, 4, it is poor：33%+< It is no>, 4, it is poor：33%+<Man is short>, 4, it is poor：33%+<Man is no>, 4, it is poor：33%+<It is short no>, 4, it is poor：33%+<Man is short no>,4, Difference：33%

Stage 4：

Algorithm goes in the destination file of the module 3 of stage 2 to search non-in certain deficiency of data tuple in the destination file of stage 3 The quantity O of combination C (y, u) of missing data_j。

As a result：<Man>Quantity is 3

<It is short>Quantity is 2

<It is no>Quantity is 4

<Man is short>Quantity is 2

<Man is no>Quantity is 2

<It is short no>Quantity is 2

<Man is short no>Quantity is 2

And search in the destination file of module 4 in the stage 2 non-in certain deficiency of data tuple in the destination file of stage 3 The number of times T that combination C (y, u) of missing data and the possible value p of missing data occur simultaneously in whole data set_j。

As a result：It is good,<Man>Quantity is 1

It is good,<It is short>Quantity is 0

It is good,<It is no>Quantity is 2

It is good,<Man is short>Quantity is 0

It is good,<Man is no>Quantity is 0

It is good,<It is short no>Quantity is 0

It is good,<Man is short no>Quantity is 0

Difference,<Man>Quantity is 1

Difference,<It is short>Quantity is 1

Difference,<It is no>Quantity is 1

Difference,<Man is short>Quantity is 1

Difference,<Man is no>Quantity is 1

Difference,<It is short no>Quantity is 1

Difference,<Man is short no>Quantity is 1

F(<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short noIt is good)=1/3+0/2+2/4+0/2 + 0/2+0/2+0/2+50%=1.33

F(<Man>,<It is short>,<It is no>,<Man is short>,<Man is no>,<It is short no>,<Man is short noDifference)=1/3+1/2+1/4+1/2 + 1/2+1/2+1/2+33%=3.41

Stage 5：Maximum is taken so the 4th Filling power of attribute missing data of five-tuple is " poor ".

Above example is only explanation technological thought of the invention, it is impossible to limit protection scope of the present invention with this, every According to technological thought proposed by the present invention, any change done on the basis of technical scheme each falls within the scope of the present invention Within.

Claims

1. a kind of information completion method towards big data, it is characterised in that comprise the following steps：

Step 1, makes data set for D, and D has m row n column datas, and one data tuple of each behavior is each to be classified as an attribute, sweeps Retouch each data tuple D of data set_j, j=1 ..., m simultaneously number, and finds out deficiency of data tuple therein and imperfect number According to the position of missing data in tuple；

Step 2, other the non-missing datas in deficiency of data tuple are combined, and obtain non-in the deficiency of data tuple The set of missing data combination, as the chain of evidence for estimating missing data value；

Step 3, according to the non-missing data of partial data tuple correspondence deletion sites, obtains the possible value of missing data, counts Calculate the probability of the possible value p of missing data；

Step 4, according to the position of step 1 missing data, the set of the non-missing data combination that step 2 is obtained is obtained with step 3 To the possible value of missing data be attached, then by set each combination of non-missing data is possible with missing data takes Value is attached；

Step 6, to each data tuple, selects one of property value, and remaining attribute in the data tuple is combined, and obtains To the set of combination, the number of times that each combination occurs with selected property value in whole data set in statistics set simultaneously；

Step 7, each non-missing data combination is in whole data set during each deficiency of data tuple is extracted in the result of step 5 In quantity, non-missing data combination in each deficiency of data tuple is extracted in the result of step 6 possible with missing data The number of times that value occurs simultaneously, the possible value of missing data is in its non-missing data combobar in calculating deficiency of data tuple The probability of value under part, using the corresponding value of maximum probability value as missing data Filling power.

2. according to claim 1 towards the information completion method of big data, it is characterised in that missing data described in step 3 The computing formula of the probability of possible value p is：

P (p) = \frac{K (p)}{m},

Wherein, m is the quantity of all data tuples, and K (p) is that the possible value p of missing data is same in each data tuple The number of times that deletion sites occur, P (p) is the probability of the possible value p of missing data.

3. according to claim 1 towards the information completion method of big data, it is characterised in that cannot not calculated completely described in step 7 The possible value of missing data computing formula of the probability of value under its non-missing data combination condition is in entire data tuple：

F (S_{j} &DoubleRightArrow; p) = Σ \frac{S (p \cup C (y, u))}{S (C (y, u))} + P (p),

Wherein, P (p) is the probability of the possible value p of missing data, and S (p ∪ C (y, u)) is non-missing in deficiency of data tuple Data combine the number of times that C (y, u) occurs simultaneously with the possible value of missing data, and S (C (y, u)) is in deficiency of data tuple Each quantity of non-missing data combination in whole data set,It is the chain of evidence of the possible value of missing data S_jConfidence level be probability, S_jTo number the chain of evidence of the deficiency of data tuple missing data for being j.

4. according to claim 1 towards the information completion method of big data, it is characterised in that the step 2, step 3 do not have There is order to limit.

5. according to claim 1 towards the information completion method of big data, it is characterised in that the step 5, step 6 do not have There is order to limit.