CN107169520A

CN107169520A - A kind of big data lacks attribute complementing method

Info

Publication number: CN107169520A
Application number: CN201710357512.7A
Authority: CN
Inventors: 郝虹; 于治楼; 段成德
Original assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Current assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2017-09-15

Abstract

The present invention discloses a kind of big data missing attribute complementing method, is related to data processing and Data Mining, mainly including four-stage：Training sample clustering phase；Treat the Similarity measures stage between completion data and each cluster；Each cluster weight determines the stage；Lack the attribute completion stage；The training sample that attribute perfects first is clustered into certain amount of cluster using the present invention, then weight is determined according to the data and the similitude of each cluster of missing attribute, finally utilize each cluster weighting attribute sum completion missing attribute, the relevance of missing attribute data and other samples is taken into full account, the property value accuracy rate of completion is higher, does not influence the later stage to be directed to the other application of the data.

Description

A kind of big data lacks attribute complementing method

Technical field

The present invention discloses a kind of big data missing attribute complementing method, is related to data processing and Data Mining.

Background technology

Under current internet background, various data are all produced all the time, contain abundant potential in these data Knowledge, the policymaker of all trades and professions also appreciates the value of these mass data, utilizes the new skill such as cloud computing, data mining Art Extracting Knowledge from these big datas, supports decision-making.But due to the diversity of data source, the complexity of True Data is adopted The big data that collection comes has quite a few information content not enough or attribute missing, causes data imperfect, so that it is difficult to such number According to being further processed.In addition, the method for existing big data missing attribute completion be usually take all samples average or The default value of some fixation of person, these methods ignore missing attribute data and the sex chromosome mosaicism that associates of other samples, the category of completion Property value accuracy rate it is relatively low, and then influence the later stage be directed to the data other application, such as precisely recommend, marketing etc..

And attribute complementing method is lacked the invention provides a kind of big data, mainly including four-stage：Training sample gathers The class stage；Treat the Similarity measures stage between completion data and each cluster；Each cluster weight determines the stage；Lack the attribute completion stage；Profit The training sample that attribute perfects first is clustered into certain amount of cluster with the present invention, then according to the data of missing attribute and each The similitude of cluster determines weight, finally using each cluster weighting attribute sum completion missing attribute, has taken into full account missing attribute number According to the relevance with other samples, the property value accuracy rate of completion is higher, does not influence the later stage to be directed to the other application of the data.

The content of the invention

The present invention provides a kind of big data missing attribute complementing method, with highly versatile, be easy to implement the features such as, have Wide application prospect.

Concrete scheme proposed by the present invention is：

A kind of big data lacks attribute complementing method：

The big data that attribute is perfected is clustered into certain amount of cluster as training sample, calculates between missing attribute data and each cluster Similitude, and weight is determined according to above-mentioned similitude, utilizes the weighting attribute sum completion of each cluster to lack the missing of attribute data Attribute.Said process substantially includes A1：Training sample clustering phase；A2：Treat the Similarity measures stage between completion data and each cluster； A3：Each cluster weight determines the stage；A4：Lack attribute completion stage, this four-stage.

The mode that the big data that attribute perfects is clustered into specific clusters as training sample is that sampling attribute perfects Specification sample data, is divided into a number of cluster using clustering method by sample, and each cluster sample average is calculated respectively.

The mode for calculating the similitude between missing attribute data and each cluster is the meter on the premise of missing attribute is removed Calculate the distance between missing attribute data and described each cluster sample average.

The mode for determining weight is to each distance between described missing attribute data and each cluster sample average It is inverted and sum, weighted value is used as using the ratio of inverse distance and summation.

The mode of the missing attribute of the completion missing attribute data is to utilize each cluster sample average and described pair The weighted value answered calculates weighting sum, in this, as the property value of missing.

Usefulness of the present invention is：

The present invention provides a kind of big data missing attribute complementing method, mainly including four-stage：Training sample clustering phase；Treat The Similarity measures stage between completion data and each cluster；Each cluster weight determines the stage；Lack the attribute completion stage；Using the present invention first The training sample that attribute perfects is clustered into certain amount of cluster, then according to the data and the similitude of each cluster of missing attribute Weight is determined, finally using each cluster weighting attribute sum completion missing attribute, missing attribute data and other samples has been taken into full account This relevance, the property value accuracy rate of completion is higher, does not influence the later stage to be directed to the other application of the data.

Brief description of the drawings

Fig. 1 is the inventive method four-stage schematic flow sheet；

Fig. 2 the inventive method idiographic flow schematic diagrams.

Embodiment

The present invention provides a kind of big data missing attribute complementing method：

The big data that attribute is perfected is clustered into certain amount of cluster as training sample, calculates between missing attribute data and each cluster Similitude, and weight is determined according to above-mentioned similitude, utilizes the weighting attribute sum completion of each cluster to lack the missing of attribute data Attribute.

With reference to accompanying drawing, the present invention will be further described.

Using the inventive method, mainly including four-stage：

A1：Training sample clustering phase；

A2：Treat the Similarity measures stage between completion data and each cluster；

A3：Each cluster weight determines the stage；

A4：Lack the attribute completion stage.

Wherein A1 detailed processes are：The sound specification sample data of processed good attribute in stochastical sampling data warehouse, Sample is divided into k cluster using k central points clustering method, k value is determined according to the data category number of anticipation, calculates each respectively Cluster sample average；

The computational methods of A2 similitudes are：On the premise of missing attribute is removed, calculating is treated described in completion data and stage A1 The distance between each cluster sample average；

A3 weighing computation methods are：To each apart from inverted and sum described in the A2 stages, with inverse distance and the ratio of summation It is used as weighted value；

A4 attribute complementing methods are：Average described in stage A1 calculates weighted sum with the respective weights value described in stage A3, with this It is used as the property value of missing.

Data by the property value completion of missing to missing attribute, complete the completion stage.The present invention considers missing attribute Data and the relevance of other samples, the property value accuracy rate of completion are higher, do not influence the later stage to be directed to the other application of the data.

Claims

1. a kind of big data lacks attribute complementing method, it is characterized in that

2. according to the method described in claim 1, it is characterized in that described cluster the big data that attribute perfects as training sample Mode into specific clusters is the sound specification sample data of attribute of sampling, and sample is divided into a number of using clustering method Cluster, calculates each cluster sample average respectively.

3. method according to claim 2, it is characterized in that the calculating lacks the similitude between attribute data and each cluster Mode is on the premise of missing attribute is removed, to calculate the distance between missing attribute data and described each cluster sample average.

4. method according to claim 3, it is characterized in that the mode for determining weight is to described missing attribute number According to each between each cluster sample average apart from inverted and sum, weighted value is used as using the ratio of inverse distance and summation.

5. method according to claim 4, it is characterized in that the mode of the missing attribute of completion missing attribute data is Weighting sum is calculated using each cluster sample average weighted value corresponding with described, in this, as the property value of missing.