CN107169520A - A kind of big data lacks attribute complementing method - Google Patents

A kind of big data lacks attribute complementing method Download PDF

Info

Publication number
CN107169520A
CN107169520A CN201710357512.7A CN201710357512A CN107169520A CN 107169520 A CN107169520 A CN 107169520A CN 201710357512 A CN201710357512 A CN 201710357512A CN 107169520 A CN107169520 A CN 107169520A
Authority
CN
China
Prior art keywords
attribute
cluster
data
missing
completion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710357512.7A
Other languages
Chinese (zh)
Inventor
郝虹
于治楼
段成德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan Inspur Hi Tech Investment and Development Co Ltd
Original Assignee
Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan Inspur Hi Tech Investment and Development Co Ltd filed Critical Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority to CN201710357512.7A priority Critical patent/CN107169520A/en
Publication of CN107169520A publication Critical patent/CN107169520A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of big data missing attribute complementing method, is related to data processing and Data Mining, mainly including four-stage:Training sample clustering phase;Treat the Similarity measures stage between completion data and each cluster;Each cluster weight determines the stage;Lack the attribute completion stage;The training sample that attribute perfects first is clustered into certain amount of cluster using the present invention, then weight is determined according to the data and the similitude of each cluster of missing attribute, finally utilize each cluster weighting attribute sum completion missing attribute, the relevance of missing attribute data and other samples is taken into full account, the property value accuracy rate of completion is higher, does not influence the later stage to be directed to the other application of the data.

Description

A kind of big data lacks attribute complementing method
Technical field
The present invention discloses a kind of big data missing attribute complementing method, is related to data processing and Data Mining.
Background technology
Under current internet background, various data are all produced all the time, contain abundant potential in these data Knowledge, the policymaker of all trades and professions also appreciates the value of these mass data, utilizes the new skill such as cloud computing, data mining Art Extracting Knowledge from these big datas, supports decision-making.But due to the diversity of data source, the complexity of True Data is adopted The big data that collection comes has quite a few information content not enough or attribute missing, causes data imperfect, so that it is difficult to such number According to being further processed.In addition, the method for existing big data missing attribute completion be usually take all samples average or The default value of some fixation of person, these methods ignore missing attribute data and the sex chromosome mosaicism that associates of other samples, the category of completion Property value accuracy rate it is relatively low, and then influence the later stage be directed to the data other application, such as precisely recommend, marketing etc..
And attribute complementing method is lacked the invention provides a kind of big data, mainly including four-stage:Training sample gathers The class stage;Treat the Similarity measures stage between completion data and each cluster;Each cluster weight determines the stage;Lack the attribute completion stage;Profit The training sample that attribute perfects first is clustered into certain amount of cluster with the present invention, then according to the data of missing attribute and each The similitude of cluster determines weight, finally using each cluster weighting attribute sum completion missing attribute, has taken into full account missing attribute number According to the relevance with other samples, the property value accuracy rate of completion is higher, does not influence the later stage to be directed to the other application of the data.
The content of the invention
The present invention provides a kind of big data missing attribute complementing method, with highly versatile, be easy to implement the features such as, have Wide application prospect.
Concrete scheme proposed by the present invention is:
A kind of big data lacks attribute complementing method:
The big data that attribute is perfected is clustered into certain amount of cluster as training sample, calculates between missing attribute data and each cluster Similitude, and weight is determined according to above-mentioned similitude, utilizes the weighting attribute sum completion of each cluster to lack the missing of attribute data Attribute.Said process substantially includes A1:Training sample clustering phase;A2:Treat the Similarity measures stage between completion data and each cluster; A3:Each cluster weight determines the stage;A4:Lack attribute completion stage, this four-stage.
The mode that the big data that attribute perfects is clustered into specific clusters as training sample is that sampling attribute perfects Specification sample data, is divided into a number of cluster using clustering method by sample, and each cluster sample average is calculated respectively.
The mode for calculating the similitude between missing attribute data and each cluster is the meter on the premise of missing attribute is removed Calculate the distance between missing attribute data and described each cluster sample average.
The mode for determining weight is to each distance between described missing attribute data and each cluster sample average It is inverted and sum, weighted value is used as using the ratio of inverse distance and summation.
The mode of the missing attribute of the completion missing attribute data is to utilize each cluster sample average and described pair The weighted value answered calculates weighting sum, in this, as the property value of missing.
Usefulness of the present invention is:
The present invention provides a kind of big data missing attribute complementing method, mainly including four-stage:Training sample clustering phase;Treat The Similarity measures stage between completion data and each cluster;Each cluster weight determines the stage;Lack the attribute completion stage;Using the present invention first The training sample that attribute perfects is clustered into certain amount of cluster, then according to the data and the similitude of each cluster of missing attribute Weight is determined, finally using each cluster weighting attribute sum completion missing attribute, missing attribute data and other samples has been taken into full account This relevance, the property value accuracy rate of completion is higher, does not influence the later stage to be directed to the other application of the data.
Brief description of the drawings
Fig. 1 is the inventive method four-stage schematic flow sheet;
Fig. 2 the inventive method idiographic flow schematic diagrams.
Embodiment
The present invention provides a kind of big data missing attribute complementing method:
The big data that attribute is perfected is clustered into certain amount of cluster as training sample, calculates between missing attribute data and each cluster Similitude, and weight is determined according to above-mentioned similitude, utilizes the weighting attribute sum completion of each cluster to lack the missing of attribute data Attribute.
With reference to accompanying drawing, the present invention will be further described.
Using the inventive method, mainly including four-stage:
A1:Training sample clustering phase;
A2:Treat the Similarity measures stage between completion data and each cluster;
A3:Each cluster weight determines the stage;
A4:Lack the attribute completion stage.
Wherein A1 detailed processes are:The sound specification sample data of processed good attribute in stochastical sampling data warehouse, Sample is divided into k cluster using k central points clustering method, k value is determined according to the data category number of anticipation, calculates each respectively Cluster sample average;
The computational methods of A2 similitudes are:On the premise of missing attribute is removed, calculating is treated described in completion data and stage A1 The distance between each cluster sample average;
A3 weighing computation methods are:To each apart from inverted and sum described in the A2 stages, with inverse distance and the ratio of summation It is used as weighted value;
A4 attribute complementing methods are:Average described in stage A1 calculates weighted sum with the respective weights value described in stage A3, with this It is used as the property value of missing.
Data by the property value completion of missing to missing attribute, complete the completion stage.The present invention considers missing attribute Data and the relevance of other samples, the property value accuracy rate of completion are higher, do not influence the later stage to be directed to the other application of the data.

Claims (5)

1. a kind of big data lacks attribute complementing method, it is characterized in that
The big data that attribute is perfected is clustered into certain amount of cluster as training sample, calculates between missing attribute data and each cluster Similitude, and weight is determined according to above-mentioned similitude, utilizes the weighting attribute sum completion of each cluster to lack the missing of attribute data Attribute.
2. according to the method described in claim 1, it is characterized in that described cluster the big data that attribute perfects as training sample Mode into specific clusters is the sound specification sample data of attribute of sampling, and sample is divided into a number of using clustering method Cluster, calculates each cluster sample average respectively.
3. method according to claim 2, it is characterized in that the calculating lacks the similitude between attribute data and each cluster Mode is on the premise of missing attribute is removed, to calculate the distance between missing attribute data and described each cluster sample average.
4. method according to claim 3, it is characterized in that the mode for determining weight is to described missing attribute number According to each between each cluster sample average apart from inverted and sum, weighted value is used as using the ratio of inverse distance and summation.
5. method according to claim 4, it is characterized in that the mode of the missing attribute of completion missing attribute data is Weighting sum is calculated using each cluster sample average weighted value corresponding with described, in this, as the property value of missing.
CN201710357512.7A 2017-05-19 2017-05-19 A kind of big data lacks attribute complementing method Pending CN107169520A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710357512.7A CN107169520A (en) 2017-05-19 2017-05-19 A kind of big data lacks attribute complementing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710357512.7A CN107169520A (en) 2017-05-19 2017-05-19 A kind of big data lacks attribute complementing method

Publications (1)

Publication Number Publication Date
CN107169520A true CN107169520A (en) 2017-09-15

Family

ID=59815708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710357512.7A Pending CN107169520A (en) 2017-05-19 2017-05-19 A kind of big data lacks attribute complementing method

Country Status (1)

Country Link
CN (1) CN107169520A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108932301A (en) * 2018-06-11 2018-12-04 天津科技大学 Data filling method and device
CN109063967A (en) * 2018-07-03 2018-12-21 阿里巴巴集团控股有限公司 A kind of processing method, device and the electronic equipment of air control scene characteristic tensor
CN109710628A (en) * 2018-12-29 2019-05-03 深圳道合信息科技有限公司 Information processing method and device, system, computer and readable storage medium storing program for executing
CN113010500A (en) * 2019-12-18 2021-06-22 中国电信股份有限公司 Processing method and processing system for DPI data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103177088A (en) * 2013-03-08 2013-06-26 北京理工大学 Biomedicine missing data compensation method
CN104866578A (en) * 2015-05-26 2015-08-26 大连理工大学 Hybrid filling method for incomplete data
US20150356094A1 (en) * 2014-06-04 2015-12-10 Waterline Data Science, Inc. Systems and methods for management of data platforms
CN106326335A (en) * 2016-07-22 2017-01-11 浪潮集团有限公司 Big data classification method based on significant attribute selection
CN106407464A (en) * 2016-10-12 2017-02-15 南京航空航天大学 KNN-based improved missing data filling algorithm

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103177088A (en) * 2013-03-08 2013-06-26 北京理工大学 Biomedicine missing data compensation method
US20150356094A1 (en) * 2014-06-04 2015-12-10 Waterline Data Science, Inc. Systems and methods for management of data platforms
CN104866578A (en) * 2015-05-26 2015-08-26 大连理工大学 Hybrid filling method for incomplete data
CN106326335A (en) * 2016-07-22 2017-01-11 浪潮集团有限公司 Big data classification method based on significant attribute selection
CN106407464A (en) * 2016-10-12 2017-02-15 南京航空航天大学 KNN-based improved missing data filling algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王策: "一种基于k-means算法和关联规则的缺失数据填补方法", 《万方数据库》 *
郝胜轩,宋宏,周晓锋: "基于近邻噪声处理的KNN缺失数据填补算法", 《计算机仿真》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108932301A (en) * 2018-06-11 2018-12-04 天津科技大学 Data filling method and device
CN109063967A (en) * 2018-07-03 2018-12-21 阿里巴巴集团控股有限公司 A kind of processing method, device and the electronic equipment of air control scene characteristic tensor
CN109063967B (en) * 2018-07-03 2021-08-27 创新先进技术有限公司 Processing method and device for wind control scene feature tensor and electronic equipment
CN109710628A (en) * 2018-12-29 2019-05-03 深圳道合信息科技有限公司 Information processing method and device, system, computer and readable storage medium storing program for executing
CN109710628B (en) * 2018-12-29 2023-12-26 深圳巨湾科技有限公司 Information processing method, information processing device, information processing system, computer and readable storage medium
CN113010500A (en) * 2019-12-18 2021-06-22 中国电信股份有限公司 Processing method and processing system for DPI data

Similar Documents

Publication Publication Date Title
Bi et al. MobileNet based apple leaf diseases identification
Zheng et al. Oversampling method for imbalanced classification
US20210042664A1 (en) Model training and service recommendation
CN107169520A (en) A kind of big data lacks attribute complementing method
CN108830416B (en) Advertisement click rate prediction method based on user behaviors
CN108984530A (en) A kind of detection method and detection system of network sensitive content
CN103761311B (en) Sensibility classification method based on multi-source field instance migration
CN103106262B (en) The method and apparatus that document classification, supporting vector machine model generate
CN110134787A (en) A kind of news topic detection method
CN102663431B (en) Image matching calculation method on basis of region weighting
CN105912716A (en) Short text classification method and apparatus
CN106445954B (en) Business object display method and device
CN112541529A (en) Expression and posture fusion bimodal teaching evaluation method, device and storage medium
TW202042132A (en) Method for detecting abnormal transaction node, and device
Zhang et al. 3D object retrieval with multi-feature collaboration and bipartite graph matching
CN104915399A (en) Recommended data processing method based on news headline and recommended data processing method system based on news headline
CN103336832A (en) Video classifier construction method based on quality metadata
CN107194207A (en) Protein ligands binding site estimation method based on granularity support vector machine ensembles
CN104318241A (en) Local density spectral clustering similarity measurement algorithm based on Self-tuning
CN103488689A (en) Mail classification method and mail classification system based on clustering
CN109272056A (en) The method of data balancing method and raising data classification performance based on pseudo- negative sample
CN105609116A (en) Speech emotional dimensions region automatic recognition method
CN103744958B (en) A kind of Web page classification method based on Distributed Calculation
Liang et al. MOPSO-based CNN for keyword selection on Google ads
CN106354787A (en) Entity coreference resolution method based on similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170915

RJ01 Rejection of invention patent application after publication