CN107169520A - A kind of big data lacks attribute complementing method - Google Patents
A kind of big data lacks attribute complementing method Download PDFInfo
- Publication number
- CN107169520A CN107169520A CN201710357512.7A CN201710357512A CN107169520A CN 107169520 A CN107169520 A CN 107169520A CN 201710357512 A CN201710357512 A CN 201710357512A CN 107169520 A CN107169520 A CN 107169520A
- Authority
- CN
- China
- Prior art keywords
- attribute
- cluster
- data
- missing
- completion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of big data missing attribute complementing method, is related to data processing and Data Mining, mainly including four-stage:Training sample clustering phase;Treat the Similarity measures stage between completion data and each cluster;Each cluster weight determines the stage;Lack the attribute completion stage;The training sample that attribute perfects first is clustered into certain amount of cluster using the present invention, then weight is determined according to the data and the similitude of each cluster of missing attribute, finally utilize each cluster weighting attribute sum completion missing attribute, the relevance of missing attribute data and other samples is taken into full account, the property value accuracy rate of completion is higher, does not influence the later stage to be directed to the other application of the data.
Description
Technical field
The present invention discloses a kind of big data missing attribute complementing method, is related to data processing and Data Mining.
Background technology
Under current internet background, various data are all produced all the time, contain abundant potential in these data
Knowledge, the policymaker of all trades and professions also appreciates the value of these mass data, utilizes the new skill such as cloud computing, data mining
Art Extracting Knowledge from these big datas, supports decision-making.But due to the diversity of data source, the complexity of True Data is adopted
The big data that collection comes has quite a few information content not enough or attribute missing, causes data imperfect, so that it is difficult to such number
According to being further processed.In addition, the method for existing big data missing attribute completion be usually take all samples average or
The default value of some fixation of person, these methods ignore missing attribute data and the sex chromosome mosaicism that associates of other samples, the category of completion
Property value accuracy rate it is relatively low, and then influence the later stage be directed to the data other application, such as precisely recommend, marketing etc..
And attribute complementing method is lacked the invention provides a kind of big data, mainly including four-stage:Training sample gathers
The class stage;Treat the Similarity measures stage between completion data and each cluster;Each cluster weight determines the stage;Lack the attribute completion stage;Profit
The training sample that attribute perfects first is clustered into certain amount of cluster with the present invention, then according to the data of missing attribute and each
The similitude of cluster determines weight, finally using each cluster weighting attribute sum completion missing attribute, has taken into full account missing attribute number
According to the relevance with other samples, the property value accuracy rate of completion is higher, does not influence the later stage to be directed to the other application of the data.
The content of the invention
The present invention provides a kind of big data missing attribute complementing method, with highly versatile, be easy to implement the features such as, have
Wide application prospect.
Concrete scheme proposed by the present invention is:
A kind of big data lacks attribute complementing method:
The big data that attribute is perfected is clustered into certain amount of cluster as training sample, calculates between missing attribute data and each cluster
Similitude, and weight is determined according to above-mentioned similitude, utilizes the weighting attribute sum completion of each cluster to lack the missing of attribute data
Attribute.Said process substantially includes A1:Training sample clustering phase;A2:Treat the Similarity measures stage between completion data and each cluster;
A3:Each cluster weight determines the stage;A4:Lack attribute completion stage, this four-stage.
The mode that the big data that attribute perfects is clustered into specific clusters as training sample is that sampling attribute perfects
Specification sample data, is divided into a number of cluster using clustering method by sample, and each cluster sample average is calculated respectively.
The mode for calculating the similitude between missing attribute data and each cluster is the meter on the premise of missing attribute is removed
Calculate the distance between missing attribute data and described each cluster sample average.
The mode for determining weight is to each distance between described missing attribute data and each cluster sample average
It is inverted and sum, weighted value is used as using the ratio of inverse distance and summation.
The mode of the missing attribute of the completion missing attribute data is to utilize each cluster sample average and described pair
The weighted value answered calculates weighting sum, in this, as the property value of missing.
Usefulness of the present invention is:
The present invention provides a kind of big data missing attribute complementing method, mainly including four-stage:Training sample clustering phase;Treat
The Similarity measures stage between completion data and each cluster;Each cluster weight determines the stage;Lack the attribute completion stage;Using the present invention first
The training sample that attribute perfects is clustered into certain amount of cluster, then according to the data and the similitude of each cluster of missing attribute
Weight is determined, finally using each cluster weighting attribute sum completion missing attribute, missing attribute data and other samples has been taken into full account
This relevance, the property value accuracy rate of completion is higher, does not influence the later stage to be directed to the other application of the data.
Brief description of the drawings
Fig. 1 is the inventive method four-stage schematic flow sheet;
Fig. 2 the inventive method idiographic flow schematic diagrams.
Embodiment
The present invention provides a kind of big data missing attribute complementing method:
The big data that attribute is perfected is clustered into certain amount of cluster as training sample, calculates between missing attribute data and each cluster
Similitude, and weight is determined according to above-mentioned similitude, utilizes the weighting attribute sum completion of each cluster to lack the missing of attribute data
Attribute.
With reference to accompanying drawing, the present invention will be further described.
Using the inventive method, mainly including four-stage:
A1:Training sample clustering phase;
A2:Treat the Similarity measures stage between completion data and each cluster;
A3:Each cluster weight determines the stage;
A4:Lack the attribute completion stage.
Wherein A1 detailed processes are:The sound specification sample data of processed good attribute in stochastical sampling data warehouse,
Sample is divided into k cluster using k central points clustering method, k value is determined according to the data category number of anticipation, calculates each respectively
Cluster sample average;
The computational methods of A2 similitudes are:On the premise of missing attribute is removed, calculating is treated described in completion data and stage A1
The distance between each cluster sample average;
A3 weighing computation methods are:To each apart from inverted and sum described in the A2 stages, with inverse distance and the ratio of summation
It is used as weighted value;
A4 attribute complementing methods are:Average described in stage A1 calculates weighted sum with the respective weights value described in stage A3, with this
It is used as the property value of missing.
Data by the property value completion of missing to missing attribute, complete the completion stage.The present invention considers missing attribute
Data and the relevance of other samples, the property value accuracy rate of completion are higher, do not influence the later stage to be directed to the other application of the data.
Claims (5)
1. a kind of big data lacks attribute complementing method, it is characterized in that
The big data that attribute is perfected is clustered into certain amount of cluster as training sample, calculates between missing attribute data and each cluster
Similitude, and weight is determined according to above-mentioned similitude, utilizes the weighting attribute sum completion of each cluster to lack the missing of attribute data
Attribute.
2. according to the method described in claim 1, it is characterized in that described cluster the big data that attribute perfects as training sample
Mode into specific clusters is the sound specification sample data of attribute of sampling, and sample is divided into a number of using clustering method
Cluster, calculates each cluster sample average respectively.
3. method according to claim 2, it is characterized in that the calculating lacks the similitude between attribute data and each cluster
Mode is on the premise of missing attribute is removed, to calculate the distance between missing attribute data and described each cluster sample average.
4. method according to claim 3, it is characterized in that the mode for determining weight is to described missing attribute number
According to each between each cluster sample average apart from inverted and sum, weighted value is used as using the ratio of inverse distance and summation.
5. method according to claim 4, it is characterized in that the mode of the missing attribute of completion missing attribute data is
Weighting sum is calculated using each cluster sample average weighted value corresponding with described, in this, as the property value of missing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710357512.7A CN107169520A (en) | 2017-05-19 | 2017-05-19 | A kind of big data lacks attribute complementing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710357512.7A CN107169520A (en) | 2017-05-19 | 2017-05-19 | A kind of big data lacks attribute complementing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107169520A true CN107169520A (en) | 2017-09-15 |
Family
ID=59815708
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710357512.7A Pending CN107169520A (en) | 2017-05-19 | 2017-05-19 | A kind of big data lacks attribute complementing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107169520A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108932301A (en) * | 2018-06-11 | 2018-12-04 | 天津科技大学 | Data filling method and device |
CN109063967A (en) * | 2018-07-03 | 2018-12-21 | 阿里巴巴集团控股有限公司 | A kind of processing method, device and the electronic equipment of air control scene characteristic tensor |
CN109710628A (en) * | 2018-12-29 | 2019-05-03 | 深圳道合信息科技有限公司 | Information processing method and device, system, computer and readable storage medium storing program for executing |
CN113010500A (en) * | 2019-12-18 | 2021-06-22 | 中国电信股份有限公司 | Processing method and processing system for DPI data |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103177088A (en) * | 2013-03-08 | 2013-06-26 | 北京理工大学 | Biomedicine missing data compensation method |
CN104866578A (en) * | 2015-05-26 | 2015-08-26 | 大连理工大学 | Hybrid filling method for incomplete data |
US20150356094A1 (en) * | 2014-06-04 | 2015-12-10 | Waterline Data Science, Inc. | Systems and methods for management of data platforms |
CN106326335A (en) * | 2016-07-22 | 2017-01-11 | 浪潮集团有限公司 | Big data classification method based on significant attribute selection |
CN106407464A (en) * | 2016-10-12 | 2017-02-15 | 南京航空航天大学 | KNN-based improved missing data filling algorithm |
-
2017
- 2017-05-19 CN CN201710357512.7A patent/CN107169520A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103177088A (en) * | 2013-03-08 | 2013-06-26 | 北京理工大学 | Biomedicine missing data compensation method |
US20150356094A1 (en) * | 2014-06-04 | 2015-12-10 | Waterline Data Science, Inc. | Systems and methods for management of data platforms |
CN104866578A (en) * | 2015-05-26 | 2015-08-26 | 大连理工大学 | Hybrid filling method for incomplete data |
CN106326335A (en) * | 2016-07-22 | 2017-01-11 | 浪潮集团有限公司 | Big data classification method based on significant attribute selection |
CN106407464A (en) * | 2016-10-12 | 2017-02-15 | 南京航空航天大学 | KNN-based improved missing data filling algorithm |
Non-Patent Citations (2)
Title |
---|
王策: "一种基于k-means算法和关联规则的缺失数据填补方法", 《万方数据库》 * |
郝胜轩,宋宏,周晓锋: "基于近邻噪声处理的KNN缺失数据填补算法", 《计算机仿真》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108932301A (en) * | 2018-06-11 | 2018-12-04 | 天津科技大学 | Data filling method and device |
CN109063967A (en) * | 2018-07-03 | 2018-12-21 | 阿里巴巴集团控股有限公司 | A kind of processing method, device and the electronic equipment of air control scene characteristic tensor |
CN109063967B (en) * | 2018-07-03 | 2021-08-27 | 创新先进技术有限公司 | Processing method and device for wind control scene feature tensor and electronic equipment |
CN109710628A (en) * | 2018-12-29 | 2019-05-03 | 深圳道合信息科技有限公司 | Information processing method and device, system, computer and readable storage medium storing program for executing |
CN109710628B (en) * | 2018-12-29 | 2023-12-26 | 深圳巨湾科技有限公司 | Information processing method, information processing device, information processing system, computer and readable storage medium |
CN113010500A (en) * | 2019-12-18 | 2021-06-22 | 中国电信股份有限公司 | Processing method and processing system for DPI data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Bi et al. | MobileNet based apple leaf diseases identification | |
Zheng et al. | Oversampling method for imbalanced classification | |
US20210042664A1 (en) | Model training and service recommendation | |
CN107169520A (en) | A kind of big data lacks attribute complementing method | |
CN108830416B (en) | Advertisement click rate prediction method based on user behaviors | |
CN108984530A (en) | A kind of detection method and detection system of network sensitive content | |
CN103761311B (en) | Sensibility classification method based on multi-source field instance migration | |
CN103106262B (en) | The method and apparatus that document classification, supporting vector machine model generate | |
CN110134787A (en) | A kind of news topic detection method | |
CN102663431B (en) | Image matching calculation method on basis of region weighting | |
CN105912716A (en) | Short text classification method and apparatus | |
CN106445954B (en) | Business object display method and device | |
CN112541529A (en) | Expression and posture fusion bimodal teaching evaluation method, device and storage medium | |
TW202042132A (en) | Method for detecting abnormal transaction node, and device | |
Zhang et al. | 3D object retrieval with multi-feature collaboration and bipartite graph matching | |
CN104915399A (en) | Recommended data processing method based on news headline and recommended data processing method system based on news headline | |
CN103336832A (en) | Video classifier construction method based on quality metadata | |
CN107194207A (en) | Protein ligands binding site estimation method based on granularity support vector machine ensembles | |
CN104318241A (en) | Local density spectral clustering similarity measurement algorithm based on Self-tuning | |
CN103488689A (en) | Mail classification method and mail classification system based on clustering | |
CN109272056A (en) | The method of data balancing method and raising data classification performance based on pseudo- negative sample | |
CN105609116A (en) | Speech emotional dimensions region automatic recognition method | |
CN103744958B (en) | A kind of Web page classification method based on Distributed Calculation | |
Liang et al. | MOPSO-based CNN for keyword selection on Google ads | |
CN106354787A (en) | Entity coreference resolution method based on similarity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170915 |
|
RJ01 | Rejection of invention patent application after publication |