CN106326335A - Big data classification method based on significant attribute selection - Google Patents

Big data classification method based on significant attribute selection Download PDF

Info

Publication number
CN106326335A
CN106326335A CN201610585702.XA CN201610585702A CN106326335A CN 106326335 A CN106326335 A CN 106326335A CN 201610585702 A CN201610585702 A CN 201610585702A CN 106326335 A CN106326335 A CN 106326335A
Authority
CN
China
Prior art keywords
group
sample
attribute
big data
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610585702.XA
Other languages
Chinese (zh)
Inventor
郝虹
刘强
于治楼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Group Co Ltd
Original Assignee
Inspur Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Group Co Ltd filed Critical Inspur Group Co Ltd
Priority to CN201610585702.XA priority Critical patent/CN106326335A/en
Publication of CN106326335A publication Critical patent/CN106326335A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Abstract

The invention belongs to the technical field of computers, and particularly relates to a big data classification method based on significant attribute selection. The big data classification method based on significant attribute selection comprises the following steps: A, sampling, grouping and preprocessing samples; B, training attribute weight of each group; and C, training a classification model of each group. Compared with the prior art, the big data classification method based on significant attribute selection is capable of reducing computation burden, improving classification accuracy and particularly alleviating influence on classification accuracy when new samples lose part of non-significant attributes.

Description

A kind of big data classifying method based on notable Attributions selection
Technical field
The present invention relates to data processing field, a kind of big data classifying method based on notable Attributions selection of concrete offer.
Background technology
In real world, the most all producing various data, these data containing abundant potential knowledge, The policymaker of all trades and professions also appreciates the value of big data, and then utilizes the new technique such as cloud computing, data mining from greatly Extracting Knowledge in data, is used for supporting decision-making.Due to the multiformity of Data Source, the complexity of truthful data, it is big that collection comes Data have quite a few quantity of information not enough or attribute disappearance, cause data imperfect, thus are difficult to sort out such data.
It addition, on the basis of existing big data digging method is all built upon data prediction, latent in order to fully excavate In knowledge, at pretreatment stage, many attributes can be retained as far as possible to gathering the data got, so cause a problem in that and can increase Adding later stage correlation computations amount, some unrelated attribute also can affect excavation accuracy rate simultaneously.
Summary of the invention
The technical assignment of the present invention is for above-mentioned the deficiencies in the prior art, it is provided that a kind of based on notable Attributions selection big Data classifying method, the method can not only solve data attribute and lack the new data subsumption problem brought, and also can reduce system simultaneously Statistics calculates pressure, improves and sorts out accuracy rate.
The technical assignment of the present invention realizes in the following manner: a kind of big data classification side based on notable Attributions selection Method, is characterized in comprising the following steps:
A. specimen sample, packet, pretreatment
A1. processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, calculate respectively Respectively organize sample average;
A2: in each group, for each dimension attribute random assortment initial coefficients of sample data, sample and sample in calculating group The meansigma methods of the Weighted distance of this average, and and other each weighted average distance organized between sample averages;
Respectively group attribute weight training
Optimize each group of attribute coefficients, determine the notable attribute of each group according to coefficient, and as group attribute;
The most each component class model is trained
Often organizing and use two classification device, the sample attribute selected during training is group attribute, and training result is often to organize one two Grader.
As preferably, step B minimizes according to group inner distance and the maximized principle of group distance optimizes each group of attribute Coefficient.
As preferably, step C method particularly includes:
Select a number of attribute that coefficient is bigger as target group attribute, retain these genus of classification based training sample Property, utilize these training samples to train this target group two grader.
The present invention big data classifying method based on notable Attributions selection can be different according to the extraction of different pieces of information label Notable set of properties, compared with prior art, has a beneficial effect highlighted below:
(1) operand can be reduced at sorting phase and improve nicety of grading;
(2) even when new samples lack part non-significant attribute, it is also possible to alleviate the impact on classification accuracy, Improve and sort out accuracy rate.
Accompanying drawing explanation
Fig. 1 is the flow chart of big data classifying method based on notable Attributions selection of the present invention;
Fig. 2 is the flow chart of embodiment method.
Detailed description of the invention
Below in conjunction with drawings and Examples, based on notable Attributions selection the big data classifying method of the present invention is made into One step describes in detail.
As it is shown in figure 1, based on notable Attributions selection the big data classifying method of the present invention comprises the following steps:
A. specimen sample, packet, pretreatment
A1. processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, calculate respectively Respectively organize sample average;
A2: in each group, for each dimension attribute random assortment initial coefficients of sample data, sample and sample in calculating group The meansigma methods of the Weighted distance of this average, and and other each weighted average distance organized between sample averages;
Respectively group attribute weight training
Minimize according to group inner distance and the maximized principle of group distance optimizes each group of attribute coefficients, determine according to coefficient The notable attribute of each group, and as group attribute;
The most each component class model is trained
Often organizing and use two classification device, the sample attribute selected during training is group attribute, and training result is often to organize one two Grader.
As shown in Figure 2, the concretely comprising the following steps of said method:
Step 1: processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, respectively Calculate each group of sample average, calculate each group of sample average according to formula one:
Formula one is:
x d ‾ = 1 m Σ i = 1 m x i
Wherein, m is number of samples, x in groupiFor sample, x in groupdFor target group sample average.
Step 2: for each dimension attribute random assortment initial coefficients of target group sample data, sample and sample in calculating group The meansigma methods of the Weighted distance of average, and and other each weighted average distance organized between sample averages.
Sample data attribute coefficients one column vector of composition, is expressed as:
∂ d → = ( ∂ 1 , ∂ 2 , ... , ∂ f ) T
Wherein, f is sample attribute number,For target group each attribute coefficients vector, limit initial coefficients and be 1.
According to the meansigma methods of sample in formula two calculating group Yu the Weighted distance of sample average,
Formula two is:
Wherein, m,WithAs described above, D1Meansigma methods for sample in group with the Weighted distance of sample average.
Calculate according to formula three and other respectively organize the weighted average distance between sample average,
Formula three is:
Wherein,WithAs described above, n is the number of group, D2For the weighted average between other respectively group sample average Distance.
Step 3: minimize according to group inner distance and the maximized principle of group distance, uses mathematically optimization method to move State adjusts attribute coefficients and carries out attribute weight training, such as, can select to minimizeIt is optimized training.
Step 4: select the bigger a number of attribute of coefficient as target group attribute, retain this of classification based training sample A little attributes, utilize these training samples to train this target group two grader.

Claims (3)

1. a big data classifying method based on notable Attributions selection, it is characterised in that comprise the following steps:
A. specimen sample, packet, pretreatment
A1. processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, calculate each group respectively Sample average;
A2: in each group, for each dimension attribute random assortment initial coefficients of sample data, sample and sample standard deviation in calculating group The meansigma methods of the Weighted distance of value, and and other each weighted average distance organized between sample averages;
Respectively group attribute weight training
Optimize each group of attribute coefficients, determine the notable attribute of each group according to coefficient, and as group attribute;
The most each component class model is trained
Often group uses two classification device, and the sample attribute selected during training is group attribute, and training result is often one two classification of group Device.
Big data classifying method based on notable Attributions selection the most according to claim 1, it is characterised in that: in step B Minimize according to group inner distance and the maximized principle of group distance optimizes each group of attribute coefficients.
Big data classifying method based on notable Attributions selection the most according to claim 1 and 2, it is characterised in that: step C The bigger attribute of middle selection coefficient, as target group attribute, retains these attributes of classification based training sample, utilizes these to train sample This trains this target group two grader.
CN201610585702.XA 2016-07-22 2016-07-22 Big data classification method based on significant attribute selection Pending CN106326335A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610585702.XA CN106326335A (en) 2016-07-22 2016-07-22 Big data classification method based on significant attribute selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610585702.XA CN106326335A (en) 2016-07-22 2016-07-22 Big data classification method based on significant attribute selection

Publications (1)

Publication Number Publication Date
CN106326335A true CN106326335A (en) 2017-01-11

Family

ID=57740254

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610585702.XA Pending CN106326335A (en) 2016-07-22 2016-07-22 Big data classification method based on significant attribute selection

Country Status (1)

Country Link
CN (1) CN106326335A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169520A (en) * 2017-05-19 2017-09-15 济南浪潮高新科技投资发展有限公司 A kind of big data lacks attribute complementing method
CN107248514A (en) * 2017-06-06 2017-10-13 上海华力微电子有限公司 A kind of new E SD protection structures and its implementation
TWI677843B (en) * 2017-09-15 2019-11-21 群益金鼎證券股份有限公司 Intelligent cluster suggestion system and method
CN113033722A (en) * 2021-05-31 2021-06-25 中铁第一勘察设计院集团有限公司 Sensor data fusion method and device, storage medium and computing equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027678A1 (en) * 2003-07-30 2005-02-03 International Business Machines Corporation Computer executable dimension reduction and retrieval engine
CN104142986A (en) * 2014-07-24 2014-11-12 中国软件与技术服务股份有限公司 Big data situation analysis early warning method and system based on clustering
CN104156403A (en) * 2014-07-24 2014-11-19 中国软件与技术服务股份有限公司 Clustering-based big data normal-mode extracting method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027678A1 (en) * 2003-07-30 2005-02-03 International Business Machines Corporation Computer executable dimension reduction and retrieval engine
CN104142986A (en) * 2014-07-24 2014-11-12 中国软件与技术服务股份有限公司 Big data situation analysis early warning method and system based on clustering
CN104156403A (en) * 2014-07-24 2014-11-19 中国软件与技术服务股份有限公司 Clustering-based big data normal-mode extracting method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘永军: ""大数据集的属性选择算法的研究与实现"", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
张靖: ""面向高维小样本数据的分类特征选择算法研究"", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169520A (en) * 2017-05-19 2017-09-15 济南浪潮高新科技投资发展有限公司 A kind of big data lacks attribute complementing method
CN107248514A (en) * 2017-06-06 2017-10-13 上海华力微电子有限公司 A kind of new E SD protection structures and its implementation
CN107248514B (en) * 2017-06-06 2019-11-22 上海华力微电子有限公司 A kind of new E SD protection structure and its implementation
TWI677843B (en) * 2017-09-15 2019-11-21 群益金鼎證券股份有限公司 Intelligent cluster suggestion system and method
CN113033722A (en) * 2021-05-31 2021-06-25 中铁第一勘察设计院集团有限公司 Sensor data fusion method and device, storage medium and computing equipment
CN113033722B (en) * 2021-05-31 2021-08-17 中铁第一勘察设计院集团有限公司 Sensor data fusion method and device, storage medium and computing equipment

Similar Documents

Publication Publication Date Title
CN105956560B (en) A kind of model recognizing method based on the multiple dimensioned depth convolution feature of pondization
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN103678702B (en) Video duplicate removal method and device
CN106326335A (en) Big data classification method based on significant attribute selection
CN103487832B (en) Supervision waveform classification is had in a kind of 3-D seismics signal
CN103559175B (en) A kind of Spam Filtering System based on cluster and method
CN102629272A (en) Clustering based optimization method for examination system database
CN105976056A (en) Information extraction system based on bidirectional RNN
CN111274814B (en) Novel semi-supervised text entity information extraction method
CN104036289A (en) Hyperspectral image classification method based on spatial and spectral features and sparse representation
CN111046917B (en) Object-based enhanced target detection method based on deep neural network
CN102289522A (en) Method of intelligently classifying texts
CN105975455A (en) Information analysis system based on bidirectional recursive neural network
CN105389480A (en) Multiclass unbalanced genomics data iterative integrated feature selection method and system
CN103136327A (en) Time series signifying method based on local feature cluster
CN107133640A (en) Image classification method based on topography's block description and Fei Sheer vectors
CN104688252A (en) Method for detecting fatigue status of driver through steering wheel rotation angle information
Liu et al. Compositional balance analysis: an elegant method of geochemical pattern recognition and anomaly mapping for mineral exploration
CN103473556A (en) Hierarchical support vector machine classifying method based on rejection subspace
CN105389471A (en) Method for reducing training set of machine learning
CN109299753A (en) A kind of integrated learning approach and system for Law Text information excavating
CN106203510A (en) A kind of based on morphological feature with the hyperspectral image classification method of dictionary learning
CN105117740A (en) Font identification method and device
CN103824063A (en) Dynamic gesture recognition method based on sparse representation
CN108268461A (en) A kind of document sorting apparatus based on hybrid classifer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170111