CN106326335A - Big data classification method based on significant attribute selection - Google Patents
Big data classification method based on significant attribute selection Download PDFInfo
- Publication number
- CN106326335A CN106326335A CN201610585702.XA CN201610585702A CN106326335A CN 106326335 A CN106326335 A CN 106326335A CN 201610585702 A CN201610585702 A CN 201610585702A CN 106326335 A CN106326335 A CN 106326335A
- Authority
- CN
- China
- Prior art keywords
- group
- sample
- attribute
- big data
- method based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
Abstract
The invention belongs to the technical field of computers, and particularly relates to a big data classification method based on significant attribute selection. The big data classification method based on significant attribute selection comprises the following steps: A, sampling, grouping and preprocessing samples; B, training attribute weight of each group; and C, training a classification model of each group. Compared with the prior art, the big data classification method based on significant attribute selection is capable of reducing computation burden, improving classification accuracy and particularly alleviating influence on classification accuracy when new samples lose part of non-significant attributes.
Description
Technical field
The present invention relates to data processing field, a kind of big data classifying method based on notable Attributions selection of concrete offer.
Background technology
In real world, the most all producing various data, these data containing abundant potential knowledge,
The policymaker of all trades and professions also appreciates the value of big data, and then utilizes the new technique such as cloud computing, data mining from greatly
Extracting Knowledge in data, is used for supporting decision-making.Due to the multiformity of Data Source, the complexity of truthful data, it is big that collection comes
Data have quite a few quantity of information not enough or attribute disappearance, cause data imperfect, thus are difficult to sort out such data.
It addition, on the basis of existing big data digging method is all built upon data prediction, latent in order to fully excavate
In knowledge, at pretreatment stage, many attributes can be retained as far as possible to gathering the data got, so cause a problem in that and can increase
Adding later stage correlation computations amount, some unrelated attribute also can affect excavation accuracy rate simultaneously.
Summary of the invention
The technical assignment of the present invention is for above-mentioned the deficiencies in the prior art, it is provided that a kind of based on notable Attributions selection big
Data classifying method, the method can not only solve data attribute and lack the new data subsumption problem brought, and also can reduce system simultaneously
Statistics calculates pressure, improves and sorts out accuracy rate.
The technical assignment of the present invention realizes in the following manner: a kind of big data classification side based on notable Attributions selection
Method, is characterized in comprising the following steps:
A. specimen sample, packet, pretreatment
A1. processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, calculate respectively
Respectively organize sample average;
A2: in each group, for each dimension attribute random assortment initial coefficients of sample data, sample and sample in calculating group
The meansigma methods of the Weighted distance of this average, and and other each weighted average distance organized between sample averages;
Respectively group attribute weight training
Optimize each group of attribute coefficients, determine the notable attribute of each group according to coefficient, and as group attribute;
The most each component class model is trained
Often organizing and use two classification device, the sample attribute selected during training is group attribute, and training result is often to organize one two
Grader.
As preferably, step B minimizes according to group inner distance and the maximized principle of group distance optimizes each group of attribute
Coefficient.
As preferably, step C method particularly includes:
Select a number of attribute that coefficient is bigger as target group attribute, retain these genus of classification based training sample
Property, utilize these training samples to train this target group two grader.
The present invention big data classifying method based on notable Attributions selection can be different according to the extraction of different pieces of information label
Notable set of properties, compared with prior art, has a beneficial effect highlighted below:
(1) operand can be reduced at sorting phase and improve nicety of grading;
(2) even when new samples lack part non-significant attribute, it is also possible to alleviate the impact on classification accuracy,
Improve and sort out accuracy rate.
Accompanying drawing explanation
Fig. 1 is the flow chart of big data classifying method based on notable Attributions selection of the present invention;
Fig. 2 is the flow chart of embodiment method.
Detailed description of the invention
Below in conjunction with drawings and Examples, based on notable Attributions selection the big data classifying method of the present invention is made into
One step describes in detail.
As it is shown in figure 1, based on notable Attributions selection the big data classifying method of the present invention comprises the following steps:
A. specimen sample, packet, pretreatment
A1. processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, calculate respectively
Respectively organize sample average;
A2: in each group, for each dimension attribute random assortment initial coefficients of sample data, sample and sample in calculating group
The meansigma methods of the Weighted distance of this average, and and other each weighted average distance organized between sample averages;
Respectively group attribute weight training
Minimize according to group inner distance and the maximized principle of group distance optimizes each group of attribute coefficients, determine according to coefficient
The notable attribute of each group, and as group attribute;
The most each component class model is trained
Often organizing and use two classification device, the sample attribute selected during training is group attribute, and training result is often to organize one two
Grader.
As shown in Figure 2, the concretely comprising the following steps of said method:
Step 1: processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, respectively
Calculate each group of sample average, calculate each group of sample average according to formula one:
Formula one is:
Wherein, m is number of samples, x in groupiFor sample, x in groupdFor target group sample average.
Step 2: for each dimension attribute random assortment initial coefficients of target group sample data, sample and sample in calculating group
The meansigma methods of the Weighted distance of average, and and other each weighted average distance organized between sample averages.
Sample data attribute coefficients one column vector of composition, is expressed as:
Wherein, f is sample attribute number,For target group each attribute coefficients vector, limit initial coefficients and be 1.
According to the meansigma methods of sample in formula two calculating group Yu the Weighted distance of sample average,
Formula two is:
Wherein, m,WithAs described above, D1Meansigma methods for sample in group with the Weighted distance of sample average.
Calculate according to formula three and other respectively organize the weighted average distance between sample average,
Formula three is:
Wherein,WithAs described above, n is the number of group, D2For the weighted average between other respectively group sample average
Distance.
Step 3: minimize according to group inner distance and the maximized principle of group distance, uses mathematically optimization method to move
State adjusts attribute coefficients and carries out attribute weight training, such as, can select to minimizeIt is optimized training.
Step 4: select the bigger a number of attribute of coefficient as target group attribute, retain this of classification based training sample
A little attributes, utilize these training samples to train this target group two grader.
Claims (3)
1. a big data classifying method based on notable Attributions selection, it is characterised in that comprise the following steps:
A. specimen sample, packet, pretreatment
A1. processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, calculate each group respectively
Sample average;
A2: in each group, for each dimension attribute random assortment initial coefficients of sample data, sample and sample standard deviation in calculating group
The meansigma methods of the Weighted distance of value, and and other each weighted average distance organized between sample averages;
Respectively group attribute weight training
Optimize each group of attribute coefficients, determine the notable attribute of each group according to coefficient, and as group attribute;
The most each component class model is trained
Often group uses two classification device, and the sample attribute selected during training is group attribute, and training result is often one two classification of group
Device.
Big data classifying method based on notable Attributions selection the most according to claim 1, it is characterised in that: in step B
Minimize according to group inner distance and the maximized principle of group distance optimizes each group of attribute coefficients.
Big data classifying method based on notable Attributions selection the most according to claim 1 and 2, it is characterised in that: step C
The bigger attribute of middle selection coefficient, as target group attribute, retains these attributes of classification based training sample, utilizes these to train sample
This trains this target group two grader.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610585702.XA CN106326335A (en) | 2016-07-22 | 2016-07-22 | Big data classification method based on significant attribute selection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610585702.XA CN106326335A (en) | 2016-07-22 | 2016-07-22 | Big data classification method based on significant attribute selection |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106326335A true CN106326335A (en) | 2017-01-11 |
Family
ID=57740254
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610585702.XA Pending CN106326335A (en) | 2016-07-22 | 2016-07-22 | Big data classification method based on significant attribute selection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106326335A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107169520A (en) * | 2017-05-19 | 2017-09-15 | 济南浪潮高新科技投资发展有限公司 | A kind of big data lacks attribute complementing method |
CN107248514A (en) * | 2017-06-06 | 2017-10-13 | 上海华力微电子有限公司 | A kind of new E SD protection structures and its implementation |
TWI677843B (en) * | 2017-09-15 | 2019-11-21 | 群益金鼎證券股份有限公司 | Intelligent cluster suggestion system and method |
CN113033722A (en) * | 2021-05-31 | 2021-06-25 | 中铁第一勘察设计院集团有限公司 | Sensor data fusion method and device, storage medium and computing equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050027678A1 (en) * | 2003-07-30 | 2005-02-03 | International Business Machines Corporation | Computer executable dimension reduction and retrieval engine |
CN104142986A (en) * | 2014-07-24 | 2014-11-12 | 中国软件与技术服务股份有限公司 | Big data situation analysis early warning method and system based on clustering |
CN104156403A (en) * | 2014-07-24 | 2014-11-19 | 中国软件与技术服务股份有限公司 | Clustering-based big data normal-mode extracting method and system |
-
2016
- 2016-07-22 CN CN201610585702.XA patent/CN106326335A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050027678A1 (en) * | 2003-07-30 | 2005-02-03 | International Business Machines Corporation | Computer executable dimension reduction and retrieval engine |
CN104142986A (en) * | 2014-07-24 | 2014-11-12 | 中国软件与技术服务股份有限公司 | Big data situation analysis early warning method and system based on clustering |
CN104156403A (en) * | 2014-07-24 | 2014-11-19 | 中国软件与技术服务股份有限公司 | Clustering-based big data normal-mode extracting method and system |
Non-Patent Citations (2)
Title |
---|
刘永军: ""大数据集的属性选择算法的研究与实现"", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 * |
张靖: ""面向高维小样本数据的分类特征选择算法研究"", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107169520A (en) * | 2017-05-19 | 2017-09-15 | 济南浪潮高新科技投资发展有限公司 | A kind of big data lacks attribute complementing method |
CN107248514A (en) * | 2017-06-06 | 2017-10-13 | 上海华力微电子有限公司 | A kind of new E SD protection structures and its implementation |
CN107248514B (en) * | 2017-06-06 | 2019-11-22 | 上海华力微电子有限公司 | A kind of new E SD protection structure and its implementation |
TWI677843B (en) * | 2017-09-15 | 2019-11-21 | 群益金鼎證券股份有限公司 | Intelligent cluster suggestion system and method |
CN113033722A (en) * | 2021-05-31 | 2021-06-25 | 中铁第一勘察设计院集团有限公司 | Sensor data fusion method and device, storage medium and computing equipment |
CN113033722B (en) * | 2021-05-31 | 2021-08-17 | 中铁第一勘察设计院集团有限公司 | Sensor data fusion method and device, storage medium and computing equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105956560B (en) | A kind of model recognizing method based on the multiple dimensioned depth convolution feature of pondization | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN103678702B (en) | Video duplicate removal method and device | |
CN106326335A (en) | Big data classification method based on significant attribute selection | |
CN103487832B (en) | Supervision waveform classification is had in a kind of 3-D seismics signal | |
CN103559175B (en) | A kind of Spam Filtering System based on cluster and method | |
CN102629272A (en) | Clustering based optimization method for examination system database | |
CN105976056A (en) | Information extraction system based on bidirectional RNN | |
CN111274814B (en) | Novel semi-supervised text entity information extraction method | |
CN104036289A (en) | Hyperspectral image classification method based on spatial and spectral features and sparse representation | |
CN111046917B (en) | Object-based enhanced target detection method based on deep neural network | |
CN102289522A (en) | Method of intelligently classifying texts | |
CN105975455A (en) | Information analysis system based on bidirectional recursive neural network | |
CN105389480A (en) | Multiclass unbalanced genomics data iterative integrated feature selection method and system | |
CN103136327A (en) | Time series signifying method based on local feature cluster | |
CN107133640A (en) | Image classification method based on topography's block description and Fei Sheer vectors | |
CN104688252A (en) | Method for detecting fatigue status of driver through steering wheel rotation angle information | |
Liu et al. | Compositional balance analysis: an elegant method of geochemical pattern recognition and anomaly mapping for mineral exploration | |
CN103473556A (en) | Hierarchical support vector machine classifying method based on rejection subspace | |
CN105389471A (en) | Method for reducing training set of machine learning | |
CN109299753A (en) | A kind of integrated learning approach and system for Law Text information excavating | |
CN106203510A (en) | A kind of based on morphological feature with the hyperspectral image classification method of dictionary learning | |
CN105117740A (en) | Font identification method and device | |
CN103824063A (en) | Dynamic gesture recognition method based on sparse representation | |
CN108268461A (en) | A kind of document sorting apparatus based on hybrid classifer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170111 |