CN106326335A

CN106326335A - Big data classification method based on significant attribute selection

Info

Publication number: CN106326335A
Application number: CN201610585702.XA
Authority: CN
Inventors: 郝虹; 刘强; 于治楼
Original assignee: Inspur Group Co Ltd
Current assignee: Inspur Group Co Ltd
Priority date: 2016-07-22
Filing date: 2016-07-22
Publication date: 2017-01-11

Abstract

The invention belongs to the technical field of computers, and particularly relates to a big data classification method based on significant attribute selection. The big data classification method based on significant attribute selection comprises the following steps: A, sampling, grouping and preprocessing samples; B, training attribute weight of each group; and C, training a classification model of each group. Compared with the prior art, the big data classification method based on significant attribute selection is capable of reducing computation burden, improving classification accuracy and particularly alleviating influence on classification accuracy when new samples lose part of non-significant attributes.

Description

A kind of big data classifying method based on notable Attributions selection

Technical field

The present invention relates to data processing field, a kind of big data classifying method based on notable Attributions selection of concrete offer.

Background technology

In real world, the most all producing various data, these data containing abundant potential knowledge, The policymaker of all trades and professions also appreciates the value of big data, and then utilizes the new technique such as cloud computing, data mining from greatly Extracting Knowledge in data, is used for supporting decision-making.Due to the multiformity of Data Source, the complexity of truthful data, it is big that collection comes Data have quite a few quantity of information not enough or attribute disappearance, cause data imperfect, thus are difficult to sort out such data.

It addition, on the basis of existing big data digging method is all built upon data prediction, latent in order to fully excavate In knowledge, at pretreatment stage, many attributes can be retained as far as possible to gathering the data got, so cause a problem in that and can increase Adding later stage correlation computations amount, some unrelated attribute also can affect excavation accuracy rate simultaneously.

Summary of the invention

The technical assignment of the present invention is for above-mentioned the deficiencies in the prior art, it is provided that a kind of based on notable Attributions selection big Data classifying method, the method can not only solve data attribute and lack the new data subsumption problem brought, and also can reduce system simultaneously Statistics calculates pressure, improves and sorts out accuracy rate.

The technical assignment of the present invention realizes in the following manner: a kind of big data classification side based on notable Attributions selection Method, is characterized in comprising the following steps:

A. specimen sample, packet, pretreatment

A1. processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, calculate respectively Respectively organize sample average；

A2: in each group, for each dimension attribute random assortment initial coefficients of sample data, sample and sample in calculating group The meansigma methods of the Weighted distance of this average, and and other each weighted average distance organized between sample averages；

Respectively group attribute weight training

Optimize each group of attribute coefficients, determine the notable attribute of each group according to coefficient, and as group attribute；

The most each component class model is trained

Often organizing and use two classification device, the sample attribute selected during training is group attribute, and training result is often to organize one two Grader.

As preferably, step B minimizes according to group inner distance and the maximized principle of group distance optimizes each group of attribute Coefficient.

As preferably, step C method particularly includes:

Select a number of attribute that coefficient is bigger as target group attribute, retain these genus of classification based training sample Property, utilize these training samples to train this target group two grader.

The present invention big data classifying method based on notable Attributions selection can be different according to the extraction of different pieces of information label Notable set of properties, compared with prior art, has a beneficial effect highlighted below:

(1) operand can be reduced at sorting phase and improve nicety of grading；

(2) even when new samples lack part non-significant attribute, it is also possible to alleviate the impact on classification accuracy, Improve and sort out accuracy rate.

Accompanying drawing explanation

Fig. 1 is the flow chart of big data classifying method based on notable Attributions selection of the present invention；

Fig. 2 is the flow chart of embodiment method.

Detailed description of the invention

Below in conjunction with drawings and Examples, based on notable Attributions selection the big data classifying method of the present invention is made into One step describes in detail.

As it is shown in figure 1, based on notable Attributions selection the big data classifying method of the present invention comprises the following steps:

A. specimen sample, packet, pretreatment

Respectively group attribute weight training

Minimize according to group inner distance and the maximized principle of group distance optimizes each group of attribute coefficients, determine according to coefficient The notable attribute of each group, and as group attribute；

The most each component class model is trained

As shown in Figure 2, the concretely comprising the following steps of said method:

Step 1: processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, respectively Calculate each group of sample average, calculate each group of sample average according to formula one:

Formula one is:

\overset{&OverBar;}{x_{d}} = \frac{1}{m} Σ_{i = 1}^{m} x_{i}

Wherein, m is number of samples, x in group_iFor sample, x in group_dFor target group sample average.

Step 2: for each dimension attribute random assortment initial coefficients of target group sample data, sample and sample in calculating group The meansigma methods of the Weighted distance of average, and and other each weighted average distance organized between sample averages.

Sample data attribute coefficients one column vector of composition, is expressed as:

\overset{&RightArrow;}{\partial_{d}} = {(\partial_{1}, \partial_{2}, ..., \partial_{f})}^{T}

Wherein, f is sample attribute number,For target group each attribute coefficients vector, limit initial coefficients and be 1.

According to the meansigma methods of sample in formula two calculating group Yu the Weighted distance of sample average,

Formula two is:

Wherein, m,WithAs described above, D₁Meansigma methods for sample in group with the Weighted distance of sample average.

Calculate according to formula three and other respectively organize the weighted average distance between sample average,

Formula three is:

Wherein,WithAs described above, n is the number of group, D₂For the weighted average between other respectively group sample average Distance.

Step 3: minimize according to group inner distance and the maximized principle of group distance, uses mathematically optimization method to move State adjusts attribute coefficients and carries out attribute weight training, such as, can select to minimizeIt is optimized training.

Step 4: select the bigger a number of attribute of coefficient as target group attribute, retain this of classification based training sample A little attributes, utilize these training samples to train this target group two grader.

Claims

1. a big data classifying method based on notable Attributions selection, it is characterised in that comprise the following steps:

A. specimen sample, packet, pretreatment

A1. processed good specification sample data in stochastical sampling data warehouse, and according to its labeled packet, calculate each group respectively Sample average；

A2: in each group, for each dimension attribute random assortment initial coefficients of sample data, sample and sample standard deviation in calculating group The meansigma methods of the Weighted distance of value, and and other each weighted average distance organized between sample averages；

Respectively group attribute weight training

The most each component class model is trained

Often group uses two classification device, and the sample attribute selected during training is group attribute, and training result is often one two classification of group Device.

Big data classifying method based on notable Attributions selection the most according to claim 1, it is characterised in that: in step B Minimize according to group inner distance and the maximized principle of group distance optimizes each group of attribute coefficients.

Big data classifying method based on notable Attributions selection the most according to claim 1 and 2, it is characterised in that: step C The bigger attribute of middle selection coefficient, as target group attribute, retains these attributes of classification based training sample, utilizes these to train sample This trains this target group two grader.