CN106446011A

CN106446011A - Data processing method and device

Info

Publication number: CN106446011A
Application number: CN201610715951.6A
Authority: CN
Inventors: 孙浩
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2016-08-24
Filing date: 2016-08-24
Publication date: 2017-02-22
Anticipated expiration: 2036-08-24
Also published as: CN106446011B

Abstract

The invention discloses a data processing method and device, relates to the technical field of computer application, and solves the problem of low efficiency since a training sample with an overhigh dimension is used for training a SVM (Support Vector Machine) model. The method comprises the following steps of: obtaining an original sample matrix corresponding to each training sample, wherein the training sample is used for training the SVM model to obtain the SVM model which is used for classifying data to be predicted, and the training sample comprises at least two different categories of training samples; and according to a PCA (Principal Component Analysis) algorithm, carrying out dimension reduction processing on the original sample matrix to obtain the training sample subjected to dimension reduction. The data processing method and device is applied to a process that the SVM model is trained.

Description

The method and device of data processing

Technical field

The present invention relates to Computer Applied Technology field, more particularly to a kind of method and device of data processing.

Background technology

Support vector machine (support vector machine, SVM), are one kind for carrying out pattern recognition, classification etc. Learning model.In actual applications, therefore SVM model is commonly used for solving for the solution best results of two classification problems Two classification problems.Such as mail is classified, using unknown mails as data input to be predicted in SVM model, by SVM Two sort features of model, obtain the classification results that the unknown mails are normal email or spam.

Generally, before being classified using SVM model, it is necessary first to using known training sample, SVM model is entered Row training.The a large amount of normal email that collects and spams in advance are for example used to be trained SVM model as training sample. However, during being trained to SVM model, inventor has found, and for the training sample that some dimensions are too high, corresponding group The dimension of the training set for becoming is equally too high, and the dimension of training set is too high, can cause to train the amount of calculation of SVM model very big and Comprising " noise data " more, therefore directly using the too high training sample of dimension normally result in training SVM model effect Rate is relatively low.The training sample of such as mail class, if with constitute Mail Contents " word " as unit, by whole words constitute mail to Amount, the corresponding vector dimension of each mail can be up to hundreds of thousands, then for the dimension of the training set of this training sample composition Also hundreds of thousands will be up to, and so high dimension will necessarily increase bag in amount of calculation, and Mail Contents for the training of SVM model Some words for containing such as " " etc. insignificant " noise data " also can be more, big amount of calculation and more " noise data " The efficiency of training SVM model will necessarily be reduced.

Content of the invention

In view of the above problems, it is proposed that on the present invention is overcome the problems referred to above or solves at least in part so as to offer one kind State the method and device of the data processing of problem.

For solving above-mentioned technical problem, on the one hand, the invention provides a kind of method of data processing, including：

The corresponding original sample matrix of each training sample is obtained, the training sample is used for support vector machines mould Type is trained obtaining the SVM model for treating that prediction data is classified, and wherein, the training sample is different comprising at least two The training sample of classification；

Dimension-reduction treatment is carried out according to principal component analysiss PCA algorithm to the original sample matrix, obtains the training after dimensionality reduction Sample.

On the other hand, the invention provides a kind of device of data processing, including：

Acquiring unit, for obtaining the corresponding original sample matrix of each training sample, the training sample is used for propping up Hold vector machine SVM model to be trained obtaining the SVM model for treating that prediction data is classified, wherein, the training sample bag Containing at least two different classes of training samples；

Dimensionality reduction unit, for carrying out dimension-reduction treatment according to principal component analysiss PCA algorithm to the original sample matrix, obtains Training sample after dimensionality reduction.

The method and device of the data processing for providing by technique scheme, the present invention, can be from for SVM model Corresponding original sample matrix is obtained in the training sample being trained, and wherein, training sample is different classes of comprising at least two Training sample；Then according to principal component analysiss (Principal Component Analysis, PCA) algorithm to original sample Matrix carries out dimension-reduction treatment, obtains the training sample after dimensionality reduction.Compared with prior art, the present invention can pass through PCA algorithm pair Training sample carries out dimensionality reduction, and the reduction of training sample dimension can reduce and carry out SVM model training using training sample During amount of calculation, while also eliminate some " noise datas " in training sample, so as to improve the effect of training SVM model Rate.

Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of description, and in order to allow the above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by the specific embodiment of the present invention.

Description of the drawings

By reading the detailed description of hereafter preferred implementation, various other advantages and benefit are common for this area Technical staff will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation, and is not considered as to the present invention Restriction.And in whole accompanying drawing, it is denoted by the same reference numerals identical part.In the accompanying drawings：

Fig. 1 shows a kind of method flow diagram of data processing provided in an embodiment of the present invention；

Fig. 2 shows the method flow diagram of another kind provided in an embodiment of the present invention data processing；

Fig. 3 shows a kind of composition frame chart of the device of data processing provided in an embodiment of the present invention；

Fig. 4 shows the composition frame chart of the device of another kind provided in an embodiment of the present invention data processing.

Specific embodiment

The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure and should not be by embodiments set forth here Limited.Conversely, provide these embodiments to be able to be best understood from the disclosure, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.

Efficiency for solving the problems, such as the too high training sample training SVM model of existing use dimension is low, and the present invention is implemented Example provides a kind of method of data processing, as shown in figure 1, the method includes：

101st, each training sample corresponding original sample matrix is obtained.

In the present embodiment, training sample is used for being trained SVM model, treats what prediction data was classified to obtain SVM model, wherein training sample include at least two different classes of training samples.Permissible for different classes of training sample Include normal email with spam for different classes of mail；Sports news, wealth can also be included for different classes of news Through news, entertainment newses etc..

The original sample matrix for obtaining after being changed is obtained by training sample, and wherein each training sample corresponds to an original sample This matrix.To original sample matrix conversion process it is specifically：First, the corresponding presetting database of construction training sample, this is pre- If data base is the element set of all elements for including that training sample may relate to, wherein each element corresponds to a dimension； Secondly, all of element is ranked up；Finally, respectively by the element for including in each training sample and presetting database Element is mated, and respectively obtains the corresponding original sample matrix of each training sample according to matching result.

Wherein, the process mated by each training sample is identical, is to simplify description here with a training sample The process of coupling is illustrated as a example by this, specifically：Training sample is split according to element, obtains training sample element set； Whether the element that training of judgement sample elements are concentrated is contained in presetting database, default by occurring in training sample element set The element of lane database is marked, and is generally designated as 1, the unit that will be not present in training sample element set in presetting database Element is also carried out labelling, is generally designated as 0；Then arranged according to the sequence of element in corresponding presetting database by 0 and 1, Finally give the original sample matrix of corresponding training sample, wherein the element in the dimension of original sample matrix and presetting database Number is identical.

In an example of the present embodiment, it is assumed that after all elements sequence for including in presetting database, result is：Red, Orange, yellow, green, blue or green, blue, purple, if the element for including in the corresponding training sample element set of one of training sample is：Orange, blue or green, Purple, then to should training sample original sample matrix be [0,1,0,0,1,0,1]^TIf another training sample is corresponding The element for including in training sample element set is：Red, orange, yellow, then to should training sample original sample matrix for [1,1,1, 0,0,0,0]^T.

In practical application, the abundanter dimension being related to of the content of training sample is higher, such as message body content, Because which may relate to the various characters of different language, therefore training dimension is very large, can be normally reached hundreds of thousands dimension.By Very high in the dimension of training sample, so the original sample matrix for obtaining in this step is also superelevation dimension, in the present embodiment By executing subsequent step 102, the dimension of original sample matrix can be reduced, so as to reach the purpose for reducing model training amount.

102nd, dimension-reduction treatment is carried out according to principal component analysiss PCA algorithm to original sample matrix, obtains the training after dimensionality reduction Sample.

Principal component analysiss PCA is the one kind for parsing main influence factor in a kind of things for grasping things principal contradiction Algorithm, can be by high-dimensional data projection to the space compared with low dimensional.The present embodiment is according to PCA algorithm to each original sample Matrix carries out dimension-reduction treatment, obtains the original sample matrix after dimensionality reduction, is derived from the training sample after dimensionality reduction.Obtain dimensionality reduction Training sample afterwards be in order to treat the SVM model training classified by prediction data.By the training sample pair compared with low dimensional SVM model is trained reducing the amount of calculation in training process, improves the efficiency of training SVM model.

The method of data processing provided in an embodiment of the present invention, can be from the training sample for being trained to SVM model Corresponding original sample matrix is obtained in this, wherein, training sample includes at least two different classes of training samples；Then root Dimension-reduction treatment is carried out to original sample matrix according to main PCA algorithm, obtain the training sample after dimensionality reduction.Compared with prior art, originally Inventive embodiments can carry out dimensionality reduction by PCA algorithm to training sample, and the reduction of training sample dimension can reduce and make Amount of calculation during SVM model training being carried out with training sample, while also eliminate some " the noise numbers in training sample According to ", so as to improve the efficiency of training SVM model.

Further, as refinement and the extension to method shown in Fig. 1, another embodiment of the present invention gives a kind of number Method according to processing.As shown in Fig. 2 the method includes：

201st, each training sample corresponding original sample matrix is obtained.

Implementation in the implementation Fig. 1 step 101 of this step is identical, and here is omitted.

202nd, dimension-reduction treatment is carried out according to principal component analysiss PCA algorithm to original sample matrix, obtains the training after dimensionality reduction Sample.

Specifically carrying out dimension-reduction treatment according to principal component analysiss PCA algorithm to original sample matrix includes procedure below：

First, eigenmatrix is generated according to original sample matrix；

The specific eigenmatrix that generates includes：

(1) average of all original sample matrixes is calculated, obtains central sample matrix；

Specifically the formula of calculating original sample matrix average m is：

Wherein,Represent j-th original sample matrix of i-th classification, c is classification sum, N_iRepresent i-th classification bag The original sample matrix sum for containing, i is natural number with j.

The central sample matrix H of acquisition is：

(2) transposed matrix of central sample matrix, and the transposition square by central sample matrix and central sample matrix are calculated Battle array is multiplied, and obtains target sample matrix H H^T；

(3) target sample matrix H H is calculated^TMultiple eigenvalue μ, and corresponding characteristic vector g of each eigenvalue；

In the present embodiment, multiple eigenvalues are multiple nonzero eigenvalues.

Calculate HH^TMultiple nonzero eigenvalues when can according to the common nonzero eigenvalue for seeking matrix method calculate, HH is calculated in the present embodiment^TMethod can also be：H is first calculated^TMultiple nonzero eigenvalue μ of H and feature corresponding with μ Vector v, then obtains characteristic vector g using equation below；

It should be noted that HH^TWith H^TThe nonzero eigenvalue of H is identical, is therefore all denoted as μ.

(4) characteristic vector of predetermined number is selected successively according to the descending order of eigenvalue；

Wherein predetermined number is determined according to below equation：

Wherein θ is usually and is close to 1 but the predetermined threshold value less than 1, can generally take 0.9 or higher, specifically can root Factually the application on border is set；μ_jFor HH^TThe i-th big eigenvalue, i.e. μ₁≥μ₂≥…≥μ_k, wherein k is nonzero eigenvalue Number.

The characteristic vector of predetermined number is selected, that is, chooses the corresponding characteristic vector of front i nonzero eigenvalue.

(5) according to the order for selecting characteristic vector, the characteristic vector of selection is sequentially arranged, eigenmatrix G is obtained, specifically Acquisition eigenmatrix G as follows：

G=[g₁,…,g_i]

Second, calculate the transposed matrix G of eigenmatrix^T；

The transposed matrix for calculating eigenmatrix is to change the row of eigenmatrix into corresponding row, the new matrix for obtaining, generally It is denoted as G^T.Wherein when matrix transpose is carried out, typically using the first row of matrix as transposed matrix the first row, the first row make First row for transposed matrix.

3rd, the transposed matrix of eigenmatrix and original sample matrix multiple obtain the training sample after dimensionality reduction.

Formula in this step specifically related to is：Y=G^Tx

Wherein, y is the matrix after corresponding original sample matrix dimensionality reduction.Each y be each original sample matrix dimensionality reduction it A corresponding training sample afterwards.

In the present embodiment, using PCA algorithm, by generating eigenmatrix, eigenmatrix transposition and transposed matrix and original Several steps of beginning sample matrix multiplication are processed, and realize reducing the purpose of original sample matrix dimensionality.

203rd, will after eigenmatrix and training obtain training result multiplication of vectors, obtain optimize after training result to Amount.

Wherein training result vector be using dimensionality reduction after training sample SVM model is trained after obtain training knot Fruit vector, the training result vector for specifically obtaining for SVM model is supporting vector, and supporting vector is used for distinguishing difference The corresponding vector of the classification boundaries of categorical data.

Specific process that training result vector is optimized is：By by the feature square for getting in above-mentioned steps 202 Battle array is that supporting vector is optimized to training result vector, and the concrete formula for optimizing is as follows：

Z '=Gz

Wherein, z is training result vector, z, is the training result vector after optimizing.

204th, classification and matching collection is determined according to the training result vector after optimization.

Classification and matching integrates the proper subclass as presetting database, and presetting database includes all units that training sample may relate to The element set of element.Determine that the process of classification and matching collection includes：First, the non-zero in the training result vector after optimizing is searched Value；Secondly, corresponding element is extracted in presetting database according to the nonzero value in the training result vector after optimization, needs It is that training sample after the dimensionality reduction is obtained that bright is training result vector, therefore after the dimension of training result vector and dimensionality reduction Training sample in dimension there is corresponding relation, and the training sample after dimensionality reduction is obtained by original sample matrix dimensionality reduction , therefore there is corresponding relation with the dimension of original sample matrix in the dimension of training result vector.Again due to original sample square The dimension of battle array have corresponding relation with unit in presetting database, therefore the dimension of the training result vector after optimization and present count Corresponding relation is have according to the unit in storehouse, the dimension that therefore can be located according to the nonzero value in the training result vector after optimization The element of corresponding dimension is extracted in presetting database；Finally by corresponding for the nonzero value for extracting elementary composition classification coupling Collection, the wherein coefficient of the corresponding element that nonzero value is concentrated as classification and matching.

205th, classified according to classification and matching set pair data to be predicted.

The specific process that is classified according to classification and matching set pair data to be predicted is as follows：

First, data to be predicted are concentrated in classification and matching carries out multimode matching.Multimode matching be in a character string Find the process of multiple model strings.Multimode matching is specifically referred in the present embodiment, finds classification in data to be predicted Multiple elements in set of matches.Carry out multimode matching algorithm have multiple, such as common including dictionary tree (Trie tree, Trie tree), AC automat (Aho-Corasick automation, AC automat) algorithm, Wu-Manber (Wu-Manber, VM) algorithm etc..The present embodiment does not limit specifically used specifically any multimode matching algorithm and is mated.Coupling is finally obtained As a result the element that the corresponding classification and matching for including in data to be predicted is concentrated is to determine.

Second, it is right that the element of presence simultaneously in classification and matching concentration and data to be predicted is concentrated in classification and matching The coefficient that answers is added up.

3rd, prediction data is treated according to accumulation result and is classified.

Prediction data is treated according to accumulation result to be classified, that is, determine the classification of data to be predicted.Specifically：For difference Classification is respectively provided with a corresponding threshold range, and different threshold ranges do not have common factor, then by accumulation result with all Threshold range compare, which threshold range accumulation result belongs to, and decides which corresponding data to be predicted belong to The data of classification.

The method of data processing provided in an embodiment of the present invention, except can be subtracted by way of reducing training sample dimension Outside the amount of calculation of few training process, additionally it is possible to which the training result vector obtained by training is optimized, based on the training after optimization Result vector determines classification and matching collection.When treating prediction data using SVM model and being classified, due to only needing to according to classification The element that includes in set of matches rather than classified according to the possible all elements for constituting different classes of data, therefore can The amount of calculation in data categorizing process is enough substantially reduced, improves the efficiency of data classification.

Further, for the method for the generation eigenmatrix being related in Fig. 2, the embodiment of the present invention additionally provides other one The method for generating eigenmatrix is planted, specifically, as described below：

As training sample generally comprises the higher training sample of the order of magnitude, entering according to the corresponding generation step in Fig. 2 During the correlation computations of row eigenmatrix, if calculate to the training sample of comparatively high amts level simultaneously, due to training samples number Larger, larger calculating pressure can be caused to system-computed, therefore the present embodiment is realized according to original sample by way of packet This matrix generates eigenmatrix.The thought of the mode of packet is that the training sample of comparatively high amts level is divided into multiple groups, each group In the quantity of training sample that includes can substantially reduce, then respectively corresponding calculating is carried out to the training sample in each group, Big compared to the quantity of the training sample for carrying out being used all of training sample as an entirety when eigenmatrix is calculated Big reduction, therefore can reduce system-computed pressure.Specifically realize being generated according to original sample matrix by packet mode The process of eigenmatrix is comprised the following steps：

First, according to default dimension values calculating matrix set M；

It is the dimension values for being arranged according to the computing capability of system wherein to preset dimension, and the setting for generally presetting dimension is followed The principle being the bigger the better within the computing capability of system.

According to the specific formula of quantity M of default dimension values calculating matrix set it is：

Wherein, a is the quantity of the training sample of each classification for including in training sample, it should be noted that being instructed During the selection of white silk sample, the quantity of the training sample of each classification is equal；B is for presetting dimension values.

Second, respectively the original sample matrix of each classification is averagely divided into M set of matrices；

Before the original sample matrix by each classification is averagely divided into multiple M set of matrices, need first with instruction The classification for practicing sample is that corresponding for training sample original sample Classification of Matrix is obtained different classes of original sample by foundation Matrix.

3rd, different classes of set of matrices is carried out the combination of permutation and combination type, multiple matrix group are obtained, wherein, per Individual matrix group all set of matrices comprising all categories.

Specific example is given, the process for obtaining multiple matrix group is illustrated：Assume to include two kinds in training sample The training sample of classification, the wherein quantity of the training sample of each classification corresponding are 1000, then corresponding original sample matrix Quantity be also 1000, and assume that according to quantity M of set of matrices that dimension is worth to is preset be 10, then each can be obtained Corresponding 10 set of matrices of the original sample matrix of classification, then will be each in corresponding for one of classification set of matrices Individual set of matrices is combined with corresponding set of matrices in another classification respectively, and the number of the matrix group for finally giving is 45.

5th, calculate the subcharacter matrix of each matrix group respectively；

The mode for calculating the subcharacter matrix of each matrix group is identical with the implementation for generating eigenmatrix in Fig. 2, its Middle matrix group corresponds to all original sample matrixes, subcharacter matrix character pair matrix.Finally give multiple subcharacter matrixes.

6th, according to the order for calculating subcharacter matrix, multiple subcharacter matrixes are combined, obtain eigenmatrix.

The subcharacter matrix for being obtained by the 5th step is carried out linear combination, particularly according to the suitable of calculating subcharacter matrix Sequence, multiple subcharacter matrixes is combined, obtains eigenmatrix.

Provide example to illustrate the mode for obtaining eigenmatrix：The number of the subcharacter matrix that hypothesis is obtained is r, note I-th group of subcharacter matrix is Gi, then eigenmatrix G=[the G1 G2 ... that will obtain after r sub- eigenmatrix linear combination Gr].

In a kind of application mode of the present embodiment, the SVM model for being trained by Fig. 1 or Fig. 2 flow process can be used for knowing Whether other unknown mails are spam and/or normal email.Specifically when being identified：Will be normal known to quantity identical Mail and known spam composition training sample, then according to the training sample after the mode in Fig. 1 or Fig. 2 receives dimensionality reduction This, wherein it should be noted that presetting database is to constitute the set that all possible " word " of mail constitutes；Then will training Sample obtains corresponding supporting vector to SVM model training, then supporting vector is optimized, finally according to propping up after optimization Hold vector and corresponding classification and matching collection is obtained, classification and matching is concentrated only corresponding comprising non-zero values in the supporting vector after optimizing " word "；Then corresponding for the word being present in set of matches for including in unknown mails coefficient is added up, and according to cumulative Numerical result goes to recognize that mail is spam or normal email.

Further, as the realization to the various embodiments described above, another embodiment of the embodiment of the present invention additionally provides one The device of data processing is planted, for realizing the method described in above-mentioned Fig. 1 and Fig. 2.As shown in figure 3, the device includes：Acquiring unit 31 and dimensionality reduction unit 32.

Acquiring unit 31, for obtaining the corresponding original sample matrix of each training sample, training sample is used for supporting Vector machine SVM model is trained obtaining the SVM model for treating that prediction data is classified, and wherein, training sample is comprising at least Two kinds of different classes of training samples.

Wherein, the process mated by each training sample is identical, is to simplify description here with a training sample The process of coupling is illustrated as a example by this, specifically：Training sample is split according to element, obtains training sample element set； Whether the element that training of judgement sample elements are concentrated is contained in presetting database, default by occurring in training sample element set The element of lane database is marked, and is generally designated as 1, and the element being not included in presetting database is also carried out labelling, leads to 0 is often labeled as；Then arranged according to the sequence of element in corresponding presetting database by 0 and 1, finally given corresponding training The original sample matrix of sample, the wherein dimension of original sample matrix are identical with the element number in presetting database.

Dimensionality reduction unit 32, for carrying out dimension-reduction treatment to original sample matrix according to principal component analysiss PCA algorithm, obtains drop Training sample after dimension.

Further, as shown in figure 4, dimensionality reduction unit 32 includes：

Generation module 321, for generating eigenmatrix according to original sample matrix；

Eigenmatrix G is generated according to original sample matrix x.

Computing module 322, for calculating the transposed matrix of eigenmatrix；

Calculate the transposed matrix G of eigenmatrix^T, specifically：Calculate eigenmatrix transposed matrix be by eigenmatrix Row changes corresponding row into, and the new matrix for obtaining generally is denoted as G^T.Wherein when matrix transpose is carried out, typically by the of matrix String is used as the first row of transposed matrix, and the first row is used as the first row of transposed matrix.

Multiplication module 323, for by the transposed matrix of eigenmatrix and original sample matrix multiple, obtaining the instruction after dimensionality reduction Practice sample.

Formula in multiplication module 323 specifically related to is：Y=G^Tx

Further, generation module 321, are used for：

The average of all original sample matrixes is calculated, obtains central sample matrix；

Calculate the transposed matrix of central sample matrix, and the transposed matrix phase by central sample matrix and central sample matrix Take advantage of, obtain target sample matrix；

Calculate multiple eigenvalues of target sample matrix, and the corresponding characteristic vector of each eigenvalue；

The characteristic vector of predetermined number is selected successively according to the descending order of eigenvalue；

According to the order for selecting characteristic vector, the characteristic vector of selection is sequentially arranged, obtains eigenmatrix.

Eigenmatrix being generated in specific generation module 321 includes：

Specifically the formula of calculating original sample matrix average m is：

The central sample matrix H of acquisition is：

Wherein predetermined number is determined according to below equation：

G=[g₁,…,g_i]

Further, generation module 321 is used for：

Original sample matrix is averagely divided into multiple matrix group；

The subcharacter matrix of each matrix group is calculated respectively；

According to the order for calculating subcharacter matrix, multiple subcharacter matrixes are combined, obtain eigenmatrix.

Further, generation module 321 is used for：

With the classification of training sample as foundation, to the corresponding original sample Classification of Matrix of training sample；

According to default dimension values calculating matrix set M；

Respectively the original sample matrix of each classification is averagely divided into M set of matrices；

Different classes of set of matrices is carried out the combination of permutation and combination type, obtains multiple matrix group, wherein, each matrix Group all set of matrices comprising all categories.

Mode in generation module 321 also comprising another kind of generation eigenmatrix, specifically：

First, according to default dimension values calculating matrix set M；

Further, as shown in figure 4, device is further included：

Optimize unit 33, for carrying out dimension-reduction treatment to original sample matrix according to principal component analysiss PCA algorithm, obtain After training sample after dimensionality reduction, and after the training sample after using dimensionality reduction is trained to SVM model, by eigenmatrix With the training result multiplication of vectors for obtaining after training, the training result vector after optimizing is obtained；

Specific process that training result vector is optimized is：By by the feature for getting in above-mentioned dimensionality reduction unit 32 Matrix is that supporting vector is optimized to training result vector, and the concrete formula for optimizing is as follows：

Z '=Gz

Determining unit 34, for determining classification and matching collection according to the training result vector after optimization, classification and matching collection is pre- If the proper subclass of data base, presetting database includes the element set of all elements that training sample may relate to；

Taxon 35, for being classified according to classification and matching set pair data to be predicted.

First, data to be predicted are concentrated in classification and matching carries out multimode matching.Multimode matching be in a character string Find the problem of multiple model strings.Multimode matching is specifically referred in the present embodiment, finds classification in data to be predicted Multiple elements in set of matches.Carry out multimode matching algorithm have multiple, such as common including dictionary tree (Trie tree, Trie tree), AC automat (Aho-Corasick automation, AC automat) algorithm, Wu-Manber (Wu-Manber, VM) algorithm etc..The present embodiment does not limit specifically used specifically any multimode matching algorithm and is mated.Coupling is finally obtained As a result the element that the corresponding classification and matching for including in data to be predicted is concentrated is to determine.

The method of data processing provided in an embodiment of the present invention, except can be subtracted by way of reducing training sample dimension Outside the amount of calculation of few training process, additionally it is possible to which the training result vector obtained by training is optimized, based on the training after optimization Result vector determines classification and matching collection.When treating prediction data using SVM model and being classified, due to only needing to according to coupling Concentrate the element for including rather than classified according to the possible all elements for constituting different classes of data, therefore, it is possible to big The big amount of calculation for reducing in data categorizing process, improves the efficiency of data classification.

Further, the SVM model in acquiring unit 31 is used for recognizing that whether unknown mails to be spam and/or normal Mail；

Known normal email and known spam that training sample is used comprising training.

In a kind of application mode of the present embodiment, the SVM model for being trained by Fig. 3 or Fig. 4 device can be used for knowing Whether other unknown mails are spam and/or normal email.Specifically when being identified：Will be normal known to quantity identical Mail and known spam composition training sample, then according to the training sample after the device in Fig. 3 or Fig. 4 receives dimensionality reduction This, wherein it should be noted that presetting database is to constitute the set that all possible " word " of mail constitutes；Then will training Sample obtains corresponding supporting vector to SVM model training, then supporting vector is optimized, finally according to propping up after optimization Hold vector and corresponding classification and matching collection is obtained, classification and matching is concentrated only corresponding comprising non-zero values in the supporting vector after optimizing " word "；Then corresponding for the word being present in set of matches for including in unknown mails coefficient is added up, and according to cumulative Numerical result goes to recognize that mail is spam or normal email.

The device of data processing provided in an embodiment of the present invention, can be from the training sample for being trained to SVM model Corresponding original sample matrix is obtained in this, wherein, training sample includes at least two different classes of training samples；Then root According to principal component analysiss (Principal Component Analysis, PCA) algorithm, dimension-reduction treatment is carried out to original sample matrix, Obtain the training sample after dimensionality reduction.Compared with prior art, the embodiment of the present invention can be carried out to training sample by PCA algorithm Dimensionality reduction, and the reduction of training sample dimension can reduce the calculating during SVM model training is carried out using training sample Amount, while some " noise datas " in training sample are also eliminated, so as to improve the efficiency of training SVM model.

The embodiment of the invention also discloses：

A1, a kind of method of data processing, methods described includes：Obtain the corresponding original sample square of each training sample Battle array, the training sample is used for support vector machines model to be trained obtaining the SVM mould for treating that prediction data is classified Type, wherein, the training sample includes at least two different classes of training samples；

A2, the method according to A1, described drop to the original sample matrix according to principal component analysiss PCA algorithm Dimension is processed, and obtains the training sample after dimensionality reduction, including：

Eigenmatrix is generated according to the original sample matrix；

Calculate the transposed matrix of the eigenmatrix；

By the transposed matrix of the eigenmatrix and the original sample matrix multiple, the training sample after dimensionality reduction is obtained.

A3, the method according to A2, described according to the original sample matrix generate eigenmatrix, including：

Calculate the transposed matrix of the central sample matrix, and by the central sample matrix and the central sample matrix Transposed matrix be multiplied, obtain target sample matrix；

Calculate multiple eigenvalues of the target sample matrix, and the corresponding characteristic vector of each eigenvalue；

According to the order for selecting characteristic vector, the characteristic vector of selection is sequentially arranged, obtains the eigenmatrix.

A4, the method according to A2 or A3, described according to the original sample matrix generate eigenmatrix, including：

The original sample matrix is averagely divided into multiple matrix group；

The subcharacter matrix of each matrix group is calculated respectively；

According to the order for calculating subcharacter matrix, multiple subcharacter matrixes are combined, obtain the eigenmatrix.

A5, the method according to A4, described the original sample matrix is averagely divided into multiple matrix group, including：

According to default dimension values calculating matrix set M；

Different classes of set of matrices is carried out the combination of permutation and combination type, the plurality of matrix group is obtained, wherein, each Matrix group all set of matrices comprising all categories.

A6, the method according to A2, are carried out to the original sample matrix according to principal component analysiss PCA algorithm described Dimension-reduction treatment, after obtaining the training sample after dimensionality reduction, methods described is further included：

After training sample after using the dimensionality reduction is trained to the SVM model, by the eigenmatrix and instruction The training result multiplication of vectors for obtaining after white silk, obtains the training result vector after optimizing；

Classification and matching collection is determined according to the training result vector after optimization, the classification and matching integrates as the true of presetting database Subset, the presetting database includes the element set of all elements that training sample may relate to；

Data to be predicted according to the classification and matching set pair are classified.

A7, the method according to A1, the SVM model is used for recognizing that whether unknown mails to be spam and/or just Normal mail；

Known normal email and known spam that the training sample is used comprising training.

B8, a kind of device of data processing, described device includes：

B9, the device according to B8, the dimensionality reduction unit includes：

Generation module, for generating eigenmatrix according to the original sample matrix；

Computing module, for calculating the transposed matrix of the eigenmatrix；

Multiplication module, for by the transposed matrix of the eigenmatrix and the original sample matrix multiple, obtaining dimensionality reduction Training sample afterwards.

B10, the device according to B9, the generation module, it is used for：

B11, the device according to B9 or B10, the generation module is used for：

The original sample matrix is averagely divided into multiple matrix group；

The subcharacter matrix of each matrix group is calculated respectively；

B12, the device according to B11, the generation module is used for：

According to default dimension values calculating matrix set M；

B13, the device according to B9, described device is further included：

Optimize unit, for carrying out at dimensionality reduction to the original sample matrix according to principal component analysiss PCA algorithm described Reason, after obtaining the training sample after dimensionality reduction, and the training sample after using the dimensionality reduction is instructed to the SVM model After white silk, the training result multiplication of vectors that will obtain after the eigenmatrix and training, obtain the training result vector after optimizing；

Determining unit, for determining classification and matching collection according to the training result vector after optimization, the classification and matching collection is The proper subclass of presetting database, the presetting database includes the element set of all elements that training sample may relate to；

Taxon, is classified for data to be predicted according to the classification and matching set pair.

B14, the device according to B8, the SVM model in the acquiring unit is used for whether recognizing unknown mails For spam and/or normal email；

In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion that describes in detail in certain embodiment Point, may refer to the associated description of other embodiment.

It is understood that said method and the correlated characteristic in device mutually can be referred to.In addition, in above-described embodiment " first ", " second " etc. be for distinguishing each embodiment, and do not represent the quality of each embodiment.

Those skilled in the art can be understood that, for convenience and simplicity of description, the system of foregoing description, Device and the specific work process of unit, may be referred to the corresponding process in preceding method embodiment, will not be described here.

Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together based on teaching in this.As described above, construct required by this kind of system Structure be obvious.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to use various Programming language realizes the content of invention described herein, and the description that above language-specific is done be in order to disclose this Bright preferred forms.

In description mentioned herein, a large amount of details are illustrated.It is to be appreciated, however, that the enforcement of the present invention Example can be put into practice in the case of not having these details.In some instances, known method, structure are not been shown in detail And technology, so as not to obscure the understanding of this description.

Similarly, it will be appreciated that in order to simplify the disclosure helping understand one or more in each inventive aspect, Above in the description to the exemplary embodiment of the present invention, each feature of the present invention is grouped together into single enforcement sometimes In example, figure or descriptions thereof.However, should the method for the disclosure be construed to reflect following intention：That is required guarantor The more features of feature that the application claims ratio of shield is expressly recited in each claim.More precisely, such as following Claims reflected as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as the separate embodiments of the present invention.

Those skilled in the art are appreciated that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module in embodiment or list Unit or component are combined into a module or unit or component, and can be divided in addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit is excluded each other, can adopt any Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so to appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (includes adjoint power Profit is required, summary and accompanying drawing) disclosed in each feature can identical by offers, be equal to or the alternative features of similar purpose carry out generation Replace.

Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiments means in the present invention's Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection appoint One of meaning can in any combination mode using.

The all parts embodiment of the present invention can be realized with hardware, or to run on one or more processor Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor (DSP) are realizing denomination of invention according to embodiments of the present invention (as data processing Device) in some or all parts some or all functions.The present invention is also implemented as executing institute here (for example, computer program and computer program are produced for some or all equipment of the method for description or program of device Product).Such program for realizing the present invention can be stored on a computer-readable medium, or can have one or more The form of signal.Such signal can be downloaded from internet website and be obtained, or provide on carrier signal, or to appoint What other forms is provided.

It should be noted that above-described embodiment the present invention will be described rather than limits the invention, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference markss Wei Yu bracket between should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not Element listed in the claims or step.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can come real by means of the hardware for including some different elements and by means of properly programmed computer Existing.If in the unit claim for listing equipment for drying, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.

Claims

1. a kind of method of data processing, it is characterised in that methods described includes：Obtain the corresponding original sample of each training sample This matrix, the training sample is used for support vector machines model to be trained obtaining to treat what prediction data was classified SVM model, wherein, the training sample includes at least two different classes of training samples；

Dimension-reduction treatment is carried out according to principal component analysiss PCA algorithm to the original sample matrix, obtains the training sample after dimensionality reduction.

2. method according to claim 1, it is characterised in that described according to principal component analysiss PCA algorithm to described original Sample matrix carries out dimension-reduction treatment, obtains the training sample after dimensionality reduction, including：

Eigenmatrix is generated according to the original sample matrix；

Calculate the transposed matrix of the eigenmatrix；

3. method according to claim 2, it is characterised in that described feature square is generated according to the original sample matrix Battle array, including：

Calculate the transposed matrix of the central sample matrix, and turning the central sample matrix and the central sample matrix Matrix multiple is put, obtains target sample matrix；

4. according to the method in claim 2 or 3, it is characterised in that described according to original sample matrix generation feature Matrix, including：

The original sample matrix is averagely divided into multiple matrix group；

The subcharacter matrix of each matrix group is calculated respectively；

5. method according to claim 4, it is characterised in that described the original sample matrix is averagely divided into multiple Matrix group, including：

According to default dimension values calculating matrix set M；

Different classes of set of matrices is carried out the combination of permutation and combination type, obtains the plurality of matrix group, wherein, each matrix Group all set of matrices comprising all categories.

6. method according to claim 2, it is characterised in that described according to principal component analysiss PCA algorithm to the original Beginning sample matrix carries out dimension-reduction treatment, and after obtaining the training sample after dimensionality reduction, methods described is further included：

After training sample after using the dimensionality reduction is trained to the SVM model, by after the eigenmatrix and training The training result multiplication of vectors of acquisition, obtains the training result vector after optimizing；

Classification and matching collection is determined according to the training result vector after optimization, the classification and matching integrates the very son as presetting database Collection, the presetting database includes the element set of all elements that training sample may relate to；

7. method according to claim 1, it is characterised in that the SVM model is used for recognizing whether unknown mails are rubbish Rubbish mail and/or normal email；

8. a kind of device of data processing, it is characterised in that described device includes：

Acquiring unit, for obtaining the corresponding original sample matrix of each training sample, the training sample be used for support to Amount machine SVM model is trained obtaining the SVM model for treating that prediction data is classified, and wherein, the training sample is comprising extremely Few two kinds of different classes of training samples；

Dimensionality reduction unit, for carrying out dimension-reduction treatment to the original sample matrix according to principal component analysiss PCA algorithm, obtains dimensionality reduction Training sample afterwards.

9. device according to claim 8, it is characterised in that the dimensionality reduction unit includes：

Computing module, for calculating the transposed matrix of the eigenmatrix；

Multiplication module, for by the transposed matrix of the eigenmatrix and the original sample matrix multiple, after obtaining dimensionality reduction Training sample.

10. device according to claim 9, it is characterised in that the generation module, is used for：