CN106446011A - Data processing method and device - Google Patents
Data processing method and device Download PDFInfo
- Publication number
- CN106446011A CN106446011A CN201610715951.6A CN201610715951A CN106446011A CN 106446011 A CN106446011 A CN 106446011A CN 201610715951 A CN201610715951 A CN 201610715951A CN 106446011 A CN106446011 A CN 106446011A
- Authority
- CN
- China
- Prior art keywords
- matrix
- training
- sample
- training sample
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Abstract
The invention discloses a data processing method and device, relates to the technical field of computer application, and solves the problem of low efficiency since a training sample with an overhigh dimension is used for training a SVM (Support Vector Machine) model. The method comprises the following steps of: obtaining an original sample matrix corresponding to each training sample, wherein the training sample is used for training the SVM model to obtain the SVM model which is used for classifying data to be predicted, and the training sample comprises at least two different categories of training samples; and according to a PCA (Principal Component Analysis) algorithm, carrying out dimension reduction processing on the original sample matrix to obtain the training sample subjected to dimension reduction. The data processing method and device is applied to a process that the SVM model is trained.
Description
Technical field
The present invention relates to Computer Applied Technology field, more particularly to a kind of method and device of data processing.
Background technology
Support vector machine (support vector machine, SVM), are one kind for carrying out pattern recognition, classification etc.
Learning model.In actual applications, therefore SVM model is commonly used for solving for the solution best results of two classification problems
Two classification problems.Such as mail is classified, using unknown mails as data input to be predicted in SVM model, by SVM
Two sort features of model, obtain the classification results that the unknown mails are normal email or spam.
Generally, before being classified using SVM model, it is necessary first to using known training sample, SVM model is entered
Row training.The a large amount of normal email that collects and spams in advance are for example used to be trained SVM model as training sample.
However, during being trained to SVM model, inventor has found, and for the training sample that some dimensions are too high, corresponding group
The dimension of the training set for becoming is equally too high, and the dimension of training set is too high, can cause to train the amount of calculation of SVM model very big and
Comprising " noise data " more, therefore directly using the too high training sample of dimension normally result in training SVM model effect
Rate is relatively low.The training sample of such as mail class, if with constitute Mail Contents " word " as unit, by whole words constitute mail to
Amount, the corresponding vector dimension of each mail can be up to hundreds of thousands, then for the dimension of the training set of this training sample composition
Also hundreds of thousands will be up to, and so high dimension will necessarily increase bag in amount of calculation, and Mail Contents for the training of SVM model
Some words for containing such as " " etc. insignificant " noise data " also can be more, big amount of calculation and more " noise data "
The efficiency of training SVM model will necessarily be reduced.
Content of the invention
In view of the above problems, it is proposed that on the present invention is overcome the problems referred to above or solves at least in part so as to offer one kind
State the method and device of the data processing of problem.
For solving above-mentioned technical problem, on the one hand, the invention provides a kind of method of data processing, including:
The corresponding original sample matrix of each training sample is obtained, the training sample is used for support vector machines mould
Type is trained obtaining the SVM model for treating that prediction data is classified, and wherein, the training sample is different comprising at least two
The training sample of classification;
Dimension-reduction treatment is carried out according to principal component analysiss PCA algorithm to the original sample matrix, obtains the training after dimensionality reduction
Sample.
On the other hand, the invention provides a kind of device of data processing, including:
Acquiring unit, for obtaining the corresponding original sample matrix of each training sample, the training sample is used for propping up
Hold vector machine SVM model to be trained obtaining the SVM model for treating that prediction data is classified, wherein, the training sample bag
Containing at least two different classes of training samples;
Dimensionality reduction unit, for carrying out dimension-reduction treatment according to principal component analysiss PCA algorithm to the original sample matrix, obtains
Training sample after dimensionality reduction.
The method and device of the data processing for providing by technique scheme, the present invention, can be from for SVM model
Corresponding original sample matrix is obtained in the training sample being trained, and wherein, training sample is different classes of comprising at least two
Training sample;Then according to principal component analysiss (Principal Component Analysis, PCA) algorithm to original sample
Matrix carries out dimension-reduction treatment, obtains the training sample after dimensionality reduction.Compared with prior art, the present invention can pass through PCA algorithm pair
Training sample carries out dimensionality reduction, and the reduction of training sample dimension can reduce and carry out SVM model training using training sample
During amount of calculation, while also eliminate some " noise datas " in training sample, so as to improve the effect of training SVM model
Rate.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention,
And can be practiced according to the content of description, and in order to allow the above and other objects of the present invention, feature and advantage can
Become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of the drawings
By reading the detailed description of hereafter preferred implementation, various other advantages and benefit are common for this area
Technical staff will be clear from understanding.Accompanying drawing is only used for illustrating the purpose of preferred implementation, and is not considered as to the present invention
Restriction.And in whole accompanying drawing, it is denoted by the same reference numerals identical part.In the accompanying drawings:
Fig. 1 shows a kind of method flow diagram of data processing provided in an embodiment of the present invention;
Fig. 2 shows the method flow diagram of another kind provided in an embodiment of the present invention data processing;
Fig. 3 shows a kind of composition frame chart of the device of data processing provided in an embodiment of the present invention;
Fig. 4 shows the composition frame chart of the device of another kind provided in an embodiment of the present invention data processing.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure and should not be by embodiments set forth here
Limited.Conversely, provide these embodiments to be able to be best understood from the disclosure, and can be by the scope of the present disclosure
Complete conveys to those skilled in the art.
Efficiency for solving the problems, such as the too high training sample training SVM model of existing use dimension is low, and the present invention is implemented
Example provides a kind of method of data processing, as shown in figure 1, the method includes:
101st, each training sample corresponding original sample matrix is obtained.
In the present embodiment, training sample is used for being trained SVM model, treats what prediction data was classified to obtain
SVM model, wherein training sample include at least two different classes of training samples.Permissible for different classes of training sample
Include normal email with spam for different classes of mail;Sports news, wealth can also be included for different classes of news
Through news, entertainment newses etc..
The original sample matrix for obtaining after being changed is obtained by training sample, and wherein each training sample corresponds to an original sample
This matrix.To original sample matrix conversion process it is specifically:First, the corresponding presetting database of construction training sample, this is pre-
If data base is the element set of all elements for including that training sample may relate to, wherein each element corresponds to a dimension;
Secondly, all of element is ranked up;Finally, respectively by the element for including in each training sample and presetting database
Element is mated, and respectively obtains the corresponding original sample matrix of each training sample according to matching result.
Wherein, the process mated by each training sample is identical, is to simplify description here with a training sample
The process of coupling is illustrated as a example by this, specifically:Training sample is split according to element, obtains training sample element set;
Whether the element that training of judgement sample elements are concentrated is contained in presetting database, default by occurring in training sample element set
The element of lane database is marked, and is generally designated as 1, the unit that will be not present in training sample element set in presetting database
Element is also carried out labelling, is generally designated as 0;Then arranged according to the sequence of element in corresponding presetting database by 0 and 1,
Finally give the original sample matrix of corresponding training sample, wherein the element in the dimension of original sample matrix and presetting database
Number is identical.
In an example of the present embodiment, it is assumed that after all elements sequence for including in presetting database, result is:Red,
Orange, yellow, green, blue or green, blue, purple, if the element for including in the corresponding training sample element set of one of training sample is:Orange, blue or green,
Purple, then to should training sample original sample matrix be [0,1,0,0,1,0,1]TIf another training sample is corresponding
The element for including in training sample element set is:Red, orange, yellow, then to should training sample original sample matrix for [1,1,1,
0,0,0,0]T.
In practical application, the abundanter dimension being related to of the content of training sample is higher, such as message body content,
Because which may relate to the various characters of different language, therefore training dimension is very large, can be normally reached hundreds of thousands dimension.By
Very high in the dimension of training sample, so the original sample matrix for obtaining in this step is also superelevation dimension, in the present embodiment
By executing subsequent step 102, the dimension of original sample matrix can be reduced, so as to reach the purpose for reducing model training amount.
102nd, dimension-reduction treatment is carried out according to principal component analysiss PCA algorithm to original sample matrix, obtains the training after dimensionality reduction
Sample.
Principal component analysiss PCA is the one kind for parsing main influence factor in a kind of things for grasping things principal contradiction
Algorithm, can be by high-dimensional data projection to the space compared with low dimensional.The present embodiment is according to PCA algorithm to each original sample
Matrix carries out dimension-reduction treatment, obtains the original sample matrix after dimensionality reduction, is derived from the training sample after dimensionality reduction.Obtain dimensionality reduction
Training sample afterwards be in order to treat the SVM model training classified by prediction data.By the training sample pair compared with low dimensional
SVM model is trained reducing the amount of calculation in training process, improves the efficiency of training SVM model.
The method of data processing provided in an embodiment of the present invention, can be from the training sample for being trained to SVM model
Corresponding original sample matrix is obtained in this, wherein, training sample includes at least two different classes of training samples;Then root
Dimension-reduction treatment is carried out to original sample matrix according to main PCA algorithm, obtain the training sample after dimensionality reduction.Compared with prior art, originally
Inventive embodiments can carry out dimensionality reduction by PCA algorithm to training sample, and the reduction of training sample dimension can reduce and make
Amount of calculation during SVM model training being carried out with training sample, while also eliminate some " the noise numbers in training sample
According to ", so as to improve the efficiency of training SVM model.
Further, as refinement and the extension to method shown in Fig. 1, another embodiment of the present invention gives a kind of number
Method according to processing.As shown in Fig. 2 the method includes:
201st, each training sample corresponding original sample matrix is obtained.
Implementation in the implementation Fig. 1 step 101 of this step is identical, and here is omitted.
202nd, dimension-reduction treatment is carried out according to principal component analysiss PCA algorithm to original sample matrix, obtains the training after dimensionality reduction
Sample.
Specifically carrying out dimension-reduction treatment according to principal component analysiss PCA algorithm to original sample matrix includes procedure below:
First, eigenmatrix is generated according to original sample matrix;
The specific eigenmatrix that generates includes:
(1) average of all original sample matrixes is calculated, obtains central sample matrix;
Specifically the formula of calculating original sample matrix average m is:
Wherein,Represent j-th original sample matrix of i-th classification, c is classification sum, NiRepresent i-th classification bag
The original sample matrix sum for containing, i is natural number with j.
The central sample matrix H of acquisition is:
(2) transposed matrix of central sample matrix, and the transposition square by central sample matrix and central sample matrix are calculated
Battle array is multiplied, and obtains target sample matrix H HT;
(3) target sample matrix H H is calculatedTMultiple eigenvalue μ, and corresponding characteristic vector g of each eigenvalue;
In the present embodiment, multiple eigenvalues are multiple nonzero eigenvalues.
Calculate HHTMultiple nonzero eigenvalues when can according to the common nonzero eigenvalue for seeking matrix method calculate,
HH is calculated in the present embodimentTMethod can also be:H is first calculatedTMultiple nonzero eigenvalue μ of H and feature corresponding with μ
Vector v, then obtains characteristic vector g using equation below;
It should be noted that HHTWith HTThe nonzero eigenvalue of H is identical, is therefore all denoted as μ.
(4) characteristic vector of predetermined number is selected successively according to the descending order of eigenvalue;
Wherein predetermined number is determined according to below equation:
Wherein θ is usually and is close to 1 but the predetermined threshold value less than 1, can generally take 0.9 or higher, specifically can root
Factually the application on border is set;μjFor HHTThe i-th big eigenvalue, i.e. μ1≥μ2≥…≥μk, wherein k is nonzero eigenvalue
Number.
The characteristic vector of predetermined number is selected, that is, chooses the corresponding characteristic vector of front i nonzero eigenvalue.
(5) according to the order for selecting characteristic vector, the characteristic vector of selection is sequentially arranged, eigenmatrix G is obtained, specifically
Acquisition eigenmatrix G as follows:
G=[g1,…,gi]
Second, calculate the transposed matrix G of eigenmatrixT;
The transposed matrix for calculating eigenmatrix is to change the row of eigenmatrix into corresponding row, the new matrix for obtaining, generally
It is denoted as GT.Wherein when matrix transpose is carried out, typically using the first row of matrix as transposed matrix the first row, the first row make
First row for transposed matrix.
3rd, the transposed matrix of eigenmatrix and original sample matrix multiple obtain the training sample after dimensionality reduction.
Formula in this step specifically related to is:Y=GTx
Wherein, y is the matrix after corresponding original sample matrix dimensionality reduction.Each y be each original sample matrix dimensionality reduction it
A corresponding training sample afterwards.
In the present embodiment, using PCA algorithm, by generating eigenmatrix, eigenmatrix transposition and transposed matrix and original
Several steps of beginning sample matrix multiplication are processed, and realize reducing the purpose of original sample matrix dimensionality.
203rd, will after eigenmatrix and training obtain training result multiplication of vectors, obtain optimize after training result to
Amount.
Wherein training result vector be using dimensionality reduction after training sample SVM model is trained after obtain training knot
Fruit vector, the training result vector for specifically obtaining for SVM model is supporting vector, and supporting vector is used for distinguishing difference
The corresponding vector of the classification boundaries of categorical data.
Specific process that training result vector is optimized is:By by the feature square for getting in above-mentioned steps 202
Battle array is that supporting vector is optimized to training result vector, and the concrete formula for optimizing is as follows:
Z '=Gz
Wherein, z is training result vector, z, is the training result vector after optimizing.
204th, classification and matching collection is determined according to the training result vector after optimization.
Classification and matching integrates the proper subclass as presetting database, and presetting database includes all units that training sample may relate to
The element set of element.Determine that the process of classification and matching collection includes:First, the non-zero in the training result vector after optimizing is searched
Value;Secondly, corresponding element is extracted in presetting database according to the nonzero value in the training result vector after optimization, needs
It is that training sample after the dimensionality reduction is obtained that bright is training result vector, therefore after the dimension of training result vector and dimensionality reduction
Training sample in dimension there is corresponding relation, and the training sample after dimensionality reduction is obtained by original sample matrix dimensionality reduction
, therefore there is corresponding relation with the dimension of original sample matrix in the dimension of training result vector.Again due to original sample square
The dimension of battle array have corresponding relation with unit in presetting database, therefore the dimension of the training result vector after optimization and present count
Corresponding relation is have according to the unit in storehouse, the dimension that therefore can be located according to the nonzero value in the training result vector after optimization
The element of corresponding dimension is extracted in presetting database;Finally by corresponding for the nonzero value for extracting elementary composition classification coupling
Collection, the wherein coefficient of the corresponding element that nonzero value is concentrated as classification and matching.
205th, classified according to classification and matching set pair data to be predicted.
The specific process that is classified according to classification and matching set pair data to be predicted is as follows:
First, data to be predicted are concentrated in classification and matching carries out multimode matching.Multimode matching be in a character string
Find the process of multiple model strings.Multimode matching is specifically referred in the present embodiment, finds classification in data to be predicted
Multiple elements in set of matches.Carry out multimode matching algorithm have multiple, such as common including dictionary tree (Trie tree,
Trie tree), AC automat (Aho-Corasick automation, AC automat) algorithm, Wu-Manber (Wu-Manber,
VM) algorithm etc..The present embodiment does not limit specifically used specifically any multimode matching algorithm and is mated.Coupling is finally obtained
As a result the element that the corresponding classification and matching for including in data to be predicted is concentrated is to determine.
Second, it is right that the element of presence simultaneously in classification and matching concentration and data to be predicted is concentrated in classification and matching
The coefficient that answers is added up.
3rd, prediction data is treated according to accumulation result and is classified.
Prediction data is treated according to accumulation result to be classified, that is, determine the classification of data to be predicted.Specifically:For difference
Classification is respectively provided with a corresponding threshold range, and different threshold ranges do not have common factor, then by accumulation result with all
Threshold range compare, which threshold range accumulation result belongs to, and decides which corresponding data to be predicted belong to
The data of classification.
The method of data processing provided in an embodiment of the present invention, except can be subtracted by way of reducing training sample dimension
Outside the amount of calculation of few training process, additionally it is possible to which the training result vector obtained by training is optimized, based on the training after optimization
Result vector determines classification and matching collection.When treating prediction data using SVM model and being classified, due to only needing to according to classification
The element that includes in set of matches rather than classified according to the possible all elements for constituting different classes of data, therefore can
The amount of calculation in data categorizing process is enough substantially reduced, improves the efficiency of data classification.
Further, for the method for the generation eigenmatrix being related in Fig. 2, the embodiment of the present invention additionally provides other one
The method for generating eigenmatrix is planted, specifically, as described below:
As training sample generally comprises the higher training sample of the order of magnitude, entering according to the corresponding generation step in Fig. 2
During the correlation computations of row eigenmatrix, if calculate to the training sample of comparatively high amts level simultaneously, due to training samples number
Larger, larger calculating pressure can be caused to system-computed, therefore the present embodiment is realized according to original sample by way of packet
This matrix generates eigenmatrix.The thought of the mode of packet is that the training sample of comparatively high amts level is divided into multiple groups, each group
In the quantity of training sample that includes can substantially reduce, then respectively corresponding calculating is carried out to the training sample in each group,
Big compared to the quantity of the training sample for carrying out being used all of training sample as an entirety when eigenmatrix is calculated
Big reduction, therefore can reduce system-computed pressure.Specifically realize being generated according to original sample matrix by packet mode
The process of eigenmatrix is comprised the following steps:
First, according to default dimension values calculating matrix set M;
It is the dimension values for being arranged according to the computing capability of system wherein to preset dimension, and the setting for generally presetting dimension is followed
The principle being the bigger the better within the computing capability of system.
According to the specific formula of quantity M of default dimension values calculating matrix set it is:
Wherein, a is the quantity of the training sample of each classification for including in training sample, it should be noted that being instructed
During the selection of white silk sample, the quantity of the training sample of each classification is equal;B is for presetting dimension values.
Second, respectively the original sample matrix of each classification is averagely divided into M set of matrices;
Before the original sample matrix by each classification is averagely divided into multiple M set of matrices, need first with instruction
The classification for practicing sample is that corresponding for training sample original sample Classification of Matrix is obtained different classes of original sample by foundation
Matrix.
3rd, different classes of set of matrices is carried out the combination of permutation and combination type, multiple matrix group are obtained, wherein, per
Individual matrix group all set of matrices comprising all categories.
Specific example is given, the process for obtaining multiple matrix group is illustrated:Assume to include two kinds in training sample
The training sample of classification, the wherein quantity of the training sample of each classification corresponding are 1000, then corresponding original sample matrix
Quantity be also 1000, and assume that according to quantity M of set of matrices that dimension is worth to is preset be 10, then each can be obtained
Corresponding 10 set of matrices of the original sample matrix of classification, then will be each in corresponding for one of classification set of matrices
Individual set of matrices is combined with corresponding set of matrices in another classification respectively, and the number of the matrix group for finally giving is
45.
5th, calculate the subcharacter matrix of each matrix group respectively;
The mode for calculating the subcharacter matrix of each matrix group is identical with the implementation for generating eigenmatrix in Fig. 2, its
Middle matrix group corresponds to all original sample matrixes, subcharacter matrix character pair matrix.Finally give multiple subcharacter matrixes.
6th, according to the order for calculating subcharacter matrix, multiple subcharacter matrixes are combined, obtain eigenmatrix.
The subcharacter matrix for being obtained by the 5th step is carried out linear combination, particularly according to the suitable of calculating subcharacter matrix
Sequence, multiple subcharacter matrixes is combined, obtains eigenmatrix.
Provide example to illustrate the mode for obtaining eigenmatrix:The number of the subcharacter matrix that hypothesis is obtained is r, note
I-th group of subcharacter matrix is Gi, then eigenmatrix G=[the G1 G2 ... that will obtain after r sub- eigenmatrix linear combination
Gr].
In a kind of application mode of the present embodiment, the SVM model for being trained by Fig. 1 or Fig. 2 flow process can be used for knowing
Whether other unknown mails are spam and/or normal email.Specifically when being identified:Will be normal known to quantity identical
Mail and known spam composition training sample, then according to the training sample after the mode in Fig. 1 or Fig. 2 receives dimensionality reduction
This, wherein it should be noted that presetting database is to constitute the set that all possible " word " of mail constitutes;Then will training
Sample obtains corresponding supporting vector to SVM model training, then supporting vector is optimized, finally according to propping up after optimization
Hold vector and corresponding classification and matching collection is obtained, classification and matching is concentrated only corresponding comprising non-zero values in the supporting vector after optimizing
" word ";Then corresponding for the word being present in set of matches for including in unknown mails coefficient is added up, and according to cumulative
Numerical result goes to recognize that mail is spam or normal email.
Further, as the realization to the various embodiments described above, another embodiment of the embodiment of the present invention additionally provides one
The device of data processing is planted, for realizing the method described in above-mentioned Fig. 1 and Fig. 2.As shown in figure 3, the device includes:Acquiring unit
31 and dimensionality reduction unit 32.
Acquiring unit 31, for obtaining the corresponding original sample matrix of each training sample, training sample is used for supporting
Vector machine SVM model is trained obtaining the SVM model for treating that prediction data is classified, and wherein, training sample is comprising at least
Two kinds of different classes of training samples.
The original sample matrix for obtaining after being changed is obtained by training sample, and wherein each training sample corresponds to an original sample
This matrix.To original sample matrix conversion process it is specifically:First, the corresponding presetting database of construction training sample, this is pre-
If data base is the element set of all elements for including that training sample may relate to, wherein each element corresponds to a dimension;
Secondly, all of element is ranked up;Finally, respectively by the element for including in each training sample and presetting database
Element is mated, and respectively obtains the corresponding original sample matrix of each training sample according to matching result.
Wherein, the process mated by each training sample is identical, is to simplify description here with a training sample
The process of coupling is illustrated as a example by this, specifically:Training sample is split according to element, obtains training sample element set;
Whether the element that training of judgement sample elements are concentrated is contained in presetting database, default by occurring in training sample element set
The element of lane database is marked, and is generally designated as 1, and the element being not included in presetting database is also carried out labelling, leads to
0 is often labeled as;Then arranged according to the sequence of element in corresponding presetting database by 0 and 1, finally given corresponding training
The original sample matrix of sample, the wherein dimension of original sample matrix are identical with the element number in presetting database.
Dimensionality reduction unit 32, for carrying out dimension-reduction treatment to original sample matrix according to principal component analysiss PCA algorithm, obtains drop
Training sample after dimension.
Principal component analysiss PCA is the one kind for parsing main influence factor in a kind of things for grasping things principal contradiction
Algorithm, can be by high-dimensional data projection to the space compared with low dimensional.The present embodiment is according to PCA algorithm to each original sample
Matrix carries out dimension-reduction treatment, obtains the original sample matrix after dimensionality reduction, is derived from the training sample after dimensionality reduction.Obtain dimensionality reduction
Training sample afterwards be in order to treat the SVM model training classified by prediction data.By the training sample pair compared with low dimensional
SVM model is trained reducing the amount of calculation in training process, improves the efficiency of training SVM model.
Further, as shown in figure 4, dimensionality reduction unit 32 includes:
Generation module 321, for generating eigenmatrix according to original sample matrix;
Eigenmatrix G is generated according to original sample matrix x.
Computing module 322, for calculating the transposed matrix of eigenmatrix;
Calculate the transposed matrix G of eigenmatrixT, specifically:Calculate eigenmatrix transposed matrix be by eigenmatrix
Row changes corresponding row into, and the new matrix for obtaining generally is denoted as GT.Wherein when matrix transpose is carried out, typically by the of matrix
String is used as the first row of transposed matrix, and the first row is used as the first row of transposed matrix.
Multiplication module 323, for by the transposed matrix of eigenmatrix and original sample matrix multiple, obtaining the instruction after dimensionality reduction
Practice sample.
Formula in multiplication module 323 specifically related to is:Y=GTx
Wherein, y is the matrix after corresponding original sample matrix dimensionality reduction.Each y be each original sample matrix dimensionality reduction it
A corresponding training sample afterwards.
In the present embodiment, using PCA algorithm, by generating eigenmatrix, eigenmatrix transposition and transposed matrix and original
Several steps of beginning sample matrix multiplication are processed, and realize reducing the purpose of original sample matrix dimensionality.
Further, generation module 321, are used for:
The average of all original sample matrixes is calculated, obtains central sample matrix;
Calculate the transposed matrix of central sample matrix, and the transposed matrix phase by central sample matrix and central sample matrix
Take advantage of, obtain target sample matrix;
Calculate multiple eigenvalues of target sample matrix, and the corresponding characteristic vector of each eigenvalue;
The characteristic vector of predetermined number is selected successively according to the descending order of eigenvalue;
According to the order for selecting characteristic vector, the characteristic vector of selection is sequentially arranged, obtains eigenmatrix.
Eigenmatrix being generated in specific generation module 321 includes:
(1) average of all original sample matrixes is calculated, obtains central sample matrix;
Specifically the formula of calculating original sample matrix average m is:
Wherein,Represent j-th original sample matrix of i-th classification, c is classification sum, NiRepresent i-th classification bag
The original sample matrix sum for containing, i is natural number with j.
The central sample matrix H of acquisition is:
(2) transposed matrix of central sample matrix, and the transposition square by central sample matrix and central sample matrix are calculated
Battle array is multiplied, and obtains target sample matrix H HT;
(3) target sample matrix H H is calculatedTMultiple eigenvalue μ, and corresponding characteristic vector g of each eigenvalue;
In the present embodiment, multiple eigenvalues are multiple nonzero eigenvalues.
Calculate HHTMultiple nonzero eigenvalues when can according to the common nonzero eigenvalue for seeking matrix method calculate,
HH is calculated in the present embodimentTMethod can also be:H is first calculatedTMultiple nonzero eigenvalue μ of H and feature corresponding with μ
Vector v, then obtains characteristic vector g using equation below;
It should be noted that HHTWith HTThe nonzero eigenvalue of H is identical, is therefore all denoted as μ.
(4) characteristic vector of predetermined number is selected successively according to the descending order of eigenvalue;
Wherein predetermined number is determined according to below equation:
Wherein θ is usually and is close to 1 but the predetermined threshold value less than 1, can generally take 0.9 or higher, specifically can root
Factually the application on border is set;μjFor HHTThe i-th big eigenvalue, i.e. μ1≥μ2≥…≥μk, wherein k is nonzero eigenvalue
Number.
The characteristic vector of predetermined number is selected, that is, chooses the corresponding characteristic vector of front i nonzero eigenvalue.
(5) according to the order for selecting characteristic vector, the characteristic vector of selection is sequentially arranged, eigenmatrix G is obtained, specifically
Acquisition eigenmatrix G as follows:
G=[g1,…,gi]
Further, generation module 321 is used for:
Original sample matrix is averagely divided into multiple matrix group;
The subcharacter matrix of each matrix group is calculated respectively;
According to the order for calculating subcharacter matrix, multiple subcharacter matrixes are combined, obtain eigenmatrix.
Further, generation module 321 is used for:
With the classification of training sample as foundation, to the corresponding original sample Classification of Matrix of training sample;
According to default dimension values calculating matrix set M;
Respectively the original sample matrix of each classification is averagely divided into M set of matrices;
Different classes of set of matrices is carried out the combination of permutation and combination type, obtains multiple matrix group, wherein, each matrix
Group all set of matrices comprising all categories.
Mode in generation module 321 also comprising another kind of generation eigenmatrix, specifically:
First, according to default dimension values calculating matrix set M;
It is the dimension values for being arranged according to the computing capability of system wherein to preset dimension, and the setting for generally presetting dimension is followed
The principle being the bigger the better within the computing capability of system.
According to the specific formula of quantity M of default dimension values calculating matrix set it is:
Wherein, a is the quantity of the training sample of each classification for including in training sample, it should be noted that being instructed
During the selection of white silk sample, the quantity of the training sample of each classification is equal;B is for presetting dimension values.
Second, respectively the original sample matrix of each classification is averagely divided into M set of matrices;
Before the original sample matrix by each classification is averagely divided into multiple M set of matrices, need first with instruction
The classification for practicing sample is that corresponding for training sample original sample Classification of Matrix is obtained different classes of original sample by foundation
Matrix.
3rd, different classes of set of matrices is carried out the combination of permutation and combination type, multiple matrix group are obtained, wherein, per
Individual matrix group all set of matrices comprising all categories.
Further, as shown in figure 4, device is further included:
Optimize unit 33, for carrying out dimension-reduction treatment to original sample matrix according to principal component analysiss PCA algorithm, obtain
After training sample after dimensionality reduction, and after the training sample after using dimensionality reduction is trained to SVM model, by eigenmatrix
With the training result multiplication of vectors for obtaining after training, the training result vector after optimizing is obtained;
Wherein training result vector be using dimensionality reduction after training sample SVM model is trained after obtain training knot
Fruit vector, the training result vector for specifically obtaining for SVM model is supporting vector, and supporting vector is used for distinguishing difference
The corresponding vector of the classification boundaries of categorical data.
Specific process that training result vector is optimized is:By by the feature for getting in above-mentioned dimensionality reduction unit 32
Matrix is that supporting vector is optimized to training result vector, and the concrete formula for optimizing is as follows:
Z '=Gz
Wherein, z is training result vector, z, is the training result vector after optimizing.
Determining unit 34, for determining classification and matching collection according to the training result vector after optimization, classification and matching collection is pre-
If the proper subclass of data base, presetting database includes the element set of all elements that training sample may relate to;
Classification and matching integrates the proper subclass as presetting database, and presetting database includes all units that training sample may relate to
The element set of element.Determine that the process of classification and matching collection includes:First, the non-zero in the training result vector after optimizing is searched
Value;Secondly, corresponding element is extracted in presetting database according to the nonzero value in the training result vector after optimization, needs
It is that training sample after the dimensionality reduction is obtained that bright is training result vector, therefore after the dimension of training result vector and dimensionality reduction
Training sample in dimension there is corresponding relation, and the training sample after dimensionality reduction is obtained by original sample matrix dimensionality reduction
, therefore there is corresponding relation with the dimension of original sample matrix in the dimension of training result vector.Again due to original sample square
The dimension of battle array have corresponding relation with unit in presetting database, therefore the dimension of the training result vector after optimization and present count
Corresponding relation is have according to the unit in storehouse, the dimension that therefore can be located according to the nonzero value in the training result vector after optimization
The element of corresponding dimension is extracted in presetting database;Finally by corresponding for the nonzero value for extracting elementary composition classification coupling
Collection, the wherein coefficient of the corresponding element that nonzero value is concentrated as classification and matching.
Taxon 35, for being classified according to classification and matching set pair data to be predicted.
The specific process that is classified according to classification and matching set pair data to be predicted is as follows:
First, data to be predicted are concentrated in classification and matching carries out multimode matching.Multimode matching be in a character string
Find the problem of multiple model strings.Multimode matching is specifically referred in the present embodiment, finds classification in data to be predicted
Multiple elements in set of matches.Carry out multimode matching algorithm have multiple, such as common including dictionary tree (Trie tree,
Trie tree), AC automat (Aho-Corasick automation, AC automat) algorithm, Wu-Manber (Wu-Manber,
VM) algorithm etc..The present embodiment does not limit specifically used specifically any multimode matching algorithm and is mated.Coupling is finally obtained
As a result the element that the corresponding classification and matching for including in data to be predicted is concentrated is to determine.
Second, it is right that the element of presence simultaneously in classification and matching concentration and data to be predicted is concentrated in classification and matching
The coefficient that answers is added up.
3rd, prediction data is treated according to accumulation result and is classified.
Prediction data is treated according to accumulation result to be classified, that is, determine the classification of data to be predicted.Specifically:For difference
Classification is respectively provided with a corresponding threshold range, and different threshold ranges do not have common factor, then by accumulation result with all
Threshold range compare, which threshold range accumulation result belongs to, and decides which corresponding data to be predicted belong to
The data of classification.
The method of data processing provided in an embodiment of the present invention, except can be subtracted by way of reducing training sample dimension
Outside the amount of calculation of few training process, additionally it is possible to which the training result vector obtained by training is optimized, based on the training after optimization
Result vector determines classification and matching collection.When treating prediction data using SVM model and being classified, due to only needing to according to coupling
Concentrate the element for including rather than classified according to the possible all elements for constituting different classes of data, therefore, it is possible to big
The big amount of calculation for reducing in data categorizing process, improves the efficiency of data classification.
Further, the SVM model in acquiring unit 31 is used for recognizing that whether unknown mails to be spam and/or normal
Mail;
Known normal email and known spam that training sample is used comprising training.
In a kind of application mode of the present embodiment, the SVM model for being trained by Fig. 3 or Fig. 4 device can be used for knowing
Whether other unknown mails are spam and/or normal email.Specifically when being identified:Will be normal known to quantity identical
Mail and known spam composition training sample, then according to the training sample after the device in Fig. 3 or Fig. 4 receives dimensionality reduction
This, wherein it should be noted that presetting database is to constitute the set that all possible " word " of mail constitutes;Then will training
Sample obtains corresponding supporting vector to SVM model training, then supporting vector is optimized, finally according to propping up after optimization
Hold vector and corresponding classification and matching collection is obtained, classification and matching is concentrated only corresponding comprising non-zero values in the supporting vector after optimizing
" word ";Then corresponding for the word being present in set of matches for including in unknown mails coefficient is added up, and according to cumulative
Numerical result goes to recognize that mail is spam or normal email.
The device of data processing provided in an embodiment of the present invention, can be from the training sample for being trained to SVM model
Corresponding original sample matrix is obtained in this, wherein, training sample includes at least two different classes of training samples;Then root
According to principal component analysiss (Principal Component Analysis, PCA) algorithm, dimension-reduction treatment is carried out to original sample matrix,
Obtain the training sample after dimensionality reduction.Compared with prior art, the embodiment of the present invention can be carried out to training sample by PCA algorithm
Dimensionality reduction, and the reduction of training sample dimension can reduce the calculating during SVM model training is carried out using training sample
Amount, while some " noise datas " in training sample are also eliminated, so as to improve the efficiency of training SVM model.
The embodiment of the invention also discloses:
A1, a kind of method of data processing, methods described includes:Obtain the corresponding original sample square of each training sample
Battle array, the training sample is used for support vector machines model to be trained obtaining the SVM mould for treating that prediction data is classified
Type, wherein, the training sample includes at least two different classes of training samples;
Dimension-reduction treatment is carried out according to principal component analysiss PCA algorithm to the original sample matrix, obtains the training after dimensionality reduction
Sample.
A2, the method according to A1, described drop to the original sample matrix according to principal component analysiss PCA algorithm
Dimension is processed, and obtains the training sample after dimensionality reduction, including:
Eigenmatrix is generated according to the original sample matrix;
Calculate the transposed matrix of the eigenmatrix;
By the transposed matrix of the eigenmatrix and the original sample matrix multiple, the training sample after dimensionality reduction is obtained.
A3, the method according to A2, described according to the original sample matrix generate eigenmatrix, including:
The average of all original sample matrixes is calculated, obtains central sample matrix;
Calculate the transposed matrix of the central sample matrix, and by the central sample matrix and the central sample matrix
Transposed matrix be multiplied, obtain target sample matrix;
Calculate multiple eigenvalues of the target sample matrix, and the corresponding characteristic vector of each eigenvalue;
The characteristic vector of predetermined number is selected successively according to the descending order of eigenvalue;
According to the order for selecting characteristic vector, the characteristic vector of selection is sequentially arranged, obtains the eigenmatrix.
A4, the method according to A2 or A3, described according to the original sample matrix generate eigenmatrix, including:
The original sample matrix is averagely divided into multiple matrix group;
The subcharacter matrix of each matrix group is calculated respectively;
According to the order for calculating subcharacter matrix, multiple subcharacter matrixes are combined, obtain the eigenmatrix.
A5, the method according to A4, described the original sample matrix is averagely divided into multiple matrix group, including:
With the classification of training sample as foundation, to the corresponding original sample Classification of Matrix of training sample;
According to default dimension values calculating matrix set M;
Respectively the original sample matrix of each classification is averagely divided into M set of matrices;
Different classes of set of matrices is carried out the combination of permutation and combination type, the plurality of matrix group is obtained, wherein, each
Matrix group all set of matrices comprising all categories.
A6, the method according to A2, are carried out to the original sample matrix according to principal component analysiss PCA algorithm described
Dimension-reduction treatment, after obtaining the training sample after dimensionality reduction, methods described is further included:
After training sample after using the dimensionality reduction is trained to the SVM model, by the eigenmatrix and instruction
The training result multiplication of vectors for obtaining after white silk, obtains the training result vector after optimizing;
Classification and matching collection is determined according to the training result vector after optimization, the classification and matching integrates as the true of presetting database
Subset, the presetting database includes the element set of all elements that training sample may relate to;
Data to be predicted according to the classification and matching set pair are classified.
A7, the method according to A1, the SVM model is used for recognizing that whether unknown mails to be spam and/or just
Normal mail;
Known normal email and known spam that the training sample is used comprising training.
B8, a kind of device of data processing, described device includes:
Acquiring unit, for obtaining the corresponding original sample matrix of each training sample, the training sample is used for propping up
Hold vector machine SVM model to be trained obtaining the SVM model for treating that prediction data is classified, wherein, the training sample bag
Containing at least two different classes of training samples;
Dimensionality reduction unit, for carrying out dimension-reduction treatment according to principal component analysiss PCA algorithm to the original sample matrix, obtains
Training sample after dimensionality reduction.
B9, the device according to B8, the dimensionality reduction unit includes:
Generation module, for generating eigenmatrix according to the original sample matrix;
Computing module, for calculating the transposed matrix of the eigenmatrix;
Multiplication module, for by the transposed matrix of the eigenmatrix and the original sample matrix multiple, obtaining dimensionality reduction
Training sample afterwards.
B10, the device according to B9, the generation module, it is used for:
The average of all original sample matrixes is calculated, obtains central sample matrix;
Calculate the transposed matrix of the central sample matrix, and by the central sample matrix and the central sample matrix
Transposed matrix be multiplied, obtain target sample matrix;
Calculate multiple eigenvalues of the target sample matrix, and the corresponding characteristic vector of each eigenvalue;
The characteristic vector of predetermined number is selected successively according to the descending order of eigenvalue;
According to the order for selecting characteristic vector, the characteristic vector of selection is sequentially arranged, obtains the eigenmatrix.
B11, the device according to B9 or B10, the generation module is used for:
The original sample matrix is averagely divided into multiple matrix group;
The subcharacter matrix of each matrix group is calculated respectively;
According to the order for calculating subcharacter matrix, multiple subcharacter matrixes are combined, obtain the eigenmatrix.
B12, the device according to B11, the generation module is used for:
With the classification of training sample as foundation, to the corresponding original sample Classification of Matrix of training sample;
According to default dimension values calculating matrix set M;
Respectively the original sample matrix of each classification is averagely divided into M set of matrices;
Different classes of set of matrices is carried out the combination of permutation and combination type, the plurality of matrix group is obtained, wherein, each
Matrix group all set of matrices comprising all categories.
B13, the device according to B9, described device is further included:
Optimize unit, for carrying out at dimensionality reduction to the original sample matrix according to principal component analysiss PCA algorithm described
Reason, after obtaining the training sample after dimensionality reduction, and the training sample after using the dimensionality reduction is instructed to the SVM model
After white silk, the training result multiplication of vectors that will obtain after the eigenmatrix and training, obtain the training result vector after optimizing;
Determining unit, for determining classification and matching collection according to the training result vector after optimization, the classification and matching collection is
The proper subclass of presetting database, the presetting database includes the element set of all elements that training sample may relate to;
Taxon, is classified for data to be predicted according to the classification and matching set pair.
B14, the device according to B8, the SVM model in the acquiring unit is used for whether recognizing unknown mails
For spam and/or normal email;
Known normal email and known spam that the training sample is used comprising training.
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion that describes in detail in certain embodiment
Point, may refer to the associated description of other embodiment.
It is understood that said method and the correlated characteristic in device mutually can be referred to.In addition, in above-described embodiment
" first ", " second " etc. be for distinguishing each embodiment, and do not represent the quality of each embodiment.
Those skilled in the art can be understood that, for convenience and simplicity of description, the system of foregoing description,
Device and the specific work process of unit, may be referred to the corresponding process in preceding method embodiment, will not be described here.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein.
Various general-purpose systems can also be used together based on teaching in this.As described above, construct required by this kind of system
Structure be obvious.Additionally, the present invention is also not for any certain programmed language.It is understood that, it is possible to use various
Programming language realizes the content of invention described herein, and the description that above language-specific is done be in order to disclose this
Bright preferred forms.
In description mentioned herein, a large amount of details are illustrated.It is to be appreciated, however, that the enforcement of the present invention
Example can be put into practice in the case of not having these details.In some instances, known method, structure are not been shown in detail
And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure helping understand one or more in each inventive aspect,
Above in the description to the exemplary embodiment of the present invention, each feature of the present invention is grouped together into single enforcement sometimes
In example, figure or descriptions thereof.However, should the method for the disclosure be construed to reflect following intention:That is required guarantor
The more features of feature that the application claims ratio of shield is expressly recited in each claim.More precisely, such as following
Claims reflected as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself
All as the separate embodiments of the present invention.
Those skilled in the art are appreciated that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more equipment different from the embodiment.Can be the module in embodiment or list
Unit or component are combined into a module or unit or component, and can be divided in addition multiple submodule or subelement or
Sub-component.In addition at least some in such feature and/or process or unit is excluded each other, can adopt any
Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so to appoint
Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (includes adjoint power
Profit is required, summary and accompanying drawing) disclosed in each feature can identical by offers, be equal to or the alternative features of similar purpose carry out generation
Replace.
Although additionally, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments
In included some features rather than further feature, but the combination of the feature of different embodiments means in the present invention's
Within the scope of and form different embodiments.For example, in the following claims, embodiment required for protection appoint
One of meaning can in any combination mode using.
The all parts embodiment of the present invention can be realized with hardware, or to run on one or more processor
Software module realize, or with combinations thereof realize.It will be understood by those of skill in the art that can use in practice
Microprocessor or digital signal processor (DSP) are realizing denomination of invention according to embodiments of the present invention (as data processing
Device) in some or all parts some or all functions.The present invention is also implemented as executing institute here
(for example, computer program and computer program are produced for some or all equipment of the method for description or program of device
Product).Such program for realizing the present invention can be stored on a computer-readable medium, or can have one or more
The form of signal.Such signal can be downloaded from internet website and be obtained, or provide on carrier signal, or to appoint
What other forms is provided.
It should be noted that above-described embodiment the present invention will be described rather than limits the invention, and ability
Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference markss Wei Yu bracket between should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not
Element listed in the claims or step.Word "a" or "an" before element does not exclude the presence of multiple such
Element.The present invention can come real by means of the hardware for including some different elements and by means of properly programmed computer
Existing.If in the unit claim for listing equipment for drying, several in these devices can be by same hardware branch
To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame
Claim.
Claims (10)
1. a kind of method of data processing, it is characterised in that methods described includes:Obtain the corresponding original sample of each training sample
This matrix, the training sample is used for support vector machines model to be trained obtaining to treat what prediction data was classified
SVM model, wherein, the training sample includes at least two different classes of training samples;
Dimension-reduction treatment is carried out according to principal component analysiss PCA algorithm to the original sample matrix, obtains the training sample after dimensionality reduction.
2. method according to claim 1, it is characterised in that described according to principal component analysiss PCA algorithm to described original
Sample matrix carries out dimension-reduction treatment, obtains the training sample after dimensionality reduction, including:
Eigenmatrix is generated according to the original sample matrix;
Calculate the transposed matrix of the eigenmatrix;
By the transposed matrix of the eigenmatrix and the original sample matrix multiple, the training sample after dimensionality reduction is obtained.
3. method according to claim 2, it is characterised in that described feature square is generated according to the original sample matrix
Battle array, including:
The average of all original sample matrixes is calculated, obtains central sample matrix;
Calculate the transposed matrix of the central sample matrix, and turning the central sample matrix and the central sample matrix
Matrix multiple is put, obtains target sample matrix;
Calculate multiple eigenvalues of the target sample matrix, and the corresponding characteristic vector of each eigenvalue;
The characteristic vector of predetermined number is selected successively according to the descending order of eigenvalue;
According to the order for selecting characteristic vector, the characteristic vector of selection is sequentially arranged, obtains the eigenmatrix.
4. according to the method in claim 2 or 3, it is characterised in that described according to original sample matrix generation feature
Matrix, including:
The original sample matrix is averagely divided into multiple matrix group;
The subcharacter matrix of each matrix group is calculated respectively;
According to the order for calculating subcharacter matrix, multiple subcharacter matrixes are combined, obtain the eigenmatrix.
5. method according to claim 4, it is characterised in that described the original sample matrix is averagely divided into multiple
Matrix group, including:
With the classification of training sample as foundation, to the corresponding original sample Classification of Matrix of training sample;
According to default dimension values calculating matrix set M;
Respectively the original sample matrix of each classification is averagely divided into M set of matrices;
Different classes of set of matrices is carried out the combination of permutation and combination type, obtains the plurality of matrix group, wherein, each matrix
Group all set of matrices comprising all categories.
6. method according to claim 2, it is characterised in that described according to principal component analysiss PCA algorithm to the original
Beginning sample matrix carries out dimension-reduction treatment, and after obtaining the training sample after dimensionality reduction, methods described is further included:
After training sample after using the dimensionality reduction is trained to the SVM model, by after the eigenmatrix and training
The training result multiplication of vectors of acquisition, obtains the training result vector after optimizing;
Classification and matching collection is determined according to the training result vector after optimization, the classification and matching integrates the very son as presetting database
Collection, the presetting database includes the element set of all elements that training sample may relate to;
Data to be predicted according to the classification and matching set pair are classified.
7. method according to claim 1, it is characterised in that the SVM model is used for recognizing whether unknown mails are rubbish
Rubbish mail and/or normal email;
Known normal email and known spam that the training sample is used comprising training.
8. a kind of device of data processing, it is characterised in that described device includes:
Acquiring unit, for obtaining the corresponding original sample matrix of each training sample, the training sample be used for support to
Amount machine SVM model is trained obtaining the SVM model for treating that prediction data is classified, and wherein, the training sample is comprising extremely
Few two kinds of different classes of training samples;
Dimensionality reduction unit, for carrying out dimension-reduction treatment to the original sample matrix according to principal component analysiss PCA algorithm, obtains dimensionality reduction
Training sample afterwards.
9. device according to claim 8, it is characterised in that the dimensionality reduction unit includes:
Generation module, for generating eigenmatrix according to the original sample matrix;
Computing module, for calculating the transposed matrix of the eigenmatrix;
Multiplication module, for by the transposed matrix of the eigenmatrix and the original sample matrix multiple, after obtaining dimensionality reduction
Training sample.
10. device according to claim 9, it is characterised in that the generation module, is used for:
The average of all original sample matrixes is calculated, obtains central sample matrix;
Calculate the transposed matrix of the central sample matrix, and turning the central sample matrix and the central sample matrix
Matrix multiple is put, obtains target sample matrix;
Calculate multiple eigenvalues of the target sample matrix, and the corresponding characteristic vector of each eigenvalue;
The characteristic vector of predetermined number is selected successively according to the descending order of eigenvalue;
According to the order for selecting characteristic vector, the characteristic vector of selection is sequentially arranged, obtains the eigenmatrix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610715951.6A CN106446011B (en) | 2016-08-24 | 2016-08-24 | The method and device of data processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610715951.6A CN106446011B (en) | 2016-08-24 | 2016-08-24 | The method and device of data processing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106446011A true CN106446011A (en) | 2017-02-22 |
CN106446011B CN106446011B (en) | 2019-11-26 |
Family
ID=58182611
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610715951.6A Active CN106446011B (en) | 2016-08-24 | 2016-08-24 | The method and device of data processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106446011B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107194260A (en) * | 2017-04-20 | 2017-09-22 | 中国科学院软件研究所 | A kind of Linux Kernel association CVE intelligent Forecastings based on machine learning |
CN107392257A (en) * | 2017-08-03 | 2017-11-24 | 网易(杭州)网络有限公司 | Acquisition methods, device, storage medium, processor and the service end of the sequence of operation |
CN107729144A (en) * | 2017-09-30 | 2018-02-23 | 广东欧珀移动通信有限公司 | Application control method, apparatus, storage medium and electronic equipment |
CN113780339A (en) * | 2021-08-03 | 2021-12-10 | 阿里巴巴(中国)有限公司 | Model training, predicting and content understanding method and electronic equipment |
CN114547482A (en) * | 2022-03-03 | 2022-05-27 | 智慧足迹数据科技有限公司 | Service feature generation method and device, electronic equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104142960A (en) * | 2013-05-10 | 2014-11-12 | 上海普华诚信信息技术有限公司 | Internet data analysis system |
-
2016
- 2016-08-24 CN CN201610715951.6A patent/CN106446011B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104142960A (en) * | 2013-05-10 | 2014-11-12 | 上海普华诚信信息技术有限公司 | Internet data analysis system |
Non-Patent Citations (2)
Title |
---|
秦玉平 等: "基于C-SVM和KPCA的垃圾邮件检测研究", 《计算机工程与应用》 * |
高宏宾 等: "基于核主成分分析的数据流降维研究", 《计算机工程与应用》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107194260A (en) * | 2017-04-20 | 2017-09-22 | 中国科学院软件研究所 | A kind of Linux Kernel association CVE intelligent Forecastings based on machine learning |
CN107392257A (en) * | 2017-08-03 | 2017-11-24 | 网易(杭州)网络有限公司 | Acquisition methods, device, storage medium, processor and the service end of the sequence of operation |
CN107392257B (en) * | 2017-08-03 | 2020-05-12 | 网易(杭州)网络有限公司 | Method and device for acquiring operation sequence, storage medium, processor and server |
CN107729144A (en) * | 2017-09-30 | 2018-02-23 | 广东欧珀移动通信有限公司 | Application control method, apparatus, storage medium and electronic equipment |
CN107729144B (en) * | 2017-09-30 | 2020-01-14 | Oppo广东移动通信有限公司 | Application control method and device, storage medium and electronic equipment |
CN113780339A (en) * | 2021-08-03 | 2021-12-10 | 阿里巴巴(中国)有限公司 | Model training, predicting and content understanding method and electronic equipment |
CN113780339B (en) * | 2021-08-03 | 2024-03-29 | 阿里巴巴(中国)有限公司 | Model training, predicting and content understanding method and electronic equipment |
CN114547482A (en) * | 2022-03-03 | 2022-05-27 | 智慧足迹数据科技有限公司 | Service feature generation method and device, electronic equipment and storage medium |
CN114547482B (en) * | 2022-03-03 | 2023-01-20 | 智慧足迹数据科技有限公司 | Service feature generation method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106446011B (en) | 2019-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106446011A (en) | Data processing method and device | |
CN108376220A (en) | A kind of malice sample program sorting technique and system based on deep learning | |
CN105955962B (en) | The calculation method and device of topic similarity | |
CN110232280B (en) | Software security vulnerability detection method based on tree structure convolutional neural network | |
CN106611052A (en) | Text label determination method and device | |
CN109948029A (en) | Based on the adaptive depth hashing image searching method of neural network | |
CN112732583B (en) | Software test data generation method based on clustering and multi-population genetic algorithm | |
CN108053030A (en) | A kind of transfer learning method and system of Opening field | |
CN104809069A (en) | Source node loophole detection method based on integrated neural network | |
CN115331732B (en) | Gene phenotype training and predicting method and device based on graph neural network | |
CN109086886A (en) | A kind of convolutional neural networks learning algorithm based on extreme learning machine | |
Gaucel et al. | Learning dynamical systems using standard symbolic regression | |
CN106598827A (en) | Method and device for extracting log data | |
CN115563610B (en) | Training method, recognition method and device for intrusion detection model | |
CN111353313A (en) | Emotion analysis model construction method based on evolutionary neural network architecture search | |
CN109598307A (en) | Data screening method, apparatus, server and storage medium | |
CN112232087A (en) | Transformer-based specific aspect emotion analysis method of multi-granularity attention model | |
CN109840413A (en) | A kind of detection method for phishing site and device | |
CN109067800A (en) | A kind of cross-platform association detection method of firmware loophole | |
CN105045913A (en) | Text classification method based on WordNet and latent semantic analysis | |
CN106649385A (en) | Data ranking method and device based on HBase database | |
Wakayama et al. | Distributed forests for MapReduce-based machine learning | |
Kim et al. | Tweaking deep neural networks | |
Dinu et al. | Authorship Identification of Romanian Texts with Controversial Paternity. | |
CN109472276A (en) | The construction method and device and mode identification method of pattern recognition model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |