CN106971192A

CN106971192A - View data categorizing system based on Universum associate(d) matrix Ho Kashyap algorithms

Info

Publication number: CN106971192A
Application number: CN201611023336.5A
Authority: CN
Inventors: 王喆; 李冬冬; 朱昱锦; 崇传禹; 高大启
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2016-11-21
Filing date: 2016-11-21
Publication date: 2017-07-21

Abstract

The present invention provides a kind of view data categorizing system based on Universum associate(d) matrix Ho Kashyap algorithms, and the 3rd class sample point that certain amount is located between two class samples, i.e. Universum samples are generated first by based on In Between generation strategies；Universum sample points are substituted into regularization term R afterwards_uniIn；Then HK disaggregated models regularization term introduced after matrixing, constitute complete combination Universum matrixing HK models；Finally the model is trained, the optimized parameter that model is directed to current training dataset is obtained, optimal classification decision surface is generated.In test phase, test sample point substitution decision-making surface function is judged, output category label.Compared to traditional sorting technique, the present invention is allowed the contrast of two class samples of script to become apparent from, further increases accuracy by introducing Universum samples.

Description

View data classification based on Universum associate(d) matrix Ho-Kashyap algorithms System

Technical field

The present invention relates to Pattern classification techniques field, more particularly to a kind of processing is identified to image data set Universum associate(d) matrix Ho-Kashyap algorithms and system.

Background technology

Pattern-recognition is research and utilization computer to imitate or realize the recognition capability of the mankind or other animals, so as to grinding Study carefully the task that object completes automatic identification.In recent years, mode identification technology be widely used in artificial intelligence, machine learning, Computer engineering, robotics, Neurobiology, medical science, detective learn and archaeology, geological prospecting, Astronautics and weapon Many key areas such as technology.Pattern-recognition needs one of processing classical problem to be to 2-D data, i.e., to show using matrix table Data, are handled.In actual applications, the data that matrix is represented are common in problem of image recognition, such as recognition of face, fingerprint Identification, or spectral matching factor.

Traditional method for classifying modes is when handling image problem, it is necessary to which an image pattern is converted into vector table first Show, then the sample of vectorization is handled.Classical method include SVMs (Support Vector Machine, SVM), principal component analysis (Principal Component Analysis), Fisher linear discriminants (Fisher Linear Discriminant) etc..Handle the image after vectorization and there are two subject matters：First, an image is converted into vector Afterwards, vectorial dimension is of a relatively high, for many classical ways in feature extraction field, it may appear that small sample problem, i.e., The scale of data set is much smaller than the dimension of data set.For example, projection algorithm (Locally Preserving protect in office Projection, LPP), FLD, and PCA etc..This kind of algorithm needs to be related to Eigenvalues Decomposition, and dimension and the difference of sample number Linear multivariate Diophantine equation group is caused to seek approximate solution problem.Higher-dimension sample also causes computational complexity to increase, and consumption is more Internal memory places the parameters such as weight vectors.Secondly, an image is converted into after vector, the space knot between image element itself Structure is destroyed.It is not the attribute that correspondence is independently defined, but represent because the element of image pattern is different from vectorial sample elements Pixel Information of the whole sample in ad-hoc location.Therefore, the two-dimensional structure of destruction image script in theory can be accurate to classifying Degree causes certain influence.

In order to solve traditional mode recognition methods in problem present on two-dimentional data set, some specific methods are designed Out.In these methods, the method for directly processing two dimensional sample achieves more significant success.Exemplary process has will be traditional special Levy two-dimensional principal component analysis (2DPCA) and two dimension Fisher linear discriminants (2DFLD) of processing method two dimensionization etc..Meanwhile, Have the method for classical taxonomy method two dimensionization, for example, support tensor machine (Support Tensor Machine, STM) etc..

At present, the method for both direction respectively has deficiency.First kind method is only in the characteristic processing stage to the direct place of data set Reason, main purpose be dimensionality reduction to avoid or alleviate small sample problem, but still entered in follow-up sorting phase using conventional method Row processing, although so part solves produced problem one after two dimensional sample vectorization described above, can not solve problem Two.Equations of The Second Kind method is often complicated, it is necessary to adjust quantity of parameters to obtain optimal value due to being mostly nonlinear method. And matrix computations amount is the cube of exponent number, this kind of method is related to a large amount of matrix computations when handling many nonlinear steps, because This time complexity is high.If can design simple for structure, parameter is less, and the side that directly can be classified to 2-D data Method, it will further improve disposal ability of the Pattern classification techniques on image problem.

The content of the invention

For prior art construction is complicated, inefficiency and precision is not high, it is impossible to meet precisely, in real time or lack priori The image problem of knowledge, the invention provides a kind of sorting technique based on Universum associate(d) matrix Ho-Kashyap algorithms, To two classification problems, the Universum samples between class are generated by classical In-Between technologies first, one is then devised The model of individual two dimensionization Ho-Kashyap (HK) algorithm, designs a sign Universum sample and is associated with original sample afterwards Regularization term and substituted into the module of second step design, optimal ginseng finally is solved with gradient descent method to whole model Number, obtained decision boundary is while image data set classification accuracy rate is ensured, in modelling and the aspect of model calculation two Improve efficiency.

The technical solution adopted for the present invention to solve the technical problems：Backstage is described according to specific image problem first, The sample collected is subjected to dimensionality reduction denoising using classical LPP, FLD or PCA method.Secondly, by what is represented with matrix Data set is divided into training dataset and test data set two parts.In training step, given birth to first by based on In-Between It is located at the 3rd class sample point between two class samples, i.e. Universum samples into strategy generating certain amount.Afterwards, will Universum sample points substitute into regularization term R_uniIn.Then HK disaggregated models regularization term introduced after matrixing, structure Into complete combination Universum matrixing HK models.Finally, the model is trained, obtains model for current instruction Practice the optimized parameter of data set, generate optimal classification decision surface.3rd, in test phase, current test sample point is substituted into and instructed The decision-making surface function perfected is judged.Finally, the class label that output is determined.

The technical solution adopted for the present invention to solve the technical problems can also be further perfect.The of the training module One step, generation Universum method be not limited to use In-Between, as long as the method used can be quickly generated between The 3rd class sample between two classes.Further, because vector is also a kind of special matrix, the model can also handle to Measure data set.In processing, if not considering the Universum samples introduced, and the weight vectors of model side are made to be not involved in changing Generation optimization, then model degradation is to traditional amendment HK algorithms (Modified Ho-Kashyap Algorithm, MHKS).Can be with Find out, as the method such as this method and MHKS, belong to linear classification method, therefore, it is possible to faster determine classification than nonlinear method Decision surface, so as to improve efficiency.

The invention has the advantages that：The sorting technique of view data is directly handled, small sample problem is not only overcome, carries High efficiency, and the integrality of view data structure set is remained, therefore have higher accuracy；By introducing Universum Sample, allows the contrast of two class samples of script to become apparent from, further increases accuracy；Because this method belongs to linear method, Shorten the training time；This method can prove that the risk supremum of promoting under the conditions of Rademacher is no more than original MHKS Method.

Brief description of the drawings

Fig. 1 is the system framework that the present invention is applied to image model classification problem；

Fig. 2 is the experimental comparison figure of inventive algorithm and other algorithms；

Embodiment

The invention will be described further with reference to the accompanying drawings and examples：The method of the present invention is divided into three modules.

Part I：Data acquisition

This module includes two steps, first by value data；Secondly, Universum samples are generated.

1) by the image problem digitization in reality：The data set that generator matrix is represented is easy to subsequent module to be handled. The matrix data generated after collection further can carry out dimension-reduction treatment using classical way.One matrix samples is expressed as A, square Dimension d=m × n of the pixel conversion value, i.e. sample of each element correspondence sample of battle array.

2) In-Between methods generation Universum samples are utilized：Universum samples are defined as and problem data Collection is but not belonging to any kind sample in same domain value range.For example in grapheme classification problem, two classification are used Model is to digital " 5 " and " 8 " two class sample classification, and remaining digital " 0 ", " 1 ", " 2 ", " 3 ", " 4 ", " 6 ", " 7 ", " 9 " can To be considered as Universum samples.In other problemses, if there is no ready-made Universum samples, it is necessary to use Certain method generation.Here we used a typical generating algorithm, i.e. In-Between methods.The thought of this method It is, it is first determined two classes are close to the sample of decision boundary, the line between inhomogeneous boundary sample, then the random distance on line The new sample of place's generation.The sample of generation is exactly Universum samples.In our method, to simplify calculating, two are unified in The midpoint generation Universum samples of individual sample line.

Part II：Train classification models

In this module, the data set collected will be trained in the core algorithm for substituting into invention.Key step is as follows：

1) design regularization term R_uni：Universum samples are substituted into initial decision-making surface function as the 3rd class sample to enter Row processing, the formula for generating regularization term is as follows：

2) new model M atMHKS is generated to traditional MHKS matrixings：First, traditional MHKS models are based on minimum equal The square theory of error is proposed, and MHKS is the HK algorithms of amendment.The target equation of HK algorithms is as follows；

J_s(w, b)=| | Yw-b | |²

Wherein, Y is the matrix that vectorial sample is constituted, and w is weight vectors, and b is the bias correction vector not to bear being manually set. HK target is just so that Yw-b error as close possible to 0.MHKS by increasing border width, by the target turn to it is following not Equation：

Yw≥1_N×1

It is so as to obtain new target equation：

Matrixing is directly handled matrix on the basis of MHKS, first, and MatMHKS is by by the weight vectors w of script It is divided into the vectorial u and the vector v of control rectangular array of control row matrix, the decision surface equation for obtaining basis is changed into：

And then, MatMHKS target equation is changed into：

Wherein, v=[v^T,v0]^T, Y=[y₁,y₂,...,y_N]^T,y_i=ψ_i[u^TA_i,1]^T.For simplicity, S₁With S₂For two unit squares Battle array.

3) by regularization term R_uniMatMHKS is introduced, the matrixing HK disaggregated models for combining Universum methods are constituted UMatMHKS：As can be seen that HK, MHKS and MatMHKS follow same Frame Design, i.e. structural risk minimization framework：

Min J=R_emp+cR_reg

Wherein R_empIt is traditional empiric risk, the i.e. error sum of squares of experiment value and theoretical value.R_regPair it is to promote risk, i.e., Empiric risk it is extensive so that model can be applicable on different pieces of information collection.C is a penalty factor.In this conventional frame In, introduce the designed Universum regularization terms R of previous step_uni, so as to obtain the complete frame of new method：

4) object function under generation new frame：New model by Universum samples due to introducing matrixing HK methods In,

Substitute into design parameter and just obtain final target equation：

5) optimized parameter is solved using gradient descent method：For UMatMHKS target equation, using gradient descent method, First to target component derivation：

When the differential formulas result of parameter is 0, parameter obtains extreme value, now obtains the calculation formula of each parameter acquiring extreme value It is as follows：

It is according to back empiric risk and as the standard for the condition of stopping, parameter b solution is different with v from u

What the error equation that item is obtained was represented：

Part III：Test unknown data

, it is necessary to detect that the unknown data of its class label substitutes into the model trained in the module, and made decision by model. If unknown sample is Ai.Decision function is：

It from decision function, if decision-making equation result is not 0, can be judged, be 0 and represent that test sample assigns to two classes Probability is equal, and disaggregated model can not judge.

Experimental design

1) experimental data set is chosen：The classical image data sets of the experimental selection four.Choose class number, the sample dimension of data set Degree, scale (total sample number) row are in the following table.

All data sets used are handled using the wheel cross-iteration mode of Monte Carlo ten, i.e., be divided into two parts by data set is all kinds of And upsetting sample order, portion is as test data, and another is training data, repeats ten times.Extraction mode is to put back to Extract.In an experiment, by contrasting two parts of different proportion, the effect of each disaggregated model in actual applications is observed.For example with When the sample number of training is much smaller than the sample number for testing, the classification accuracy of different classifications model is how many.

2) algorithm is contrasted：Core algorithm UMatMHKS used in invention.In addition, we select MatMHKS, MHKS, Algorithm on the basis of SVM (Linear), SVM (Non-Linear).Wherein SVM (Non-Linear) algorithm uses RBF (Radial basis function).Parameter specifically sets as follows：

For UMatMHKS, MatMHKS and MHKS, vectorial b initial values are set to 10^-6, parameter of stopping ξ is set to 10^-4.Learning rate p It is set to 0.99.For prevent from not restraining situation occur and defined maximum iteration is set to 1000 times.Control R_regWith R_uniThe penalty parameter c of item is all from set { 10^-2,10^-1,10⁰,10¹,10²Middle selection.Especially, UMatMHKS weight vectors U initial values are set to random and are more than 0 number for being less than 1.

For SVM, relaxation factor C selection range is { 10^-2,10^-1,10⁰,10¹,10²In.For non-linear SVM, nuclear parameter Calculation formula is as follows, i.e. the average distance of sample two-by-two：

K(x_i, x_j)=exp (- | | x_i-x_j||²/σ)

3) performance metric method：Experiment is unified to be come using classification accuracy (Classification Accuracy, Acc) Record classification results of the distinct methods to each data set.Result is that correspondence algorithm is configured on the data set using optimized parameter When the result that obtains, i.e. optimal result.Acc values are between 0 to 100, and numerical value is higher, show that the algorithm divides on current data set Class effect is better.

The result that all models are handled on each image data set is as shown in Figure 2.Four width figures respectively depict contrast algorithm With classification accuracy during different scales setting training sample on four data sets.It can be seen that in all data On collection, Most models improve accuracy with number of training purpose increase.Especially, UMatMHKS is in four picture numbers According to all achieving effect best in model group on collection.

Claims

1. a kind of view data categorizing system based on Universum associate(d) matrix Ho-Kashyap algorithms, it is characterised in that：Tool Body step is：

1）, sample collection：Backstage is described according to specific image problem, and the sample collected is changed into can be for subsequent algorithm The matrix model of processing；

2）Training generation Universum samples：It is located at two class samples using based on In-Between generation strategies generation certain amount The 3rd class sample point between this, i.e. Universum samples；

3）Training obtains Universum regularization terms R_uni；

4）Training obtains matrix model MatMHKS；

5）Train regularization term R_uniIntroduce matrixing model and obtain final mask UMatMHKS；

6）The optimized parameter of UMatMHKS object functions is sought using gradient descent method；

7）Calculated in test phase, the decision function that test sample is substituted into model UMatMHKS generations, according to the result drawn Symbol is classified.

2. training according to claim 1 obtains Universum regularization terms R_uni, it is characterised in that：Standalone configuration makes It is introduced into for the processing formula of Universum samples, and using the formula as one in original matrix model.

3. according to claim 1 train regularization term R_uniIntroduce matrixing model and obtain final mask UMatMHKS, It is characterized in that：By R_uniIntroduce traditional structure risk framework so that solution space is further constrained, the result is that UMatMHKS popularization risk supremum is not higher than the popularization risk supremum of MatMHKS and MHKS models.

4. use gradient descent method according to claim 1 seeks the optimized parameter of UMatMHKS object functions, its feature exists In：Obtain two weight vectors u and v optimum value respectively using alternating iteration, and carried out using error rate for bias vector b Stop judgement.