CN101710392B

CN101710392B - Important information acquiring method based on variable boundary support vector machine

Info

Publication number: CN101710392B
Application number: CN2009102194509A
Authority: CN
Inventors: 张莉; 郑小皇; 王婷; 冯骁; 焦李成
Original assignee: Xidian University
Current assignee: Shaanxi Beidou Kang Xin Information Polytron Technologies Inc
Priority date: 2009-12-11
Filing date: 2009-12-11
Publication date: 2011-09-21
Anticipated expiration: 2029-12-11
Also published as: CN101710392A

Abstract

The invention discloses an important information acquiring method based on a variable boundary support vector machine, which mainly overcomes the defect of ignoring the important difference information of samples in the prior art. The method comprises the following implementing steps of: searching the required information of an information object to be assessed by a smart search engine and preprocessing the information to acquire an original training set; constructing a new training set based on the original training set and introducing a variable boundary factor, wherein the factor is an absolute value of a sample identification difference of two samples in the original training set; inputting the new training set, adopting the factor as a boundary in each restraint of the support vector machine, training an information assessing model and acquiring an information assessing function; inputting sample characteristic vectors of the information to be assessed and acquiring the important information according to the magnitude of function values of characteristic vectors. The invention has the advantage of high average accuracy for acquiring the important information and can be used for grading information importance and assessing product quality.

Description

Important information acquiring method based on the variable boundary support vector machine

Technical field

The invention belongs to the technical field of obtaining of information, particularly a kind of important information acquiring method, this method can be applicable to the grading of information importance degree, and the evaluation of product quality.

Background technology

At present, along with development of science and technology, the internet provides the magnanimity information resource, so whether can access the important information that we want, it is more and more important to become.In the method that information is obtained, the application of information retrieval and search engine is an important approach.In search engine, core is how to provide information by people's demand, and how the information that is obtained is graded.

In information getting method, at first to determine information requirement, be given our interested inquiry, secondly, information collected at inquiry, then information is estimated, the information evaluation system is a vital step of obtaining of information, is each to be returned sample grade, and gives their corresponding evaluation score, can these marks have reflected the quality of the importance degree information evaluation system of the contained information of each sample, determining us obtain the information of wanting at last.This information evaluation system is the method by machine learning, obtains in the training sample set training.

In information grading process, relatively be method relatively more commonly used to formula.In the information text sample that returns, by to the contrast of sample in twos, determine the significance level of information, be the method that supervision is arranged.Sample of two sample compositions is right, and this is regarded as one to style originally to sample, and given label, and this just can solve this class problem with the method that supervised classification is arranged.

1998, the founder Bu Lin of Google with thank to the strange method that proposes Pagerank, be used for the grading of info web.But just single feature is handled, can not all be reflected the importance degree of information.In the information grading, support vector machine is an important evaluation method, can handle various features, more can reflect the full content of information.2000, Herbrich proposed the support vector machine theory is applied to orderly recurrence, proposed the ordering support vector machine first, to the style training originally, obtained the information evaluation system, was used for the importance of evaluation information.2002, Joachims released by style originally being trained supported vector machine evaluation model from another angle, was applied to the information scoring.Though the sampling model of the two is different, all be the grading that comes research information with the method for classification, promptly, obtain the information evaluation model by to style classification based training originally.

More than two kinds of support vector machine information evaluation methods, though can handle the feature of multidimensional, in training process, all do not consider the otherness between the information significance level.Because when the significance level of training set is judged more than two kinds,, be discrepant to style contained information originally.The sample importance degree label of supposing training set has Y={1,2,3,4,5}, importance value is that 5 sample importance value is the sample of l, form to style this and importance value be 3 sample with importance value be 2 sample composition to style, its label all is 1, by equal having treated.And in the above-mentioned support vector machine information evaluation method, the border of constraint condition all is constant in its support vector machine optimizing process, therefore can not embody the style information gap opposite sex originally, so just lose very important information, make the result of information grading inaccurate.

Summary of the invention

The objective of the invention is to overcome the deficiency in the said method, a kind of support vector machine information getting method based on variable boundary is provided, to introduce in the optimization of support vector machine the otherness information between the style basis, make the training of sample more effective, assurance is obtained important information, improves the accuracy of information rating result.

For achieving the above object, the present invention includes as follows:

Collect the demand information step; At information object to be evaluated, by the smart search engine, according to query demand, the information extraction that need are collected becomes a text collection;

Information pre-treatment step: t dimension primitive character t＞44 that utilize the word frequency and the reverse file frequency of text collection, text collection is carried out feature extraction, with the Feature Conversion of these extractions is 45 dimension value proper vectors, and these proper vectors are carried out dimensionality reduction, obtains sample set (x _i, y _i), i=1.......n, x ₁... x _nBe the two dimensional sample eigenvector, y _iBe sample importance degree sign, n is a sample number;

The training step of information evaluation model:

R sample in the sample set that the last step was obtained is as original training set r＜n, at original training set ((x ₁, y ₁) ..., (x _r, y _r)) in, form originally by any two two dimensional sample eigenvectors style

If the first sample characteristics vector

Importance degree sign greater than the second sample characteristics vector

Importance degree sign, then will to style this

Be designated z _i=1, on the contrary z _i=-1, the training set that structure makes new advances:

I=1,2 ... m, m=O (n ²), m is new training set sample number,

Be the variable boundary factor, embodied difference this information of style importance degree;

Utilize the support vector machine information evaluation method of variable boundary, new training set is trained, obtain information evaluation function f (x)=wx, the weighting parameter that w obtains for training, x is the two dimensional sample eigenvector of input;

The obtaining step of important information: the sample characteristics vector of input information to be evaluated in information evaluation function f (x)=wx, according to the functional value size of these eigenvectors, carry out descending sort, with coming the sample of front, as the important information that will obtain.

The present invention is owing to introduce the variable boundary factor in the training step of information evaluation model, to introduce in the optimization of support vector machine the otherness information between the style basis, can embody each to style significance level difference originally, make the training of sample more effective, improve the accuracy of information rating result, thereby guaranteed to obtain the average accuracy rate of important information.

Description of drawings

Fig. 1 is realization flow figure of the present invention;

Fig. 2 is the training process process flow diagram of information evaluation model of the present invention;

Embodiment

With reference to Fig. 1, specific implementation step of the present invention is as follows:

Step 1, at information object to be evaluated, by the smart search engine, according to query demand, the information extraction that need are collected becomes a text collection;

Step 2 is utilized t dimension primitive character t＞44 of the word frequency of text collection and reverse file frequency, and text collection is carried out feature extraction, is 45 dimension value proper vectors with the Feature Conversion of these extractions, and these proper vectors are carried out dimensionality reduction, obtains sample set (x _i, y _i), i=1.......n, n are sample number, x ₁... x _nBe the two dimensional sample eigenvector, y _iBe sample importance degree sign, y _i∈ 2,1, and 0}, ' 2 ' to represent the contained information of this sample be most important, and ' 1 ' represents part important, and ' 0 ' representative is inessential fully;

Step 3 is constructed new training set.

R sample in the sample set that step 2 is obtained is as original training set r＜n, at original training set ((x ₁, y ₁) ..., (x _r, y _r)) in, form originally by any two two dimensional sample eigenvectors style

If the first sample characteristics vector

Importance degree sign greater than the second sample characteristics vector

Importance degree sign, then will to style this

Be designated z _i=1, on the contrary z _i=-1, the training set that structure makes new advances: I=1,2 ... m, m=O (n ²), m is new training set sample number,

Be the variable boundary factor, embodied difference this information of style importance degree.

Step 4, the training of information evaluation model.

With reference to Fig. 2, utilize the support vector machine information evaluation method of variable boundary, new training set is carried out following training:

(4a) input training sample set

I=1,2 ... m;

(4b) according to the support vector machine theory, calculate the weighting parameter w of input training set by following formula:

w = Σ_{i = 1}^{m} d_{i} z_{i} α_{i} (x_{i}^{(1)} - x_{i}^{(2)}),

In the formula, z _iBe the sign of i sample, d _iBe the variable boundary factor,

α _iBe the Lagrangian factor of the unknown, 0≤α _i≤ C, this Lagrange factor is found the solution by following quadratic programming formula:

Σ_{i = 1}^{m} d_{i} α_{i} - \frac{1}{2} Σ_{i = 1}^{m} Σ_{j = 1}^{m} α_{i} α_{j} z_{i} z_{j} < x_{i}^{(1)} - x_{i}^{(2)}, x_{j}^{(1)} - x_{j}^{(2)} >

Be that i is individual to the style first sample characteristics vector originally,

Be that i is individual to the style second sample characteristics vector originally,

Be that j is individual to the style first sample characteristics vector originally, Be that j is individual to the style second sample characteristics vector originally, z _jIt is the sign of j sample.

Step 5, with the weighting parameter w of input training set and the sample characteristics vector x of information to be evaluated, be input among information evaluation function f (x)=wx, functional value size according to these eigenvectors, sample to be evaluated is carried out descending sort, these samples just form an ordered list, with coming the sample of tabulation front, as the important information that will obtain.

Effect of the present invention can further specify by following emulation experiment:

The present invention experimentizes to the OHSUMED data set, and the present invention and existing ranksvm method are compared.

The OHSUMED data set derives from U.S. medical information data storehouse MEDLINE.It has comprised 106 groups of medical treatment category information samples, and every group of sample size do not wait, and sample has 45 dimension primitive characters, its sample importance degree sign y ∈ 2,1,0}, ' 2 ' to represent the contained information of this sample be most important, and ' 1 ' represents part important, and ' 0 ' representative is inessential fully.

Every group of sample application PCA method carried out dimensionality reduction to original 45 dimensional features, obtains sample set (x _i, y _i), x ₁... x _nBe the two dimensional sample eigenvector, i=1.......n, n are the sample number of every group of sample.

PCA (principal component analysis), i.e. principal component analysis (PCA) is the projecting method that can represent raw data under all square meaning of a kind of searching.PCA has reached the purpose of feature space being carried out dimensionality reduction by extracting the method that cloud cluster scatters maximum direction.

We have adopted the most general evaluation criterion: average accuracy rate (Mean Average Precision) abbreviates MAP as

It weighs the average accuracy rate of obtaining important information.

MAP can only estimate the data set of two kinds of signs.Therefore when calculating MAP value, we identified ' 2 ' and identify ' 1 ' sample and be designated ' 1 ' originally with data centralization, and remaining sample is constant.In the experiment of i group, average accuracy rate computing formula is as follows:

A P_{i} = \frac{Σ_{j = 1}^{N} (P (j) * pos (j))}{h},

P (j) = \frac{h_{j}}{j}

In the ordered list of output, when the sample of j position be designated ' 1 ' time, pos (i)=1; Otherwise, pos (i)=0.

H represents that sample identification in the ordered list is ' 1 ' number of samples, h _jBe illustrated in preceding j the sample of ordered list and be designated ' 1 ' number of samples, N represents the number of samples of ordered list.

1, simulated conditions and content

8 groups of samples in 106 groups of data of OHSUMED data set are chosen in experiment, carry out 8 groups of experiments, and the running environment of experiment all is Matlab7.0.1.8 groups of data are respectively the 1st group of OHSUMED data centralization, the 5th group, the 6th group, the 7th group, the 9th group, the 10th group, the 11st group, the 13rd group of data.The 1st group of sample number is that 130, the 5 groups of sample numbers are that 56, the 6 groups of sample numbers are that 153, the 7 groups of sample numbers are that 54, the 9 groups of sample numbers are that 139, the 10 groups of sample numbers are that 34, the 11 groups of sample numbers are that 95, the 13 groups of sample numbers are 95.In every group of experiment, every group of sample is divided into disjoint 4 parts, every part has n/4 sample.Every group of sample carries out 4 experiments, when testing at every turn, three increments this as training set, a sample is as test set.

2, The simulation experiment result

Every group of sample carries out 4 experiments, the average accuracy rate that the each experiment of record obtains, and 4 times average accuracy rate is averaged.Experimental result is as shown in table 1, and C expresses support for the compromise coefficient of vector machine, in the experiment from { 1,10,100,1000} selects.

The average accuracy rate of table 1. relatively

From the simulation result of table 1 as can be seen, in the emulation of 8 groups of data, the average accuracy rate that the inventive method is obtained important information all will be higher than existing ranksvm method.

Claims

1. important information acquiring method based on variable boundary support vector machine comprises:

The training step of information evaluation model:

If the first sample characteristics vector

Importance degree sign greater than the second sample characteristics vector Importance degree sign, then will to style this

I=1,2 ... m, m=O (n ²), m is new training set sample number,

Utilize As followsThe support vector machine information evaluation method of variable boundary is trained new training set, obtains information evaluation function f (x)=wx, the weighting parameter that w obtains for training, and x is the two dimensional sample eigenvector of input:

At first, input training sample set

I=1,2 ... m;

Then, according to the support vector machine theory, calculate the weighting parameter w of input training set by following formula:

w = Σ_{i = 1}^{m} d_{i} z_{i} α_{i} (x_{i}^{(1)} - x_{i}^{(2)}),

Σ_{i = 1}^{m} d_{i} α_{i} - \frac{1}{2} Σ_{i = 1}^{m} Σ_{j = 1}^{m} α_{i} α_{j} z_{i} z_{j} < x_{i}^{(1)} - x_{i}^{(2)}, x_{j}^{(1)} - x_{j}^{(2)} >

Be that j is individual to the style first sample characteristics vector originally,

Be that j is individual to the style second sample characteristics vector originally, z _jIt is the sign of j sample;