CN101710392B - Important information acquiring method based on variable boundary support vector machine - Google Patents

Important information acquiring method based on variable boundary support vector machine Download PDF

Info

Publication number
CN101710392B
CN101710392B CN2009102194509A CN200910219450A CN101710392B CN 101710392 B CN101710392 B CN 101710392B CN 2009102194509 A CN2009102194509 A CN 2009102194509A CN 200910219450 A CN200910219450 A CN 200910219450A CN 101710392 B CN101710392 B CN 101710392B
Authority
CN
China
Prior art keywords
information
sample
training set
style
vector machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009102194509A
Other languages
Chinese (zh)
Other versions
CN101710392A (en
Inventor
张莉
郑小皇
王婷
冯骁
焦李成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Beidou Kang Xin Information Polytron Technologies Inc
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN2009102194509A priority Critical patent/CN101710392B/en
Publication of CN101710392A publication Critical patent/CN101710392A/en
Application granted granted Critical
Publication of CN101710392B publication Critical patent/CN101710392B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an important information acquiring method based on a variable boundary support vector machine, which mainly overcomes the defect of ignoring the important difference information of samples in the prior art. The method comprises the following implementing steps of: searching the required information of an information object to be assessed by a smart search engine and preprocessing the information to acquire an original training set; constructing a new training set based on the original training set and introducing a variable boundary factor, wherein the factor is an absolute value of a sample identification difference of two samples in the original training set; inputting the new training set, adopting the factor as a boundary in each restraint of the support vector machine, training an information assessing model and acquiring an information assessing function; inputting sample characteristic vectors of the information to be assessed and acquiring the important information according to the magnitude of function values of characteristic vectors. The invention has the advantage of high average accuracy for acquiring the important information and can be used for grading information importance and assessing product quality.

Description

Important information acquiring method based on the variable boundary support vector machine
Technical field
The invention belongs to the technical field of obtaining of information, particularly a kind of important information acquiring method, this method can be applicable to the grading of information importance degree, and the evaluation of product quality.
Background technology
At present, along with development of science and technology, the internet provides the magnanimity information resource, so whether can access the important information that we want, it is more and more important to become.In the method that information is obtained, the application of information retrieval and search engine is an important approach.In search engine, core is how to provide information by people's demand, and how the information that is obtained is graded.
In information getting method, at first to determine information requirement, be given our interested inquiry, secondly, information collected at inquiry, then information is estimated, the information evaluation system is a vital step of obtaining of information, is each to be returned sample grade, and gives their corresponding evaluation score, can these marks have reflected the quality of the importance degree information evaluation system of the contained information of each sample, determining us obtain the information of wanting at last.This information evaluation system is the method by machine learning, obtains in the training sample set training.
In information grading process, relatively be method relatively more commonly used to formula.In the information text sample that returns, by to the contrast of sample in twos, determine the significance level of information, be the method that supervision is arranged.Sample of two sample compositions is right, and this is regarded as one to style originally to sample, and given label, and this just can solve this class problem with the method that supervised classification is arranged.
1998, the founder Bu Lin of Google with thank to the strange method that proposes Pagerank, be used for the grading of info web.But just single feature is handled, can not all be reflected the importance degree of information.In the information grading, support vector machine is an important evaluation method, can handle various features, more can reflect the full content of information.2000, Herbrich proposed the support vector machine theory is applied to orderly recurrence, proposed the ordering support vector machine first, to the style training originally, obtained the information evaluation system, was used for the importance of evaluation information.2002, Joachims released by style originally being trained supported vector machine evaluation model from another angle, was applied to the information scoring.Though the sampling model of the two is different, all be the grading that comes research information with the method for classification, promptly, obtain the information evaluation model by to style classification based training originally.
More than two kinds of support vector machine information evaluation methods, though can handle the feature of multidimensional, in training process, all do not consider the otherness between the information significance level.Because when the significance level of training set is judged more than two kinds,, be discrepant to style contained information originally.The sample importance degree label of supposing training set has Y={1,2,3,4,5}, importance value is that 5 sample importance value is the sample of l, form to style this and importance value be 3 sample with importance value be 2 sample composition to style, its label all is 1, by equal having treated.And in the above-mentioned support vector machine information evaluation method, the border of constraint condition all is constant in its support vector machine optimizing process, therefore can not embody the style information gap opposite sex originally, so just lose very important information, make the result of information grading inaccurate.
Summary of the invention
The objective of the invention is to overcome the deficiency in the said method, a kind of support vector machine information getting method based on variable boundary is provided, to introduce in the optimization of support vector machine the otherness information between the style basis, make the training of sample more effective, assurance is obtained important information, improves the accuracy of information rating result.
For achieving the above object, the present invention includes as follows:
Collect the demand information step; At information object to be evaluated, by the smart search engine, according to query demand, the information extraction that need are collected becomes a text collection;
Information pre-treatment step: t dimension primitive character t>44 that utilize the word frequency and the reverse file frequency of text collection, text collection is carried out feature extraction, with the Feature Conversion of these extractions is 45 dimension value proper vectors, and these proper vectors are carried out dimensionality reduction, obtains sample set (x i, y i), i=1.......n, x 1... x nBe the two dimensional sample eigenvector, y iBe sample importance degree sign, n is a sample number;
The training step of information evaluation model:
R sample in the sample set that the last step was obtained is as original training set r<n, at original training set ((x 1, y 1) ..., (x r, y r)) in, form originally by any two two dimensional sample eigenvectors style
Figure GSB00000515360200021
If the first sample characteristics vector
Figure GSB00000515360200022
Importance degree sign greater than the second sample characteristics vector
Figure GSB00000515360200023
Importance degree sign, then will to style this
Figure GSB00000515360200024
Be designated z i=1, on the contrary z i=-1, the training set that structure makes new advances:
Figure GSB00000515360200025
I=1,2 ... m, m=O (n 2), m is new training set sample number,
Figure GSB00000515360200026
Be the variable boundary factor, embodied difference this information of style importance degree;
Utilize the support vector machine information evaluation method of variable boundary, new training set is trained, obtain information evaluation function f (x)=wx, the weighting parameter that w obtains for training, x is the two dimensional sample eigenvector of input;
The obtaining step of important information: the sample characteristics vector of input information to be evaluated in information evaluation function f (x)=wx, according to the functional value size of these eigenvectors, carry out descending sort, with coming the sample of front, as the important information that will obtain.
The present invention is owing to introduce the variable boundary factor in the training step of information evaluation model, to introduce in the optimization of support vector machine the otherness information between the style basis, can embody each to style significance level difference originally, make the training of sample more effective, improve the accuracy of information rating result, thereby guaranteed to obtain the average accuracy rate of important information.
Description of drawings
Fig. 1 is realization flow figure of the present invention;
Fig. 2 is the training process process flow diagram of information evaluation model of the present invention;
Embodiment
With reference to Fig. 1, specific implementation step of the present invention is as follows:
Step 1, at information object to be evaluated, by the smart search engine, according to query demand, the information extraction that need are collected becomes a text collection;
Step 2 is utilized t dimension primitive character t>44 of the word frequency of text collection and reverse file frequency, and text collection is carried out feature extraction, is 45 dimension value proper vectors with the Feature Conversion of these extractions, and these proper vectors are carried out dimensionality reduction, obtains sample set (x i, y i), i=1.......n, n are sample number, x 1... x nBe the two dimensional sample eigenvector, y iBe sample importance degree sign, y i∈ 2,1, and 0}, ' 2 ' to represent the contained information of this sample be most important, and ' 1 ' represents part important, and ' 0 ' representative is inessential fully;
Step 3 is constructed new training set.
R sample in the sample set that step 2 is obtained is as original training set r<n, at original training set ((x 1, y 1) ..., (x r, y r)) in, form originally by any two two dimensional sample eigenvectors style
Figure GSB00000515360200031
If the first sample characteristics vector
Figure GSB00000515360200032
Importance degree sign greater than the second sample characteristics vector
Figure GSB00000515360200033
Importance degree sign, then will to style this
Figure GSB00000515360200034
Be designated z i=1, on the contrary z i=-1, the training set that structure makes new advances: I=1,2 ... m, m=O (n 2), m is new training set sample number,
Figure GSB00000515360200036
Be the variable boundary factor, embodied difference this information of style importance degree.
Step 4, the training of information evaluation model.
With reference to Fig. 2, utilize the support vector machine information evaluation method of variable boundary, new training set is carried out following training:
(4a) input training sample set
Figure GSB00000515360200041
I=1,2 ... m;
(4b) according to the support vector machine theory, calculate the weighting parameter w of input training set by following formula:
w = Σ i = 1 m d i z i α i ( x i ( 1 ) - x i ( 2 ) ) ,
In the formula, z iBe the sign of i sample, d iBe the variable boundary factor,
α iBe the Lagrangian factor of the unknown, 0≤α i≤ C, this Lagrange factor is found the solution by following quadratic programming formula:
&Sigma; i = 1 m d i &alpha; i - 1 2 &Sigma; i = 1 m &Sigma; j = 1 m &alpha; i &alpha; j z i z j < x i ( 1 ) - x i ( 2 ) , x j ( 1 ) - x j ( 2 ) >
Figure GSB00000515360200044
Be that i is individual to the style first sample characteristics vector originally,
Figure GSB00000515360200045
Be that i is individual to the style second sample characteristics vector originally,
Figure GSB00000515360200046
Be that j is individual to the style first sample characteristics vector originally, Be that j is individual to the style second sample characteristics vector originally, z jIt is the sign of j sample.
Step 5, with the weighting parameter w of input training set and the sample characteristics vector x of information to be evaluated, be input among information evaluation function f (x)=wx, functional value size according to these eigenvectors, sample to be evaluated is carried out descending sort, these samples just form an ordered list, with coming the sample of tabulation front, as the important information that will obtain.
Effect of the present invention can further specify by following emulation experiment:
The present invention experimentizes to the OHSUMED data set, and the present invention and existing ranksvm method are compared.
The OHSUMED data set derives from U.S. medical information data storehouse MEDLINE.It has comprised 106 groups of medical treatment category information samples, and every group of sample size do not wait, and sample has 45 dimension primitive characters, its sample importance degree sign y ∈ 2,1,0}, ' 2 ' to represent the contained information of this sample be most important, and ' 1 ' represents part important, and ' 0 ' representative is inessential fully.
Every group of sample application PCA method carried out dimensionality reduction to original 45 dimensional features, obtains sample set (x i, y i), x 1... x nBe the two dimensional sample eigenvector, i=1.......n, n are the sample number of every group of sample.
PCA (principal component analysis), i.e. principal component analysis (PCA) is the projecting method that can represent raw data under all square meaning of a kind of searching.PCA has reached the purpose of feature space being carried out dimensionality reduction by extracting the method that cloud cluster scatters maximum direction.
We have adopted the most general evaluation criterion: average accuracy rate (Mean Average Precision) abbreviates MAP as
It weighs the average accuracy rate of obtaining important information.
MAP can only estimate the data set of two kinds of signs.Therefore when calculating MAP value, we identified ' 2 ' and identify ' 1 ' sample and be designated ' 1 ' originally with data centralization, and remaining sample is constant.In the experiment of i group, average accuracy rate computing formula is as follows:
A P i = &Sigma; j = 1 N ( P ( j ) * pos ( j ) ) h , P ( j ) = h j j
In the ordered list of output, when the sample of j position be designated ' 1 ' time, pos (i)=1; Otherwise, pos (i)=0.
H represents that sample identification in the ordered list is ' 1 ' number of samples, h jBe illustrated in preceding j the sample of ordered list and be designated ' 1 ' number of samples, N represents the number of samples of ordered list.
1, simulated conditions and content
8 groups of samples in 106 groups of data of OHSUMED data set are chosen in experiment, carry out 8 groups of experiments, and the running environment of experiment all is Matlab7.0.1.8 groups of data are respectively the 1st group of OHSUMED data centralization, the 5th group, the 6th group, the 7th group, the 9th group, the 10th group, the 11st group, the 13rd group of data.The 1st group of sample number is that 130, the 5 groups of sample numbers are that 56, the 6 groups of sample numbers are that 153, the 7 groups of sample numbers are that 54, the 9 groups of sample numbers are that 139, the 10 groups of sample numbers are that 34, the 11 groups of sample numbers are that 95, the 13 groups of sample numbers are 95.In every group of experiment, every group of sample is divided into disjoint 4 parts, every part has n/4 sample.Every group of sample carries out 4 experiments, when testing at every turn, three increments this as training set, a sample is as test set.
2, The simulation experiment result
Every group of sample carries out 4 experiments, the average accuracy rate that the each experiment of record obtains, and 4 times average accuracy rate is averaged.Experimental result is as shown in table 1, and C expresses support for the compromise coefficient of vector machine, in the experiment from { 1,10,100,1000} selects.
The average accuracy rate of table 1. relatively
Figure GSB00000515360200061
From the simulation result of table 1 as can be seen, in the emulation of 8 groups of data, the average accuracy rate that the inventive method is obtained important information all will be higher than existing ranksvm method.

Claims (1)

1. important information acquiring method based on variable boundary support vector machine comprises:
Collect the demand information step; At information object to be evaluated, by the smart search engine, according to query demand, the information extraction that need are collected becomes a text collection;
Information pre-treatment step: t dimension primitive character t>44 that utilize the word frequency and the reverse file frequency of text collection, text collection is carried out feature extraction, with the Feature Conversion of these extractions is 45 dimension value proper vectors, and these proper vectors are carried out dimensionality reduction, obtains sample set (x i, y i), i=1.......n, x 1... x nBe the two dimensional sample eigenvector, y iBe sample importance degree sign, n is a sample number;
The training step of information evaluation model:
R sample in the sample set that the last step was obtained is as original training set r<n, at original training set ((x 1, y 1) ..., (x r, y r)) in, form originally by any two two dimensional sample eigenvectors style
Figure FSB00000490070800011
If the first sample characteristics vector
Figure FSB00000490070800012
Importance degree sign greater than the second sample characteristics vector Importance degree sign, then will to style this
Figure FSB00000490070800014
Be designated z i=1, on the contrary z i=-1, the training set that structure makes new advances:
Figure FSB00000490070800015
I=1,2 ... m, m=O (n 2), m is new training set sample number,
Figure FSB00000490070800016
Be the variable boundary factor, embodied difference this information of style importance degree;
Utilize As followsThe support vector machine information evaluation method of variable boundary is trained new training set, obtains information evaluation function f (x)=wx, the weighting parameter that w obtains for training, and x is the two dimensional sample eigenvector of input:
At first, input training sample set
Figure FSB00000490070800017
I=1,2 ... m;
Then, according to the support vector machine theory, calculate the weighting parameter w of input training set by following formula:
w = &Sigma; i = 1 m d i z i &alpha; i ( x i ( 1 ) - x i ( 2 ) ) ,
In the formula, z iBe the sign of i sample, d iBe the variable boundary factor,
α iBe the Lagrangian factor of the unknown, 0≤α i≤ C, this Lagrange factor is found the solution by following quadratic programming formula:
&Sigma; i = 1 m d i &alpha; i - 1 2 &Sigma; i = 1 m &Sigma; j = 1 m &alpha; i &alpha; j z i z j < x i ( 1 ) - x i ( 2 ) , x j ( 1 ) - x j ( 2 ) >
Figure FSB00000490070800022
Be that i is individual to the style first sample characteristics vector originally,
Figure FSB00000490070800023
Be that i is individual to the style second sample characteristics vector originally,
Figure FSB00000490070800024
Be that j is individual to the style first sample characteristics vector originally,
Figure FSB00000490070800025
Be that j is individual to the style second sample characteristics vector originally, z jIt is the sign of j sample;
The obtaining step of important information: the sample characteristics vector of input information to be evaluated in information evaluation function f (x)=wx, according to the functional value size of these eigenvectors, carry out descending sort, with coming the sample of front, as the important information that will obtain.
CN2009102194509A 2009-12-11 2009-12-11 Important information acquiring method based on variable boundary support vector machine Expired - Fee Related CN101710392B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102194509A CN101710392B (en) 2009-12-11 2009-12-11 Important information acquiring method based on variable boundary support vector machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102194509A CN101710392B (en) 2009-12-11 2009-12-11 Important information acquiring method based on variable boundary support vector machine

Publications (2)

Publication Number Publication Date
CN101710392A CN101710392A (en) 2010-05-19
CN101710392B true CN101710392B (en) 2011-09-21

Family

ID=42403177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102194509A Expired - Fee Related CN101710392B (en) 2009-12-11 2009-12-11 Important information acquiring method based on variable boundary support vector machine

Country Status (1)

Country Link
CN (1) CN101710392B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104537118B (en) * 2015-01-26 2017-12-26 苏州大学 A kind of microblog data processing method, apparatus and system
WO2019006631A1 (en) * 2017-07-03 2019-01-10 深圳市汇顶科技股份有限公司 Quality evaluation method and apparatus, model establishment method and module, and wearable device

Also Published As

Publication number Publication date
CN101710392A (en) 2010-05-19

Similar Documents

Publication Publication Date Title
CN108628971B (en) Text classification method, text classifier and storage medium for unbalanced data set
Zhao et al. Spectral feature selection for data mining
CN106845717B (en) Energy efficiency evaluation method based on multi-model fusion strategy
CN103632168B (en) Classifier integration method for machine learning
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN102262642B (en) Web image search engine and realizing method thereof
CN102663447B (en) Cross-media searching method based on discrimination correlation analysis
CN105512285B (en) Adaptive network reptile method based on machine learning
CN105389583A (en) Image classifier generation method, and image classification method and device
CN104966105A (en) Robust machine error retrieving method and system
CN109492230B (en) Method for extracting insurance contract key information based on interested text field convolutional neural network
CN103530321A (en) Sequencing system based on machine learning
CN102693452A (en) Multiple-model soft-measuring method based on semi-supervised regression learning
CN106203483A (en) A kind of zero sample image sorting technique of multi-modal mapping method of being correlated with based on semanteme
CN105334504A (en) Radar target identification method based on large-boundary nonlinear discrimination projection model
CN107219510B (en) Radar target identification method based on unlimited largest interval linear discriminant projection model
CN103942749A (en) Hyperspectral ground feature classification method based on modified cluster hypothesis and semi-supervised extreme learning machine
CN104750875A (en) Machine error data classification method and system
CN105574213A (en) Microblog recommendation method and device based on data mining technology
CN110310012B (en) Data analysis method, device, equipment and computer readable storage medium
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
CN102930291B (en) Automatic K adjacent local search heredity clustering method for graphic image
CN102902984B (en) Remote-sensing image semi-supervised projection dimension reducing method based on local consistency
CN106649264A (en) Text information-based Chinese fruit variety information extracting method and device
CN101710392B (en) Important information acquiring method based on variable boundary support vector machine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20100519

Assignee: Shaanxi Beidou Kang Xin information Polytron Technologies Inc

Assignor: Xidian University

Contract record no.: 2014610000064

Denomination of invention: Important information acquiring method based on variable boundary support vector machine

Granted publication date: 20110921

License type: Exclusive License

Record date: 20140409

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
EC01 Cancellation of recordation of patent licensing contract

Assignee: Shaanxi Beidou Kang Xin information Polytron Technologies Inc

Assignor: Xidian University

Contract record no.: 2014610000064

Date of cancellation: 20150330

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
ASS Succession or assignment of patent right

Owner name: SHAANXI BEIDOU KANGXI INFORMATION TECHNOLOGY CO.,

Free format text: FORMER OWNER: XIDIAN UNIVERSITY

Effective date: 20150722

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150722

Address after: 710100, Shaanxi, Xi'an Aerospace base, Shenzhou four road, creating an International Plaza, block C, 7

Patentee after: Shaanxi Beidou Kang Xin information Polytron Technologies Inc

Address before: Xi'an City, Shaanxi province Taibai Road 710071 No. 2

Patentee before: Xidian University

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20110921

Termination date: 20161211

CF01 Termination of patent right due to non-payment of annual fee