CN101887419A

CN101887419A - Batch initiative rank learning algorithm

Info

Publication number: CN101887419A
Application number: CN2009100688805A
Authority: CN
Inventors: 蒯宇豪
Original assignee: Individual
Current assignee: Individual
Priority date: 2009-05-15
Filing date: 2009-05-15
Publication date: 2010-11-17

Abstract

The invention provides a batch initiative rank learning algorithm, comprising the following six steps: pretreating the queries input by the users, including segmenting Chinese words and filtering the stop words; using the rank support vector machine algorithm to train the sample sets which have been labeled by the users to obtain an initial ranking model; computing the distances from each sample in the sample sets which have not been labeled by the users to the ranking decision plane to obtain the batch selected distance measure; computing the differences between included angles between each sample in the sample sets which have not been labeled by the users and the selected samples to obtain the batch selected measure of included angles; integrating the batch selected measure of included angles and distance measure to obtain the batch selected measure; and selecting a plurality of samples worth labeling for the users to label according to the integrated batch selected measure.

Description

Batch initiative rank learning algorithm

Technical field

The present invention relates to information retrieval field, particularly relate to the batch initiative rank learning algorithm of information retrieval field.

Background technology

Along with advancing by leaps and bounds of Internet technology, the explosive increase of Web quantity of information, people more and more are accustomed to using search engine to search the information of being concerned about.But immense information resources have proposed new challenge for the development of search engine.How effectively, fast and accurately Query Result is returned to the user, improve Web information retrieval effect, become a urgent and significant research topic.

In the research of present stage information retrieval field, become the focus of ordering research gradually based on the ordering study of supervised learning.Need the sample of a large amount of artificial marks based on the ordering of supervised learning study,, produced some so-called learning algorithms that initiatively sort based on " selecting, the sample of worth mark marks " thought in order to reduce the mark amount of artificial mark sample.By the learning algorithm that initiatively sorts, the user does not need to mark at the beginning all samples, but beginning only marks a part of sample, and study earlier obtains an order models; Then at every turn from remaining not marking select the sample one the sample of worth mark mark, the sample of this new mark is put into training set, training obtains new order models again; In remaining not mark sample, reselect a sample more then and mark, add training set, so analogize up to obtaining final order models.Initiatively study has reduced the sample mark amount of ordering study, is only to select a sample mark but the method has a problem at every turn, training again again afterwards, and a lot of times of training need, the mark personnel mark next sample and need wait for for a long time simultaneously.If can select a plurality of samples at every turn, then can reduce the time of whole active ordering study, reduce mark personnel's workload, promptly mark cost, simultaneously,, can also realize parallel mark if a plurality of mark personnel are arranged, improve the initiatively efficient of ordering.

Summary of the invention

The invention provides batch initiative rank learning algorithm, reduce the required a large amount of artificial mark sample cost of ordering study.

The batch initiative rank learning algorithm that the present invention proposes comprises six steps: pre-service is carried out in the inquiry to user's input, comprises carrying out Chinese word segmentation and filtering stop words; Use rank support vector machine algorithm training user to mark sample set, obtain an initial ranking model; Calculate the distance do not mark each sample and ordering decision surface in the sample set, with the distance metric that obtains selecting in batches; Calculate and not mark the angle difference between each sample and the sample of having selected in the sample set, measure with the angle that obtains selecting in batches; Integrate angle tolerance and the distance metric selected in batches, to obtain selecting in batches tolerance; Select tolerance to select the sample of a plurality of worth marks to mark according to the batch of integrating to the user.

The batch initiative rank learning algorithm that the present invention proposes comprises: pretreatment unit, and training mark sample is used rank support vector machine algorithm; Computed range tolerance is selected and the near sample of ordering decision surface distance; Calculate the angle difference measurement, select and the sample of selecting sample set difference maximum; Integrate distance metric and angle difference measurement, with the overall tolerance that obtains selecting in batches; According to the tolerance of integrating, selecting the most in batches, the sample of worth mark marks to the user.

Embodiment

Among the present invention pre-service is carried out in the inquiry of user's input, use rank support vector machine algorithm training user to mark sample set, calculate the distance do not mark each sample and ordering decision surface in the sample set, calculate the angle difference between the sample do not mark each sample in the sample set and to have selected, integrate angle tolerance and the distance metric selected in batches, select tolerance to select the sample of a plurality of worth marks to mark to the user according to the batch of integrating.The sample of worth mark marks to the user because batch initiative rank learning algorithm is selected, rather than marks all samples, has reduced the mark cost of ordering study.

Claims

1. the sort algorithm based on language model is characterized in that comprising the following steps:

Pretreatment unit carries out participle and filters the stop words processing the query word and the candidate documents of user's input;

Use rank support vector machine algorithm training user to mark sample set, obtain an initial ranking model;

Calculate the distance do not mark each sample and ordering decision surface in the sample set, with the distance metric that obtains selecting in batches;

Calculate and not mark the angle difference between each sample and the sample of having selected in the sample set, measure with the angle that obtains selecting in batches;

Integrate angle tolerance and the distance metric selected in batches, to obtain selecting in batches tolerance;

Select tolerance to select the sample of a plurality of worth marks to mark according to the batch of integrating to the user.

2. the method for claim 1 is characterized in that also comprising before setting up inverted index candidate documents is carried out preprocessing part.At first to filter stop words, make similar " ", " ", " ", " it " and so on invalid speech filter out, can reduce the retrieval cost like this; Its less important Chinese word segmentation that carries out, method commonly used has the forward maximum matching algorithm, anyway maximum matching algorithm, two-way maximum matching algorithm etc., we have adopted the forward maximum matching algorithm.

3. the method for claim 1, it is characterized in that marked sample set with rank support vector machine algorithm training user, rank support vector machine algorithm is by forming ordered pair to ordered samples, sequencing problem is changed into classification problem, solve with algorithm of support vector machine then.Computing method are not:

\min_{\overset{&OverBar;}{w}} M (\overset{&OverBar;}{w}) = \frac{1}{2} {| | \overset{&OverBar;}{w} | |}^{2} + C Σ_{i = 1}^{l} ξ_{i}

4. the method for claim 1 is characterized in that, calculates the distance do not mark each sample and ordering decision surface in the sample set, and computing formula is:

Wherein

The weighted vector of the order models that the study of expression rank support vector machine algorithm obtains.

5. method as described in claim 1 is characterized in that, calculates the angle difference between the sample do not mark each sample in the sample set and to have selected, and computing formula is:

6. require described method as right 1, it is characterized in that, adjust the distance tolerance and angle difference measurement, with the whole tolerance that obtains initiatively selecting in batches, computing formula is:

f ({\overset{&OverBar;}{x}}_{i}) = dis \tan ce [i] + λ * \max Cos [i];

7. require described method as right 1, it is characterized in that, select tolerance to select the sample of a plurality of worth marks to mark to the user according to the batch of integrating.Batch initiative rank learning algorithm selects that the sample of worth mark marks to the user, rather than marks all samples, has reduced the mark cost of ordering study.A majority sample then can reduce the time that whole active ordering is learnt, and reduces mark personnel's workload, promptly marks cost, simultaneously, if a plurality of mark personnel are arranged, can also realize parallel mark, improves the efficient of active ordering.