CN101887419A - Batch initiative rank learning algorithm - Google Patents

Batch initiative rank learning algorithm Download PDF

Info

Publication number
CN101887419A
CN101887419A CN2009100688805A CN200910068880A CN101887419A CN 101887419 A CN101887419 A CN 101887419A CN 2009100688805 A CN2009100688805 A CN 2009100688805A CN 200910068880 A CN200910068880 A CN 200910068880A CN 101887419 A CN101887419 A CN 101887419A
Authority
CN
China
Prior art keywords
sample
mark
batch
algorithm
tolerance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2009100688805A
Other languages
Chinese (zh)
Inventor
蒯宇豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN2009100688805A priority Critical patent/CN101887419A/en
Publication of CN101887419A publication Critical patent/CN101887419A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a batch initiative rank learning algorithm, comprising the following six steps: pretreating the queries input by the users, including segmenting Chinese words and filtering the stop words; using the rank support vector machine algorithm to train the sample sets which have been labeled by the users to obtain an initial ranking model; computing the distances from each sample in the sample sets which have not been labeled by the users to the ranking decision plane to obtain the batch selected distance measure; computing the differences between included angles between each sample in the sample sets which have not been labeled by the users and the selected samples to obtain the batch selected measure of included angles; integrating the batch selected measure of included angles and distance measure to obtain the batch selected measure; and selecting a plurality of samples worth labeling for the users to label according to the integrated batch selected measure.

Description

Batch initiative rank learning algorithm
Technical field
The present invention relates to information retrieval field, particularly relate to the batch initiative rank learning algorithm of information retrieval field.
Background technology
Along with advancing by leaps and bounds of Internet technology, the explosive increase of Web quantity of information, people more and more are accustomed to using search engine to search the information of being concerned about.But immense information resources have proposed new challenge for the development of search engine.How effectively, fast and accurately Query Result is returned to the user, improve Web information retrieval effect, become a urgent and significant research topic.
In the research of present stage information retrieval field, become the focus of ordering research gradually based on the ordering study of supervised learning.Need the sample of a large amount of artificial marks based on the ordering of supervised learning study,, produced some so-called learning algorithms that initiatively sort based on " selecting, the sample of worth mark marks " thought in order to reduce the mark amount of artificial mark sample.By the learning algorithm that initiatively sorts, the user does not need to mark at the beginning all samples, but beginning only marks a part of sample, and study earlier obtains an order models; Then at every turn from remaining not marking select the sample one the sample of worth mark mark, the sample of this new mark is put into training set, training obtains new order models again; In remaining not mark sample, reselect a sample more then and mark, add training set, so analogize up to obtaining final order models.Initiatively study has reduced the sample mark amount of ordering study, is only to select a sample mark but the method has a problem at every turn, training again again afterwards, and a lot of times of training need, the mark personnel mark next sample and need wait for for a long time simultaneously.If can select a plurality of samples at every turn, then can reduce the time of whole active ordering study, reduce mark personnel's workload, promptly mark cost, simultaneously,, can also realize parallel mark if a plurality of mark personnel are arranged, improve the initiatively efficient of ordering.
Summary of the invention
The invention provides batch initiative rank learning algorithm, reduce the required a large amount of artificial mark sample cost of ordering study.
The batch initiative rank learning algorithm that the present invention proposes comprises six steps: pre-service is carried out in the inquiry to user's input, comprises carrying out Chinese word segmentation and filtering stop words; Use rank support vector machine algorithm training user to mark sample set, obtain an initial ranking model; Calculate the distance do not mark each sample and ordering decision surface in the sample set, with the distance metric that obtains selecting in batches; Calculate and not mark the angle difference between each sample and the sample of having selected in the sample set, measure with the angle that obtains selecting in batches; Integrate angle tolerance and the distance metric selected in batches, to obtain selecting in batches tolerance; Select tolerance to select the sample of a plurality of worth marks to mark according to the batch of integrating to the user.
The batch initiative rank learning algorithm that the present invention proposes comprises: pretreatment unit, and training mark sample is used rank support vector machine algorithm; Computed range tolerance is selected and the near sample of ordering decision surface distance; Calculate the angle difference measurement, select and the sample of selecting sample set difference maximum; Integrate distance metric and angle difference measurement, with the overall tolerance that obtains selecting in batches; According to the tolerance of integrating, selecting the most in batches, the sample of worth mark marks to the user.
Embodiment
Among the present invention pre-service is carried out in the inquiry of user's input, use rank support vector machine algorithm training user to mark sample set, calculate the distance do not mark each sample and ordering decision surface in the sample set, calculate the angle difference between the sample do not mark each sample in the sample set and to have selected, integrate angle tolerance and the distance metric selected in batches, select tolerance to select the sample of a plurality of worth marks to mark to the user according to the batch of integrating.The sample of worth mark marks to the user because batch initiative rank learning algorithm is selected, rather than marks all samples, has reduced the mark cost of ordering study.

Claims (7)

1. the sort algorithm based on language model is characterized in that comprising the following steps:
Pretreatment unit carries out participle and filters the stop words processing the query word and the candidate documents of user's input;
Use rank support vector machine algorithm training user to mark sample set, obtain an initial ranking model;
Calculate the distance do not mark each sample and ordering decision surface in the sample set, with the distance metric that obtains selecting in batches;
Calculate and not mark the angle difference between each sample and the sample of having selected in the sample set, measure with the angle that obtains selecting in batches;
Integrate angle tolerance and the distance metric selected in batches, to obtain selecting in batches tolerance;
Select tolerance to select the sample of a plurality of worth marks to mark according to the batch of integrating to the user.
2. the method for claim 1 is characterized in that also comprising before setting up inverted index candidate documents is carried out preprocessing part.At first to filter stop words, make similar " ", " ", " ", " it " and so on invalid speech filter out, can reduce the retrieval cost like this; Its less important Chinese word segmentation that carries out, method commonly used has the forward maximum matching algorithm, anyway maximum matching algorithm, two-way maximum matching algorithm etc., we have adopted the forward maximum matching algorithm.
3. the method for claim 1, it is characterized in that marked sample set with rank support vector machine algorithm training user, rank support vector machine algorithm is by forming ordered pair to ordered samples, sequencing problem is changed into classification problem, solve with algorithm of support vector machine then.Computing method are not:
min w ‾ M ( w ‾ ) = 1 2 | | w ‾ | | 2 + C Σ i = 1 l ξ i
4. the method for claim 1 is characterized in that, calculates the distance do not mark each sample and ordering decision surface in the sample set, and computing formula is:
Figure F2009100688805C0000012
Wherein
Figure F2009100688805C0000013
The weighted vector of the order models that the study of expression rank support vector machine algorithm obtains.
5. method as described in claim 1 is characterized in that, calculates the angle difference between the sample do not mark each sample in the sample set and to have selected, and computing formula is:
Figure F2009100688805C0000014
6. require described method as right 1, it is characterized in that, adjust the distance tolerance and angle difference measurement, with the whole tolerance that obtains initiatively selecting in batches, computing formula is:
f ( x ‾ i ) = dis tan ce [ i ] + λ * max Cos [ i ] ;
7. require described method as right 1, it is characterized in that, select tolerance to select the sample of a plurality of worth marks to mark to the user according to the batch of integrating.Batch initiative rank learning algorithm selects that the sample of worth mark marks to the user, rather than marks all samples, has reduced the mark cost of ordering study.A majority sample then can reduce the time that whole active ordering is learnt, and reduces mark personnel's workload, promptly marks cost, simultaneously, if a plurality of mark personnel are arranged, can also realize parallel mark, improves the efficient of active ordering.
CN2009100688805A 2009-05-15 2009-05-15 Batch initiative rank learning algorithm Pending CN101887419A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100688805A CN101887419A (en) 2009-05-15 2009-05-15 Batch initiative rank learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100688805A CN101887419A (en) 2009-05-15 2009-05-15 Batch initiative rank learning algorithm

Publications (1)

Publication Number Publication Date
CN101887419A true CN101887419A (en) 2010-11-17

Family

ID=43073345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100688805A Pending CN101887419A (en) 2009-05-15 2009-05-15 Batch initiative rank learning algorithm

Country Status (1)

Country Link
CN (1) CN101887419A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688576A (en) * 2016-08-04 2018-02-13 中国科学院声学研究所 The structure and tendentiousness sorting technique of a kind of CNN SVM models

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107688576A (en) * 2016-08-04 2018-02-13 中国科学院声学研究所 The structure and tendentiousness sorting technique of a kind of CNN SVM models
CN107688576B (en) * 2016-08-04 2020-06-16 中国科学院声学研究所 Construction and tendency classification method of CNN-SVM model

Similar Documents

Publication Publication Date Title
CN105653706B (en) A kind of multilayer quotation based on literature content knowledge mapping recommends method
CN104090890B (en) Keyword similarity acquisition methods, device and server
CN103778227B (en) The method screening useful image from retrieval image
CN104156433B (en) Image retrieval method based on semantic mapping space construction
CN106815252A (en) A kind of searching method and equipment
CN106709754A (en) Power user grouping method based on text mining
CN105808530B (en) Interpretation method and device in a kind of statistical machine translation
CN103605658B (en) A kind of search engine system analyzed based on text emotion
CN101853470A (en) Collaborative filtering method based on socialized label
CN105844424A (en) Product quality problem discovery and risk assessment method based on network comments
CN105893609A (en) Mobile APP recommendation method based on weighted mixing
CN103123653A (en) Search engine retrieving ordering method based on Bayesian classification learning
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN107656920B (en) Scientific and technological talent recommendation method based on patents
CN110032733A (en) A kind of rumour detection method and system for news long text
CN104361102A (en) Expert recommendation method and system based on group matching
CN105677857B (en) method and device for accurately matching keywords with marketing landing pages
CN103593373A (en) Search result sorting method and search result sorting device
CN108090223B (en) Openers portrait method based on internet information
CN110543595A (en) in-station search system and method
CN103473128A (en) Collaborative filtering method for mashup application recommendation
CN108520038B (en) Biomedical literature retrieval method based on sequencing learning algorithm
CN106445994A (en) Mixed algorithm-based web page classification method and apparatus
CN102609539B (en) Search method and search system
CN103744954A (en) Word relevancy network model establishing method and establishing device thereof

Legal Events

Date Code Title Description
DD01 Delivery of document by public notice

Addressee: Kuai Yuhao

Document name: Notification of Passing Preliminary Examination of the Application for Invention

C06 Publication
PB01 Publication
DD01 Delivery of document by public notice

Addressee: Kuai Yuhao

Document name: Notification of Publication of the Application for Invention

C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20101117