CN103123653A

CN103123653A - Search engine retrieving ordering method based on Bayesian classification learning

Info

Publication number: CN103123653A
Application number: CN2013100831513A
Authority: CN
Inventors: 贾德星; 徐正礼; 魏金雷
Original assignee: Langchao Qilu Software Industry Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2013-03-15
Filing date: 2013-03-15
Publication date: 2013-05-29

Abstract

The invention discloses a search engine retrieving ordering method based on Bayesian classification learning and belongs to the computer application field. According to the search engine retrieving ordering method based on the Bayesian classification learning, a query statement is used as an n-dimensional feature vector B, wherein the B={b1, b2,..., bn}, index documents are used as a category A, and then the Bayesian classification algorithm is used for training user searching behavior data, so that a query word-document click classification model is established; when retrieval results are graded, combination calculation is carried out based on the similarity scores of the query statement and the index document feature vector and the probability value of the category which the query statement belongs to, and therefore new scores are obtained; and then the search results are reordered according to the new scores and sent back to a retrieving client. Compared with the prior art, through the search engine retrieving ordering method based on the Bayesian classification learning, the ordering of the searching results in a search engine can be improved and optimized, and therefore the retrieving accuracy of the search engine is improved, and the search engine retrieving ordering method based on the Bayesian classification learning has high popularization value and application value.

Description

Search engine retrieving sort method based on Bayess classification study

Technical field

The present invention relates to a kind of computer application field, specifically a kind of search engine retrieving sort method based on Bayess classification study.

Background technology

Traditional search engine is generally to mark according to the similarity degree between query statement and index file when index database is carried out retrieval and inquisition, the document score that similarity is high is high, after then sorting from high to low according to scoring, result for retrieval is returned to inquiring user.The calculating of similarity is generally that after by the TF-IDF method, query word and document being carried out proper vector respectively, calculated characteristics vector similarity is marked.Concrete similarity calculating method may have a lot, but the static nature that all is based on document compares calculating, and is difficult to process the diversity of the meaning of a word and the inquiry scene of context relation.Can not reflect timely that the user is to the Search Requirement of hotspot query word, focus index file.

Summary of the invention

Technical assignment of the present invention is for above-mentioned the deficiencies in the prior art, and a kind of search engine retrieving sort method based on Bayess classification study is provided.Utilize that the method can be improved, the sequence of result for retrieval in the Optimizing Search engine, thereby improve the retrieval precision of search engine, be conducive to the result that the user retrieves oneself needs more fast.

Technical assignment of the present invention is realized in the following manner: based on the search engine retrieving sort method of Bayess classification study, be characterized in: with query statement as n dimensional feature vector B={b1, b2, bn}, as classification A, use Bayesian Classification Arithmetic that the user search behavioral data is trained index file, thereby set up the disaggregated model of query word-click document; When result for retrieval is marked, make up calculating according to the similarity score value of query statement and index file proper vector and the probable value of affiliated classification, obtain new score value, and return to the retrieval client after according to new score value, result for retrieval being resequenced.

The realization of said method comprises following concrete steps:

A. recording user inquiry log

Usage log component record user query behavior data in search engine, log content comprises: the document identification that user ID, query statement, query time, the result document number that retrieves, user click;

B. train Bayesian Classification Model

Resolve one by one the user query behavior data that record in log component, query statement is carried out participle obtain n dimensional feature vector B={b1, b2,, bn}, b1 wherein ... word after bn representative of consumer query statement participle, and the document A that the user is clicked is as classification, then use Bayesian Classification Arithmetic that behavioral data is trained, calculate P (A), P (B), P (B|A), thereby set up the disaggregated model of query word-click document;

C. the result for retrieval sequence is calculated

1. at first adopt traditional file characteristics vector similarity computing method, according to user's query statement B search index storehouse, obtain the result set-doc (n) of a front n index file, comprising document identification-id and scoring-score;

2. call the Bayesian learning system, calculate front m the set-classfiler (m) that classifies under query statement B, comprising document identification-id and probable value-p;

3. document in doc (n) is recomputated scoring, formula:

=

+

, wherein

The similarity that represents document n is calculated scoring,

Represent the probable value of document n, do not arrange=0 if document n appears in classfiler (m) set,

Represent the final score of document n;

4. according to the final score of each document of doc (n)

Re-start sequence, and return to retrieval agent by new ranking results.

In step 2, the training of disaggregated model adopts the mode of unit mode or Distributed Calculation to complete.

In step 2, the disaggregated model that training obtains is stored in file, database or internal memory.

At the final score that calculates document n

The time, also can adopt the method that multiplies each other, that is: =

*

, this moment for

=0 document n can arrange a standardized probable value (minimum probability value or average probability value), to avoid document n scoring as 0.

Compared with prior art, the inventive method has following outstanding beneficial effect:

(1) the search behavior daily record by analysis user and use Bayess classification study has improved, has optimized the Query Result of search engine, helps the user to inquire more fast results needed.Show according to the statistical study of daily record data, after using this method, the average page turning number of times of the result for retrieval of the each query statement of user can reduce 50%;

(2) by setting up disaggregated model for each independent user, can also build more Extraordinary, belong to user's oneself search engine, thereby the inquiry that further improves search engine is experienced.

Description of drawings

Accompanying drawing 1 is the working model figure that the present invention is based on the search engine retrieving sort method of Bayess classification study.

Embodiment

Search engine retrieving sort method based on Bayess classification study of the present invention is described in detail below with specific embodiment with reference to Figure of description.

Embodiment:

As shown in drawings, search engine retrieving sort method based on Bayess classification study of the present invention query statement as known conditions B, the document of clicking is as A, then calculate the probability of the document A that the user clicks when query statement B, and this probable value P (A/B) is carried out addition or multiply each other obtaining the final score of each document with the scoring that the file characteristics vector relatively obtains.

The specific implementation step is as follows:

1. recording user inquiry log

Need the behavioral datas such as click of usage log component record user's query requests and result for retrieval in search engine, log content comprises: the document identification that user ID, query statement, query time, the result document number that retrieves, user click etc.

2. training Bayesian Classification Model

Resolve one by one the user query behavior data that record in daily record, query statement is carried out participle obtain n dimensional feature vector B={b1, b2,, bn}, b1 wherein ... word after bn representative of consumer query statement participle, and the document A that the user is clicked is as classification, then use Bayesian Classification Arithmetic that behavioral data is trained, calculate P (A), P (B), P (B|A), thereby set up the disaggregated model of " query word-click document ".The training of disaggregated model can be adopted the unit mode, also can adopt the mode of Distributed Calculation.The disaggregated model that training obtains can be stored in file, database or internal memory.

3. the result for retrieval sequence is calculated

2. calling the Bayesian learning system, calculate front m classification set-classfiler (m) under query statement B, (is document identification-id) and probable value-p comprising classifying;

3. document in doc (n) is recomputated scoring, formula:

=

+

, wherein

The similarity that represents document n is calculated scoring,

Represent the probable value of document n, do not arrange if document n appears in classfiler (m) set

=0,

Represent the final score of document n;

4. according to the final score of each document of doc (n)

Re-start sequence, and return to retrieval agent by new ranking results.

At the final score that calculates document n

The time, also can adopt the method that multiplies each other, that is:

=

*

, this moment for

Claims

1. based on the search engine retrieving sort method of Bayess classification study, it is characterized in that:

With query statement as n dimensional feature vector B={b1, b2 ..., bn} as classification A, uses Bayesian Classification Arithmetic that the user search behavioral data is trained index file, thereby sets up the disaggregated model of query word-click document;

When result for retrieval is marked, make up calculating according to the similarity score value of query statement and index file proper vector and the probable value of affiliated classification, obtain new score value, and return to the retrieval client after according to new score value, result for retrieval being resequenced.

2. method according to claim 1 is characterized in that comprising the following steps:

A. recording user inquiry log

B. train Bayesian Classification Model

C. the result for retrieval sequence is calculated

At first adopt traditional file characteristics vector similarity computing method, according to user's query statement B search index storehouse, obtain the result set-doc (n) of a front n index file, comprising document identification-id and scoring-score;

Call the Bayesian learning system, calculate front m the set-classfiler (m) that classifies under query statement B, comprising document identification-id and probable value-p;

Document in doc (n) is recomputated scoring, formula:

Figure 2013100831513100001DEST_PATH_IMAGE002

=

Figure 2013100831513100001DEST_PATH_IMAGE004

+

Figure 2013100831513100001DEST_PATH_IMAGE006

, wherein

The similarity that represents document n is calculated scoring,

=0, Represent the final score of document n;

Final score according to each document of doc (n)

Re-start sequence, and return to retrieval agent by new ranking results.

3. method according to claim 2, is characterized in that in step 2, and the training of disaggregated model adopts the mode of unit mode or Distributed Calculation to complete.

4. method according to claim 2, is characterized in that in step 2, and the disaggregated model that training obtains is stored in file, database or internal memory.