CN103123653A - Search engine retrieving ordering method based on Bayesian classification learning - Google Patents

Search engine retrieving ordering method based on Bayesian classification learning Download PDF

Info

Publication number
CN103123653A
CN103123653A CN2013100831513A CN201310083151A CN103123653A CN 103123653 A CN103123653 A CN 103123653A CN 2013100831513 A CN2013100831513 A CN 2013100831513A CN 201310083151 A CN201310083151 A CN 201310083151A CN 103123653 A CN103123653 A CN 103123653A
Authority
CN
China
Prior art keywords
document
search engine
user
query statement
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013100831513A
Other languages
Chinese (zh)
Inventor
贾德星
徐正礼
魏金雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Langchao Qilu Software Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Langchao Qilu Software Industry Co Ltd filed Critical Langchao Qilu Software Industry Co Ltd
Priority to CN2013100831513A priority Critical patent/CN103123653A/en
Publication of CN103123653A publication Critical patent/CN103123653A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a search engine retrieving ordering method based on Bayesian classification learning and belongs to the computer application field. According to the search engine retrieving ordering method based on the Bayesian classification learning, a query statement is used as an n-dimensional feature vector B, wherein the B={b1, b2,..., bn}, index documents are used as a category A, and then the Bayesian classification algorithm is used for training user searching behavior data, so that a query word-document click classification model is established; when retrieval results are graded, combination calculation is carried out based on the similarity scores of the query statement and the index document feature vector and the probability value of the category which the query statement belongs to, and therefore new scores are obtained; and then the search results are reordered according to the new scores and sent back to a retrieving client. Compared with the prior art, through the search engine retrieving ordering method based on the Bayesian classification learning, the ordering of the searching results in a search engine can be improved and optimized, and therefore the retrieving accuracy of the search engine is improved, and the search engine retrieving ordering method based on the Bayesian classification learning has high popularization value and application value.

Description

Search engine retrieving sort method based on Bayess classification study
Technical field
The present invention relates to a kind of computer application field, specifically a kind of search engine retrieving sort method based on Bayess classification study.
Background technology
Traditional search engine is generally to mark according to the similarity degree between query statement and index file when index database is carried out retrieval and inquisition, the document score that similarity is high is high, after then sorting from high to low according to scoring, result for retrieval is returned to inquiring user.The calculating of similarity is generally that after by the TF-IDF method, query word and document being carried out proper vector respectively, calculated characteristics vector similarity is marked.Concrete similarity calculating method may have a lot, but the static nature that all is based on document compares calculating, and is difficult to process the diversity of the meaning of a word and the inquiry scene of context relation.Can not reflect timely that the user is to the Search Requirement of hotspot query word, focus index file.
Summary of the invention
Technical assignment of the present invention is for above-mentioned the deficiencies in the prior art, and a kind of search engine retrieving sort method based on Bayess classification study is provided.Utilize that the method can be improved, the sequence of result for retrieval in the Optimizing Search engine, thereby improve the retrieval precision of search engine, be conducive to the result that the user retrieves oneself needs more fast.
Technical assignment of the present invention is realized in the following manner: based on the search engine retrieving sort method of Bayess classification study, be characterized in: with query statement as n dimensional feature vector B={b1, b2, bn}, as classification A, use Bayesian Classification Arithmetic that the user search behavioral data is trained index file, thereby set up the disaggregated model of query word-click document; When result for retrieval is marked, make up calculating according to the similarity score value of query statement and index file proper vector and the probable value of affiliated classification, obtain new score value, and return to the retrieval client after according to new score value, result for retrieval being resequenced.
The realization of said method comprises following concrete steps:
A. recording user inquiry log
Usage log component record user query behavior data in search engine, log content comprises: the document identification that user ID, query statement, query time, the result document number that retrieves, user click;
B. train Bayesian Classification Model
Resolve one by one the user query behavior data that record in log component, query statement is carried out participle obtain n dimensional feature vector B={b1, b2,, bn}, b1 wherein ... word after bn representative of consumer query statement participle, and the document A that the user is clicked is as classification, then use Bayesian Classification Arithmetic that behavioral data is trained, calculate P (A), P (B), P (B|A), thereby set up the disaggregated model of query word-click document;
C. the result for retrieval sequence is calculated
1. at first adopt traditional file characteristics vector similarity computing method, according to user's query statement B search index storehouse, obtain the result set-doc (n) of a front n index file, comprising document identification-id and scoring-score;
2. call the Bayesian learning system, calculate front m the set-classfiler (m) that classifies under query statement B, comprising document identification-id and probable value-p;
3. document in doc (n) is recomputated scoring, formula:
Figure 518688DEST_PATH_IMAGE001
=
Figure 230293DEST_PATH_IMAGE002
+
Figure 95480DEST_PATH_IMAGE003
, wherein
Figure 47387DEST_PATH_IMAGE002
The similarity that represents document n is calculated scoring,
Figure 630815DEST_PATH_IMAGE003
Represent the probable value of document n, do not arrange=0 if document n appears in classfiler (m) set,
Figure 880531DEST_PATH_IMAGE001
Represent the final score of document n;
4. according to the final score of each document of doc (n)
Figure 865804DEST_PATH_IMAGE001
Re-start sequence, and return to retrieval agent by new ranking results.
In step 2, the training of disaggregated model adopts the mode of unit mode or Distributed Calculation to complete.
In step 2, the disaggregated model that training obtains is stored in file, database or internal memory.
At the final score that calculates document n
Figure 237880DEST_PATH_IMAGE001
The time, also can adopt the method that multiplies each other, that is: =
Figure 362011DEST_PATH_IMAGE002
*
Figure 201791DEST_PATH_IMAGE003
, this moment for
Figure 495500DEST_PATH_IMAGE003
=0 document n can arrange a standardized probable value (minimum probability value or average probability value), to avoid document n scoring as 0.
Compared with prior art, the inventive method has following outstanding beneficial effect:
(1) the search behavior daily record by analysis user and use Bayess classification study has improved, has optimized the Query Result of search engine, helps the user to inquire more fast results needed.Show according to the statistical study of daily record data, after using this method, the average page turning number of times of the result for retrieval of the each query statement of user can reduce 50%;
(2) by setting up disaggregated model for each independent user, can also build more Extraordinary, belong to user's oneself search engine, thereby the inquiry that further improves search engine is experienced.
Description of drawings
Accompanying drawing 1 is the working model figure that the present invention is based on the search engine retrieving sort method of Bayess classification study.
Embodiment
Search engine retrieving sort method based on Bayess classification study of the present invention is described in detail below with specific embodiment with reference to Figure of description.
Embodiment:
As shown in drawings, search engine retrieving sort method based on Bayess classification study of the present invention query statement as known conditions B, the document of clicking is as A, then calculate the probability of the document A that the user clicks when query statement B, and this probable value P (A/B) is carried out addition or multiply each other obtaining the final score of each document with the scoring that the file characteristics vector relatively obtains.
The specific implementation step is as follows:
1. recording user inquiry log
Need the behavioral datas such as click of usage log component record user's query requests and result for retrieval in search engine, log content comprises: the document identification that user ID, query statement, query time, the result document number that retrieves, user click etc.
2. training Bayesian Classification Model
Resolve one by one the user query behavior data that record in daily record, query statement is carried out participle obtain n dimensional feature vector B={b1, b2,, bn}, b1 wherein ... word after bn representative of consumer query statement participle, and the document A that the user is clicked is as classification, then use Bayesian Classification Arithmetic that behavioral data is trained, calculate P (A), P (B), P (B|A), thereby set up the disaggregated model of " query word-click document ".The training of disaggregated model can be adopted the unit mode, also can adopt the mode of Distributed Calculation.The disaggregated model that training obtains can be stored in file, database or internal memory.
3. the result for retrieval sequence is calculated
1. at first adopt traditional file characteristics vector similarity computing method, according to user's query statement B search index storehouse, obtain the result set-doc (n) of a front n index file, comprising document identification-id and scoring-score;
2. calling the Bayesian learning system, calculate front m classification set-classfiler (m) under query statement B, (is document identification-id) and probable value-p comprising classifying;
3. document in doc (n) is recomputated scoring, formula:
Figure 787941DEST_PATH_IMAGE001
=
Figure 645039DEST_PATH_IMAGE002
+
Figure 339325DEST_PATH_IMAGE003
, wherein
Figure 53203DEST_PATH_IMAGE002
The similarity that represents document n is calculated scoring,
Figure 98520DEST_PATH_IMAGE003
Represent the probable value of document n, do not arrange if document n appears in classfiler (m) set
Figure 493729DEST_PATH_IMAGE003
=0,
Figure 42522DEST_PATH_IMAGE001
Represent the final score of document n;
4. according to the final score of each document of doc (n)
Figure 940683DEST_PATH_IMAGE001
Re-start sequence, and return to retrieval agent by new ranking results.
At the final score that calculates document n
Figure 207717DEST_PATH_IMAGE001
The time, also can adopt the method that multiplies each other, that is:
Figure 406617DEST_PATH_IMAGE001
=
Figure 809916DEST_PATH_IMAGE002
*
Figure 131176DEST_PATH_IMAGE003
, this moment for
Figure 885505DEST_PATH_IMAGE003
=0 document n can arrange a standardized probable value (minimum probability value or average probability value), to avoid document n scoring as 0.

Claims (4)

1. based on the search engine retrieving sort method of Bayess classification study, it is characterized in that:
With query statement as n dimensional feature vector B={b1, b2 ..., bn} as classification A, uses Bayesian Classification Arithmetic that the user search behavioral data is trained index file, thereby sets up the disaggregated model of query word-click document;
When result for retrieval is marked, make up calculating according to the similarity score value of query statement and index file proper vector and the probable value of affiliated classification, obtain new score value, and return to the retrieval client after according to new score value, result for retrieval being resequenced.
2. method according to claim 1 is characterized in that comprising the following steps:
A. recording user inquiry log
Usage log component record user query behavior data in search engine, log content comprises: the document identification that user ID, query statement, query time, the result document number that retrieves, user click;
B. train Bayesian Classification Model
Resolve one by one the user query behavior data that record in log component, query statement is carried out participle obtain n dimensional feature vector B={b1, b2,, bn}, b1 wherein ... word after bn representative of consumer query statement participle, and the document A that the user is clicked is as classification, then use Bayesian Classification Arithmetic that behavioral data is trained, calculate P (A), P (B), P (B|A), thereby set up the disaggregated model of query word-click document;
C. the result for retrieval sequence is calculated
At first adopt traditional file characteristics vector similarity computing method, according to user's query statement B search index storehouse, obtain the result set-doc (n) of a front n index file, comprising document identification-id and scoring-score;
Call the Bayesian learning system, calculate front m the set-classfiler (m) that classifies under query statement B, comprising document identification-id and probable value-p;
Document in doc (n) is recomputated scoring, formula:
Figure 2013100831513100001DEST_PATH_IMAGE002
=
Figure 2013100831513100001DEST_PATH_IMAGE004
+
Figure 2013100831513100001DEST_PATH_IMAGE006
, wherein
Figure 389523DEST_PATH_IMAGE004
The similarity that represents document n is calculated scoring,
Figure 119712DEST_PATH_IMAGE006
Represent the probable value of document n, do not arrange if document n appears in classfiler (m) set
Figure 899450DEST_PATH_IMAGE006
=0, Represent the final score of document n;
Final score according to each document of doc (n)
Figure 171348DEST_PATH_IMAGE002
Re-start sequence, and return to retrieval agent by new ranking results.
3. method according to claim 2, is characterized in that in step 2, and the training of disaggregated model adopts the mode of unit mode or Distributed Calculation to complete.
4. method according to claim 2, is characterized in that in step 2, and the disaggregated model that training obtains is stored in file, database or internal memory.
CN2013100831513A 2013-03-15 2013-03-15 Search engine retrieving ordering method based on Bayesian classification learning Pending CN103123653A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013100831513A CN103123653A (en) 2013-03-15 2013-03-15 Search engine retrieving ordering method based on Bayesian classification learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013100831513A CN103123653A (en) 2013-03-15 2013-03-15 Search engine retrieving ordering method based on Bayesian classification learning

Publications (1)

Publication Number Publication Date
CN103123653A true CN103123653A (en) 2013-05-29

Family

ID=48454629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013100831513A Pending CN103123653A (en) 2013-03-15 2013-03-15 Search engine retrieving ordering method based on Bayesian classification learning

Country Status (1)

Country Link
CN (1) CN103123653A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440242A (en) * 2013-06-26 2013-12-11 北京亿赞普网络技术有限公司 User search behavior-based personalized recommendation method and system
CN104050240A (en) * 2014-05-26 2014-09-17 北京奇虎科技有限公司 Method and device for determining categorical attribute of search query word
CN106294363A (en) * 2015-05-15 2017-01-04 厦门美柚信息科技有限公司 A kind of forum postings evaluation methodology, Apparatus and system
CN106354856A (en) * 2016-09-05 2017-01-25 北京百度网讯科技有限公司 Enhanced deep neural network search method and device based on artificial intelligence
CN106649515A (en) * 2016-10-17 2017-05-10 中国电子技术标准化研究院 Real-time micro-blog classifier based on multiple search models
CN107092626A (en) * 2015-12-31 2017-08-25 达索系统公司 The retrieval of the result of precomputation model
CN107092681A (en) * 2017-04-21 2017-08-25 安徽富驰信息技术有限公司 A kind of judicial retrieval result based on user behavior feature learns sort method automatically
CN107977452A (en) * 2017-12-15 2018-05-01 金陵科技学院 A kind of information retrieval system and method based on big data
CN108804408A (en) * 2017-04-27 2018-11-13 安徽富驰信息技术有限公司 Information extraction system based on domain-specialist knowledge system and information extraction method
WO2019233117A1 (en) * 2018-06-06 2019-12-12 众安信息技术服务有限公司 Routing method, device and equipment for on-line analytical processing engine
CN111061836A (en) * 2019-12-18 2020-04-24 焦点科技股份有限公司 Custom scoring method suitable for Lucene full-text search engine
CN111753048A (en) * 2020-05-21 2020-10-09 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
CN111859138A (en) * 2020-07-27 2020-10-30 小红书科技有限公司 Searching method and device
CN111919208A (en) * 2019-01-25 2020-11-10 微软技术许可有限责任公司 Scoring documents in document retrieval
CN113254933A (en) * 2021-06-24 2021-08-13 福建省海峡信息技术有限公司 Deep learning sequencing model-based user behavior data auditing method and system
CN115017200A (en) * 2022-06-02 2022-09-06 北京百度网讯科技有限公司 Search result sorting method and device, electronic equipment and storage medium
CN117909491A (en) * 2024-03-18 2024-04-19 中国标准化研究院 Document metadata analysis method and system based on Bayesian network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1996316A (en) * 2007-01-09 2007-07-11 天津大学 Search engine searching method based on web page correlation
CN101038596A (en) * 2007-04-29 2007-09-19 北京搜狗科技发展有限公司 Method and system for classifying website
US8005767B1 (en) * 2007-02-13 2011-08-23 The United States Of America As Represented By The Secretary Of The Navy System and method of classifying events
CN102419755A (en) * 2010-09-28 2012-04-18 阿里巴巴集团控股有限公司 Method and device for sorting search results

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1996316A (en) * 2007-01-09 2007-07-11 天津大学 Search engine searching method based on web page correlation
US8005767B1 (en) * 2007-02-13 2011-08-23 The United States Of America As Represented By The Secretary Of The Navy System and method of classifying events
CN101038596A (en) * 2007-04-29 2007-09-19 北京搜狗科技发展有限公司 Method and system for classifying website
CN102419755A (en) * 2010-09-28 2012-04-18 阿里巴巴集团控股有限公司 Method and device for sorting search results

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
杨建武: "基于倒排索引的文本相似搜索", 《计算机工程》 *
陈红涛: "基于搜索日志的用户行为研究及应用", 《万方数据-学位首页-计算机应用技术》 *
马尧: "基于多维用户特征建模的个性化社交搜索引擎的设计与实现", 《中国优秀硕士学位论文全文数据库-信息科技辑》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440242A (en) * 2013-06-26 2013-12-11 北京亿赞普网络技术有限公司 User search behavior-based personalized recommendation method and system
CN104050240A (en) * 2014-05-26 2014-09-17 北京奇虎科技有限公司 Method and device for determining categorical attribute of search query word
CN106294363A (en) * 2015-05-15 2017-01-04 厦门美柚信息科技有限公司 A kind of forum postings evaluation methodology, Apparatus and system
CN107092626A (en) * 2015-12-31 2017-08-25 达索系统公司 The retrieval of the result of precomputation model
CN106354856A (en) * 2016-09-05 2017-01-25 北京百度网讯科技有限公司 Enhanced deep neural network search method and device based on artificial intelligence
CN106649515A (en) * 2016-10-17 2017-05-10 中国电子技术标准化研究院 Real-time micro-blog classifier based on multiple search models
CN107092681A (en) * 2017-04-21 2017-08-25 安徽富驰信息技术有限公司 A kind of judicial retrieval result based on user behavior feature learns sort method automatically
CN108804408A (en) * 2017-04-27 2018-11-13 安徽富驰信息技术有限公司 Information extraction system based on domain-specialist knowledge system and information extraction method
CN107977452A (en) * 2017-12-15 2018-05-01 金陵科技学院 A kind of information retrieval system and method based on big data
WO2019233117A1 (en) * 2018-06-06 2019-12-12 众安信息技术服务有限公司 Routing method, device and equipment for on-line analytical processing engine
CN111919208A (en) * 2019-01-25 2020-11-10 微软技术许可有限责任公司 Scoring documents in document retrieval
CN111061836A (en) * 2019-12-18 2020-04-24 焦点科技股份有限公司 Custom scoring method suitable for Lucene full-text search engine
CN111061836B (en) * 2019-12-18 2022-07-22 焦点科技股份有限公司 Custom scoring method suitable for Lucene full-text retrieval engine
CN111753048A (en) * 2020-05-21 2020-10-09 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
CN111753048B (en) * 2020-05-21 2024-02-02 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
CN111859138A (en) * 2020-07-27 2020-10-30 小红书科技有限公司 Searching method and device
CN111859138B (en) * 2020-07-27 2024-05-14 小红书科技有限公司 Searching method and device
CN113254933A (en) * 2021-06-24 2021-08-13 福建省海峡信息技术有限公司 Deep learning sequencing model-based user behavior data auditing method and system
CN115017200A (en) * 2022-06-02 2022-09-06 北京百度网讯科技有限公司 Search result sorting method and device, electronic equipment and storage medium
CN115017200B (en) * 2022-06-02 2023-08-25 北京百度网讯科技有限公司 Method and device for sorting search results, electronic equipment and storage medium
CN117909491A (en) * 2024-03-18 2024-04-19 中国标准化研究院 Document metadata analysis method and system based on Bayesian network
CN117909491B (en) * 2024-03-18 2024-05-14 中国标准化研究院 Document metadata analysis method and system based on Bayesian network

Similar Documents

Publication Publication Date Title
CN103123653A (en) Search engine retrieving ordering method based on Bayesian classification learning
WO2020108608A1 (en) Search result processing method, device, terminal, electronic device, and storage medium
KR102080362B1 (en) Query expansion
CN104685501B (en) Text vocabulary is identified in response to visual query
US9589208B2 (en) Retrieval of similar images to a query image
US10366093B2 (en) Query result bottom retrieval method and apparatus
CN107862070B (en) Online classroom discussion short text instant grouping method and system based on text clustering
CN103605658B (en) A kind of search engine system analyzed based on text emotion
CN104252456B (en) A kind of weight method of estimation, apparatus and system
CN104834693A (en) Depth-search-based visual image searching method and system thereof
CN104156433B (en) Image retrieval method based on semantic mapping space construction
CN104199965A (en) Semantic information retrieval method
CN107590128B (en) Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method
CN101174273A (en) News event detecting method based on metadata analysis
CN109564573A (en) Platform from computer application metadata supports cluster
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN105005590B (en) A kind of generation method of the interim abstract of the special topic of information media
CN103218436A (en) Similar problem retrieving method fusing user category labels and device thereof
CN107895303B (en) Personalized recommendation method based on OCEAN model
CN102693316B (en) Linear generalization regression model based cross-media retrieval method
CN103823906A (en) Multi-dimension searching sequencing optimization algorithm and tool based on microblog data
CN106649605B (en) Method and device for triggering promotion keywords
CN103729365A (en) Searching method and system
CN103778206A (en) Method for providing network service resources
CN103744958B (en) A kind of Web page classification method based on Distributed Calculation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130529

WD01 Invention patent application deemed withdrawn after publication