CN103123653A - Search engine retrieving ordering method based on Bayesian classification learning - Google Patents
Search engine retrieving ordering method based on Bayesian classification learning Download PDFInfo
- Publication number
- CN103123653A CN103123653A CN2013100831513A CN201310083151A CN103123653A CN 103123653 A CN103123653 A CN 103123653A CN 2013100831513 A CN2013100831513 A CN 2013100831513A CN 201310083151 A CN201310083151 A CN 201310083151A CN 103123653 A CN103123653 A CN 103123653A
- Authority
- CN
- China
- Prior art keywords
- document
- search engine
- user
- query statement
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000006399 behavior Effects 0.000 claims abstract description 7
- 238000004364 calculation method Methods 0.000 claims abstract description 7
- 238000013145 classification model Methods 0.000 claims abstract description 4
- 230000003542 behavioural effect Effects 0.000 claims description 6
- 239000003795 chemical substances by application Substances 0.000 claims description 3
- 238000004883 computer application Methods 0.000 abstract description 2
- 238000007635 classification algorithm Methods 0.000 abstract 1
- 241001269238 Data Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a search engine retrieving ordering method based on Bayesian classification learning and belongs to the computer application field. According to the search engine retrieving ordering method based on the Bayesian classification learning, a query statement is used as an n-dimensional feature vector B, wherein the B={b1, b2,..., bn}, index documents are used as a category A, and then the Bayesian classification algorithm is used for training user searching behavior data, so that a query word-document click classification model is established; when retrieval results are graded, combination calculation is carried out based on the similarity scores of the query statement and the index document feature vector and the probability value of the category which the query statement belongs to, and therefore new scores are obtained; and then the search results are reordered according to the new scores and sent back to a retrieving client. Compared with the prior art, through the search engine retrieving ordering method based on the Bayesian classification learning, the ordering of the searching results in a search engine can be improved and optimized, and therefore the retrieving accuracy of the search engine is improved, and the search engine retrieving ordering method based on the Bayesian classification learning has high popularization value and application value.
Description
Technical field
The present invention relates to a kind of computer application field, specifically a kind of search engine retrieving sort method based on Bayess classification study.
Background technology
Traditional search engine is generally to mark according to the similarity degree between query statement and index file when index database is carried out retrieval and inquisition, the document score that similarity is high is high, after then sorting from high to low according to scoring, result for retrieval is returned to inquiring user.The calculating of similarity is generally that after by the TF-IDF method, query word and document being carried out proper vector respectively, calculated characteristics vector similarity is marked.Concrete similarity calculating method may have a lot, but the static nature that all is based on document compares calculating, and is difficult to process the diversity of the meaning of a word and the inquiry scene of context relation.Can not reflect timely that the user is to the Search Requirement of hotspot query word, focus index file.
Summary of the invention
Technical assignment of the present invention is for above-mentioned the deficiencies in the prior art, and a kind of search engine retrieving sort method based on Bayess classification study is provided.Utilize that the method can be improved, the sequence of result for retrieval in the Optimizing Search engine, thereby improve the retrieval precision of search engine, be conducive to the result that the user retrieves oneself needs more fast.
Technical assignment of the present invention is realized in the following manner: based on the search engine retrieving sort method of Bayess classification study, be characterized in: with query statement as n dimensional feature vector B={b1, b2, bn}, as classification A, use Bayesian Classification Arithmetic that the user search behavioral data is trained index file, thereby set up the disaggregated model of query word-click document; When result for retrieval is marked, make up calculating according to the similarity score value of query statement and index file proper vector and the probable value of affiliated classification, obtain new score value, and return to the retrieval client after according to new score value, result for retrieval being resequenced.
The realization of said method comprises following concrete steps:
A. recording user inquiry log
Usage log component record user query behavior data in search engine, log content comprises: the document identification that user ID, query statement, query time, the result document number that retrieves, user click;
B. train Bayesian Classification Model
Resolve one by one the user query behavior data that record in log component, query statement is carried out participle obtain n dimensional feature vector B={b1, b2,, bn}, b1 wherein ... word after bn representative of consumer query statement participle, and the document A that the user is clicked is as classification, then use Bayesian Classification Arithmetic that behavioral data is trained, calculate P (A), P (B), P (B|A), thereby set up the disaggregated model of query word-click document;
C. the result for retrieval sequence is calculated
1. at first adopt traditional file characteristics vector similarity computing method, according to user's query statement B search index storehouse, obtain the result set-doc (n) of a front n index file, comprising document identification-id and scoring-score;
2. call the Bayesian learning system, calculate front m the set-classfiler (m) that classifies under query statement B, comprising document identification-id and probable value-p;
3. document in doc (n) is recomputated scoring, formula:
=
+
, wherein
The similarity that represents document n is calculated scoring,
Represent the probable value of document n, do not arrange=0 if document n appears in classfiler (m) set,
Represent the final score of document n;
4. according to the final score of each document of doc (n)
Re-start sequence, and return to retrieval agent by new ranking results.
In step 2, the training of disaggregated model adopts the mode of unit mode or Distributed Calculation to complete.
In step 2, the disaggregated model that training obtains is stored in file, database or internal memory.
At the final score that calculates document n
The time, also can adopt the method that multiplies each other, that is:
=
*
, this moment for
=0 document n can arrange a standardized probable value (minimum probability value or average probability value), to avoid document n scoring as 0.
Compared with prior art, the inventive method has following outstanding beneficial effect:
(1) the search behavior daily record by analysis user and use Bayess classification study has improved, has optimized the Query Result of search engine, helps the user to inquire more fast results needed.Show according to the statistical study of daily record data, after using this method, the average page turning number of times of the result for retrieval of the each query statement of user can reduce 50%;
(2) by setting up disaggregated model for each independent user, can also build more Extraordinary, belong to user's oneself search engine, thereby the inquiry that further improves search engine is experienced.
Description of drawings
Accompanying drawing 1 is the working model figure that the present invention is based on the search engine retrieving sort method of Bayess classification study.
Embodiment
Search engine retrieving sort method based on Bayess classification study of the present invention is described in detail below with specific embodiment with reference to Figure of description.
Embodiment:
As shown in drawings, search engine retrieving sort method based on Bayess classification study of the present invention query statement as known conditions B, the document of clicking is as A, then calculate the probability of the document A that the user clicks when query statement B, and this probable value P (A/B) is carried out addition or multiply each other obtaining the final score of each document with the scoring that the file characteristics vector relatively obtains.
The specific implementation step is as follows:
1. recording user inquiry log
Need the behavioral datas such as click of usage log component record user's query requests and result for retrieval in search engine, log content comprises: the document identification that user ID, query statement, query time, the result document number that retrieves, user click etc.
2. training Bayesian Classification Model
Resolve one by one the user query behavior data that record in daily record, query statement is carried out participle obtain n dimensional feature vector B={b1, b2,, bn}, b1 wherein ... word after bn representative of consumer query statement participle, and the document A that the user is clicked is as classification, then use Bayesian Classification Arithmetic that behavioral data is trained, calculate P (A), P (B), P (B|A), thereby set up the disaggregated model of " query word-click document ".The training of disaggregated model can be adopted the unit mode, also can adopt the mode of Distributed Calculation.The disaggregated model that training obtains can be stored in file, database or internal memory.
3. the result for retrieval sequence is calculated
1. at first adopt traditional file characteristics vector similarity computing method, according to user's query statement B search index storehouse, obtain the result set-doc (n) of a front n index file, comprising document identification-id and scoring-score;
2. calling the Bayesian learning system, calculate front m classification set-classfiler (m) under query statement B, (is document identification-id) and probable value-p comprising classifying;
3. document in doc (n) is recomputated scoring, formula:
=
+
, wherein
The similarity that represents document n is calculated scoring,
Represent the probable value of document n, do not arrange if document n appears in classfiler (m) set
=0,
Represent the final score of document n;
4. according to the final score of each document of doc (n)
Re-start sequence, and return to retrieval agent by new ranking results.
Claims (4)
1. based on the search engine retrieving sort method of Bayess classification study, it is characterized in that:
With query statement as n dimensional feature vector B={b1, b2 ..., bn} as classification A, uses Bayesian Classification Arithmetic that the user search behavioral data is trained index file, thereby sets up the disaggregated model of query word-click document;
When result for retrieval is marked, make up calculating according to the similarity score value of query statement and index file proper vector and the probable value of affiliated classification, obtain new score value, and return to the retrieval client after according to new score value, result for retrieval being resequenced.
2. method according to claim 1 is characterized in that comprising the following steps:
A. recording user inquiry log
Usage log component record user query behavior data in search engine, log content comprises: the document identification that user ID, query statement, query time, the result document number that retrieves, user click;
B. train Bayesian Classification Model
Resolve one by one the user query behavior data that record in log component, query statement is carried out participle obtain n dimensional feature vector B={b1, b2,, bn}, b1 wherein ... word after bn representative of consumer query statement participle, and the document A that the user is clicked is as classification, then use Bayesian Classification Arithmetic that behavioral data is trained, calculate P (A), P (B), P (B|A), thereby set up the disaggregated model of query word-click document;
C. the result for retrieval sequence is calculated
At first adopt traditional file characteristics vector similarity computing method, according to user's query statement B search index storehouse, obtain the result set-doc (n) of a front n index file, comprising document identification-id and scoring-score;
Call the Bayesian learning system, calculate front m the set-classfiler (m) that classifies under query statement B, comprising document identification-id and probable value-p;
Document in doc (n) is recomputated scoring, formula:
=
+
, wherein
The similarity that represents document n is calculated scoring,
Represent the probable value of document n, do not arrange if document n appears in classfiler (m) set
=0,
Represent the final score of document n;
3. method according to claim 2, is characterized in that in step 2, and the training of disaggregated model adopts the mode of unit mode or Distributed Calculation to complete.
4. method according to claim 2, is characterized in that in step 2, and the disaggregated model that training obtains is stored in file, database or internal memory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013100831513A CN103123653A (en) | 2013-03-15 | 2013-03-15 | Search engine retrieving ordering method based on Bayesian classification learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013100831513A CN103123653A (en) | 2013-03-15 | 2013-03-15 | Search engine retrieving ordering method based on Bayesian classification learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103123653A true CN103123653A (en) | 2013-05-29 |
Family
ID=48454629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013100831513A Pending CN103123653A (en) | 2013-03-15 | 2013-03-15 | Search engine retrieving ordering method based on Bayesian classification learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103123653A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103440242A (en) * | 2013-06-26 | 2013-12-11 | 北京亿赞普网络技术有限公司 | User search behavior-based personalized recommendation method and system |
CN104050240A (en) * | 2014-05-26 | 2014-09-17 | 北京奇虎科技有限公司 | Method and device for determining categorical attribute of search query word |
CN106294363A (en) * | 2015-05-15 | 2017-01-04 | 厦门美柚信息科技有限公司 | A kind of forum postings evaluation methodology, Apparatus and system |
CN106354856A (en) * | 2016-09-05 | 2017-01-25 | 北京百度网讯科技有限公司 | Enhanced deep neural network search method and device based on artificial intelligence |
CN106649515A (en) * | 2016-10-17 | 2017-05-10 | 中国电子技术标准化研究院 | Real-time micro-blog classifier based on multiple search models |
CN107092626A (en) * | 2015-12-31 | 2017-08-25 | 达索系统公司 | The retrieval of the result of precomputation model |
CN107092681A (en) * | 2017-04-21 | 2017-08-25 | 安徽富驰信息技术有限公司 | A kind of judicial retrieval result based on user behavior feature learns sort method automatically |
CN107977452A (en) * | 2017-12-15 | 2018-05-01 | 金陵科技学院 | A kind of information retrieval system and method based on big data |
CN108804408A (en) * | 2017-04-27 | 2018-11-13 | 安徽富驰信息技术有限公司 | Information extraction system based on domain-specialist knowledge system and information extraction method |
WO2019233117A1 (en) * | 2018-06-06 | 2019-12-12 | 众安信息技术服务有限公司 | Routing method, device and equipment for on-line analytical processing engine |
CN111061836A (en) * | 2019-12-18 | 2020-04-24 | 焦点科技股份有限公司 | Custom scoring method suitable for Lucene full-text search engine |
CN111753048A (en) * | 2020-05-21 | 2020-10-09 | 高新兴科技集团股份有限公司 | Document retrieval method, device, equipment and storage medium |
CN111859138A (en) * | 2020-07-27 | 2020-10-30 | 小红书科技有限公司 | Searching method and device |
CN111919208A (en) * | 2019-01-25 | 2020-11-10 | 微软技术许可有限责任公司 | Scoring documents in document retrieval |
CN113254933A (en) * | 2021-06-24 | 2021-08-13 | 福建省海峡信息技术有限公司 | Deep learning sequencing model-based user behavior data auditing method and system |
CN115017200A (en) * | 2022-06-02 | 2022-09-06 | 北京百度网讯科技有限公司 | Search result sorting method and device, electronic equipment and storage medium |
CN117909491A (en) * | 2024-03-18 | 2024-04-19 | 中国标准化研究院 | Document metadata analysis method and system based on Bayesian network |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1996316A (en) * | 2007-01-09 | 2007-07-11 | 天津大学 | Search engine searching method based on web page correlation |
CN101038596A (en) * | 2007-04-29 | 2007-09-19 | 北京搜狗科技发展有限公司 | Method and system for classifying website |
US8005767B1 (en) * | 2007-02-13 | 2011-08-23 | The United States Of America As Represented By The Secretary Of The Navy | System and method of classifying events |
CN102419755A (en) * | 2010-09-28 | 2012-04-18 | 阿里巴巴集团控股有限公司 | Method and device for sorting search results |
-
2013
- 2013-03-15 CN CN2013100831513A patent/CN103123653A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1996316A (en) * | 2007-01-09 | 2007-07-11 | 天津大学 | Search engine searching method based on web page correlation |
US8005767B1 (en) * | 2007-02-13 | 2011-08-23 | The United States Of America As Represented By The Secretary Of The Navy | System and method of classifying events |
CN101038596A (en) * | 2007-04-29 | 2007-09-19 | 北京搜狗科技发展有限公司 | Method and system for classifying website |
CN102419755A (en) * | 2010-09-28 | 2012-04-18 | 阿里巴巴集团控股有限公司 | Method and device for sorting search results |
Non-Patent Citations (3)
Title |
---|
杨建武: "基于倒排索引的文本相似搜索", 《计算机工程》 * |
陈红涛: "基于搜索日志的用户行为研究及应用", 《万方数据-学位首页-计算机应用技术》 * |
马尧: "基于多维用户特征建模的个性化社交搜索引擎的设计与实现", 《中国优秀硕士学位论文全文数据库-信息科技辑》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103440242A (en) * | 2013-06-26 | 2013-12-11 | 北京亿赞普网络技术有限公司 | User search behavior-based personalized recommendation method and system |
CN104050240A (en) * | 2014-05-26 | 2014-09-17 | 北京奇虎科技有限公司 | Method and device for determining categorical attribute of search query word |
CN106294363A (en) * | 2015-05-15 | 2017-01-04 | 厦门美柚信息科技有限公司 | A kind of forum postings evaluation methodology, Apparatus and system |
CN107092626A (en) * | 2015-12-31 | 2017-08-25 | 达索系统公司 | The retrieval of the result of precomputation model |
CN106354856A (en) * | 2016-09-05 | 2017-01-25 | 北京百度网讯科技有限公司 | Enhanced deep neural network search method and device based on artificial intelligence |
CN106649515A (en) * | 2016-10-17 | 2017-05-10 | 中国电子技术标准化研究院 | Real-time micro-blog classifier based on multiple search models |
CN107092681A (en) * | 2017-04-21 | 2017-08-25 | 安徽富驰信息技术有限公司 | A kind of judicial retrieval result based on user behavior feature learns sort method automatically |
CN108804408A (en) * | 2017-04-27 | 2018-11-13 | 安徽富驰信息技术有限公司 | Information extraction system based on domain-specialist knowledge system and information extraction method |
CN107977452A (en) * | 2017-12-15 | 2018-05-01 | 金陵科技学院 | A kind of information retrieval system and method based on big data |
WO2019233117A1 (en) * | 2018-06-06 | 2019-12-12 | 众安信息技术服务有限公司 | Routing method, device and equipment for on-line analytical processing engine |
CN111919208A (en) * | 2019-01-25 | 2020-11-10 | 微软技术许可有限责任公司 | Scoring documents in document retrieval |
CN111061836A (en) * | 2019-12-18 | 2020-04-24 | 焦点科技股份有限公司 | Custom scoring method suitable for Lucene full-text search engine |
CN111061836B (en) * | 2019-12-18 | 2022-07-22 | 焦点科技股份有限公司 | Custom scoring method suitable for Lucene full-text retrieval engine |
CN111753048A (en) * | 2020-05-21 | 2020-10-09 | 高新兴科技集团股份有限公司 | Document retrieval method, device, equipment and storage medium |
CN111753048B (en) * | 2020-05-21 | 2024-02-02 | 高新兴科技集团股份有限公司 | Document retrieval method, device, equipment and storage medium |
CN111859138A (en) * | 2020-07-27 | 2020-10-30 | 小红书科技有限公司 | Searching method and device |
CN111859138B (en) * | 2020-07-27 | 2024-05-14 | 小红书科技有限公司 | Searching method and device |
CN113254933A (en) * | 2021-06-24 | 2021-08-13 | 福建省海峡信息技术有限公司 | Deep learning sequencing model-based user behavior data auditing method and system |
CN115017200A (en) * | 2022-06-02 | 2022-09-06 | 北京百度网讯科技有限公司 | Search result sorting method and device, electronic equipment and storage medium |
CN115017200B (en) * | 2022-06-02 | 2023-08-25 | 北京百度网讯科技有限公司 | Method and device for sorting search results, electronic equipment and storage medium |
CN117909491A (en) * | 2024-03-18 | 2024-04-19 | 中国标准化研究院 | Document metadata analysis method and system based on Bayesian network |
CN117909491B (en) * | 2024-03-18 | 2024-05-14 | 中国标准化研究院 | Document metadata analysis method and system based on Bayesian network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103123653A (en) | Search engine retrieving ordering method based on Bayesian classification learning | |
WO2020108608A1 (en) | Search result processing method, device, terminal, electronic device, and storage medium | |
KR102080362B1 (en) | Query expansion | |
CN104685501B (en) | Text vocabulary is identified in response to visual query | |
US9589208B2 (en) | Retrieval of similar images to a query image | |
US10366093B2 (en) | Query result bottom retrieval method and apparatus | |
CN107862070B (en) | Online classroom discussion short text instant grouping method and system based on text clustering | |
CN103605658B (en) | A kind of search engine system analyzed based on text emotion | |
CN104252456B (en) | A kind of weight method of estimation, apparatus and system | |
CN104834693A (en) | Depth-search-based visual image searching method and system thereof | |
CN104156433B (en) | Image retrieval method based on semantic mapping space construction | |
CN104199965A (en) | Semantic information retrieval method | |
CN107590128B (en) | Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method | |
CN101174273A (en) | News event detecting method based on metadata analysis | |
CN109564573A (en) | Platform from computer application metadata supports cluster | |
CN105426529A (en) | Image retrieval method and system based on user search intention positioning | |
CN105005590B (en) | A kind of generation method of the interim abstract of the special topic of information media | |
CN103218436A (en) | Similar problem retrieving method fusing user category labels and device thereof | |
CN107895303B (en) | Personalized recommendation method based on OCEAN model | |
CN102693316B (en) | Linear generalization regression model based cross-media retrieval method | |
CN103823906A (en) | Multi-dimension searching sequencing optimization algorithm and tool based on microblog data | |
CN106649605B (en) | Method and device for triggering promotion keywords | |
CN103729365A (en) | Searching method and system | |
CN103778206A (en) | Method for providing network service resources | |
CN103744958B (en) | A kind of Web page classification method based on Distributed Calculation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20130529 |
|
WD01 | Invention patent application deemed withdrawn after publication |