CN111831936A - Information retrieval result sorting method, computer equipment and storage medium - Google Patents
Information retrieval result sorting method, computer equipment and storage medium Download PDFInfo
- Publication number
- CN111831936A CN111831936A CN202010656908.3A CN202010656908A CN111831936A CN 111831936 A CN111831936 A CN 111831936A CN 202010656908 A CN202010656908 A CN 202010656908A CN 111831936 A CN111831936 A CN 111831936A
- Authority
- CN
- China
- Prior art keywords
- document
- query
- ranking
- model
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 26
- 230000006870 function Effects 0.000 claims abstract description 14
- 238000004590 computer program Methods 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 9
- 238000001514 detection method Methods 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 101100533306 Mus musculus Setx gene Proteins 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 4
- 238000007418 data mining Methods 0.000 abstract description 3
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000005314 correlation function Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9538—Presentation of query results
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention is suitable for the technical field of information retrieval, and provides an information retrieval result ordering method, which comprises the steps of marking training data, extracting text characteristics, training a learning function and the like, and has the beneficial effects that: the method is characterized in that the information is merged into a feature model according to the information relevant to query, such as click data, anchor text of a webpage, PageRank scores and the like, and a ranking model is automatically constructed by using a learning sorting technology, so that the method has a wide application prospect in the fields of information retrieval, natural language processing, data mining and the like.
Description
Technical Field
The present invention relates to the field of information retrieval technologies, and in particular, to a method for sorting information retrieval results, a computer device, and a storage medium.
Background
In the field of information retrieval, a traditional sorting method is realized by constructing a sorting function, and sorting is generally carried out according to the relevance. Typically, a query in a search engine will return a relevant document, which is then ranked according to the relevance between the search key and the document and returned to the user.
As the factors affecting the degree of correlation become greater, it becomes difficult to use the conventional sorting method. The traditional sorting method is difficult to fuse various information, for example, a vector space model uses tf idf as a weight to construct a correlation function, so that other information is difficult to utilize, parameter adjustment is difficult if parameters in the model are more, and an overfitting phenomenon is likely to occur.
Based on this, the application provides an information retrieval result sorting method, a computer device and a storage medium.
Disclosure of Invention
An embodiment of the present invention provides an information retrieval result sorting method, a computer device, and a storage medium, and aims to solve the technical problems in the background art.
The embodiment of the invention is realized in such a way that the information retrieval result ordering method comprises the following steps:
annotating training data
Searching for documents relevant to the query, and ranking the documents according to the relevance; in particular, the method comprises the following steps of,
text feature extraction
Determining the feature quantity of a document, converting the document into a feature vector, and forming a training example containing the feature vector and the correlation;
training learning functions
Definition Q ═ { Q ═ Q1,q2,L,qmIs the query set, qiFor the ith query, D is the set of documents associated with query set Q, where D isi={di,1,di,2,L,di,niIs equal to qiQuery related document sets, di,jRepresenting a document set DiIs a set of degrees of relevance, where Y is {1,2, L,1}, where Y is the ith document in (1)i={yi,1,yi,2,L,yi,niIs equal to qiQuerying a related document relevancy set;
from this, the original training set can be obtained asFeature vectorBy each query document pair (q)i,di,j),i=1,2,L,m;j=1,2,,niThe method comprises the steps of generating the data,is a characteristic function;
Xi={xi,1,xi,2,L xi,ni}, setting a training data setX is equal to X andassigning scores to a given query document pair (q, d) by using a training local ranking model f (q, d) ═ f (x), and outputting a score list to a training data set S';
document set D by score listiRankingglist pi for defining ranking tableiBidirectional mapping by subscript, with πiRepresenting all documents in the document set DiPossible mapping of inner, pii(j) Represents the jth document at piiIn a sorting manner of f (q)i,di) For query qiSelecting an ordered mappingi∈∏iAnd document set Di;
Using a test set containing new queries and new documents, a feature vector x is createdm+1Training by using a ranking model, and ranking according to the score to obtain pim+1;
Evaluating the performance of the sequencing model;
the model detection model MAP is evaluated.
As a further technical scheme of the invention: the performance evaluation of the ranking model comprises the following steps: the performance evaluation of the ranking model is performed by comparing the ranking list output by the ranking model with the ranking list given as a ground truth, given a query qiAnd related document DiIs provided with piiIs DiRank list of, yiIs DiIs measured in DCGRank list, then the DCG at location k is:
wherein G is a gain function, D is a loss function, pii(j) Is di,jAt piiOf (c) is used.
As a further technical scheme of the invention: the steps for evaluating the model detection model MAP are:
given a rank of relevance of two levels, 1 and 0, a query q is giveniRelated document Di、DiOrdered set of (pi)iRelated document DiSet of correlations yiGiving a query qiThe average accuracy of (d) is:
and repeating the measurement precision until positioning, and further averaging the obtained results to obtain the MAP.
It is another object of an embodiment of the present invention to provide a computer device, including a memory and a processor, where the memory stores therein a computer program, and the computer program, when executed by the processor, causes the processor to execute the steps of the information retrieval result ranking method.
It is another object of an embodiment of the present invention to provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, causes the processor to execute the steps of the information retrieval result ranking method.
Compared with the prior art, the invention has the beneficial effects that: the method is characterized in that the information is merged into a feature model according to the information relevant to query, such as click data, anchor text of a webpage, PageRank scores and the like, and a ranking model is automatically constructed by using a learning sorting technology, so that the method has a wide application prospect in the fields of information retrieval, natural language processing, data mining and the like.
Drawings
Fig. 1 is a schematic diagram of an information retrieval result sorting method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Specific implementations of the present invention are described in detail below with reference to specific embodiments.
As shown in fig. 1, in an embodiment of the present invention, an information retrieval result sorting method includes the following steps:
annotating training data
Searching for documents relevant to the query, and ranking the documents according to the relevance; specifically, the degree of correlation can be divided into 3 grades from 1 to 5, wherein 1 represents weak correlation, 5 represents the most correlation, other numerical values represent the correlation between the two, and the degree of correlation can be manually marked or a manual scoring mechanism can be simulated by clicking records by a user;
text feature extraction
Determining the feature quantity of the document, converting the document into a feature vector X, and forming a training example (X, Y) containing the feature vector and the correlation according to the correlation Y, wherein the more common features comprise: the method comprises the following steps of obtaining word frequency information of query words, IDF information of the query words, document length, number of linked pages, pageRank value, URL looseness, and proximit value of checked query words;
training learning functions
Definition Q ═ { Q ═ Q1,q2,L,qmIs the query set, qiFor the ith query, D is the set of documents associated with query set Q, where D isi={di,1,di,2,L,di,niIs equal to qiQuery related document sets, di,jRepresenting a document set DiIs a set of degrees of relevance, where Y is {1,2, L,1}, where Y is the ith document in (1)i={yi,1,yi,2,L,yi,niIs equal to qiQuerying a related document relevancy set;
from this, the original training set can be obtained asFeature vectorBy each query document pair (q)i,di,j),i=1,2,L,m;j=1,2,,niThe method comprises the steps of generating the data,is a characteristic function;
Xi={xi,1,xi,2,L xi,ni}, setting a training data setX is equal to X andassigning scores to a given query document pair (q, d) by using a training local ranking model f (q, d) ═ f (x), and outputting a score list to a training data set S';
document set D by score listiRankingglist pi for defining ranking tableiBidirectional mapping by subscript, with πiRepresenting all documents in the document set DiPossible mapping of inner, pii(j) Represents the jth document at piiIn a sorting manner of f (q)i,di) For query qiSelecting an ordered mappingi∈∏iAnd document set Di;
Using a test set containing new queries and new documents, a feature vector x is createdm+1Training by using a ranking model, and ranking according to the score to obtain pim+1;
The performance evaluation of the ranking model is performed by comparing the ranking list output by the ranking model with the ranking list given as a ground truthGiving a query qiAnd related document DiIs provided with piiIs DiRank list of, yiIs DiThe relevance of (c) is measured by DCG, and then the DCG at position k is:
wherein G is a gain function, D is a loss function, pii(j) Is di,jAt piiThe position of (1);
evaluating a model detection model MAP as a target detection model performance statistic, dividing the level of correlation into two levels of 1 and 0, and giving a query qiRelated document Di、DiOrdered set of (pi)iRelated document DiSet of correlations yiGiving a query qiThe average accuracy of (d) is:
and repeating the measurement precision until positioning, and further averaging the obtained results to obtain the MAP.
The embodiment of the invention also provides computer equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the computer program is executed by the processor, so that the processor executes the steps of the information retrieval result sorting method.
The embodiment of the invention also provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the processor is enabled to execute the steps of the information retrieval result sorting method.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The embodiment of the invention provides an information retrieval result sorting method, and provides computer equipment and a computer readable storage medium based on the information retrieval result sorting method, according to information related to query, such as click data, anchor text of a webpage, PageRank score and the like, the information is merged into a feature model, a learning sorting technology is used for automatically constructing a ranking model, and the method has wide application prospects in the fields of information retrieval, natural language processing, data mining and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (5)
1. An information retrieval result sorting method is characterized by comprising the following steps:
annotating training data
Searching for documents relevant to the query, and ranking the documents according to the relevance;
text feature extraction
Determining the feature quantity of a document, converting the document into a feature vector, and forming a training example containing the feature vector and the correlation;
training learning functions
Definition Q ═ { Q ═ Q1,q2,L,qmIs the query set, qiFor the ith query, D is the set of documents associated with query set Q, where D isi={di,1,di,2,L,di,niIs equal to qiQuery related document sets, di,jRepresenting a document set DiIs a set of degrees of relevance, where Y is {1,2, L,1}, where Y is the ith document in (1)i={yi,1,yi,2,L,yi,niIs equal to qiQuerying a related document relevancy set;
from this, the original training set can be obtained asFeature vectorBy each query document pair (q)i,di,j),i=1,2,L,m;j=1,2,,niThe method comprises the steps of generating the data,is a characteristic function;
Xi={xi,1,xi,2,L xi,ni}, setting a training data setX is equal to X andassigning scores to a given query document pair (q, d) by using a training local ranking model f (q, d) ═ f (x), and outputting a score list to a training data set S';
document set D by score listiRankingglist pi for defining ranking tableiBidirectional mapping by subscript, with πiRepresenting all documents in the document set DiPossible mapping of inner, pii(j) RepresentsThe jth document is at piiIn a sorting manner of f (q)i,di) For query qiSelecting an ordered mappingi∈∏iAnd document set Di;
Using a test set containing new queries and new documents, a feature vector x is createdm+1Training by using a ranking model, and ranking according to the score to obtain pim+1;
Evaluating the performance of the sequencing model;
the model detection model MAP is evaluated.
2. The method as claimed in claim 1, wherein the step of evaluating the performance of the ranking model comprises: the performance evaluation of the ranking model is performed by comparing the ranking list output by the ranking model with the ranking list given as a ground truth, given a query qiAnd related document DiIs provided with piiIs DiRank list of, yiIs DiThe relevance of (c) is measured by DCG, and then the DCG at position k is:
wherein G is a gain function, D is a loss function, pii(j) Is di,jAt piiOf (c) is used.
3. The method of claim 1, wherein the step of evaluating the model detection model MAP comprises:
given a rank of relevance of two levels, 1 and 0, a query q is giveniRelated document Di、DiOrdered set of (pi)iRelated document DiSet of correlations yiGiving a query qiThe average accuracy of (d) is:
and repeating the measurement precision until positioning, and further averaging the obtained results to obtain the MAP.
4. A computer arrangement, comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to carry out the steps of the information retrieval result ranking method of any of claims 1 to 3.
5. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the information retrieval result ranking method according to any of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010656908.3A CN111831936A (en) | 2020-07-09 | 2020-07-09 | Information retrieval result sorting method, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010656908.3A CN111831936A (en) | 2020-07-09 | 2020-07-09 | Information retrieval result sorting method, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111831936A true CN111831936A (en) | 2020-10-27 |
Family
ID=72901268
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010656908.3A Pending CN111831936A (en) | 2020-07-09 | 2020-07-09 | Information retrieval result sorting method, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111831936A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113806660A (en) * | 2021-09-17 | 2021-12-17 | 北京百度网讯科技有限公司 | Data evaluation method, training method, device, electronic device and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090024607A1 (en) * | 2007-07-20 | 2009-01-22 | Microsoft Corporation | Query selection for effectively learning ranking functions |
US20090037401A1 (en) * | 2007-07-31 | 2009-02-05 | Microsoft Corporation | Information Retrieval and Ranking |
US20090132515A1 (en) * | 2007-11-19 | 2009-05-21 | Yumao Lu | Method and Apparatus for Performing Multi-Phase Ranking of Web Search Results by Re-Ranking Results Using Feature and Label Calibration |
US20100076949A1 (en) * | 2008-09-09 | 2010-03-25 | Microsoft Corporation | Information Retrieval System |
US20100082606A1 (en) * | 2008-09-24 | 2010-04-01 | Microsoft Corporation | Directly optimizing evaluation measures in learning to rank |
US20100250523A1 (en) * | 2009-03-31 | 2010-09-30 | Yahoo! Inc. | System and method for learning a ranking model that optimizes a ranking evaluation metric for ranking search results of a search query |
CN102043776A (en) * | 2009-10-14 | 2011-05-04 | 南开大学 | Inquiry-related multi-ranking-model integration algorithm |
US20160335263A1 (en) * | 2015-05-15 | 2016-11-17 | Yahoo! Inc. | Method and system for ranking search content |
CN108520038A (en) * | 2018-03-31 | 2018-09-11 | 大连理工大学 | A kind of Biomedical literature search method based on Ranking Algorithm |
-
2020
- 2020-07-09 CN CN202010656908.3A patent/CN111831936A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090024607A1 (en) * | 2007-07-20 | 2009-01-22 | Microsoft Corporation | Query selection for effectively learning ranking functions |
US20090037401A1 (en) * | 2007-07-31 | 2009-02-05 | Microsoft Corporation | Information Retrieval and Ranking |
US20090132515A1 (en) * | 2007-11-19 | 2009-05-21 | Yumao Lu | Method and Apparatus for Performing Multi-Phase Ranking of Web Search Results by Re-Ranking Results Using Feature and Label Calibration |
US20100076949A1 (en) * | 2008-09-09 | 2010-03-25 | Microsoft Corporation | Information Retrieval System |
US20100082606A1 (en) * | 2008-09-24 | 2010-04-01 | Microsoft Corporation | Directly optimizing evaluation measures in learning to rank |
US20100250523A1 (en) * | 2009-03-31 | 2010-09-30 | Yahoo! Inc. | System and method for learning a ranking model that optimizes a ranking evaluation metric for ranking search results of a search query |
CN102043776A (en) * | 2009-10-14 | 2011-05-04 | 南开大学 | Inquiry-related multi-ranking-model integration algorithm |
US20160335263A1 (en) * | 2015-05-15 | 2016-11-17 | Yahoo! Inc. | Method and system for ranking search content |
CN108520038A (en) * | 2018-03-31 | 2018-09-11 | 大连理工大学 | A kind of Biomedical literature search method based on Ranking Algorithm |
Non-Patent Citations (4)
Title |
---|
周祖坤;杨光;冯小坤;: "面向文档信息检索的排序学习算法", 自动化技术与应用, no. 02 * |
王扬;黄亚楼;谢茂强;刘杰;卢敏;廖振;: "多查询相关的排序支持向量机融合算法", 计算机研究与发展, no. 04 * |
蔡飞;陈洪辉;舒振;: "基于用户相关反馈的排序学习算法研究", 国防科技大学学报, no. 02 * |
薛剑;吕立;孙咏;王丹妮;: "应用位置信息损失的Listwise排序学习方法的研究", 小型微型计算机系统, no. 01 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113806660A (en) * | 2021-09-17 | 2021-12-17 | 北京百度网讯科技有限公司 | Data evaluation method, training method, device, electronic device and storage medium |
CN113806660B (en) * | 2021-09-17 | 2024-04-26 | 北京百度网讯科技有限公司 | Data evaluation method, training device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110502621B (en) | Question answering method, question answering device, computer equipment and storage medium | |
CN110674429B (en) | Method, apparatus, device and computer readable storage medium for information retrieval | |
Niwattanakul et al. | Using of Jaccard coefficient for keywords similarity | |
CN102236640B (en) | Disambiguation of named entities | |
US9189548B2 (en) | Document search engine including highlighting of confident results | |
Balog et al. | Formal models for expert finding in enterprise corpora | |
CN102902806B (en) | A kind of method and system utilizing search engine to carry out query expansion | |
CN110377558B (en) | Document query method, device, computer equipment and storage medium | |
CN110321408B (en) | Searching method and device based on knowledge graph, computer equipment and storage medium | |
US8019758B2 (en) | Generation of a blended classification model | |
CN110377560B (en) | Method and device for structuring resume information | |
CN108182186B (en) | Webpage sorting method based on random forest algorithm | |
WO2011152925A2 (en) | Detection of junk in search result ranking | |
CN105045875A (en) | Personalized information retrieval method and apparatus | |
CN105653562A (en) | Calculation method and apparatus for correlation between text content and query request | |
CN113821646A (en) | Intelligent patent similarity searching method and device based on semantic retrieval | |
CN111026787A (en) | Network point retrieval method, device and system | |
CN111831936A (en) | Information retrieval result sorting method, computer equipment and storage medium | |
CN111723179A (en) | Feedback model information retrieval method, system and medium based on concept map | |
CN111966869A (en) | Phrase extraction method and device, electronic equipment and storage medium | |
CN115630144A (en) | Document searching method and device and related equipment | |
CN107423298B (en) | Searching method and device | |
CN112765311A (en) | Method for searching referee document | |
CN112163065A (en) | Information retrieval method, system and medium | |
CN101661480A (en) | Method and system for ensuring name of organization in different languages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |