CN111831936A - Information retrieval result sorting method, computer equipment and storage medium - Google Patents

Information retrieval result sorting method, computer equipment and storage medium Download PDF

Info

Publication number
CN111831936A
CN111831936A CN202010656908.3A CN202010656908A CN111831936A CN 111831936 A CN111831936 A CN 111831936A CN 202010656908 A CN202010656908 A CN 202010656908A CN 111831936 A CN111831936 A CN 111831936A
Authority
CN
China
Prior art keywords
document
query
ranking
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010656908.3A
Other languages
Chinese (zh)
Inventor
黎阳
申义
侯颖
刘大伟
王涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weihai Tianxin Modern Service Technology Research Institute Co ltd
Original Assignee
Weihai Tianxin Modern Service Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weihai Tianxin Modern Service Technology Research Institute Co ltd filed Critical Weihai Tianxin Modern Service Technology Research Institute Co ltd
Priority to CN202010656908.3A priority Critical patent/CN111831936A/en
Publication of CN111831936A publication Critical patent/CN111831936A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the technical field of information retrieval, and provides an information retrieval result ordering method, which comprises the steps of marking training data, extracting text characteristics, training a learning function and the like, and has the beneficial effects that: the method is characterized in that the information is merged into a feature model according to the information relevant to query, such as click data, anchor text of a webpage, PageRank scores and the like, and a ranking model is automatically constructed by using a learning sorting technology, so that the method has a wide application prospect in the fields of information retrieval, natural language processing, data mining and the like.

Description

Information retrieval result sorting method, computer equipment and storage medium
Technical Field
The present invention relates to the field of information retrieval technologies, and in particular, to a method for sorting information retrieval results, a computer device, and a storage medium.
Background
In the field of information retrieval, a traditional sorting method is realized by constructing a sorting function, and sorting is generally carried out according to the relevance. Typically, a query in a search engine will return a relevant document, which is then ranked according to the relevance between the search key and the document and returned to the user.
As the factors affecting the degree of correlation become greater, it becomes difficult to use the conventional sorting method. The traditional sorting method is difficult to fuse various information, for example, a vector space model uses tf idf as a weight to construct a correlation function, so that other information is difficult to utilize, parameter adjustment is difficult if parameters in the model are more, and an overfitting phenomenon is likely to occur.
Based on this, the application provides an information retrieval result sorting method, a computer device and a storage medium.
Disclosure of Invention
An embodiment of the present invention provides an information retrieval result sorting method, a computer device, and a storage medium, and aims to solve the technical problems in the background art.
The embodiment of the invention is realized in such a way that the information retrieval result ordering method comprises the following steps:
annotating training data
Searching for documents relevant to the query, and ranking the documents according to the relevance; in particular, the method comprises the following steps of,
text feature extraction
Determining the feature quantity of a document, converting the document into a feature vector, and forming a training example containing the feature vector and the correlation;
training learning functions
Definition Q ═ { Q ═ Q1,q2,L,qmIs the query set, qiFor the ith query, D is the set of documents associated with query set Q, where D isi={di,1,di,2,L,di,niIs equal to qiQuery related document sets, di,jRepresenting a document set DiIs a set of degrees of relevance, where Y is {1,2, L,1}, where Y is the ith document in (1)i={yi,1,yi,2,L,yi,niIs equal to qiQuerying a related document relevancy set;
from this, the original training set can be obtained as
Figure BDA0002577087300000021
Feature vector
Figure BDA0002577087300000022
By each query document pair (q)i,di,j),i=1,2,L,m;j=1,2,,niThe method comprises the steps of generating the data,
Figure BDA0002577087300000023
is a characteristic function;
Xi={xi,1,xi,2,L xi,ni}, setting a training data setX is equal to X and
Figure BDA0002577087300000025
assigning scores to a given query document pair (q, d) by using a training local ranking model f (q, d) ═ f (x), and outputting a score list to a training data set S';
document set D by score listiRankingglist pi for defining ranking tableiBidirectional mapping by subscript, with πiRepresenting all documents in the document set DiPossible mapping of inner, pii(j) Represents the jth document at piiIn a sorting manner of f (q)i,di) For query qiSelecting an ordered mappingi∈∏iAnd document set Di
Using a test set containing new queries and new documents, a feature vector x is createdm+1Training by using a ranking model, and ranking according to the score to obtain pim+1
Evaluating the performance of the sequencing model;
the model detection model MAP is evaluated.
As a further technical scheme of the invention: the performance evaluation of the ranking model comprises the following steps: the performance evaluation of the ranking model is performed by comparing the ranking list output by the ranking model with the ranking list given as a ground truth, given a query qiAnd related document DiIs provided with piiIs DiRank list of, yiIs DiIs measured in DCGRank list, then the DCG at location k is:
Figure BDA0002577087300000026
wherein G is a gain function, D is a loss function, pii(j) Is di,jAt piiOf (c) is used.
As a further technical scheme of the invention: the steps for evaluating the model detection model MAP are:
given a rank of relevance of two levels, 1 and 0, a query q is giveniRelated document Di、DiOrdered set of (pi)iRelated document DiSet of correlations yiGiving a query qiThe average accuracy of (d) is:
Figure BDA0002577087300000031
and repeating the measurement precision until positioning, and further averaging the obtained results to obtain the MAP.
It is another object of an embodiment of the present invention to provide a computer device, including a memory and a processor, where the memory stores therein a computer program, and the computer program, when executed by the processor, causes the processor to execute the steps of the information retrieval result ranking method.
It is another object of an embodiment of the present invention to provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, causes the processor to execute the steps of the information retrieval result ranking method.
Compared with the prior art, the invention has the beneficial effects that: the method is characterized in that the information is merged into a feature model according to the information relevant to query, such as click data, anchor text of a webpage, PageRank scores and the like, and a ranking model is automatically constructed by using a learning sorting technology, so that the method has a wide application prospect in the fields of information retrieval, natural language processing, data mining and the like.
Drawings
Fig. 1 is a schematic diagram of an information retrieval result sorting method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Specific implementations of the present invention are described in detail below with reference to specific embodiments.
As shown in fig. 1, in an embodiment of the present invention, an information retrieval result sorting method includes the following steps:
annotating training data
Searching for documents relevant to the query, and ranking the documents according to the relevance; specifically, the degree of correlation can be divided into 3 grades from 1 to 5, wherein 1 represents weak correlation, 5 represents the most correlation, other numerical values represent the correlation between the two, and the degree of correlation can be manually marked or a manual scoring mechanism can be simulated by clicking records by a user;
text feature extraction
Determining the feature quantity of the document, converting the document into a feature vector X, and forming a training example (X, Y) containing the feature vector and the correlation according to the correlation Y, wherein the more common features comprise: the method comprises the following steps of obtaining word frequency information of query words, IDF information of the query words, document length, number of linked pages, pageRank value, URL looseness, and proximit value of checked query words;
training learning functions
Definition Q ═ { Q ═ Q1,q2,L,qmIs the query set, qiFor the ith query, D is the set of documents associated with query set Q, where D isi={di,1,di,2,L,di,niIs equal to qiQuery related document sets, di,jRepresenting a document set DiIs a set of degrees of relevance, where Y is {1,2, L,1}, where Y is the ith document in (1)i={yi,1,yi,2,L,yi,niIs equal to qiQuerying a related document relevancy set;
from this, the original training set can be obtained as
Figure BDA0002577087300000041
Feature vector
Figure BDA0002577087300000042
By each query document pair (q)i,di,j),i=1,2,L,m;j=1,2,,niThe method comprises the steps of generating the data,
Figure BDA0002577087300000043
is a characteristic function;
Xi={xi,1,xi,2,L xi,ni}, setting a training data set
Figure BDA0002577087300000044
X is equal to X and
Figure BDA0002577087300000045
assigning scores to a given query document pair (q, d) by using a training local ranking model f (q, d) ═ f (x), and outputting a score list to a training data set S';
document set D by score listiRankingglist pi for defining ranking tableiBidirectional mapping by subscript, with πiRepresenting all documents in the document set DiPossible mapping of inner, pii(j) Represents the jth document at piiIn a sorting manner of f (q)i,di) For query qiSelecting an ordered mappingi∈∏iAnd document set Di
Using a test set containing new queries and new documents, a feature vector x is createdm+1Training by using a ranking model, and ranking according to the score to obtain pim+1
The performance evaluation of the ranking model is performed by comparing the ranking list output by the ranking model with the ranking list given as a ground truthGiving a query qiAnd related document DiIs provided with piiIs DiRank list of, yiIs DiThe relevance of (c) is measured by DCG, and then the DCG at position k is:
Figure BDA0002577087300000046
wherein G is a gain function, D is a loss function, pii(j) Is di,jAt piiThe position of (1);
evaluating a model detection model MAP as a target detection model performance statistic, dividing the level of correlation into two levels of 1 and 0, and giving a query qiRelated document Di、DiOrdered set of (pi)iRelated document DiSet of correlations yiGiving a query qiThe average accuracy of (d) is:
Figure BDA0002577087300000051
and repeating the measurement precision until positioning, and further averaging the obtained results to obtain the MAP.
The embodiment of the invention also provides computer equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the computer program is executed by the processor, so that the processor executes the steps of the information retrieval result sorting method.
The embodiment of the invention also provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the processor is enabled to execute the steps of the information retrieval result sorting method.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The embodiment of the invention provides an information retrieval result sorting method, and provides computer equipment and a computer readable storage medium based on the information retrieval result sorting method, according to information related to query, such as click data, anchor text of a webpage, PageRank score and the like, the information is merged into a feature model, a learning sorting technology is used for automatically constructing a ranking model, and the method has wide application prospects in the fields of information retrieval, natural language processing, data mining and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (5)

1. An information retrieval result sorting method is characterized by comprising the following steps:
annotating training data
Searching for documents relevant to the query, and ranking the documents according to the relevance;
text feature extraction
Determining the feature quantity of a document, converting the document into a feature vector, and forming a training example containing the feature vector and the correlation;
training learning functions
Definition Q ═ { Q ═ Q1,q2,L,qmIs the query set, qiFor the ith query, D is the set of documents associated with query set Q, where D isi={di,1,di,2,L,di,niIs equal to qiQuery related document sets, di,jRepresenting a document set DiIs a set of degrees of relevance, where Y is {1,2, L,1}, where Y is the ith document in (1)i={yi,1,yi,2,L,yi,niIs equal to qiQuerying a related document relevancy set;
from this, the original training set can be obtained as
Figure FDA0002577087290000011
Feature vector
Figure FDA0002577087290000012
By each query document pair (q)i,di,j),i=1,2,L,m;j=1,2,,niThe method comprises the steps of generating the data,
Figure FDA0002577087290000013
is a characteristic function;
Xi={xi,1,xi,2,L xi,ni}, setting a training data set
Figure FDA0002577087290000014
X is equal to X and
Figure FDA0002577087290000015
assigning scores to a given query document pair (q, d) by using a training local ranking model f (q, d) ═ f (x), and outputting a score list to a training data set S';
document set D by score listiRankingglist pi for defining ranking tableiBidirectional mapping by subscript, with πiRepresenting all documents in the document set DiPossible mapping of inner, pii(j) RepresentsThe jth document is at piiIn a sorting manner of f (q)i,di) For query qiSelecting an ordered mappingi∈∏iAnd document set Di
Using a test set containing new queries and new documents, a feature vector x is createdm+1Training by using a ranking model, and ranking according to the score to obtain pim+1
Evaluating the performance of the sequencing model;
the model detection model MAP is evaluated.
2. The method as claimed in claim 1, wherein the step of evaluating the performance of the ranking model comprises: the performance evaluation of the ranking model is performed by comparing the ranking list output by the ranking model with the ranking list given as a ground truth, given a query qiAnd related document DiIs provided with piiIs DiRank list of, yiIs DiThe relevance of (c) is measured by DCG, and then the DCG at position k is:
Figure FDA0002577087290000021
wherein G is a gain function, D is a loss function, pii(j) Is di,jAt piiOf (c) is used.
3. The method of claim 1, wherein the step of evaluating the model detection model MAP comprises:
given a rank of relevance of two levels, 1 and 0, a query q is giveniRelated document Di、DiOrdered set of (pi)iRelated document DiSet of correlations yiGiving a query qiThe average accuracy of (d) is:
Figure FDA0002577087290000022
and repeating the measurement precision until positioning, and further averaging the obtained results to obtain the MAP.
4. A computer arrangement, comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to carry out the steps of the information retrieval result ranking method of any of claims 1 to 3.
5. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the information retrieval result ranking method according to any of claims 1 to 3.
CN202010656908.3A 2020-07-09 2020-07-09 Information retrieval result sorting method, computer equipment and storage medium Pending CN111831936A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010656908.3A CN111831936A (en) 2020-07-09 2020-07-09 Information retrieval result sorting method, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010656908.3A CN111831936A (en) 2020-07-09 2020-07-09 Information retrieval result sorting method, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111831936A true CN111831936A (en) 2020-10-27

Family

ID=72901268

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010656908.3A Pending CN111831936A (en) 2020-07-09 2020-07-09 Information retrieval result sorting method, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111831936A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806660A (en) * 2021-09-17 2021-12-17 北京百度网讯科技有限公司 Data evaluation method, training method, device, electronic device and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024607A1 (en) * 2007-07-20 2009-01-22 Microsoft Corporation Query selection for effectively learning ranking functions
US20090037401A1 (en) * 2007-07-31 2009-02-05 Microsoft Corporation Information Retrieval and Ranking
US20090132515A1 (en) * 2007-11-19 2009-05-21 Yumao Lu Method and Apparatus for Performing Multi-Phase Ranking of Web Search Results by Re-Ranking Results Using Feature and Label Calibration
US20100076949A1 (en) * 2008-09-09 2010-03-25 Microsoft Corporation Information Retrieval System
US20100082606A1 (en) * 2008-09-24 2010-04-01 Microsoft Corporation Directly optimizing evaluation measures in learning to rank
US20100250523A1 (en) * 2009-03-31 2010-09-30 Yahoo! Inc. System and method for learning a ranking model that optimizes a ranking evaluation metric for ranking search results of a search query
CN102043776A (en) * 2009-10-14 2011-05-04 南开大学 Inquiry-related multi-ranking-model integration algorithm
US20160335263A1 (en) * 2015-05-15 2016-11-17 Yahoo! Inc. Method and system for ranking search content
CN108520038A (en) * 2018-03-31 2018-09-11 大连理工大学 A kind of Biomedical literature search method based on Ranking Algorithm

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090024607A1 (en) * 2007-07-20 2009-01-22 Microsoft Corporation Query selection for effectively learning ranking functions
US20090037401A1 (en) * 2007-07-31 2009-02-05 Microsoft Corporation Information Retrieval and Ranking
US20090132515A1 (en) * 2007-11-19 2009-05-21 Yumao Lu Method and Apparatus for Performing Multi-Phase Ranking of Web Search Results by Re-Ranking Results Using Feature and Label Calibration
US20100076949A1 (en) * 2008-09-09 2010-03-25 Microsoft Corporation Information Retrieval System
US20100082606A1 (en) * 2008-09-24 2010-04-01 Microsoft Corporation Directly optimizing evaluation measures in learning to rank
US20100250523A1 (en) * 2009-03-31 2010-09-30 Yahoo! Inc. System and method for learning a ranking model that optimizes a ranking evaluation metric for ranking search results of a search query
CN102043776A (en) * 2009-10-14 2011-05-04 南开大学 Inquiry-related multi-ranking-model integration algorithm
US20160335263A1 (en) * 2015-05-15 2016-11-17 Yahoo! Inc. Method and system for ranking search content
CN108520038A (en) * 2018-03-31 2018-09-11 大连理工大学 A kind of Biomedical literature search method based on Ranking Algorithm

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
周祖坤;杨光;冯小坤;: "面向文档信息检索的排序学习算法", 自动化技术与应用, no. 02 *
王扬;黄亚楼;谢茂强;刘杰;卢敏;廖振;: "多查询相关的排序支持向量机融合算法", 计算机研究与发展, no. 04 *
蔡飞;陈洪辉;舒振;: "基于用户相关反馈的排序学习算法研究", 国防科技大学学报, no. 02 *
薛剑;吕立;孙咏;王丹妮;: "应用位置信息损失的Listwise排序学习方法的研究", 小型微型计算机系统, no. 01 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113806660A (en) * 2021-09-17 2021-12-17 北京百度网讯科技有限公司 Data evaluation method, training method, device, electronic device and storage medium
CN113806660B (en) * 2021-09-17 2024-04-26 北京百度网讯科技有限公司 Data evaluation method, training device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
CN110674429B (en) Method, apparatus, device and computer readable storage medium for information retrieval
Niwattanakul et al. Using of Jaccard coefficient for keywords similarity
CN102236640B (en) Disambiguation of named entities
US9189548B2 (en) Document search engine including highlighting of confident results
Balog et al. Formal models for expert finding in enterprise corpora
CN102902806B (en) A kind of method and system utilizing search engine to carry out query expansion
CN110377558B (en) Document query method, device, computer equipment and storage medium
CN110321408B (en) Searching method and device based on knowledge graph, computer equipment and storage medium
US8019758B2 (en) Generation of a blended classification model
CN110377560B (en) Method and device for structuring resume information
CN108182186B (en) Webpage sorting method based on random forest algorithm
WO2011152925A2 (en) Detection of junk in search result ranking
CN105045875A (en) Personalized information retrieval method and apparatus
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
CN113821646A (en) Intelligent patent similarity searching method and device based on semantic retrieval
CN111026787A (en) Network point retrieval method, device and system
CN111831936A (en) Information retrieval result sorting method, computer equipment and storage medium
CN111723179A (en) Feedback model information retrieval method, system and medium based on concept map
CN111966869A (en) Phrase extraction method and device, electronic equipment and storage medium
CN115630144A (en) Document searching method and device and related equipment
CN107423298B (en) Searching method and device
CN112765311A (en) Method for searching referee document
CN112163065A (en) Information retrieval method, system and medium
CN101661480A (en) Method and system for ensuring name of organization in different languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination