CN111831936A

CN111831936A - Information retrieval result sorting method, computer equipment and storage medium

Info

Publication number: CN111831936A
Application number: CN202010656908.3A
Authority: CN
Inventors: 黎阳; 申义; 侯颖; 刘大伟; 王涛
Original assignee: Weihai Tianxin Modern Service Technology Research Institute Co ltd
Current assignee: Weihai Tianxin Modern Service Technology Research Institute Co ltd
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2020-10-27

Abstract

The invention is suitable for the technical field of information retrieval, and provides an information retrieval result ordering method, which comprises the steps of marking training data, extracting text characteristics, training a learning function and the like, and has the beneficial effects that: the method is characterized in that the information is merged into a feature model according to the information relevant to query, such as click data, anchor text of a webpage, PageRank scores and the like, and a ranking model is automatically constructed by using a learning sorting technology, so that the method has a wide application prospect in the fields of information retrieval, natural language processing, data mining and the like.

Description

Information retrieval result sorting method, computer equipment and storage medium

Technical Field

The present invention relates to the field of information retrieval technologies, and in particular, to a method for sorting information retrieval results, a computer device, and a storage medium.

Background

In the field of information retrieval, a traditional sorting method is realized by constructing a sorting function, and sorting is generally carried out according to the relevance. Typically, a query in a search engine will return a relevant document, which is then ranked according to the relevance between the search key and the document and returned to the user.

As the factors affecting the degree of correlation become greater, it becomes difficult to use the conventional sorting method. The traditional sorting method is difficult to fuse various information, for example, a vector space model uses tf idf as a weight to construct a correlation function, so that other information is difficult to utilize, parameter adjustment is difficult if parameters in the model are more, and an overfitting phenomenon is likely to occur.

Based on this, the application provides an information retrieval result sorting method, a computer device and a storage medium.

Disclosure of Invention

An embodiment of the present invention provides an information retrieval result sorting method, a computer device, and a storage medium, and aims to solve the technical problems in the background art.

The embodiment of the invention is realized in such a way that the information retrieval result ordering method comprises the following steps:

annotating training data

Searching for documents relevant to the query, and ranking the documents according to the relevance; in particular, the method comprises the following steps of,

text feature extraction

Determining the feature quantity of a document, converting the document into a feature vector, and forming a training example containing the feature vector and the correlation;

training learning functions

Definition Q ═ { Q ═ Q₁,q₂,L,q_mIs the query set, q_iFor the ith query, D is the set of documents associated with query set Q, where D is_i＝{d_i,1,d_i,2,L,d_i,niIs equal to q_iQuery related document sets, d_i,jRepresenting a document set D_iIs a set of degrees of relevance, where Y is {1,2, L,1}, where Y is the ith document in (1)_i＝{y_i,1,y_i,2,L,y_i,niIs equal to q_iQuerying a related document relevancy set;

from this, the original training set can be obtained as

Feature vector

By each query document pair (q)_i,d_i,j),i＝1,2,L,m；j＝1,2,,n_iThe method comprises the steps of generating the data,

is a characteristic function;

X_i＝{x_i,1,x_i,2,L x_i,ni}, setting a training data setX is equal to X and

assigning scores to a given query document pair (q, d) by using a training local ranking model f (q, d) ═ f (x), and outputting a score list to a training data set S';

document set D by score list_iRankingglist pi for defining ranking table_iBidirectional mapping by subscript, with π_iRepresenting all documents in the document set D_iPossible mapping of inner, pi_i(j) Represents the jth document at pi_iIn a sorting manner of f (q)_i,d_i) For query q_iSelecting an ordered mapping_i∈∏_iAnd document set D_i；

Using a test set containing new queries and new documents, a feature vector x is created_m+1Training by using a ranking model, and ranking according to the score to obtain pi_m+1；

Evaluating the performance of the sequencing model;

the model detection model MAP is evaluated.

As a further technical scheme of the invention: the performance evaluation of the ranking model comprises the following steps: the performance evaluation of the ranking model is performed by comparing the ranking list output by the ranking model with the ranking list given as a ground truth, given a query q_iAnd related document D_iIs provided with pi_iIs D_iRank list of, y_iIs D_iIs measured in DCGRank list, then the DCG at location k is:

wherein G is a gain function, D is a loss function, pi_i(j) Is d_i,jAt pi_iOf (c) is used.

As a further technical scheme of the invention: the steps for evaluating the model detection model MAP are:

given a rank of relevance of two levels, 1 and 0, a query q is given_iRelated document D_i、D_iOrdered set of (pi)_iRelated document D_iSet of correlations y_iGiving a query q_iThe average accuracy of (d) is:

and repeating the measurement precision until positioning, and further averaging the obtained results to obtain the MAP.

It is another object of an embodiment of the present invention to provide a computer device, including a memory and a processor, where the memory stores therein a computer program, and the computer program, when executed by the processor, causes the processor to execute the steps of the information retrieval result ranking method.

It is another object of an embodiment of the present invention to provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, causes the processor to execute the steps of the information retrieval result ranking method.

Compared with the prior art, the invention has the beneficial effects that: the method is characterized in that the information is merged into a feature model according to the information relevant to query, such as click data, anchor text of a webpage, PageRank scores and the like, and a ranking model is automatically constructed by using a learning sorting technology, so that the method has a wide application prospect in the fields of information retrieval, natural language processing, data mining and the like.

Drawings

Fig. 1 is a schematic diagram of an information retrieval result sorting method.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Specific implementations of the present invention are described in detail below with reference to specific embodiments.

As shown in fig. 1, in an embodiment of the present invention, an information retrieval result sorting method includes the following steps:

annotating training data

Searching for documents relevant to the query, and ranking the documents according to the relevance; specifically, the degree of correlation can be divided into 3 grades from 1 to 5, wherein 1 represents weak correlation, 5 represents the most correlation, other numerical values represent the correlation between the two, and the degree of correlation can be manually marked or a manual scoring mechanism can be simulated by clicking records by a user;

text feature extraction

Determining the feature quantity of the document, converting the document into a feature vector X, and forming a training example (X, Y) containing the feature vector and the correlation according to the correlation Y, wherein the more common features comprise: the method comprises the following steps of obtaining word frequency information of query words, IDF information of the query words, document length, number of linked pages, pageRank value, URL looseness, and proximit value of checked query words;

training learning functions

from this, the original training set can be obtained as

Feature vector

is a characteristic function;

X_i＝{x_i,1,x_i,2,L x_i,ni}, setting a training data set

X is equal to X and

The performance evaluation of the ranking model is performed by comparing the ranking list output by the ranking model with the ranking list given as a ground truthGiving a query q_iAnd related document D_iIs provided with pi_iIs D_iRank list of, y_iIs D_iThe relevance of (c) is measured by DCG, and then the DCG at position k is:

wherein G is a gain function, D is a loss function, pi_i(j) Is d_i,jAt pi_iThe position of (1);

evaluating a model detection model MAP as a target detection model performance statistic, dividing the level of correlation into two levels of 1 and 0, and giving a query q_iRelated document D_i、D_iOrdered set of (pi)_iRelated document D_iSet of correlations y_iGiving a query q_iThe average accuracy of (d) is:

The embodiment of the invention also provides computer equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the computer program is executed by the processor, so that the processor executes the steps of the information retrieval result sorting method.

The embodiment of the invention also provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the processor is enabled to execute the steps of the information retrieval result sorting method.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The embodiment of the invention provides an information retrieval result sorting method, and provides computer equipment and a computer readable storage medium based on the information retrieval result sorting method, according to information related to query, such as click data, anchor text of a webpage, PageRank score and the like, the information is merged into a feature model, a learning sorting technology is used for automatically constructing a ranking model, and the method has wide application prospects in the fields of information retrieval, natural language processing, data mining and the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An information retrieval result sorting method is characterized by comprising the following steps:

annotating training data

Searching for documents relevant to the query, and ranking the documents according to the relevance;

text feature extraction

training learning functions

from this, the original training set can be obtained as

Feature vector

is a characteristic function;

X_i＝{x_i,1,x_i,2,L x_i,ni}, setting a training data set

X is equal to X and

document set D by score list_iRankingglist pi for defining ranking table_iBidirectional mapping by subscript, with π_iRepresenting all documents in the document set D_iPossible mapping of inner, pi_i(j) RepresentsThe jth document is at pi_iIn a sorting manner of f (q)_i,d_i) For query q_iSelecting an ordered mapping_i∈∏_iAnd document set D_i；

Evaluating the performance of the sequencing model;

the model detection model MAP is evaluated.

2. The method as claimed in claim 1, wherein the step of evaluating the performance of the ranking model comprises: the performance evaluation of the ranking model is performed by comparing the ranking list output by the ranking model with the ranking list given as a ground truth, given a query q_iAnd related document D_iIs provided with pi_iIs D_iRank list of, y_iIs D_iThe relevance of (c) is measured by DCG, and then the DCG at position k is:

3. The method of claim 1, wherein the step of evaluating the model detection model MAP comprises:

4. A computer arrangement, comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to carry out the steps of the information retrieval result ranking method of any of claims 1 to 3.

5. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the information retrieval result ranking method according to any of claims 1 to 3.