CN111919208A - Scoring documents in document retrieval - Google Patents

Scoring documents in document retrieval Download PDF

Info

Publication number
CN111919208A
CN111919208A CN201980022597.0A CN201980022597A CN111919208A CN 111919208 A CN111919208 A CN 111919208A CN 201980022597 A CN201980022597 A CN 201980022597A CN 111919208 A CN111919208 A CN 111919208A
Authority
CN
China
Prior art keywords
matching
query
score
term
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980022597.0A
Other languages
Chinese (zh)
Inventor
邓维维
顾晨
谭屯子
符丁
方一鑫
罗丹
张祺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Publication of CN111919208A publication Critical patent/CN111919208A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

The present disclosure provides methods and apparatus for scoring documents in document retrieval. In some implementations, a query can be received, where the query includes one or more terms. The database may be searched to obtain a document containing at least one of the one or more terms. Each of the one or more terms may be classified according to whether the term is contained in the document, each classified term being a matching term or a non-matching term. A matching pattern between the query and the document may be obtained, the matching pattern indicating an integration of each classified term. A relevance score for a document relative to a query may be determined based at least on the matching pattern.

Description

Scoring documents in document retrieval
Background
Information retrieval (e.g., document retrieval) handles the presentation, storage, organization, and access of information terms. In document retrieval, a plurality of documents may be obtained for a given query by searching. Inverted indexes have been widely used for document searching, particularly for large document collections, such as web searches, sponsored searches, and e-commerce searches. When multiple documents are obtained, a scoring process may be applied to the obtained documents to determine respective scores for the documents. The top k documents may then be selected and presented from these multiple searched documents for a given query based on the scores of these multiple searched documents as determined by some scoring method.
Disclosure of Invention
This summary is provided to introduce a selection of concepts that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Embodiments of the present disclosure propose methods and apparatuses for scoring documents in document retrieval. In some implementations, a query can be received, where the query includes one or more terms. The database may be searched to obtain a document containing at least one of the one or more terms. Each of the one or more terms may be classified according to whether the term is contained in the document, each classified term being a matching term or a non-matching term. A matching pattern between the query and the document may be obtained, the matching pattern indicating an integration of the each classified term. A relevance score for the document relative to the query may be determined based at least on the matching pattern.
It should be noted that one or more of the above aspects include the features described in detail below and particularly pointed out in the claims. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. These features are indicative of but a few of the various ways in which the principles of various aspects may be employed and the present disclosure is intended to include all such aspects and their equivalents.
Drawings
The disclosed aspects will hereinafter be described in conjunction with the appended drawings, which are provided to illustrate, but not to limit, the disclosed aspects.
FIG. 1 illustrates an exemplary architecture of a system for document retrieval over a network, according to an embodiment.
FIG. 2 illustrates an exemplary process for document retrieval, according to an embodiment.
FIG. 3 illustrates an exemplary inverted index showing exemplary matching patterns between documents and queries, according to embodiments.
FIG. 4 illustrates an exemplary Deep Neural Network (DNN) scoring model for document retrieval, according to an embodiment.
FIG. 5 illustrates another exemplary process for document retrieval, according to an embodiment.
Fig. 6(a) illustrates an exemplary inverted index prior to cursor movement, according to an embodiment.
Fig. 6(B) illustrates another exemplary inverted index after cursor movement, according to an embodiment.
FIG. 7 illustrates a flow diagram of an exemplary method for scoring documents in document retrieval, according to an embodiment.
FIG. 8 illustrates an exemplary apparatus for scoring documents in document retrieval, according to an embodiment.
FIG. 9 illustrates another exemplary apparatus for scoring documents in document retrieval, according to an embodiment.
Detailed Description
The present disclosure will now be discussed with reference to several exemplary embodiments. It is to be understood that the discussion of the embodiments is merely intended to enable those skilled in the art to better understand and thereby practice the embodiments of the present disclosure, and is not intended to suggest any limitation as to the scope of the disclosure.
In Information Retrieval (IR), such as document retrieval, given a query Q and a document database D, the top k documents may be returned according to some scoring function score (D, Q), D e D. Herein, the score of a document relative to a query represents the relevance between the query and the document, and may be referred to as a relevance score. For the first k document retrieval in existing systems, especially for text retrieval tasks, the inverted index is usually the main choice to capture the original text representation. In document retrieval, the relevance scores of the documents are used to rank the respective documents to return the top k documents to be presented to the user. For most current scoring functions, a re-ranking process may be required to optimize the candidates for the best top k documents. This may add additional cost to rank the candidates in two phases, e.g., the top k retrieval phase and the re-ranking phase. Furthermore, in the top-k retrieval phase of the current scoring function, potential candidates may be lost due to the limitations of the scoring function because the current scoring function in document retrieval is unsupervised, e.g., based on statistical formulas or probabilistic models. These scoring functions cannot benefit from the available set of tagged data and only the matching terms in the query are considered in the document scoring.
Embodiments of the present invention provide a method for scoring documents in document retrieval, for example, by utilizing a supervised Machine Learning (ML) scoring function that considers matching and mismatching terms between a query and a document, coexistence of matching and mismatching terms, and locations of mismatching and matching terms in the query. As used herein, a matching term refers to a term in a query contained in a document, while a non-matching term refers to a term in a query not contained in a document. In some embodiments, Logistic Regression (LR) and Deep Neural Network (DNN) models may be used as examples of machine learning scoring models. As used herein, an "entry" in a query may refer to an index unit, which may include one or more of characters, words, phrases, numbers, and the like. The proposed machine learning scoring function is superior to the conventional scoring function in both accuracy and efficiency of document retrieval, and can be applied to various tasks such as web search or recommendation systems. For example, the machine learning scoring function may be used for search recommendations, news recommendations, advertisement recommendations, product recommendations, question and answer applications, and the like.
FIG. 1 illustrates an exemplary architecture of a system 100 for document retrieval over a network, according to an embodiment.
In fig. 1, the network 110 is applied to the interconnection between the terminal device 120 and the server 130.
Network 110 may be any type of network capable of interconnecting network entities. The network 110 may be a single network or a combination of networks. In terms of coverage, the network 110 may be a Local Area Network (LAN), a Wide Area Network (WAN), or the like. With respect to the carrier medium, the network 110 may be a wired network, a wireless network, or the like. In the case of data switching technology, the network 110 may be a circuit switched network, a packet switched network, or the like.
Terminal device 120 may be any type of electronic computing device capable of connecting to network 110, accessing a server or website on network 110, processing data or signals, and so forth. For example, the terminal device 120 may be a desktop computer, a laptop computer, a tablet computer, a smart phone, an AI terminal, and the like. Although only one terminal device is shown in fig. 1, it should be understood that: a different number of end devices may be connected to the network 110.
In an implementation, terminal device 120 may include a query input component 121, a display interface 122, a memory 123, a processor 124, and at least one communication connection 125 for connecting to network 110.
The query input component 121 may be configured to: a search query is received from a user attempting to obtain retrieved information (e.g., documents) via an input device (not shown in fig. 1) in communication with terminal device 120. For example, an input device may include a mouse, joystick, keys, microphone, I/O components, or any other component capable of receiving user input and transmitting an indication of that input to terminal device 120. A search query or query as used herein may include one or more terms that reflect a user's intent to attempt to obtain retrieved information or documents from a database 160 (e.g., a website, cloud storage, document corpus, etc.) that include at least one term in the query. In some examples, document retrieval may be performed by utilizing an inverted index from database 160. Database 160 may include a large number of documents, such as web pages, advertisements, news, and the like.
The top k retrieved documents may be returned from server 130 and presented to the user via display interface 122. Display interface 122 may be terminal device 120-specific or may be incorporated into a third party device, such as a display, television, projector, etc. external to terminal device 120.
Memory 123 may be one or more devices for storing data, which may include Read Only Memory (ROM), Random Access Memory (RAM), magnetic RAM, core memory, magnetic disk storage media, optical storage media, flash memory, and/or other machine-readable media for storing information. The term "machine-readable medium" can include, without being limited to portable or fixed storage, optical storage, wireless channels and various other media for storing, containing or carrying instruction(s) and/or data.
Processor 124 may include any number of processing units and is programmed to execute computer-executable instructions for implementing aspects of the present disclosure. In some examples, processor 124 is programmed to execute computer-executable instructions stored in memory 123 to implement a method such as that shown in fig. 7.
Communication connection 125 may include a wireless connection or a wired connection to connect terminal device 120 to network 110 or any other device.
The server 130 may be connected to a database 160. Although not shown in fig. 1, server 130 may incorporate database 160. Server 130 may provide the data or retrieved documents to terminal device 120. As one example, server 130 may be a web server that provides data over a network. Server 130 may provide data on a website to terminal device 120 via network 110.
The server 130 can include a query receiving component 140 and a scoring model 150 that includes a feature weight index 151. The query receiving component 140 may receive a query input by a user from a terminal device 120, for example, over the network 110. The received query may be provided to scoring model 150 for searching database 160 for one or more documents that match the query. That is, the matching documents contain one or more terms in the query. A plurality of features associated with the query may be generated from the query, each feature having a weight in the feature weight index 151. The scoring model 150 may calculate or determine a score for a matching document based at least on the features generated from the query and the corresponding weights in the feature weight index 151. Although the feature weight index 151 is shown as being included in the scoring model 150 in fig. 1, it may be stored in a storage unit separate from the scoring model 150 and/or the server 130, and may provide the scoring model 150 with a respective weight for each feature (if applicable). The top k documents may be selected from the matching documents based on their scores and provided to the terminal device 120, e.g., via the network 110, for presentation to the user via the display interface 122.
Although the query receiving component 140 and scoring model 150 are shown in fig. 1 as being incorporated into the server 130, they may be configured as separate components from the server 130.
It should be understood that all of the components shown in fig. 1 may be exemplary, and any of the components in terminal device 120 may be added, omitted, or replaced in fig. 1 according to particular application needs.
FIG. 2 illustrates an exemplary process 200 for document retrieval, according to an embodiment.
At 202, a query Q may be received, for example, from a user. The query may be a search query and contain one or more terms. For example, a query "home floor wood floor" may be received here. Exemplary terms contained in the query may be "home", "Depot", "wood", and "floor".
At 204, the database D may be searched for a document containing at least one of the one or more terms, for example, by utilizing an inverted index structure. Such documents may be referred to herein as "matching documents". In the example shown herein, a document containing at least one entry of "home", "Depot", "wood", and "floor" may be obtained. Each matching document may be assigned a DocID in an inverted index structure, where the DocID is the ID of a document D e D containing at least one term and may be displayed as Doc1, Doc3, Doc7, etc., as shown in the simple inverted index structure in fig. 3, or simply as <1>, <3>, <7>. The inverted index structure as used herein may include two main parts: a dictionary and a distribution list. The dictionary includes entries which may be followed by possible terms such as Inverted Document Frequency (IDF), entry frequency (TF), upper bound of relevant publication lists and entry addresses, which are not shown in this disclosure. The latter items, such as IDF, TF, etc., are well known, and a detailed description thereof is omitted herein for simplicity. Each publication list may contain a series of publications represented by a DocID. Each entry in the query is mapped to a publication list. For example, for entry 1 (in this example, "home") shown in fig. 3, the posting list for entry 1 in the query may be < Doc3, Doc7, Doc9, Doc13. >.
At 206, each term in the query may be classified according to whether the term is contained in the document. Each classified entry is either a matching entry or a non-matching entry. For example, as shown in fig. 3, for a document Doc3 containing the terms "home" and "floor", term 1(home), term 4(floor) in the query "home floor" are classified as matching terms between the query and Doc3, and term 2(depot) and term 3 (floor) are classified as unmatched terms.
At 208, respective matching patterns between each matching document and the query may be obtained. The matching pattern may indicate the integration of each classified term in the query. Herein, an integration may identify each term in the query, a classification of each term, and a location of each term in the query. That is, the matching pattern may indicate a respective location of each term in the query and a classification of each term. Herein, a matching pattern is represented by a vector, each dimension of which has a binary value, and is 1 at the position of a matching entry in the query and 0 at the position of an unmatched entry in the query. If there are 4 terms in the query and only the first term matches for a particular document, an exemplary match pattern between the query and the particular document may be represented as [1000], where a "1" indicates that the query and document pair has matching terms at that location and a "0" indicates that there are no matching terms at those locations. Taking the query "home position wood floor" as an example, if the matching pattern between the query and a specific matching document is denoted as [1001], the matching pattern indicates that the specific matching document contains the first and fourth terms in the query and does not contain the second and third terms. An exemplary matching pattern determination will be explained in conjunction with fig. 3 described below.
At 210, a respective relevance score for each matching document may be calculated for the query, e.g., by a machine learning scoring model, based at least on the matching patterns obtained at 208. For example, the relevance score of a matching document may be determined by calculating a score corresponding to a matching pattern between the query and the matching document. For the sake of explanation herein, the calculated score corresponding to the matching pattern is hereinafter referred to as "original score". The machine learning scoring models may include Logistic Regression (LR) models, Gradient Boosting Decision Tree (GBDT) models, Deep Neural Network (DNN) models, and the like. Herein, the LR model and the DNN model may be used as the following exemplary scoring models to calculate relevance scores for matching documents. It should be understood that any other suitable scoring model may be used to implement the present disclosure, depending on the particular practical requirements.
Taking the query Q "home floor" and a specific document a containing the terms "home" and "floor" as an example, the calculation of the original score corresponding to the matching pattern will be described below.
Raw score calculation
From the above query Q and the specific document A, a matching pattern between the query Q and the document A can be obtained as [1001 ]. Based at least on the query Q, and in particular, the matching and non-matching terms and matching patterns [1001] in the query, multiple related features of different types, each having a weight, may be generated. In some examples, the individual weights may be trained offline and pre-assigned to each feature. The raw score corresponding to the matching pattern may be calculated based on the weights of the plurality of relevant features, e.g. considered as the sum of the weights of the relevant features.
The weight of each feature may be trained or learned from a set of training data. The training data may include a historical query and a plurality of retrieved historical documents. Each historic document may be tagged with a relevance score that is calculated based on, for example, the number of clicks, or otherwise assigned by the person. The weights of the features may be updated periodically or on demand, based on the updated training data set. For example, the updated training data may include at least one of retrieved historic documents or retrieved documents for a new query, etc., with updated or different scores.
In one implementation, the different types may be predefined or pre-designed. For example, the following types may be predefined:
QueryMatchedToerm: the matching entry in the query,
QueryUnmatchedTerm mismatch entries in the query,
QueryMatchedTerm crosses QueryUnmatchedTerm: any matching entry in the query in combination with any non-matching entry,
the QueryMatchedToerm crosses QueryMatchedOtherTerm, the combination of any one matching entry and any other matching entry in the query,
QueryLength crosses QueryMatchedLength, the combination of the number of terms in the query and the number of matching terms,
QueryContinuousMatchedTorms consecutive matching entries in the query,
QueryMatchedChunk, the blocks in the query that match the entries, whether they are consecutive or not,
QueryMatchedUnmatchedBigram, a bituple in which any one matching entry in the query matches any one non-matching entry, e.g., two consecutive entries, one matching entry and the other non-matching entry,
QueryUnmatchedBigram, a bituple in a query that does not match an entry, e.g., two consecutive unmatched entries,
the QueryMatchedPerm and the QueryMatchedPattern are crossed, the combination of any one matching entry in the query and the corresponding matching pattern,
QueryUnmatchedPerm crosses QueryMatchedPattern, the combination of any unmatched entry in the query and the corresponding matching pattern.
It should be appreciated that although several feature types are listed above, any other suitable feature types may be predefined or pre-designed depending on the particular implementation requirements.
Table 1 shows exemplary feature types and features generated from the query Q "home floor" and the matching pattern [1001 ].
Figure BDA0002703368040000081
Figure BDA0002703368040000091
TABLE 1
Herein, the LR model and the DNN model will be discussed below as examples of implementing a machine learning scoring function by using the above-described features.
The log loss (Logloss) is a loss function of these models, as shown in equation (1).
LogLoss (y, p) ═ - (ylog (p)) + (1-y) log (1-p)) formula (1)
Where "y" represents the label and p (-) is the predicted probability of the model.
The LR model for the score is represented by formula (2) as score (d, Q) below.
Figure BDA0002703368040000092
Where "y ═ 1" represents "the document is relevant to the query," p (·) returns a likelihood value, d represents the document, Q represents the query, w is the weight of the features of the query, x is the feature input vector, T represents the transpose, and b represents the bias.
An exemplary DNN model for calculating a relevance score for a matching document at 210 is described below in conjunction with fig. 4.
In some examples, after training, some features with lower weights may be removed from the feature list for model capacity considerations.
When the raw score corresponding to the exemplary matching pattern [1001] is calculated, a relevance score for the matching document A will be determined based on the raw score, e.g., the same as the raw score, a multiple of the raw score, etc. The relevance score for each matching document may be calculated based on the respective raw scores corresponding to the matching patterns for each matching document.
Optionally, at 212, a plurality of matching documents (including document a) having respective relevance scores may be sequentially returned or provided to the user as a result of the document retrieval. For example, the multiple matching documents returned or provided may be the top k matching documents. The top k documents returned or provided may be ranked based on the calculated relevance scores of the matching documents.
It should be understood that: method 200 may also include any other or alternative steps for document retrieval as described above in accordance with embodiments of the present disclosure.
FIG. 3 illustrates an exemplary inverted index 300 showing an exemplary matching pattern 310 for document 3, according to an embodiment. The inverted index may display each term in the query and a number of documents that match any term in the query.
A simple inverted index is shown in fig. 3. Each row represents an entry in the query and each column represents a document indexed by a DocID. In fig. 3, the exemplary query includes four terms, term 1(home), term 2(depot), term 3(wood), term 4(floor), and for each term in the query, there are several matching documents indicated by the DocID in the inverted index. For example, for entry 1, there are matching documents indicated by Doc3, Doc7, Doc9, Doc13 in the inverted index; for entry 2, there are matching documents indicated by Doc1, Doc7, Doc10, Doc11, and so on. From the inverted index in FIG. 3, matching patterns between queries and documents can be obtained. For example, the matching pattern between the query and Doc1 is [0101 ]; the matching pattern between the query and Doc3 is [1001], which is illustratively highlighted by dashed box 310 in FIG. 3; the matching pattern between the query and Doc7 is [1110], and so on.
It should be understood that the structure of the inverted index 300 shown in FIG. 3 is merely exemplary; any other structure of an inverted index for document retrieval may be possible for the present disclosure.
FIG. 4 illustrates an exemplary Deep Neural Network (DNN) scoring model 400 for document retrieval, according to an embodiment.
As shown in fig. 4, features of different feature types (e.g., features of feature type i, features of feature type j, etc.) are input into the DNN model 400 as sparse inputs. In the embedding layer, instead of one weight per feature, the DNN model 400 applies one weight vector per feature. For example, the weight vector of the a-th feature of feature type i is k-valued
Figure BDA0002703368040000101
In fig. 4, a and b are feature numbers of feature types i and j, respectively. Each vector in the embedding layer will go through the pooling layer to obtain a weight vector for all features of each feature type. After the weight vector is nonlinearly converted by two hidden layers (e.g., hidden layer 1 and hidden layer 2), a score of the document is generated and output through an output layer.
FIG. 5 illustrates another exemplary process 500 for document retrieval, according to an embodiment. The exemplary process 500 is an optimized way for the process 200 by: considering a plurality of potential matching patterns for a given query and pre-computing a plurality of scores for the plurality of potential matching patterns, which may be performed before, simultaneously with, or after: the method may include searching a database to obtain matching documents, classifying each term in the query, and/or obtaining matching patterns between the query and the matching documents. In some examples, the plurality of potential matching patterns for a given query may include all possible or potential matching patterns for the given query. In some other examples, the plurality of potential matching patterns for a given query may include fewer than all of the potential matching patterns.
At 502, a query Q may be received, for example, from a user. The query may be a search query and include one or more terms. For example, a query "home floor wood floor" may be received here. Exemplary terms contained in the query may be "home", "Depot", "wood", and "floor".
At 504, a plurality of potential matching patterns for query Q may be determined. For example, if there are n terms in query Q, each term may potentially match or not match the document indicated by the DocID in the inverted index. Thus, up to 2 for query Q may be determinednA potential match pattern, which may be all possible or potential match patterns. For example, given a query containing three terms, the query may have 23A potential matching pattern comprising [000 ]]、[001]、[010]、[100]、[011]、[101]、[110]、[111]. Due to an all-zero matching pattern (e.g., [000 ])]) Indicating that there are no documents that match the query, such matching patterns may not be considered when scoring documents for document retrieval. In this case, the number of potential matching patterns for which a score is to be calculated may be less than all potential matching patterns for a given query, e.g., 2 excluding all-zero patterns n1 potential matching pattern.
At 506, a raw score or a ceiling score may be obtained for each of the plurality of potential matching patterns determined at 504.
The raw score corresponding to each potential matching pattern may be obtained by a calculation based on the weights of the plurality of features associated with the query, similar to the raw score calculation described above. Thus, for simplicity, a detailed description of the raw score calculation for each potential matching pattern may be omitted herein.
Herein, the upper bound score corresponding to the matching pattern may be a highest score of the scores of the sub-patterns of the matching pattern, e.g. a highest score of the plurality of scores. For example, a sub-pattern of a particular matching pattern may include a set of all matching patterns, where the particular matching pattern has at least one matching entry therein. For example, if the matching pattern is [101], its sub-patterns may be [100], [001], [101 ]; if the matching pattern is [010], the sub-pattern thereof can be [010 ]; if the matching pattern is [111], the sub-patterns thereof may be [001], [010], [011], [100], [101], [110], [111 ]. The upper score of the matching pattern may be used to skip less relevant documents from the plurality of matching documents to improve retrieval performance. The obtaining of the upper limit score corresponding to the matching pattern will be described below.
Ceiling score acquisition
Given query Q ═ t1,t2,...,tm]With m different terms, a raw score for each potential matching pattern may be calculated, and an upper bound score for a particular potential matching pattern may be derived or obtained from the raw scores of the sub-patterns for the particular potential matching pattern.
If the query has 3 terms { a, b, c }, then the ceiling score for each potential matching pattern can be obtained as follows:
upperbound({a,b,c})
=maxScore({1/0,1/0,1/0})
=max(maxScore({1/0,1/0,0}),maxScore({0,1/0,1/0}),maxScore({1/0,0,1/0}),score({1,1,1}))
=max(upperbound({a,b}),upperbound({b,c}),upperbound({a,c}),score({1,1,1}))
table 2 lists a plurality of potential matching patterns for an exemplary query, including 3 terms and the calculated raw score and ceiling score for each potential matching pattern.
Matching patterns Raw score Upper limit score
001 2 2
010 3 3
011 7 7
100 6 6
101 5 6
110 8 8
111 9 9
TABLE 2
It should be understood that the particular raw scores and ceiling scores for the matching patterns shown in table 2 are merely exemplary, and that any other possible scores may be calculated based on different weights for different features of different queries.
Each ceiling score for a listed potential match pattern may be obtained as follows:
for the potential matching pattern [001], upperbound ([001]) max { initial score ([001]) } 2;
for the potential matching mode [010], upperbound ([010]) max { original score ([010]) } 3;
for the potential matching mode [011], upperbound ([011]) max { originnal score ([001]), originnal score ([010]), originnal score ([011]) } 7;
for the potential matching pattern [100], upperbound ([100]) max { initial score ([100]) } ═ 6;
for the potential matching pattern [101], upperbound ([101]) max { originnal score ([001]), originnal score ([100]), originnal score ([101]) } 6;
for the potential matching pattern [110], upperbound ([110]) max { originnal score ([100]), originnal score ([010]), originnal score ([110]) } 8;
for the potential matching mode [111], upperbound ([111]) max { originnal score ([001]), originnal score ([010]), originnal score ([100]), originnal score ([011]), originnal score ([101]), originnal score ([110]), originnal score ([111]) } 9.
From the ceiling score and table 2 obtained above, although the original score "5" of the matching pattern [101] ( terms 1 and 3 match) is lower than the original score "6" of its sub-pattern [100] (only term 1 is matched), the ceiling score of the pattern [101] may be equal to the highest original score "6" of the scores of its sub-patterns.
Since documents that do not have any matching terms (i.e., the matching pattern between the query and such documents is "0.. 00") may not be retrievable or may be of little use in document retrieval, their scores need not be calculated.
The obtained raw score or upper bound score for each potential matching pattern for the query may be fed to operation 514 to determine a relevance score for the matching document.
At 508, the database may be searched for matching documents, similar to operation 204 in FIG. 2. Accordingly, a detailed description of operation 508 may be omitted herein for simplicity.
At 510, each term in the query may be classified according to whether the term is contained in the document, similar to operation 206 in FIG. 2. Accordingly, a detailed description of operation 510 may be omitted herein for the sake of simplicity.
At 512, a corresponding match pattern between each matching document and the query may be obtained, similar to operation 208 in FIG. 2. Accordingly, a detailed description of operation 512 may be omitted herein for the sake of simplicity.
At 514, a respective relevance score for each matching document may be determined for the query based on the raw or upper score corresponding to each potential matching pattern obtained at 506 and the matching pattern between the document and the query obtained at 512. In some examples, a score corresponding to the same potential matching pattern as the matching pattern of the matching document obtained at 512 may be selected from the plurality of raw scores or top-limit scores obtained at 506, and a relevance score for the matching document may be determined according to the selected score.
Optionally, at 516, as a result of the document retrieval, a plurality of matching documents (e.g., the top k documents) with corresponding relevance scores may be sequentially returned or provided to the user, similar to operation 212 in FIG. 2. Therefore, a detailed description of step 516 may be omitted herein for simplicity.
It should be understood that although operations 504, 506 are shown as being performed in parallel with operations 508, 510, 512 in fig. 5, they may be performed before, concurrently with, or after operations 508, 510, and 512.
Fig. 6(a) shows an exemplary inverted index 610 before cursor movement, and fig. 6(B) shows another exemplary inverted index 620 after cursor movement by using an upper bound score, according to an embodiment.
The use of ceiling scores may lead to retrieval optimizations in the inverted index. For m different terms in query Q, there may be m corresponding publication lists, including an empty list, in the inverted index.
Taking a query containing three terms as an example, as shown in fig. 6(a), there may be three posting lists with cursors. In a retrieval process based on the upper bound score, a threshold value may be used to determine the "pivot" document. In this context, a pivot document may refer to a document having a score above a threshold. In this example, the threshold θ for determining the "pivot" document may be assumed to be θ 8. Pivot documents in the plurality of matching documents associated with the query may be determined by ranking the publication lists based on the order of the docids that the cursor is currently pointing. For example, in FIG. 6(A), cursor 1 for entry 1 in the query is currently pointing <4>, cursor 2 for entry 2 in the query is currently pointing <3>, and cursor 3 for entry 3 is currently pointing <7>. A ranked publication list may be shown below, as shown in table 3, based on the corresponding entry pointed to by the cursor and the corresponding DocID.
Distribution list (p) ID 1 2 3
TermID Entry 2 Entry 1 Entry 3
DocID <3> <4> <7>
TABLE 3
The pivot document may be determined step by step as follows, according to the order in Table 3 in combination with the ceiling score in Table 2:
for p ═ 1, Upperbound ([0,1,0]) 3, which does not satisfy score > θ (θ ═ 8), so the search process goes to the next publication list p ═ 2;
for p 2, Upperbound ([1,1,0]) 8, which does not satisfy score > θ (θ 8), so the search process goes to the next publication list p 3;
for p 3, Upperbound ([1,1,1]) 9, which is greater than θ (θ 8), so the retrieval process returns a DocID "< 7 >" to indicate the pivot document. DocID "< 7 >" may be considered herein as a pivot DocID.
When the pivot DocID is determined, as shown in FIG. 6(A), for entry 3, which is <7>, the cursors for entry 1 and entry 2 (cursor 1 and cursor 2), respectively, may be moved to the first DocID whose number is greater than or equal to <7>, to fully evaluate the score of the document.
Fig. 6(B) shows a state after the cursor is moved, in which the cursor 1 points to <10> and the cursor 2 points to <7>. Thus, the matching pattern of the document <7> is "011", and based on the matching pattern "011", the document <7> has a score of 7, which is smaller than the threshold value of 8. Therefore, the document <7> can be skipped and the cursor 2 and the cursor 3 can be moved to the corresponding next DocID. The publication lists may be ranked again based on the current DocID to find the next pivot DocID until all publication lists have moved to the end, or the next pivot cannot be found because each of the remaining ceiling scores is less than or equal to the threshold θ.
Algorithm 1 below presents pseudo code on how to recursively compute the ceiling score of the matching pattern as described above.
Algorithm 1 CalculateUpperBound algorithm
The method comprises the following steps: posingListSet, a schema in which each element starts with either "true" or "false";
pattern Len, return the number of "true" values in posingListSet;
Figure BDA0002703368040000151
Figure BDA0002703368040000161
FIG. 7 illustrates a flow diagram of an exemplary method 700 for scoring documents in document retrieval, according to an embodiment.
At 710, a query is received, wherein the query includes one or more terms.
At 720, the database can be searched for a document containing at least one of the one or more terms.
At 730, each of the one or more terms is classified according to whether it is contained in the document. Each classified entry may be a matching entry or a non-matching entry.
At 740, matching patterns between the query and the document are obtained. The matching pattern may indicate an integration of each classified entry.
At 750, a relevance score for the document relative to the query is determined based at least on the matching pattern.
In one implementation, the integration identifies each term in the query, a classification of each term, and a location of each term in the query.
In one implementation, the method 700 may further include: determining a plurality of potential matching patterns for the query; and obtaining a plurality of scores corresponding to the plurality of potential matching patterns by calculating a score corresponding to each potential matching pattern of the plurality of potential matching patterns.
In one implementation, the relevance score is determined by selecting a score from a plurality of scores that corresponds to the same potential matching pattern as the matching pattern.
In one implementation, calculating a score corresponding to each potential matching pattern includes: generating a plurality of features of different types for the potential matching pattern based on the potential matching pattern and the query, wherein each feature has a weight; and calculating a score corresponding to the potential matching pattern based on the weights of the plurality of features.
In one implementation, calculating a score corresponding to each potential matching pattern includes: calculating at least one raw score corresponding to at least one sub-pattern of the potential matching pattern; and obtaining a score corresponding to the potential matching pattern by selecting a highest raw score among the at least one raw score.
In one implementation, the relevance score is determined by calculating a score corresponding to the matching pattern.
In one implementation, calculating a score corresponding to the matching pattern includes: generating a plurality of features of different types for the matching pattern based on the matching pattern and the query, wherein each feature has a weight; and calculating a score corresponding to the matching pattern based on the weights of the plurality of features.
In one implementation, the different types include at least one of: the term set comprises a matching term in a query, a mismatching term in the query, a combination of any matching term and any other matching term in the query, a combination of the number of terms in the query and the number of matching terms, consecutive matching terms in the query, a block of matching terms in the query, a bigram of any matching term and any mismatching term in the query, a bigram of mismatching terms in the query, a combination of any matching term and a corresponding matching pattern in the query, and a combination of any mismatching term and a corresponding matching pattern in the query.
In one implementation, the weight of each feature is trained based on a set of training data. Each training data may include a historical query and a plurality of retrieved historical documents, and each historical document may be labeled with a relevance value.
In one implementation, the matching pattern is represented by a vector, and the value of each dimension of the vector is zero or one.
It should be appreciated that method 700 may also include any steps/operations for scoring documents in document retrieval as described above in accordance with embodiments of the present disclosure.
FIG. 8 illustrates an exemplary apparatus 800 for scoring documents in document retrieval, according to an embodiment.
The apparatus 800 may include: a query receiving module 810 for receiving a query, the query including one or more terms; a search module 820 for searching a database to obtain a document containing at least one of the one or more terms; a classification module 830 for classifying each of the one or more terms according to whether the term is contained in the document, wherein each classified term is a matching term or a non-matching term; a matching pattern obtaining module 840 for obtaining a matching pattern between the query and the document, the matching pattern indicating an integration of each classified term; and a score determination module 850 for determining a relevance score of the document relative to the query based at least on the matching pattern.
In one implementation, the integration identifies each term in the query, a classification of each term, and location information for each term in the query.
In one implementation, the matching pattern obtaining module 840 is further configured to determine a plurality of potential matching patterns for the query, and the score determining module 850 is further configured to: a plurality of scores corresponding to the plurality of potential matching patterns is obtained by calculating a score corresponding to each potential matching pattern of the plurality of potential matching patterns.
In one implementation, the relevance score is determined by selecting a score from a plurality of scores that corresponds to the same potential matching pattern as the matching pattern.
In one implementation, the score determination module 850 is further configured to: generating a plurality of features of different types for the potential matching pattern based on the potential matching pattern and the query, wherein each feature has a weight; and calculating a score corresponding to the potential matching pattern based on the weights of the plurality of features.
In one implementation, the different types include at least one of: the term set comprises a matching term in a query, a mismatching term in the query, a combination of any matching term and any other matching term in the query, a combination of the number of terms in the query and the number of matching terms, consecutive matching terms in the query, a block of matching terms in the query, a bigram of any matching term and any mismatching term in the query, a bigram of mismatching terms in the query, a combination of any matching term and a corresponding matching pattern in the query, and a combination of any mismatching term and a corresponding matching pattern in the query.
In one implementation, the score determination module 850 is further configured to: calculating at least one raw score corresponding to at least one sub-pattern of the potential matching pattern; and obtaining a score corresponding to the potential matching pattern by selecting a highest raw score among the at least one raw score.
In one implementation, the relevance score is determined by calculating a score corresponding to the matching pattern.
Additionally, apparatus 800 may also include any other modules configured to score documents in document retrieval according to embodiments of the present disclosure as described above.
FIG. 9 illustrates another exemplary apparatus 900 for scoring documents in document retrieval, according to an embodiment.
The apparatus 900 may include at least one processor 910. The apparatus 900 may also include a memory 920 coupled to the processor 910. The memory 920 may store computer-executable instructions that, when executed, cause the processor 910 to perform any of the operations of the method for scoring documents in document retrieval according to embodiments of the present disclosure as described above.
Embodiments of the present disclosure may be embodied in non-transitory computer readable media. A non-transitory computer-readable medium may include instructions that, when executed, cause one or more processors to perform any operations of a method for scoring documents in document retrieval according to embodiments of the present disclosure as described above.
It should be understood that all operations in the methods described above are exemplary only, and the present disclosure is not limited to any operations in the methods or the order of the operations, but rather should encompass all other equivalents under the same or similar concepts.
It should also be understood that all of the modules in the above described apparatus may be implemented in various ways. These modules may be implemented as hardware, software, or a combination thereof. In addition, any of these modules may be further divided functionally into sub-modules or combined together.
The processor has been described in connection with various apparatus and methods. These processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software depends upon the particular application and the overall design constraints imposed on the system. By way of example, the processor, any portion of the processor, or any combination of processors presented in this disclosure may be implemented as a microprocessor, microcontroller, Digital Signal Processor (DSP), Field Programmable Gate Array (FPGA), Programmable Logic Device (PLD), state machine, gated logic, discrete hardware circuitry, and other suitable processing components configured to perform the various functions described throughout this disclosure. The functionality of a processor, any portion of a processor, or any combination of processors presented in this disclosure may be implemented as software executed by a microprocessor, microcontroller, DSP, or other suitable platform.
Software should be viewed broadly as representing instructions, instruction sets, code segments, program code, programs, subroutines, software modules, applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, and the like. The software may reside in a computer readable medium. The computer readable medium may include, for example, memory, which may be, for example, a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, a Random Access Memory (RAM), a Read Only Memory (ROM), a programmable ROM (prom), an erasable prom (eprom), an electrically erasable prom (eeprom), a register, or a removable disk. Although the memory is shown as being separate from the processor in various aspects presented throughout this disclosure, the memory may be located internal to the processor (e.g., a cache or registers).
The above description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims (20)

1. A method for scoring documents in a document retrieval, comprising:
receiving a query, the query comprising one or more terms;
searching a database to obtain a document containing at least one of the one or more terms;
classifying each of the one or more terms according to whether the term is contained in the document, wherein each classified term is a matching term or a non-matching term;
obtaining a matching pattern between the query and the document, the matching pattern indicating an integration of the each classified term; and
determining a relevance score for the document relative to the query based at least on the matching pattern.
2. The method of claim 1, wherein the integration identifies each term in the query, a classification of the each term, and a location of each term in the query.
3. The method of claim 1, further comprising:
determining a plurality of potential matching patterns for the query; and
obtaining a plurality of scores corresponding to the plurality of potential matching patterns by calculating a score corresponding to each potential matching pattern of the plurality of potential matching patterns.
4. The method of claim 3, wherein the relevance score is determined by selecting a score from the plurality of scores that corresponds to a same potential matching pattern as the matching pattern.
5. The method of claim 3, wherein calculating the score corresponding to each potential matching pattern comprises:
generating a plurality of features of different types for the potential matching pattern based on the potential matching pattern and the query, wherein each feature has a weight; and
calculating the score corresponding to the potential matching pattern based on the weights of the plurality of features.
6. The method of claim 3, wherein calculating the score corresponding to each potential matching pattern comprises:
calculating at least one raw score corresponding to at least one sub-pattern of the potential matching pattern; and
obtaining a score corresponding to the potential matching pattern by selecting a highest original score among the at least one original score.
7. The method of claim 1, wherein the relevance score is determined by calculating a score corresponding to the matching pattern.
8. The method of claim 7, wherein calculating the score corresponding to the matching pattern comprises:
generating a plurality of features of different types for the matching pattern based on the matching pattern and the query, wherein each feature has a weight; and
calculating the score corresponding to the matching pattern based on the weights of the plurality of features.
9. The method of claim 5 or 8, wherein the different types comprise at least one of:
the matching entry in the query is a word,
the non-matching entries in the query,
any matching entry in the query in combination with any non-matching entry,
any one matching entry in the query in combination with any other matching entry,
a combination of the number of terms in the query and the number of matching terms,
the consecutive matching entries in the query,
the blocks of the query that match the terms,
any matching entry in the query is a bigram of any non-matching entry,
the two-tuples in the query that do not match the entry,
a combination of any one of the matching terms in the query and the corresponding matching pattern, an
Any one of the queries does not match a combination of entries and corresponding matching patterns.
10. The method of claim 5 or 8, wherein the weight for each feature is trained based on a set of training data, each training data comprising a historical query and a plurality of retrieved historical documents, each historical document labeled with a relevance value.
11. The method of claim 1, wherein the matching pattern is represented by a vector and the value of each dimension of the vector is zero or one.
12. An apparatus for scoring documents in document retrieval, comprising:
a query receiving module for receiving a query, the query comprising one or more terms;
a search module to search a database for a document containing at least one of the one or more terms;
a classification module to classify the one or more terms according to whether each of the terms is contained in the document, wherein each classified term is a matching term or a non-matching term;
a matching pattern obtaining module to obtain a matching pattern between the query and the document, the matching pattern indicating an integration of the each classified term; and
a score determination module to determine a relevance score for the document relative to the query based at least on the matching pattern.
13. The apparatus of claim 12, wherein the integration identifies each term in the query, a classification of the each term, and a location of each term in the query.
14. The apparatus of claim 12, wherein the matching pattern obtaining module is further for determining a plurality of potential matching patterns for the query, and the score determining module is further for: obtaining a plurality of scores corresponding to the plurality of potential matching patterns by calculating a score corresponding to each potential matching pattern of the plurality of potential matching patterns.
15. The apparatus of claim 14, wherein the relevance score is determined by selecting a score from the plurality of scores that corresponds to a same potential matching pattern as the matching pattern.
16. The apparatus of claim 14, wherein the score determination module is further configured to:
generating a plurality of features of different types for the potential matching pattern based on the potential matching pattern and the query, wherein each feature has a weight; and
calculating the score corresponding to the potential matching pattern based on the weights of the plurality of features.
17. The apparatus of claim 16, wherein the different types comprise at least one of:
the matching entry in the query is a word,
the non-matching entries in the query,
any matching entry in the query in combination with any non-matching entry,
any one matching entry in the query in combination with any other matching entry,
a combination of the number of terms in the query and the number of matching terms,
the consecutive matching entries in the query,
the blocks of the query that match the terms,
any matching entry in the query is a bigram of any non-matching entry,
the two-tuples in the query that do not match the entry,
a combination of any one of the matching terms in the query and the corresponding matching pattern, an
Any one of the queries does not match a combination of entries and corresponding matching patterns.
18. The apparatus of claim 14, wherein the score determination module is further configured to:
calculating at least one raw score corresponding to at least one sub-pattern of the potential matching pattern; and
obtaining a score corresponding to the potential matching pattern by selecting a highest original score among the at least one original score.
19. The apparatus of claim 12, wherein the relevance score is determined by calculating a score corresponding to the matching pattern.
20. An apparatus for scoring documents in document retrieval, comprising:
one or more processors; and
a memory storing computer-executable instructions that, when executed, cause the one or more processors to:
receiving a query, the query comprising one or more terms;
searching a database to obtain a document containing at least one of the one or more terms;
classifying each of the one or more terms according to whether the term is contained in the document, wherein each classified term is a matching term or a non-matching term;
obtaining a matching pattern between the query and the document, the matching pattern indicating an integration of the each classified term; and
determining a relevance score for the document relative to the query based at least on the matching pattern.
CN201980022597.0A 2019-01-25 2019-01-25 Scoring documents in document retrieval Pending CN111919208A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/073262 WO2020151015A1 (en) 2019-01-25 2019-01-25 Scoring documents in document retrieval

Publications (1)

Publication Number Publication Date
CN111919208A true CN111919208A (en) 2020-11-10

Family

ID=71735418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980022597.0A Pending CN111919208A (en) 2019-01-25 2019-01-25 Scoring documents in document retrieval

Country Status (2)

Country Link
CN (1) CN111919208A (en)
WO (1) WO2020151015A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123653A (en) * 2013-03-15 2013-05-29 山东浪潮齐鲁软件产业股份有限公司 Search engine retrieving ordering method based on Bayesian classification learning
CN105814564A (en) * 2013-12-14 2016-07-27 微软技术许可有限责任公司 Query techniques and ranking results for knowledge-based matching
CN106446122A (en) * 2016-09-19 2017-02-22 华为技术有限公司 Information retrieval method and device and computation device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100146299A1 (en) * 2008-10-29 2010-06-10 Ashwin Swaminathan System and method for confidentiality-preserving rank-ordered search
CN106919565B (en) * 2015-12-24 2020-12-22 航天信息股份有限公司 MapReduce-based document retrieval method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123653A (en) * 2013-03-15 2013-05-29 山东浪潮齐鲁软件产业股份有限公司 Search engine retrieving ordering method based on Bayesian classification learning
CN105814564A (en) * 2013-12-14 2016-07-27 微软技术许可有限责任公司 Query techniques and ranking results for knowledge-based matching
CN106446122A (en) * 2016-09-19 2017-02-22 华为技术有限公司 Information retrieval method and device and computation device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHRISTOPHER D. MANNING 等: "《An Introduction to Information Retrieval》", CAMBRIDGE UNIVERSITY, pages: 1 - 5 *

Also Published As

Publication number Publication date
WO2020151015A1 (en) 2020-07-30

Similar Documents

Publication Publication Date Title
US9864808B2 (en) Knowledge-based entity detection and disambiguation
CN109885773B (en) Personalized article recommendation method, system, medium and equipment
EP2823410B1 (en) Entity augmentation service from latent relational data
CN102725759B (en) For the semantic directory of Search Results
US8620907B2 (en) Matching funnel for large document index
CN112148889A (en) Recommendation list generation method and device
KR102069341B1 (en) Method for searching electronic document and apparatus thereof
CN102193939A (en) Realization method of information navigation, information navigation server and information processing system
WO2010014082A1 (en) Method and apparatus for relating datasets by using semantic vectors and keyword analyses
CN107844493B (en) File association method and system
Silva et al. Tag recommendation for georeferenced photos
US20140164367A1 (en) Method and system for semantic search keyword recommendation
CN110968800A (en) Information recommendation method and device, electronic equipment and readable storage medium
EP3485394A1 (en) Contextual based image search results
Ghanbarpour et al. A model-based method to improve the quality of ranking in keyword search systems using pseudo-relevance feedback
Chen et al. A framework for annotating OpenStreetMap objects using geo-tagged tweets
CN111919208A (en) Scoring documents in document retrieval
CN111985217B (en) Keyword extraction method, computing device and readable storage medium
CN114911826A (en) Associated data retrieval method and system
Feng et al. University of Washington at TREC 2020 fairness ranking track
Wu et al. Max-sum diversification on image ranking with non-uniform matroid constraints
CN112015853A (en) Book searching method, book searching system, electronic device and medium
Aouadi et al. Combination of document structure and links for multimedia object retrieval
CN110851560A (en) Information retrieval method, device and equipment
Sheokand et al. Best effort query answering in dataspaces on unstructured data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination