CN115878564A

CN115878564A - Document retrieval method and device

Info

Publication number: CN115878564A
Application number: CN202111154038.0A
Authority: CN
Inventors: 张安; 张泽品; 聂光耀
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2023-03-31

Abstract

The application relates to the technical field of document retrieval, and the embodiment of the application provides a document retrieval method and a device, wherein the method comprises the following steps: acquiring a retrieval statement; generating a document filter and a scorer based on the retrieval statement; accessing an index resource library, and acquiring index resources required by a document filter and a scorer, wherein the index resources form a rough index resource pool, and the index resources comprise index information of each document; filtering out documents meeting the requirements of a document filter based on the index resources in the rough index resource pool; calculating a relevance score according to document-based rough feature vector calculation, wherein the rough feature vector is calculated based on a scorer; and selecting the documents according to the relevance scores, wherein the selected documents are used as the retrieval results in the rough ranking stage. By the method and the device, the index resources in the rough index resource pool can be shared, and the retrieval efficiency is improved.

Description

Document retrieval method and device

Technical Field

The present application relates to the field of information retrieval, and in particular, to a method and an apparatus for document retrieval.

Background

Information retrieval has deep applications in web searching, and is also widely used in vertical fields such as music, video, news, and the like.

The main task of information retrieval is to find a small number of documents relevant to query in mass data according to query sentences of a user, and a retrieval engine scores the documents according to the measurement of the relevant documents on each relevant dimension to generate a document sequence with progressively decreased relevance and returns the document sequence to the user.

Generally, in a massive index retrieval scene, due to the need of considering user experience, a retrieval engine is required to process queries as soon as possible and return results, but fine ranking with a better ranking effect generally requires the use of a ranking model, has high cost, and cannot be directly used on a large number of relevant document sets meeting filtering conditions. In order to return high-quality sequencing results in a short time, a commercial search engine generally adopts a two-stage search process, which is divided into two stages, namely coarse ranking and fine ranking. In the rough ranking stage, a filter is used to filter out a relevant document set meeting the filtering condition from the total data in a low-cost mode, then a scorer is used to calculate the relevance score of each relevant document, and the top TOPn (TOPn represents the top n ranked documents) is obtained as the rough ranking result. In the fine ranking stage, on the TOPn related results returned in the coarse ranking stage, a feature extraction module is used for extracting the feature vector of each document from the TOPn results, and then an input ranking model is used for reordering the TOPn related documents to obtain a better ranking.

Since the collection of documents to be indexed by a retrieval engine is large, most of the data is stored on a storage medium such as a disk under such massive document indexing, and the process of reading/writing the medium is called an I/O process. The search engine is actually an I/O intensive process when processing queries, and it needs to read a large amount of index information stored on these media, thereby introducing a large amount of I/O, and therefore how to reduce I/O in the query processing process becomes the key point for optimizing query performance of the search engine.

Disclosure of Invention

In view of this, an embodiment of the present application provides a document retrieval scheme, where the document retrieval includes a coarse-ranking stage, and in the coarse-ranking stage, an index resource pool is generated by querying an index resource library based on an index resource required by an atom filter and an atom correlation operator that depends on the coarse-ranking stage, the document filter and a scorer multiplex index resources in the coarse-ranking index resource pool, so as to reduce I/O in the coarse-ranking stage, and the scorer multiplexes calculation results and/or intermediate results of each atom correlation operator, so as to reduce the calculation amount in the coarse-ranking stage; in some embodiments, the document retrieval further includes a fine ranking stage, where documents of the coarse ranking results in the index repository are queried based on index resources required by the atomic relevance operators specific to the fine ranking stage, a fine ranking resource pool is generated to share, and the calculation results and/or intermediate results of each atomic relevance operator dependent on the coarse ranking and the atomic relevance operators specific to the fine ranking stage are shared in the calculation process of calculating the fine ranking feature vector. Through the I/O sharing and the calculation sharing of the atom correlation operator, the document retrieval efficiency is improved, and the user satisfaction is improved.

In order to achieve the above object, a first aspect of the present application provides a document retrieval method, including a coarse arrangement stage, where the coarse arrangement stage includes: acquiring a retrieval statement; generating a document filter and a scorer based on the retrieval statement; accessing an index resource library, and acquiring index resources required by a document filter and a scorer, wherein the index resources form a rough index resource pool and comprise index information of each document; filtering out documents meeting the requirements of a document filter based on index resources in the rough index resource pool; performing relevance score calculation according to the filtered documents, wherein the relevance score is calculated based on the rough-arranged feature vectors of the documents, and the rough-arranged feature vectors are calculated based on a scorer; and selecting the documents according to the relevance scores, wherein the selected documents are used as the retrieval results in the rough ranking stage.

In the rough arrangement stage, a rough arrangement index resource pool is generated based on index resources required by a document filter and a scorer, the document filter and the scorer multiplex the index resources in the rough arrangement index resource pool to filter and score the documents, and for one document to be processed, one index resource only needs to perform I/O on the index resource once, so that the I/O in the rough arrangement stage is reduced, and the document retrieval efficiency is improved.

As a possible implementation manner of the first aspect, the scorer includes an atomic correlation operator, and the document filter includes an atomic filter, where obtaining the document filter and the indexing resource required by the scorer includes: and acquiring index resources required by the atom correlation operator and index resources required by the atom filter.

According to the method, the index resources required by the document filter and the scorer are determined through the index resources required by the atom correlation operator and the index resources required by the atom filter, so that a rough index resource pool is constructed, and the requirements of the document filter and the scorer on the index resources can be met under the condition that the rough index resource pool only occupies resources of a proper memory.

As a possible implementation manner of the first aspect, filtering out documents that meet the document filter requirement includes: based on index resources in the rough index resource pool, primarily filtering out documents meeting the requirements of each atomic filter; for the documents filtered for the first time, documents meeting the requirements of a document filter are filtered out, and the document filter describes the combinational logic of a plurality of atomic filters.

Therefore, documents meeting the requirements of the document filter are filtered out by utilizing the combinational logic of the atomic filters according to the filtering results of the atomic filters, the document filter realizes the sharing of the filtering results of the atomic filters, and the filtering efficiency is improved.

As a possible implementation manner of the first aspect, the coarse row feature vector is calculated based on a scorer, and includes: calculating the rough arrangement characteristic vector based on the calculation result of each atomicity correlation operator; and in the calculation process of the rough-row feature vector, the calculation result and/or the intermediate result of the atomic correlation operator are/is shared.

Therefore, the rough arrangement characteristic vectors are calculated according to the calculation results of the atomic correlation operators, and the scoring device scores the correlation of the documents according to the rough arrangement characteristic vectors, so that the calculation results of the atomic correlation operators are shared in the scoring process of the documents, the retrieval operation amount is reduced, and the scoring efficiency is improved.

As a possible implementation of the first aspect, the atomic correlation operator comprises an atomic correlation operator used in the coarse-ranking stage to calculate the coarse-ranking feature vector.

Therefore, the atom correlation operator used for calculating the rough ranking feature vector is determined from the atom correlation operator, so that the reasonable range of the index resources required by the atom correlation operator in the rough ranking stage is determined, the calculation results and/or the intermediate results of the atom correlation operator in the reasonable range are stored for sharing, and excessive memory resources are avoided being occupied.

As a possible implementation manner of the first aspect, the method further includes a fine-sorting stage, where the fine-sorting stage includes: aiming at the documents in the retrieval result of the coarse ranking stage, obtaining fine ranking feature vectors of all the documents, wherein the fine ranking feature vectors are calculated based on the calculation result and/or the intermediate result of the atomicity correlation operator; and selecting and/or sorting the retrieval results of the rough ranking stage by the sorting model based on the fine ranking feature vector, wherein the selected and/or sorted results are used as the retrieval results of the fine ranking stage.

Therefore, the precision ranking characteristic vectors of the documents retrieved in the rough ranking stage are calculated in the precision ranking stage, and then the final retrieval result is selected in a sorting mode, so that the retrieval accuracy is further improved.

As a possible implementation manner of the first aspect, the fine feature vector is calculated based on a calculation result of an atomicity correlation operator, including; the fine-line feature vector comprises a first feature vector, and an atomicity correlation operator based on which the first feature vector is calculated belongs to an atomicity correlation operator used in calculation of the coarse-line feature vector; and in the calculation process of the first feature vector, the calculation result and/or the intermediate result of the atomic correlation operator used in the calculation of the coarse-row feature vector are shared.

Therefore, the calculation result and/or the intermediate result of the specially used atomicity correlation operator in the rough ranking stage are/is shared when the first feature vector in the fine ranking feature vector is calculated, so that the retrieval operation amount is reduced, and the scoring efficiency is improved.

As a possible implementation of the first aspect, the calculation of the first feature vector is performed in a coarse ranking stage.

Therefore, the calculation result and/or the intermediate result of the atomicity correlation operator in the coarse arrangement stage only occupy reasonable memory in the coarse arrangement stage by calculating the first feature vector in the coarse arrangement stage, and the consumption of memory resources is reduced.

As a possible implementation manner of the first aspect, the calculating of the fine feature vector based on the calculation result of the atomicity correlation operator includes: the fine line feature vector comprises a second feature vector, and the atomicity correlation operator on which the second feature vector is calculated comprises an atomic correlation operator which is not used in the calculation of the coarse line feature vector; and in the process of calculating the fine ranking feature vector, the calculation result and/or the intermediate result of the unused atom relevance operator are shared.

Therefore, the calculation result and/or the intermediate result of the specially used atomic correlation operator in the fine frequency ranking stage are/is shared when the second feature vector in the fine frequency ranking feature vectors is calculated, so that the retrieval operation amount is reduced, and the scoring efficiency is improved.

As a possible implementation manner of the first aspect, calculating a calculation result and/or an intermediate calculation result of an atomic correlation operator that is not used in the rough feature stage includes: accessing an index resource library, obtaining index resources of the unused atom correlation operator, and forming a fine index resource pool by the index resources; the computation result and/or the intermediate result of the unused atomic correlation operator is computed based on the index resources of the fine-grained index resource pool.

Therefore, a fine-ranking index resource pool is constructed according to index resources required by the unused atom correlation operators in the coarse-ranking stage, sharing is performed when the second feature vector is calculated, I/O in the fine-ranking stage is reduced, and document retrieval efficiency is improved.

In order to achieve the above object, a second aspect of the present application provides a document retrieval apparatus, applied to a coarse ranking stage, including: the system comprises an acquisition module, a shared query construction module, a filtering module, a scoring module and a rough ranking module; the acquisition module is used for acquiring a retrieval statement; the shared query construction module is used for generating a document filter and a scorer based on the retrieval statement; the acquisition module is also used for accessing an index resource library, acquiring index resources required by the document filter and the scorer, wherein the index resources form a rough index resource pool and comprise index information of each document; the filtering module is used for filtering out documents meeting the requirements of the document filter based on the index resources in the rough index resource pool; the scoring module is used for calculating a correlation score according to the filtered documents, the correlation score is calculated based on the rough-arranged feature vectors of the documents, and the rough-arranged feature vectors are calculated based on a scorer; and the rough ranking module is used for selecting the documents according to the relevance scores, and the selected documents are used as the retrieval results of the rough ranking stage.

In the rough arrangement stage, a rough arrangement index resource pool is generated based on index resources required by a document filter and a scorer, the document filter and the scorer multiplex the index resources in the rough arrangement index resource pool to filter and score documents, and for a document to be processed, one index resource only needs to perform I/O on the index resource once, so that the I/O in the rough arrangement stage is reduced, and the document retrieval efficiency is improved.

As a possible implementation manner of the second aspect, the scorer includes an atomic correlation operator, the document filter includes an atomic filter, and the obtaining of the index resource required by the document filter and the scorer includes: and acquiring index resources required by the atom correlation operator and index resources required by the atom filter.

As a possible implementation manner of the second aspect, the filtering module is specifically configured to filter out documents meeting requirements of each atomic filter for the first time based on the index resources in the rough index resource pool; the filtering module is further specifically configured to filter out documents that meet requirements of a document filter, which describes a combinatorial logic of the plurality of atomic filters, for the documents that are filtered out for the first time.

As a possible implementation manner of the second aspect, when calculating the rough feature vector, the scoring module is specifically configured to calculate the rough feature vector based on a calculation result of each atomic correlation operator; and in the calculation process of the rough-arranged feature vectors, the calculation results of the atomic correlation operator are shared.

As one possible implementation of the second surface, the atomic relevance operator includes the atomic relevance operator used in the coarse stage to compute the coarse feature vector.

As a possible implementation manner of the second aspect, the method is further applied to a fine ranking stage, and further includes a feature extraction module and a fine ranking module; the feature extraction module is used for obtaining fine feature vectors of all documents aiming at the documents in the search results of the coarse ranking stage, and the fine feature vectors are calculated based on the calculation results and/or the intermediate results of the atomicity correlation operator; and the fine ranking module is used for selecting the documents by the ranking model based on the fine ranking eigenvector, and the selected documents are used as the retrieval result of the fine ranking stage.

Therefore, the accurate ranking characteristic vector of the document retrieved in the rough ranking stage is calculated in the accurate ranking stage, and then the final retrieval result is selected through sorting, so that the retrieval accuracy is further improved.

As a possible implementation manner of the second aspect, the feature extraction module includes a first feature extraction module, configured to extract a first feature vector in the fine-line feature vector, where an atomicity correlation operator based on which the first feature vector is calculated belongs to an atomicity correlation operator used in calculation of the coarse-line feature vector; in the process of calculating the first feature vector by the first feature extraction module, the calculation result and/or the intermediate result of the atomic correlation operator used in the calculation by using the coarse-row feature vector are/is shared.

As a possible implementation of the second aspect, the first feature extraction module is executed in a coarse ranking stage.

As a possible implementation manner of the second aspect, the feature extraction module includes a second feature extraction module, configured to extract a second feature vector in the fine-line feature vector, and the atomic correlation operator based on which the second feature vector is calculated includes an atomic correlation operator that is not used in the calculation of the coarse-line feature vector; in the process of calculating the second feature vector by the second feature extraction module, the calculation result and/or the intermediate result of the unused atomic correlation operator are/is shared.

As a possible implementation manner of the second aspect, the obtaining module is further configured to access an index resource library, obtain index resources of an atomic correlation operator that are not used in a rough ranking stage, and the index resources form a fine ranking index resource pool; the second feature extraction module calculates based on the index resources of the fine-grained index resource pool when calculating the calculation result and/or the intermediate result of the unused atomic correlation operator.

To achieve the above object, a third aspect of the present application provides a computing device comprising at least one processor and at least one memory, the memory storing program instructions that, when executed by the at least one processor, cause the at least one processor to perform the method of the first aspect described above.

To achieve the above object, a fourth aspect of the present application provides a computer-readable storage medium having stored thereon program instructions that, when executed by a computer, cause the computer to implement the method of the first aspect described above.

To achieve the above object, a fifth aspect of the present application provides a computer program product comprising program instructions that, when executed by a computer, cause the computer to implement the method of the first aspect.

These and other aspects of the present application will be more readily apparent in the following description of the embodiment(s).

Drawings

The various features and the connections between the various features of the present application are further described below with reference to the drawings. The drawings are exemplary in nature and some of the drawings may omit features that are conventional in the art to which the application pertains and are not essential to the application or additionally illustrate features that are not essential to the application, and the combination of features shown in the drawings is not intended to limit the application. In addition, the same reference numerals are used throughout the specification to designate the same components. The specific drawings are illustrated as follows:

FIG. 1 is a schematic flow chart of the two-stage search of elastic search + Learning to Rank plug;

FIG. 2A is a schematic structural diagram of a query tree of an elastic search;

FIG. 2B is a schematic diagram of a filter of a query tree based on ElasticSearch.

FIG. 2C is a schematic diagram of the structure of a scorer of a query tree based on ElasticSearch.

FIG. 3 is a schematic diagram of the shortcomings of the scheme based on elastic search + Learning to Rank Plugin;

FIG. 4 is a schematic diagram of an application scenario according to embodiments of the present application;

FIG. 5A is a flowchart illustrating a first embodiment of a document retrieval method according to the present application;

FIG. 5B is a schematic diagram of a data flow constructed by the coarse index resource pool of the present application;

FIG. 5C is a schematic diagram illustrating a comparison between a conventional scoring method and a scoring method according to the first embodiment of the present application

FIG. 5D is a diagram illustrating the calculation of the shared index resource and the atomic correlation operator according to the first embodiment of the present application;

FIG. 6A is a flowchart illustrating a second embodiment of a document retrieval method according to the present application;

fig. 6B is a schematic diagram of a data flow constructed by an index resource pool according to a second embodiment of the present application;

FIG. 6C is a diagram illustrating the calculation of the shared index resource and the atomic correlation operator according to the first embodiment of the present application;

FIG. 7A is a flowchart illustrating a document retrieval method according to an embodiment of the present application;

FIG. 7B is a content structure diagram of a shared query in accordance with an embodiment of the present application;

FIG. 7C is a flowchart illustrating a method for constructing a global shared query according to an embodiment of the present application;

FIG. 7D is a flowchart illustrating a method for coarse retrieval of each document, in accordance with an embodiment of the present application;

FIG. 7E is a schematic flow chart diagram illustrating a method for fine-search of each document, in accordance with an embodiment of the present application;

FIG. 8A is a schematic structural diagram of a first embodiment of a document retrieval device according to the present application;

FIG. 8B is a diagram illustrating a second exemplary embodiment of a document retrieval device according to the present application;

fig. 9 is a schematic structural diagram of a computing device according to embodiments of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

The terms "first, second, third and the like" or "module a, module B, module C and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, it being understood that specific orders or sequences may be interchanged where permissible to effect embodiments of the present application in other than those illustrated or described herein.

In the following description, reference numerals indicating steps such as S110, S120 \ 8230 \8230 \ 8230, etc. do not necessarily indicate that the steps are performed, and the order of the front and rear steps may be interchanged or performed simultaneously, where the case allows.

The term "comprising" as used in the specification and claims should not be construed as being limited to the contents listed thereafter; it does not exclude other elements or steps. It is thus to be interpreted as specifying the presence of the stated features, integers, steps or components as referred to, but does not preclude the presence or addition of one or more other features, integers, steps or components, or groups thereof. Thus, the expression "a device comprising means a and B" should not be limited to a device consisting of only components a and B.

Reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Thus, appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments, as would be apparent to one of ordinary skill in the art from this disclosure.

Before further detailed description of the embodiments of the present application, technical terms related to the examples of the present application will be described. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. In the case of inconsistency, the meaning described in the present specification or the meaning derived from the content described in the present specification shall control. In addition, the terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

To accurately describe the technical contents in the present application and to accurately understand the present application, the terms used in the present specification are given the following explanations or definitions before the description of the specific embodiments.

1. Query term (term): words indexed in a search engine. In a search engine, a query input by a user generally consists of sentences, a term is segmented into the sentences to obtain an ordered sequence of term, and the search of the search engine is carried out by taking term as a unit.

2. Document (document, or doc): in the field of information retrieval, data such as web pages and items can be abstracted into documents. A document is an ordered sequence of term. In some scenarios, such as web searching, a document may also be divided into multiple fields (fields), each segmentable field being independent of the other and all being a term ordered sequence.

3. Inverted list: the term is a document index table used for storing the mapping relation of a term in which documents appear, and is an ordered list arranged from small to large according to document identifications (doc id) and representing that the term represented by the inverted list is contained in the documents. For example, term "A" appears in a document with

id

0,2,5, the inverted list of A can be expressed as {0,2,5}. And in order to be able to find phrases made up of multiple term in the document, it is also necessary to record the location of term in the document. Assuming that a appears at three positions of 11,25,46 of document 0, at position 54 of document 2, and at position 14,39 of document 5, there are inverted arrangement tables of (0, <11,25,46 >), (2, <54 >), (5, <14,39 >) }. Meanwhile, the posting list may also record the number of times term appears in doc (word frequency), as well as some other attributes that term has on doc. The retrieval engine achieves the purpose of quickly retrieving under massive indexes by constructing the inverted list structure on the input data in advance. The information contained in the inverted list is referred to as inverted index resources.

4. And (3) forward indexing: the method is an index method for recording document attributes, and records static attribute information of each doc, such as web page quality, web page url, web page click times and the like of a certain document, and is mostly used for reading the static attributes and acting on a correlation scoring process, or the static attributes can also be directly/indirectly used as correlation characteristics of the document and used for a fine ranking process. The information of the forward index is referred to as forward index resources.

An elastic search + Learning ranking plug-in (elastic search (ES) + Learning to Rank plug-in) is an implementation manner of a two-stage retrieval process, and fig. 1 shows the two-stage retrieval process of elastic search + Learning to Rank plug-in, wherein in the ES stage, i.e. the coarse-line stage, the screening and scoring functions of the coarse-line stage shown in fig. 1 are implemented, and top documents (i.e. top n documents after sorting) are output; in the Learning to Rank plug stage, i.e. the fine ranking stage, the feature vector extraction of the top documents in the fine ranking stage shown in fig. 1 is implemented, the top documents are ranked according to the feature vector, and the top documents (i.e. the top k documents after ranking, k is less than n) are output as the final output result.

In the ES phase, a query tree is generated based on a query sentence or word input by a user, and fig. 2A shows a structure of the query tree, in which each leaf on the tree is a non-resegmentable query called an atomic query, a branch labeled "must" is used to filter documents under a root node bool, and a branch labeled "should" is used to score documents passing the filtering only together with a branch labeled "must".

Filters and scorers are generated based on the query tree, and fig. 2B and 2C illustrate the structure of the filters and scorers generated based on fig. 2A. Both the filter and the scorer are tree structures, also called filter trees and scoring trees. Each leaf in the filter is an atom filter and corresponds to an atom query, and the result of each atom filter generates the filtering result of the filter through logical operation; in fig. 2C, bm25 is an operation based on the atomic query result. The ES stage performs I/O based on the leaves of the query tree, the atomic filters and operators at the same positions in FIG. 2B and FIG. 2C correspond to the same atomic query, and the I/O can be multiplexed by multiplexing the result of one atomic query. The ES stage has the following disadvantages:

(1) Atomic filters at different locations of the filter may not use I/O even if they correspond to an atomic query.

(2) Operators at different positions of the scorer correspond to one atomic query, do not multiplex I/O and do not share calculation.

(3) A leaf in one location of the filter corresponds to even one atomic query with a leaf in another location of the scorer, and does not multiplex I/O.

In the Learning to Rank plug stage, the calculated feature vectors are high-dimensional vectors, each dimension corresponds to a feature tree, each leaf of each feature tree is also an operator and corresponds to an atom query, and the feature calculation method of each dimension is the same as that of a score tree. Because high-dimensional feature vectors correspond to a large number of I/Os, the I/O overhead for the same document is much larger than for the coarse stage. The Learning to Rank plug stage performs I/O based on the leaves of each feature tree, with the following disadvantages:

(1) Operators at different positions of each feature tree do not multiplex I/O even if the operators correspond to one atomic query, and the calculation of the atomic operators is not shared.

(2) Even if operators correspond to one atomic query in each feature tree, I/O is not multiplexed, and the calculation of the atomic operators is not shared.

(3) Even if the characteristic tree in the Learning to Rank plug stage and the filter or the divider in the ES stage have an atomic operator corresponding to an atomic query and do not multiplex I/O, the calculation of the atomic operator is not shared.

FIG. 3 is a schematic diagram of a drawback of the ES + Learning to Rank plug scheme, in which I/Os represented by dotted arrows drawn from the index resource are redundant I/Os, and these I/Os appear in the left 3 arrows drawn from the index resource. In the calculation layer, the calculation module with the dashed box is the repeated calculation, and the module with the dashed box in the scorer and the filter is the repeated calculation.

The embodiment of the application provides a document retrieval method and a document retrieval device, wherein the document retrieval comprises a rough-layout stage, in the rough-layout stage, an index resource library is inquired based on index resources required by atom correlation operators depended on by atom filters and the rough-layout to generate a rough-layout index resource pool, the atom correlation operators depended on by the atom filters and the rough-layout multiplex index resources in the rough-layout index resource pool, I/O of the rough-layout stage is reduced, and a scorer multiplexes calculation results and/or intermediate results of the atom correlation operators in the rough-layout stage, so that the calculation amount of the rough-layout stage is reduced; in some embodiments, the document retrieval further includes a fine ranking stage, in which the calculation results and/or intermediate results of the atomic correlation operators in the coarse ranking stage are shared when the fine ranking features are extracted, and the documents of the coarse ranking results in the index resource library are subjected to one-time I/O based on the index resources required by the atomic correlation operators exclusive to the fine ranking stage, so as to generate a fine ranking resource pool for sharing, and the calculation of the atomic correlation operators exclusive to the fine ranking stage is also shared in the calculation process of the fine ranking feature vector. By sharing the calculation results and/or the intermediate results of the I/O and the atom correlation operators, the document retrieval efficiency is improved, and the user satisfaction is improved.

Embodiments of the present application will be described in detail below based on fig. 4 to 9.

First, an application scenario according to each embodiment of the present application is schematically described with reference to fig. 4. Fig. 4 shows a structure of an application scenario related to embodiments of the present application, including: search client 410, search server 420, and index repository 430.

The search client 410 is used for the user to input search conditions and display search results. The search client 410 includes a mobile phone, pad, notebook, desktop, etc., and is connected to the search server 420 by wireless or wired connection.

The retrieval server 420 is configured to retrieve the system index repository 430 based on a retrieval condition input by a user, and obtain a search result. The search server 420 may include a plurality of servers that perform search respectively to improve search efficiency, and in this case, a coordination controller may be further included to coordinate each search server 420.

The index repository 430 is an index resource stored on a storage device, and the index repository 430 may be distributed on a plurality of storage devices, which may be hard disks, optical disks, tapes, and the like. Typically, the index repository 430 is co-located or deployed nearby to the search server 420 to increase the I/O rate. When the search server 420 and the index repository 430 are deployed in a distributed manner, there may be a corresponding relationship between the distribution nodes of the index repository 430 and the data nodes of the search server 420, so that the data nodes of the search server 420 perform I/O from the distribution nodes of the index repository 430 of the nearby system. When the retrieval server 420 and the index repository 430 are both deployed in a distributed manner, a coordinating node is further included to control the coordination of the data nodes.

A first embodiment of a document retrieval method according to the present application is described below with reference to fig. 5A to 5D.

[ A document retrieval method embodiment ] A document retrieval method

The first method embodiment is operated on the retrieval server 420, the first method embodiment is that a document filter and a scorer are constructed based on retrieval statements input by a user, an index resource pool is constructed based on resources required by the document filter and the scorer, and the document sequence of the index resource pool is filtered, scored and ordered by using the document filter and the scorer to determine a retrieval result, so that only index resources and calculation results in a shared resource pool are realized in the retrieval process, and the retrieval efficiency is improved.

Fig. 5A shows a first flowchart of a document retrieval method according to a first embodiment of the present application, including step S510 to step S560.

S510: and acquiring a retrieval statement input by a user.

The search sentence is a sentence or word input by a user through a search client 410, and when there are a plurality of input words, the interval between words can be performed by punctuation marks such as space, comma, or pause. The retrieval condition may also be a word with a logical expression.

When the retrieval condition is a sentence, words contained in the sentence are extracted by segmenting the sentence.

S520: and generating a document filter and a scoring device according to the retrieval sentences.

In some embodiments, a query tree and a scoring tree are constructed according to the retrieval statement by using an ElasticSearch method, and then a document filter and a scorer are constructed according to the query tree and the scoring tree.

In the embodiment, an atomic query is generated according to each word (keyword) of a retrieval statement, and an atomic filter and an atomic correlation operator are generated according to the atomic query; the document filter includes a number of atomic filters, describing combinatorial logic that generates multiple atomic filters from the search statement. The scorer comprises a plurality of atom correlation operators and also comprises a rough arrangement characteristic vector calculated by the atom correlation operators according to the retrieval statement, and the correlation score of the document can be calculated according to the rough arrangement characteristic vector. In some embodiments, this step may include steps S521 through S524.

S521: and generating an atomic query according to the key words of the retrieval statement.

Wherein, the atomic query can be generated by using the method of ElasticSearch according to the key words of the retrieval statement.

S522: and constructing a rough arrangement characteristic vector, and generating the relationship between the rough arrangement characteristic in the row characteristic vector and the atom correlation operator.

The method for scoring the relevance score of the document can be generated by using an elastic search method according to the search statement, and the rough-ranking feature vector is determined according to the method, and can also be described as rough-ranking features required by the registration of the scoring method based on the relevance score of the scorer.

And generating a relationship between the rough arrangement characteristics and the atom correlation operators according to the scoring method of the correlation scores, wherein one atom correlation operator corresponds to one atom query, and the atom correlation operators in the rough arrangement stage are also called rough arrangement atom correlation operators.

Wherein, the relationship between the rough features and the atom correlation operators comprises:

each rough row feature is related to which atom correlation operators, and the atom correlation operators become rough row atom correlation operators;

each steal eigenvalue is functionally related to the calculation and/or intermediate results of the associated steal atom dependency operator.

Determining the relationship between the rough feature and the atomic correlation operator can also be described as registering the required atomic correlation operator based on the calculation method of the rough feature.

S523: construction document filters are constructed using atomic filters.

The document filter is constructed according to the relationship between the document filter and the atom filter, wherein the document filter is generated by using an elastic search filtering method according to the retrieval statement, the relationship between the document filter and the atom filter is generated according to the document filter, each atom filter corresponds to one atom query, and the document filter is constructed according to the relationship between the document filter and the atom filter.

Wherein the relationship of the document filter to the atom filter may include the following:

the document filters are related to which atom filters,

the logical relationship of a document filter to an associated atomic filter.

In some embodiments, each data node builds its own document filter.

Wherein determining the relationship of the document filter to the atomic filter may also be described as registering the required atomic filter based on the filtering method of the document filter.

Therefore, the document filter constructed by using the logical relationship of the related atom filters solves the problem that the index resource corresponding to the same atom filter and the result of the atom filter cannot be shared when different leaf nodes of the query tree correspond to one atom filter, and realizes the sharing of the index resource corresponding to the atom filter and the result of the atom filter.

S524: and constructing a scorer by using the rough ranking characteristics and the rough ranking atom correlation operator.

The rough-arranged feature actuator can be constructed by utilizing the relationship between the rough-arranged features and the atom correlation operators.

When the elastic search is used for determining the rough-line features, the relationship between the relevance scores of the documents and the rough-line features can be determined, a grader is constructed according to the relationship, the grader calculates the rough-line features by using the calculation results and/or the useful intermediate results of the rough-line atom relevance operator, and the relevance scores of the documents are calculated by using the rough-line features.

In some embodiments, a pool of atomic correlation operators is constructed for holding computed results and/or useful intermediate results of atomic correlation operators, including squashed computed results and/or useful intermediate results of atomic correlation operators.

In some embodiments, a rough feature executor is constructed by using the relationship between the rough feature and the atom correlation operator, the rough feature executor obtains the calculation result and/or useful intermediate result of the relevant atom correlation operator from the atom correlation operator pool, and calculates each rough feature value by using the functional relationship between the rough feature value and the calculation result and/or intermediate result of the relevant rough atom correlation operator; and then, constructing a scorer by utilizing the relation between the relevance score of the document and the rough-line characteristic, namely calculating the relevance score of the document by the scorer according to the output of the rough-line characteristic actuator.

In some embodiments, each data node builds its own scorer.

Compared with the scorer constructed by the scoring tree, the scorer constructed by the rough ranking features and the rough ranking atom correlation operators solves the problem that index resources corresponding to the same atom correlation operator and calculation results and/or intermediate results of the atom correlation operators cannot be shared when different leaf nodes of the scoring tree correspond to one atom correlation operator, and realizes the sharing of the index resources corresponding to the atom correlation operators and the calculation results and/or the intermediate results of the atom correlation operators.

S530: and accessing the index resource library, acquiring index resources required by the document filter and the scorer, and forming a rough index resource pool.

The index resources comprise information such as reverse index resources and forward index resources of all documents. The index resources required by the scorer comprise index resources required by an atom relevance operator, and the index resources required by the document filter comprise index resources required by an atom filter.

In some embodiments, when the index resource pool data nodes are stored uniformly, this step forms a rough index resource pool.

In some embodiments, this step forms a coarse index resource pool for each data node when the index resource pool stores the data nodes.

It should be noted that this step can also be described as registering the rough index resource based on the index resources required by the document filter and the scorer, and forming a rough index resource pool by the index resources required by the document filter and the scorer.

FIG. 5B illustrates a data flow diagram of a determination of a coarse index resource pool build. The method comprises the steps of registering a needed atom filter based on a document filter, registering needed rough-layout characteristics based on a scorer, registering needed atom correlation operators based on the rough-layout characteristics, and registering needed index resources to a shared resource pool based on the atom filter and the atom correlation operators, so that the rough-layout index resource pool is constructed by acquiring the needed index resources from an index resource library.

S540: and filtering out the documents meeting the requirements of the document filter based on the index resources in the rough index resource pool.

Firstly, filtering documents meeting the requirements of each atomic filter for the first time based on index resources in a rough index resource pool; and aiming at the documents which are filtered for the first time, filtering out the documents which meet the requirements of a document filter, wherein the document filter describes the combinational logic of a plurality of atomic filters.

In some embodiments, when each data node stores the rough index resource pool of the node, the filtering of this step is completed at each data node.

S550: and calculating a relevance score according to the filtered documents.

The relevance score of the document is calculated based on the rough-row feature vector of the document, the rough-row feature vector is calculated based on a scorer, the calculation result of each rough-row atomic relevance operator is used for calculation, and the calculation result of the rough-row atomic relevance operator is shared in the calculation process.

In some embodiments, when each data node filters out its own document according to the method of step S540, the filtered documents are scored on the data node based on the method of this step.

Fig. 5C is a diagram showing a comparison between the conventional query tree-based scoring method and the rough feature-based scoring method, where ax represents an atomic correlation operator in the right-hand technical solution of the present application, and ax represents an atomic operator in the left-hand conventional query tree-based scoring method. Wherein. The a2 atomic operator and the a3 atomic operator on the left side of the graph are respectively called for 2 times and are isolated from each other, and index resource query and atomic operator calculation cannot be shared; and index resource query of the atom correlation operator and sharing of calculation of the atom correlation operator are realized on the right side of the graph through the rough arrangement characteristics f1, f2 and f3, and the scoring efficiency of the scoring device is improved.

The calculation of the shared index resource and the atomic correlation operator in steps S540 and S550 will be described below. Fig. 5D shows a flow of the calculation of the shared index resource and the atomic correlation operator in steps S540 and S550. Index resources in the rough index resource pool are shared between the atom filters and the atom correlation operators, between the atom filters and between the atom correlation operators, the document filters share results of the atom filters in the filtering process, and each rough feature shares calculation of the atom correlation operator in the calculating process.

S560: and selecting the documents according to the relevance scores, wherein the selected documents are used as the retrieval results in the rough ranking stage.

In some embodiments, all of the filtered documents are ranked together, and the TOPn documents are selected as the search results for the coarse stage.

In some embodiments, dynamic ranking is performed after completing one filtered document, only the current TOPn document is retained, and after completing the ranking of all the filtered documents, the remaining TOPn documents are used as the search result in the coarse ranking stage.

In some embodiments, each data node has a coarse indexing resource pool, which is filtered and scored separately, the filtered documents of each data node are sorted within the data node, and then the TOPn documents of each node are put together and sorted to obtain a global TOPn document.

In some embodiments, the search result of this step is output to the user as the final search result, and the search is ended.

In some embodiments, the fine-ranking stage search is continued in the documents of the search result of this step.

In summary, in the document retrieval method according to the first embodiment of the present application, the document retrieval includes a rough stage, in the rough stage, based on index resources required by an atomic filter and an atomic correlation operator that depends on the rough stage, an index resource pool is queried to generate a rough index resource pool, each atomic filter and each atomic correlation operator that depends on the rough stage multiplex index resources in the rough index resource pool, the index resources of the document to be processed are only subjected to I/O once, the rough stage I/O is reduced, and a scorer multiplexes calculation results and/or intermediate results of each atomic correlation operator in the rough stage, so that the calculation amount in the rough stage is reduced, the document retrieval efficiency is improved, and the user satisfaction is improved.

A second embodiment of the document retrieval method according to the present application is described below with reference to fig. 6A to 6C.

[ second ] document retrieval method embodiment

The second method embodiment runs on the retrieval server 420, on the basis of the first method embodiment, the second method embodiment extracts fine feature vectors of documents obtained in the coarse ranking stage, sorts the documents according to the fine feature vector values, selects a final retrieval result, shares the calculation power of the atomic correlation operator in the extraction process, and shares resources by constructing a fine ranking index resource pool for the atomic correlation operator not used in the first method embodiment, so that the retrieval efficiency in the fine ranking stage is improved.

In the second embodiment of the method, the fine feature vector is calculated based on the calculation result and/or the intermediate result of the atomicity correlation operator.

In some embodiments, the fine line feature vector comprises a first feature vector, and the atomicity correlation operator on which the first feature vector is calculated belongs to the atomicity correlation operator used in the coarse line feature vector calculation, i.e. the coarse line atomicity correlation operator. And sharing the calculation result and/or the intermediate result of the coarse row atom correlation operator in the calculation process of the first feature vector. In some of these embodiments, the calculation of the first feature vector is performed in a coarse stage. In other of these embodiments, the calculation of the first feature vector is performed in a refinement stage.

In some embodiments, the fine line feature vector comprises a second feature vector, and the atomicity related operator on which the second feature vector is calculated comprises an unused atomicity related operator at the time of the coarse line feature vector calculation, the unused atomicity related operator being referred to as a fine line atomicity related operator. And in the calculation process, the calculation result and/or the intermediate result of the fine atomic correlation operator are/is shared.

In some embodiments, the index resource pool is accessed, index resources of the refined atom correlation operator are obtained, the resources are indexed to form a refined index resource pool, and a calculation result and/or an intermediate result of the refined atom correlation operator is calculated based on the index resources of the refined index resource pool.

Fig. 6A shows a flow of a second embodiment of a document retrieval method according to the present application, which includes steps S610 to S680. The following description focuses on the different steps of the second method embodiment and the first method embodiment.

S610: and acquiring a retrieval statement input by a user.

The method in this step is the same as step S510 in the first embodiment of the method.

S620: and generating a document filter, a scorer and a fine-ranking feature executor according to the retrieval statement.

In this step, the construction of the fine feature executor is added to step S520 in the first embodiment of the method. This step includes steps S621 to S626.

S621: and generating an atomic query according to each word of the retrieval statement.

The method of this step is the same as step S521 of the first embodiment of the method.

S622: and constructing a rough arrangement characteristic vector and a fine arrangement characteristic vector, and generating the relationship between the rough arrangement characteristic and the fine arrangement characteristic and the atom correlation operator.

Compared with the step S522 of the first embodiment of the method, the step increases the relationship between the fine feature vector and the fine feature and the atomic correlation operator. The added portions are explained below.

And constructing a fine feature vector and an acquisition method of the fine feature vector by using a Learning to Rank plug method according to the retrieval statement. It can also be described as using the method of Learning to Rank plug to register the fine feature.

And determining the relation between each fine-line feature and the atom correlation operator according to the fine-line feature vector obtaining method. The relationship between the fine-ranking features and the atom relevance operator comprises:

each fine-line feature is related to which atom correlation operators;

each refined feature value is functionally related to the calculation result and/or the intermediate result of the associated atomic correlation operator.

Wherein different refinement features may be associated with the same atom relevance operator.

The method for calculating the atomic correlation operator of the fine-line feature includes determining the relationship between each fine-line feature and the atomic correlation operator, and calculating the atomic correlation operator based on the fine-line feature.

S623 to S624: document filters and scorers are built.

The method executed here is the same as steps S523 to S524 of the first method embodiment.

S625: and constructing a fine-ranking feature actuator based on the relationship between the fine-ranking features and the atom correlation operator.

In some embodiments, the computed results and/or intermediate results of the fine-grained atomic correlation operators are saved in a pool of atomic correlation operators for transactional sharing.

The fine ranking characteristic executor acquires a calculation result and/or a useful intermediate result of a related atom correlation operator from the atom correlation operator pool, and calculates each fine ranking characteristic value by using a functional relation between the fine ranking characteristic value and the calculation result and/or the intermediate result of the related fine ranking atom correlation operator.

S626: and judging whether the atom correlation operator of the fine ranking features is a coarse ranking atom correlation operator, and marking a first feature vector and a second feature vector in the fine ranking feature vectors.

When all the atom correlation operators based on the fine ranking features are coarse ranking atom correlation operators, the fine ranking features form a first feature vector; when the atom correlation operators on which the fine-line features are based are not all coarse-line atom correlation operators, the fine-line features form a second feature vector.

S630 to S660: and acquiring index resources required by the document filter and the scorer, forming a rough-arrangement index resource pool, filtering documents in the rough-arrangement index resource pool by using the document filter, scoring and sequencing the filtered documents by using the scorer, and acquiring a retrieval result in a rough-arrangement stage. The method is the same as the method of step S630 to step S660 of the first embodiment

S670: and accessing the index resource library, acquiring index resources required by the atom correlation operator of the second feature vector, and forming a fine index resource pool.

And the deployment positions of the fine index resource pool and the rough index resource pool are the same.

In some embodiments, the fine indexed resource pool is combined with the coarse indexed resource pool into one indexed resource pool.

In some embodiments, each data node forms its own fine-grained indexed resource pool.

FIG. 6B is a data flow diagram illustrating an index resource pool build in accordance with an embodiment of the method. And adding a fine-ranking resource pool on the basis of the graph 5B, registering a fine-ranking atom correlation operator on the basis of fine-ranking characteristics, and registering required index resources to the shared resource pool on the basis of the fine-ranking atom correlation operator, so that the fine-ranking index resource pool is constructed by acquiring the required index resources from the index resource pool.

S680: and aiming at the documents in the retrieval result in the coarse ranking stage, obtaining the fine ranking feature vector of each document. The method comprises the following steps:

calculating a first feature vector by using a fine-row feature actuator of the first feature vector according to a calculation result and/or an intermediate result of a coarse-row atom correlation operator in an atom correlation operator pool;

and calculating a second vector according to the calculation results and/or the intermediate results of the rough-arranged atom correlation operator and the fine-arranged atom correlation operator, wherein the calculation results and/or the intermediate results of the fine-arranged atom correlation operator are calculated based on the index resources of the fine-arranged index resource pool and are stored in the atom correlation operator pool to realize sharing.

In some embodiments, the first feature vector is obtained in a coarse ranking stage and the second feature vector is obtained in a fine ranking stage, and the first feature vector and the second feature vector are further combined to generate a fine ranking feature vector.

In some embodiments, when each data node performs the search in the coarse ranking stage, a document in the search result in the coarse ranking stage of each data node is an intersection of the global search result and the document of the data node, and each data node obtains a fine ranking feature vector of each document in the intersection.

S690: and selecting the documents by the sequencing model based on the fine ranking feature vector, wherein the selected documents are used as the retrieval result of the fine ranking stage.

The sorting model is selected based on a method of Learning to Rank plug.

In some embodiments, each data node is a refined feature vector of a document corresponding to the data node, and the ranking model combines the refined feature vectors of the documents corresponding to the nodes together and then ranks the documents.

FIG. 6C shows the flow of the computation of the shared index resource and the atomic correlation operator in this step. Index resources in the fine-ranking index resource pool are shared among the fine-ranking atom correlation operators, a first feature in the fine-ranking features shares the calculation of the coarse-ranking atom correlation operator in the calculation process, and a second feature in the fine-ranking features shares the calculation of the fine-ranking atom correlation operator in the calculation process.

In summary, in the second document retrieval method embodiment of the present application, a fine ranking stage is added on the basis of the first method embodiment, the calculation results and/or intermediate results of the atomic correlation operators in the coarse ranking stage are shared when the fine ranking features are extracted in the fine ranking stage, and the documents of the coarse ranking results in the index resource library are subjected to one-time I/O based on the index resources required by the atomic correlation operators exclusively in the fine ranking stage, so as to generate a fine ranking resource pool for sharing, and the calculation results and/or intermediate results of the atomic correlation operators exclusively in the fine ranking stage are also shared in the calculation process of calculating the fine ranking feature vectors. By sharing the calculation results and/or the intermediate results of the I/O and the atom correlation operators, the retrieval efficiency of the document retrieval in the fine ranking stage is improved, and the user satisfaction is further improved.

A specific embodiment of a document retrieval method according to the present application is described below with reference to fig. 7A to 7E.

Before introducing the document retrieval method described in the embodiment of the present application, in order to facilitate understanding of the technical solution, a framework for implementing the method as shown in fig. 7B is first introduced, as shown in fig. 7B, the specific structure of the framework may be as shown in fig. 7B, and related concepts, such as functions, actions, or requirements related to the included modules, are introduced as follows:

1. indexing the resource pool: the index resource pool is used for caching the index resources read from the index resource library into a memory to form an index resource pool, and the index resources comprise reverse indexes and forward indexes. Through the index resource pool, the same index resource can be read from the index resource pool only by I/O once, wherein the index resource pool needs to acquire which index resources, and the filter module and the atomic operator calculation module are registered in the index resource pool before the query is executed. The characteristics of the index resource pool may specifically include:

a) Before accessing index resources to perform filtering and correlation operator calculation, an atom correlation operator (Si) and an atom filter (filter) register reverse index resources and forward index resources required in the query execution process to an index resource pool, and the same resource has a unique identifier for ensuring that the same resource is read from an index resource library only once (namely, I/O is executed once).

b) In the process of executing the query, before a document is processed, the index resource pool reads the index data of the document in the index resources read from the index resource library into the memory (i.e. reads into the index resource pool), so that the atomic correlation operator (Si) and the atomic filter (filter) can directly access the information in the index pool in a read-only manner (addressing is performed through identification). Therefore, if there are multiple atom dependency operators (Si) and atom filters (filteri) accessing the same index resource, they can share the inverted resource in the index resource pool without repeatedly reading from the index repository.

2. Shared query (SharedQuery): before executing a query statement of a user query, a search engine needs to generate a query execution plan including how documents need to be filtered and a relevance score is calculated according to the user query, the process involves the user query words, a required atomic filter (filteri), a coarse feature and a fine feature and atomic relevance operators (Si) which depend on the coarse feature and the fine feature, and the query execution plan is implemented by SharedQuery. SharedQuery contains global static information about the user query that the search engine needs to execute. In this embodiment, an atomic filter (filter) and an atomic correlation operator (Si) generated by an atomic query (Qi) are used to access an index resource pool to obtain a reverse resource and a forward resource, so as to determine whether a document meets a filtering condition and calculate a correlation operator score. Wherein, the SharedQuery may specifically relate to:

a) The list of the qualies: is a list of atomic queries (Qi) required in the search process. Wherein each atomic query (Qi) has a unique name qName within SharedQuery (seven different atomic Queries Q1 through Q7 are defined in the left-hand query list in fig. 7B).

b) Features list: when performing features (features), the features (features) are essentially aggregated computations based on the results or intermediate results of the atomic correlation operator (Si), and thus the Feature (Feature) execution depends on the atomic correlation operator (Si). Before performing the two-stage search, a Features list is defined. One of the features (Feature) is defined as Feature = < fName, values, exec (S) >, where the definitions have the following meanings:

fName is the unique Feature (Feature) name within the current SharedQuery;

values is a list of strings marking which atomic Queries the Feature (Feature) depends on (the atomic query Qi has been defined in the Queries list);

exec (S) is an execution function definition responsible for executing "how to calculate the eigenvalues based on the results, intermediate results, produced by the atomic correlation operators (Si) generated by these dependent atomic queries (Qi)," where S is the set of atomic correlation operators (Si) generated when scoring the atomic queries (Qi) referenced in values.

c) Filter (Filter): a composite Filter execution plan for filtering documents comprises an atomic Filter (Filter) generated by an atomic query (Qi) defined in a Queries list used, and how to calculate the filtering condition of the documents (the Filter shown in FIG. 7B is the sum of the Filter1 and Filter3 conditions of Q1 and Q3, respectively, and the sum of the sum and the Filter6 condition of Q6, respectively, that is, the document related to the rough-layout stage needs to satisfy the filtering condition of Q1 or Q3, and also needs to satisfy the condition of Q6).

d) And a rough feature list for marking which Features are required for calculating the correlation score in the rough stage.

e) Fine feature list, which marks which ones of the Features the fine stage needs to extract fine Features on the coarse TOPk result. It is further possible to divide the first feature and the second feature and to perform them in different stages, respectively, wherein:

a first feature: only those fine-line features that rely on the atomic correlation operator used in the course of the coarse-line feature computation can be executed in the coarse-line stage;

l2 fine features: the fine ranking features, which depend on some atomic correlation operators that are not used in the course of coarse ranking feature calculation, can be performed in the fine ranking stage.

The first feature can be calculated by the way after the atomic correlation operator (Si) is calculated in the rough ranking stage, and the index resource is not required to be read again to calculate the correlation operators when the feature is extracted in the fine ranking stage, so that extra I/O is introduced. The L2 fine-ranking feature is suitable for reading the index resource for calculation in the fine-ranking stage because the index resource not involved in the coarse-ranking stage is used.

3. Atomic correlation operator pool (Scorer pool): in the user query execution process, an atomic correlation operator (Si) and a corresponding atomic filter (filter) need to be generated based on an atomic query (Qi) in the query list, and specifically, for a document that satisfies an atomic filter (filter) condition (implemented in a manner that whether the document is hit is judged by accessing reverse and forward index resources required in an index resource pool and using the atomic filter (filter)), the atomic correlation operator (Si) may be used to perform calculation of the correlation operator, and an intermediate result required for calculating each Feature (Feature) is recorded. Wherein the pool of atom relevance operators may specifically involve:

a) Maintaining the atomic correlation operators (Si) generated by each atomic query (Qi) in an atomic correlation operator pool, wherein each atomic correlation operator (Si) name corresponds to the qName of the atomic query (Qi). For example, S1 corresponds to Q1, and S6 corresponds to Q6.

As shown in fig. 7B, Q1 to Q6 that need to be executed in the coarse sorting stage generate corresponding S1 to S6, and S1 to S6 are maintained in the Scorer pool.

b) After the atom correlation operator pool is generated, index resources required by each atom correlation operator (Si) in the atom correlation operator pool are registered in the index resource pool.

4. Document Filter (Filter): and filtering the documents meeting the conditions by using the index resources read from the index resource pool. The atom filter (filteri) can directly read the hit information in the index resource pool to judge whether the condition of the atom filter (filteri) is met, and the document filter is a composite filter to judge whether the condition of the composite filter is met by referring to the judgment result of the atom filter (filteri). Wherein, the multiple compound filters can refer to the judgment result of the same atomic filter (filter) and share the output of the atomic filter (filter). Wherein, the document filter may specifically relate to:

a) In SharedQuery, a document filter is generated that filters documents during the coarse stage according to the defined logical formula of the filter. The document filter generates atomic filters (filters) corresponding to atomic correlation operators (Si) by referring to atomic queries (Qi) required for logical formula definition of the filters in SharedQuery, and implements composite filters on their basis (these atomic filters (filters) generate qName of their atomic queries (Qi) as a unique identifier, and the document filter expresses composite filtering logic by referring to these qName).

b) When the composite filter judges whether the document meets the condition, the atomic filter (filteri) is executed to judge whether the target document meets the condition of each atomic filter (filteri) and the condition is recorded, then the composite filtering judgment logic is executed, when the composite filtering logic is executed, the filtering result of each quoted atomic filter (filteri) can be read only, if the filtering logic with the same qName is used in a plurality of composite filtering logics, because the same qName (filteri) can be executed only once, the redundant I/O of the index resource library can not be introduced.

5. Feature performer (leaf feature): wherein the feature computation depends on the intermediate results of the output and the recording of the atomic correlation operator (Si). Multiple features (Fi) may depend on the output of the same atomic correlation operator (Si) calculation and intermediate results, i.e. the execution results of the shared correlation operator (Si). In the embodiment of the present application, the calculation of the atomic correlation operator (Si) and the calculation of the feature (Fi) are separated, and may be understood as being performed in two steps: firstly, the atomic correlation operator (Si) calculates and records additional intermediate results required for respectively executing the output and the characteristic (Fi), then the characteristic (Fi) refers to the output and the intermediate results of the required atomic correlation operator (Si) as characteristic (Fi) calculation input, and the characteristic value is calculated by reading the input when calculating the characteristic value. Specifically, the method comprises the following steps:

a) Just as the atomic query (Qi) may just "define how to do regardless of the specific index data", so may the feature (Fi), just as "which atomic Queries (Qi) in the query list are referenced, and how feature values are calculated regardless of the specific data". Just as the atomic filter (filter) and the atomic correlation operator (Si) can be generated based on the atomic query (Qi) to read specific data to perform filtering and scoring, a corresponding feature executor, the feature value that satisfies the document to be processed, can be generated based on the feature (Fi).

b) The generation timing of the leaf feature can be after the generation of the atom correlation operator pool. When generated, the LeafFeature binds those atomic queries (Qi) referenced by the feature (Fi) to the generated atomic correlation operator (Si).

c) When a leaf Feature calculates a value of a certain document on a corresponding Feature (Fi), an inquiry process ensures that an atomic correlation operator (Si) in an atomic correlation operator pool already obtains a final result and records a necessary intermediate result based on the current document, the leaf Feature only needs to define an execution function of exec (S) in a triple defined by the leaf Feature, a bound atomic correlation operator (Si) set is input into the exec, and the exec reads a calculation result and an intermediate result of the cached atomic correlation operator (Si) from the atomic correlation operator pool and calculates a value of the Feature (Fi) based on the calculation result and the intermediate result.

For example, feature2 in fig. 7B refers to Q2, Q5, Q6, and when creating the leaf Feature2, the leaf Feature2 binds S2, S5, S6, exec (S) = score (S2) + score (S5) + score (S6) created by Q2, Q5, Q6. Where score(s) represents the relevance operator score of the doc obtained from scorer s.

6. And (3) planning execution: for the Features list defined in SharedQuery, it needs to determine at which stage of the coarse ranking stage or the fine ranking stage each feature (Fi) in the Features list should be executed respectively, so as to avoid redundant index resource I/O, calculation. The decision logic may be as follows:

i. let Q _α Atomic query set referenced for bold features, F _α ，S _α Are respectively Q _α Generated set of atomic filters and set of atomic dependency operators, and F _α ，S _α The index resource set needing to be accessed is R _α 。

Let the first feature, the second feature references the set of atomic queries as Q respectively _β1 ，Q _β2 Corresponding to, has S _β1 ，R _β1 ，S _β2 ，R _β2 。

The Rough feature needs to be performed in the Rough stage, therefore its reference Q _α Needs to be executed in the coarse stage, i.e. the coarse stage needs to read R _α And perform F _α ，S _α 。

For a refined feature, the index resource to be accessed by the atomic filter and atomic dependency operator generated by the atomic query it references is R _α If the first feature is not the same as the second feature, then the first feature is the same as the second feature.

The first feature may be performed during the coarse stage and the second feature may be performed separately during the fine stage.

7. Rough-row stage (or Query stage): and the system is responsible for executing scoring and feature extraction logic required by the coarse sorting stage. That is, logic is executed for the document that needs to be processed in the coarse ranking stage, and the execution logic may be as follows:

i. generatingAnd the index resource pool reads inverted and forward index resources required by the coarse-row atom filter and the atom correlation operator from the index resource library. For example, read R in the execution plan _α And taking the value of the index resource on the document, namely reading the required reverse index and forward index resources.

Read R in index resource pool _α Indexing data, performing F in an atomic filter _α Invoking document Filter based on each F _α And judging whether the document meets the composite filtering condition.

if the filtering condition is satisfied, S in the atomic relevance operator pool _α Reading R in index resource pool _α Taking value, calculating S _α And recording the intermediate results of interest to the features.

Calculate the coarse row feature values, e.g., calculate F1, F2, F4 in fig. 7B.

v. calculating a steal score using the steal feature values: using the document bold features as input, a bold score is calculated for the relevant document.

If the document may enter TOPk, then S is performed _β1 And calculating the first characteristic and recording the result, wherein S is performed _β1 The time resource pool does not need to read R additionally _β1 . Because of the fact that

/>

8. Fine stage (or Fetch stage): the second fine feature value required for executing the fine feature extraction stage, that is, for the result of the coarse stage TOPn, the following logic is executed:

i. index resource pool reads R from index resource pool _β2 The values on the document are indexed.

Read R in index resource pool _β2 Indexing data, calculating S in atomic relevance operator pool _β2 And calculating a second fine characteristic value.

9. As a result, polymerization:

and after executing second fine-line characteristics on each file belonging to TOPn, combining the first characteristic value recorded in the coarse-line stage and the second fine-line characteristic result into a complete characteristic vector by the aggregation module, inputting the complete characteristic vector into the fine-line model for reordering, and taking the complete characteristic vector as one of the outputs of two-stage retrieval.

In the following, a specific embodiment of the document retrieval method of the present application is further described with reference to fig. 7A to 7E:

detailed description of a document retrieval method

The specific implementation mode of the document retrieval method is operated on a retrieval server 420 and is distributed deployment, the retrieval server 420 comprises a coordination node and a plurality of data nodes, the data nodes store a local index resource library and perform local rough retrieval and fine retrieval, and the coordination node forms a global retrieval result based on the local retrieval results of the data nodes. Before the data node performs local retrieval, the following description will be made with reference to fig. 7A to 7E.

The specific implementation method takes the second method embodiment as an example, and the two-stage method of elastic search + Learning to Rank plug is adopted for searching. Fig. 7A shows a detailed flow of the embodiment, including steps S710 to S790.

S710: the retrieval server acquires a retrieval sentence.

The search sentence is a sentence or a word input by a user through a search client, and when the input words are multiple, the interval between the words can be performed by punctuation marks such as a space, a comma, a pause mark and the like. The retrieval condition may also be a word with a logical expression.

When the retrieval condition is a sentence, extracting each word contained in the sentence by segmenting the sentence.

The retrieval server is any data node or coordination node, and the data node closest to the retrieval client 410 may be selected.

S720: the retrieval server generates a global shared query (SharedQuery) from the retrieval statement.

Wherein the content of the global shared query includes:

each atomic query (Qi) required to perform the search, each feature (Fi) defined, and a list of features (Fi) involved in the coarse stage, and a list of features (Fi) involved in the fine stage, the execution plan of the composite Filter (Filter). Fig. 7B shows the contents of the global shared query and the node shared query, which are described in detail below in conjunction with the present embodiment.

Each feature comprises a rough-row feature in a rough-row feature vector and a fine-row feature in a fine-row feature vector, and the fine-row feature vector is divided into a first feature vector and a second feature vector. For descriptive purposes, in the following description of the present embodiment, the coarse-line feature vector is represented by a coarse-line feature list, the fine-line feature vector is represented by a fine-line feature list, the first feature vector is represented by an L1 feature list, and the second feature vector is represented by an L2 feature list.

Fig. 7C shows a flow of the global shared query construction method of this step, including the following steps S721 to S724.

S721: and generating a plurality of atomic queries according to each word of the retrieval sentence input by the user.

Wherein the atomic query is generated according to the method of ElasticSearch. The number of atomic queries may be stored in a list of atomic queries.

In this embodiment, as shown in fig. 7B, 7 atomic queries Q1-Q7 are generated.

S722: individual features (Fi) are created and a function is defined for each feature.

Wherein, each feature (Fi) is generated by using the method of elastic search + Learning to Rank plug according to a search statement input by a user.

Wherein the defined function describes an algorithm for obtaining the feature by means of at least one atomic correlation operator (Si) to which an atomic query (Qi) corresponds, each atomic correlation operator (Si) corresponding to an atomic query (Qi). Since the scoring tree is obtained based on a query tree, each atomic correlation operator (Si) will correspond to an atomic query (Qi).

In this embodiment, as shown in fig. 7B, the created features include 6 features F1 to F6. The definition of the feature F1 is < F1, value [ Q1], position >, which means that the feature F1 calculates a feature value based on the keyword position output from S1 corresponding to Q1. The feature F2 is defined as < F2, value [ Q6, Q2, Q5], sum >, and means that the feature F2 calculates a feature value based on S7, S2, S5 corresponding to Q6, Q2, Q5 in a manner of summing the scores of S7, S2, S5. The feature F5 is defined aS < F5, value [ Q1, Q4], ax + by >, and indicates that a feature value is calculated aS a function of aS1+ bS4 based on S1 and S4 corresponding to Q1 and Q4.

S723: a logical form of a document Filter (Filter) is created, comprising a logical form consisting of a logical expression and a Filter object, the Filter object being an atomic query (Qi) of at least one of the above used.

In this example, as shown in FIG. 7B, the logical formula of the created document Filter (Filter) is: (Q1 | | Q3) & & Q6, indicating that the document filter is the sum or the sum of the filter terms of Q1 and Q3, and the filter term of Q6.

S724: marking a first feature vector and a second feature vector in the coarse-row feature vector and the fine-row feature vector.

Wherein, the coarse-ranking feature list is used for marking which features (Fi) are needed when the correlation score is calculated in the coarse-ranking stage. As shown in fig. 7B, the list of the coarse-ranked features is [ F1, F2, F4], that is, F1, F2 and F4 are involved in the calculation of the correlation score in the coarse-ranking stage.

The fine line feature list is used to mark which features (Fi) mentioned above are features that the fine line stage needs to extract on the coarse line top result. As shown in fig. 7B, the list of the top ranked features is [ F3, F5, F6], that is, F3, F5 and F6 are involved in calculating the correlation score in the top ranked stage.

The L1 feature list contained in the fine ranking feature list is used for recording the calculation of the features contained in the coarse ranking stage, and the L2 feature list contained in the fine ranking stage is used for recording the calculation of the features contained in the fine ranking stage. As shown in fig. 7B, the L1 feature list relates to F3 and F5, and the L2 feature list relates to F6.

As shown in fig. 7B, the atomic queries (Qi) related to the features (Fi) recorded in the L1 feature list included in the fine feature list are subsets of the atomic queries (Qi) related to the features (Fi) recorded in the coarse feature list, the atomic queries related to the L1 feature list include Q1, Q3, and Q4, and the atomic queries related to the L2 feature list include Q7.

S730: the retrieval server sends the created global shared query (SharedQuery) to each data node through the coordinating node.

In this specific embodiment, the data nodes are distributed, each data node queries different node index resources, and the node index resource of each data node includes a reverse index resource and a forward index resource of a document on the data node.

S740: and each data node executes the query in the rough arrangement stage and returns a query result.

Wherein the query result comprises the TOPk documents in the rough stage of the data node and the relevance scores of the documents, and also comprises the rough stage related to the L1 characteristic value and the L2 characteristic of the documents. Wherein for each data node, the steps performed include the following substeps S741-S746:

s741: and the data node generates each atomic correlation operator (Si) and atomic filter (filter) of the rough stage based on the rough stage atomic query (Qi) information in the global shared query (SharedQuery).

Wherein, the related atom correlation operator (Si) is defined for each rough row characteristic (Fi) marked in the rough row stage when each atom correlation operator (Si) is generated, wherein part of the atom correlation operators (Si) can also be the related atom correlation operators (Si) defined for each L1 characteristic (Fi) marked in the fine row stage. As shown in FIG. 7B, the rough feature list is [ F1, F2, F4], and the atomic correlation operators involved according to the definition of the F1, F2 and F4 features of the rough feature are S1-S6, and S7 is not involved, so the generated atomic correlation operators include S1-S6, and the atomic correlation operators involved according to the F3 and F5 features of the L1 feature are S1, S3 and S4. Wherein, each Si can be stored in the form of atom correlation operator pool (scorer pool).

The atom Filter obtained according to the logic formula (Q1 | | Q3) & & Q6 of the document Filter (Filter) comprises Filter1, filter3 and Filter6, and corresponds to atom queries Q1, Q3 and Q6 and atom correlation operators S1, S3 and S6 respectively.

S742: and according to the index resources required by each atom correlation operator (Si) and each atom filter (filter), acquiring corresponding index resources from the index resource library of the data node, and caching to form a rough index resource pool. The index resource here includes a reverse index resource and a forward index resource.

The specific steps can be as follows: firstly, creating a rough index resource pool, and registering required resources to the rough index resource pool by using each atom correlation operator (Si) and each atom filter (filter); and then, reading the required index resource from the index resource library according to the registered required resource and caching the required index resource.

Therefore, when different atom correlation operators and different atom filters relate to the same index resource, the index resource pool is accessed once and cached, the index resource can be obtained from the index resource pool subsequently, and the index resource does not need to be obtained by repeatedly accessing the index resource pool.

S743: and binding the corresponding atomic Filter (Filter) according to the logic formula of the document Filter (Filter) and the filtering object (namely the atomic query) in the logic formula to generate the document Filter.

As shown in fig. 7B, after the atomic filters filter1, filter3, and filter6 are bound in the present embodiment, the generated document filter is: (filter 1| | filter 3) & & filter 6.

S744: according to the definition of each feature (Fi), corresponding atom correlation operators (Si) are bound according to each atom query (Qi) in the definition, and each feature calculator (LeafFeaturei) is generated.

As shown in fig. 7B, in the present embodiment, according to the definition of F1, after the atom correlation operator S1 corresponding to the atom filter Q1 is bound, the feature calculator leaf feature1= position (S1) is generated, where the position represents a position function. As another example, according to the definition of F2, after the atom correlation operators S2, S5, and S6 corresponding to the atom filters Q2, Q5, and Q6 are bound, the feature calculator leaf feature2= score (S2) + score (S5) + score (S6) is generated.

S745: and obtaining each document in the rough-arrangement index resource pool, sequentially carrying out screening, scoring, sorting and L1 characteristic calculation in the rough-arrangement stage, and outputting a relevant result in the rough-arrangement stage.

In the registered index resource, document identifiers (doc ids) are recorded in each inverted list, so that each document identifier which possibly meets the document filter can be acquired. Fig. 7D shows a flow of this step of sequentially performing screening, scoring, sorting and L1 feature calculation for each document identifier, including steps S74501-S74510:

s74501: and acquiring a current to-be-processed document identifier, namely a target document, and acquiring each index resource containing the target document from the index resources of the rough index resource pool.

S74502: and executing each atomic filter (filter) to obtain the atomic filtering result of each atomic filter of the target document.

In the specific embodiment, the filtering result of the target document based on the filter1, the filter3 and the filter6 is obtained.

S74503: and executing a document filter, namely performing logic operation according to a logic formula (a logic formula is a composite filtering condition) of the document filter based on the filtering result of each atomic filter (filteri), and obtaining the filtering result aiming at the target document.

In the specific embodiment, the filtering result of the target document is obtained through the logic formula (filter 1| | filter 3) & & filter6 of the document filter.

S74504: and judging whether the target document is hit, if so, executing the next step, and otherwise, executing the step S74510.

If the filtering result of step S74503 is null, it indicates that the target document does not satisfy the composite filtering condition of the document filter, i.e. the document is not hit.

S74505: and executing each atom correlation operator (Si) of the coarse ranking stage, wherein the forward ranking index resource of the target document and the reverse ranking index resource of the target document in the reverse ranking list are obtained from the coarse ranking index resource pool to perform ranking, and for convenience of description, the ranking is called atom ranking. In the atom scoring process, the calculation result and/or the intermediate result of each atom correlation operator (Si) are/is stored in the atom correlation operator pool, so that the subsequent sharing is facilitated.

In this embodiment, each of the atomic correlation operators S1 to S6 is executed, and atomic scoring is performed by S1 to S6 to obtain a calculation result and an intermediate result of each atomic correlation operator, and the calculation result and the intermediate result are stored in the atomic correlation operator pool.

S74506: and (3) executing each characteristic calculator (LeafFeaturei) in the rough arrangement stage, namely obtaining the characteristic value of each corresponding characteristic (Fi) according to the functional formula of the characteristic calculator (LeafFeaturei) in the rough arrangement stage based on the calculation result and/or the intermediate result of each atomic correlation operator.

Wherein the calculation result and/or intermediate result of the atomic correlation operator (Si) is directly obtained from the atomic correlation operator pool.

In this embodiment, the features F1, F2, and F4 to be referred to in the coarse ranking stage are obtained from the coarse ranking feature list, and feature values corresponding to F1, F2, and F4 are calculated by using the feature calculators leaf feature1, leaf feature2, and leaf feature4 used in the coarse ranking stage in step S744 and importing the calculation results of each atomic correlation operator (Si).

S74507: and calculating the relevance score of the coarse-ranking stage of the target document according to each coarse-ranking characteristic value.

And calculating the correlation score of the rough ranking stage based on a series of predefined algorithms, wherein the predefined algorithms take the characteristic value of each characteristic in the related group ranking characteristic list as input, and output is the correlation score of the rough ranking stage.

S74508: it is determined whether the relevance score for the coarse stage of the target document is above the current TOPk ranking, if so, the next step is performed, otherwise step 74510 is executed.

Wherein TOPk is less than or equal to TOPn, the number of TOPk being determined based on the number of data nodes and TOPn.

S74509: the method comprises the steps of including a target document into a TOPk ranking, refreshing the TOPk ranking, and calculating each feature value contained in an L1 feature list in a refined feature list of the target document.

Wherein, the L1 characteristic value can also be calculated in the fine ranking stage.

Since each Si involved in each feature calculation in the L1 feature list is a subset of each Si involved in each feature calculation in the rough-ranking feature list, when each feature in the L1 feature list is calculated, resources in the rough-ranking index resource pool and calculation results and/or intermediate results of cached Si can be read without additional I/O.

In this embodiment, since the L1 feature list in the refined feature list includes F3 and F5, and S1, S3, and S4 required to calculate the feature values of F3 and F5 have already been calculated in step S74505, the calculation results and/or intermediate results of S1, S3, and S4 required may be introduced into the leaf feature3 and leaf feature5 corresponding to F3 and F5 with reference to the principle of step S74506, and the feature values corresponding to F3 and F5 may be calculated.

S74510, judging whether the rough index resource pool has documents which are not processed, if so, returning to the step S74501, reading the next document identification, otherwise, executing S74511.

S74511, at this time, the data node has obtained the top documents of the node, the relevance scores of the rough stage of the documents, and the feature values in the L1 feature list of the documents, and the data node sends the top documents and the relevance scores of the corresponding rough stage to the coordinating node as the query result.

S750: and after aggregating the query results of each data node, the coordination node generates a global query result and sends the global query result to each data node.

The coordination nodes are ranked according to the relevance scores of the TOPk documents provided by the data nodes in the rough ranking stage, and the TOPn documents are selected as the global query result.

The coordinating node can also send the TOPn document to the data node related to the TOPn document, wherein the related data node refers to the data node of which the sent TOPk document and the TOPn document have intersection.

S760: and the data node receiving the global query result takes the intersection to obtain the TOPx document as the document to be executed in the fine ranking stage according to the reported TOPk document and the received TOPn document. The following steps S761-S765 are then performed:

s761: according to the L2 list of the refined list of features, an atomic correlation operator (Si) is generated based on the atomic query (Qi) involved in the definition of each feature (Fi) in the L2 list, and the refined atomic correlation operator (Si) is determined.

Wherein, the fine-ranking atom correlation operator (Si) is an atom correlation operator (Si) which is not used in the coarse-ranking stage.

In the present embodiment, the L2 list includes F6, and as is clear from the definition < F6, values [ Q7], sum > of F6, the feature value of feature F6 is calculated based on the atomic correlation operator S7 corresponding to Q7, and therefore, corresponding S7 is generated based on Q7.

S762: and acquiring index resources required by a top-ranked atomic correlation operator (Si) of the TOPx document from an index resource library of the data node, and caching the index resources into a top-ranked index resource pool.

In this embodiment, the indexing resources corresponding to S7 cached in the fine indexing resource pool, including S7 of the TOPx document, relate to reverse indexing resources and forward indexing resources in the reverse arrangement table.

In other embodiments, the newly registered index resource may be cached in the coarse index resource pool.

S763: and binding the corresponding atomic correlation operator (Si) according to the definition of each feature (Fi) in the L2 list and each atomic query (Qi) in the definition to generate each feature calculator (LeafFeaturei) in the L2 list. Specifically, refer to step S744 above, and will not be described again.

In the present embodiment, a feature6 is generated for calculating a feature value of the feature F6 in a subsequent step.

S764: for each document in TOPx, feature values of the features in the refined stage L2 list are obtained in turn. FIG. 7E shows a detailed flow of this step, including steps S76401-S7640.

S76401, obtaining the document mark currently processed in TOPx, namely the target document, and obtaining each forward index resource and each backward index resource containing the target document from the index resources of the index resource pool.

S76402: for the refined atom relevance operator (Si) determined in step S761, the atom relevance operators (Si) are executed to score atoms, which specifically refers to step S74505 and is not described again.

S76403: executing each feature calculator (LeafFeaturei) related in the L2 list to obtain a feature value of each corresponding feature, which may specifically refer to step S74506 described above and is not described again.

In the present embodiment, a feature value corresponding to F6 is calculated by executing a feature 6.

S76404: the data node has obtained the feature values in the L2 feature list of all the TOPx documents, and the data node sends the feature values of the L2 feature list of the fine-ranking stage of the TOPx document to the coordinating node.

S770: and the coordinating node collects the feature values of the features in the L2 lists of the TOPx subsets sent by the data nodes to form the TOPn documents, and for each document in the TOPn, the cached L1 feature vector and the received L2 feature vector are merged into a complete fine-ranking feature vector.

S780: and the coordinating node sorts according to the top ranking feature vector of the TOPn documents, and selects the TOPm documents as the final search results.

Therein, the top ranked feature vectors of each document in TOPn may be input into a Ranking model (LTR) and ranked by the scores computed by the Ranking model.

S790: and the coordinating point returns the final search result to the retrieval server, and sends the final search result to the retrieval client through the retrieval server so as to be displayed at the client.

[ first embodiment of apparatus ]

A document retrieval apparatus implements a method of the first embodiment of the method.

Fig. 8A shows a structure of a first embodiment of the apparatus, which can be applied to the coarse-row stage, and includes: the system comprises an acquisition module 810, a shared query construction module 820, a filtering module 830, a scoring module 840 and a rough ranking module 850. In some embodiments, when the fractional data node performs document filtering and scoring of the rough nodes, the first apparatus embodiment further includes an aggregation module.

The obtaining module 810 executes the method of step S510 of the method embodiment for obtaining the retrieval statement.

Shared query building module 820 performs the method of method embodiment step S520 for generating filters and scorers based on the search statement.

The obtaining module 810 further executes the method of step S530 in the method embodiment, and is configured to access the index resource library, obtain the index resources required by the filter and the scorer, and form a rough index resource pool.

The filtering module 830 executes the method of step S540 of the method embodiment, and is configured to filter out documents that meet the requirement of the filter based on the index resources in the rough index resource pool.

The scoring module 840 performs the method of method embodiment step S550 for performing a relevance score calculation based on the coarse-line feature vectors of the documents calculated based on the scorer, based on the filtered documents.

The rough sort module 850 executes a method of step S550 of the method embodiment for selecting a document according to the relevance score, and the selected document is used as a search result in the rough sort stage.

In some embodiments, when the score data nodes perform document filtering and scoring of the rough-arranged nodes, the aggregation module aggregates the documents selected by the scoring module 840 of the data nodes to generate a global retrieval result in the rough-arranged stage.

In some embodiments, the scorer comprises an atomic relevance operator, the document filter comprises an atomic filter, and the obtaining the index resources required by the document filter and the scorer comprises: and acquiring index resources required by the atom correlation operator and index resources required by the atom filter.

In some embodiments, the filtering module 830 is specifically configured to filter out documents meeting the requirements of each atomic filter for the first time based on the index resources in the rough index resource pool; the filtering module 830 is further specifically configured to filter out, for the documents filtered for the first time, documents that meet the requirements of the document filter, where the document filter describes a combinational logic of a plurality of atomic filters.

In some embodiments, the scoring module 840 is specifically configured to calculate the rough feature vector based on the calculation result of each atomic correlation operator when calculating the rough feature vector; and sharing the calculation result of using the atomic correlation operator in the calculation process of the coarse row characteristic vector.

In some embodiments, the atomic correlation operator comprises the atomic correlation operator used in the coarse stage to compute the coarse feature vector.

[ second example of device ]

A second embodiment of a document retrieval device performs the method of the second embodiment of the method.

Fig. 8B shows the structure of the second embodiment of the apparatus, which can be applied to the refinement stage on the basis of the modules of the first embodiment of the apparatus, and adds a feature extraction module 860 and a refinement sorting module 870. The following description is made.

Shared query building module 820 also executes the method of method embodiment two step S620 for generating document filters, scorers and fine-ranking feature executors from the search statements.

The obtaining module 810 further executes the method of step S6270 in the method embodiment, and is configured to access the index resource library, obtain the index resource required by the atomic correlation operator of the second feature vector, and form a fine index resource pool.

The feature extraction module 860 performs the method of step S680 in the method embodiment, which is used to obtain the fine ranking feature vector of each document according to the documents in the search result in the coarse ranking stage, where the fine ranking feature vector is calculated based on the calculation result and/or the intermediate result of the atomic correlation operator.

The feature extraction module 860 includes a first feature extraction module 8610 and a second feature extraction module 8620.

The first feature extraction module 8610 is configured to extract a first feature vector in the fine-line feature vector, and an atomicity correlation operator based on which the first feature vector is calculated belongs to an atomicity correlation operator used in calculation of the coarse-line feature vector. In the calculation of the first feature vector, the calculation results and/or intermediate results of the atomic correlation operator on which they are based are shared.

The second feature extraction module 8620 is configured to extract a second feature vector in the refined feature vector, where the atomic correlation operator based on which the second feature vector is calculated includes an unused atomic correlation operator in the rough feature vector calculation, and a calculation result and/or an intermediate result of the unused atomic correlation operator are shared in the process of calculating the second feature vector.

In some embodiments, the first feature extraction module 8610 is performed during the coarse stage, and in other embodiments, the first feature extraction module 8610 is performed during the fine stage. When the first feature extraction module 8610 is implemented in the coarse stage, it further comprises a feature aggregation module configured to combine the first feature vector and the second feature vector to form a fine-line feature vector.

The fine ranking module 870 is configured to select a document based on the fine ranking feature vector by the ranking model, where the selected document is used as a search result of the fine ranking stage.

In some embodiments, the obtaining module 810 is further configured to access the index resource library, obtain index resources of the unused atomic dependency operator, the index resources forming a fine-grained index resource pool; the second feature extraction module 8620, when computing the computation result and/or intermediate result of the unused atomic correlation operator, computes based on the index resources of the fine-grained index resources pool.

[ analysis of advantageous effects ]

1. The embodiment of the application reduces I/O access to the index resource library based on the shared index resource pool, can avoid time consumption caused by repeated I/O access to the index resource library, solves the problem of repeated I/O shown by a dotted arrow in figure 3, and improves the retrieval efficiency.

2. In the embodiment of the application, sharing of atomic scoring is performed based on the relevance atomic operator to be used for calculation of each characteristic value, each characteristic value can be shared to be used for calculation of the relevance score, repeated calculation in the calculation process can be effectively avoided through the sharing, the problem of repeated calculation shown by a dotted line in a calculation layer in fig. 3 is solved, therefore, calculation time is shortened, and retrieval efficiency is improved.

3. The method for calculating the relevance score in the embodiment of the application is more flexible, not only can the score calculation result based on the scoring tree be realized, but also score calculation modes such as condition judgment and complex mathematical logic can be realized, and the method is more flexible. The specific analysis is as follows:

as shown in fig. 5C, the relevance score of the present application is introduced, and may be a relevance score in a rough ranking stage or a relevance score in a fine ranking stage. In the embodiment of the present application, from the definition of the Feature (Feature), the Feature is essentially an aggregate calculation based on the result or intermediate result of the atomic correlation operator (Si). Thus the first layer of complex correlation operators (immediately adjacent to the leaf nodes) can be replaced by features (Fi), and the higher layer of complex relations can be replaced by calculations between features. Therefore, the present invention proposes a calculation method for calculating a rough ranking score using the sharing characteristic (Fi) as an input.

It is not exemplified that the gross rank score is calculated to illustrate that the embodiment of the present application may also implement all functions of the score tree scheme using a feature-based calculation manner, as shown in fig. 5C, the gross rank score calculation formula is score = max (max (a 1, a 2), (a 2+ a 3) × a 3), the left side is that in the existing scheme, the score tree is used for bottom-up calculation, the root node score is the relevance score of the relevant document, in this scheme, atomic relevance operators a2 and a3 need to be repeatedly calculated, and an additional calculation amount and I/O are introduced. In the right scheme, the features (F1 to F3) are directly used as input to calculate the rough ranking score, and the conversion process is as follows (note: it is shown here that the rough ranking score calculation with the features as input in the embodiment of the present application can realize rough ranking score calculation based on the scoring tree, and the rough ranking score calculation in the embodiment of the present application is not limited to be converted from the scoring tree):

defining the complex correlation operators adjacent to the atomic correlation operator as the following three shared features, and replacing these complex correlation operators with features, it can be observed that the atomic correlation operators a1, a2, and a3 only need to be calculated 1 time in the calculation process of F1, F2, and F3:

F1＝max(a1,a2)；

F2＝sum(a2,a3)；

F3＝a3。

the rough ranking score is then defined as the calculation between features, and when defined as score = max (f 1, f2 × f 3), it can be seen that this way of scoring can replace the calculation of the rough ranking score based on the scoring tree. In addition, when the rough row relevance score is calculated by using the characteristics as input, the score calculation method not only can realize the same function based on the scoring tree, but also can realize the score calculation mode of a series of free degrees of programming language levels such as condition judgment, complex mathematical logic and the like, and is more flexible compared with the scoring tree scheme.

4. In the embodiment of the application, the fine line features are divided into L1 features and L2 features, so that partial fine line features (namely, the L1 features, F3 and F5 shown in fig. 7B) can be calculated by using data in the coarse line stage, and then the fine line features are generated by combining the L1 features and the L2 (F6 shown in fig. 7B) features calculated in the fine line stage. Therefore, resource sharing between the L1 feature extraction and the rough stage correlation score calculation is realized.

[ computing device embodiments ]

Fig. 9 is a schematic structural diagram of a computing device 900 according to an embodiment of the present disclosure. The computing device 900 includes: a processor 910, a memory 920, and a communication interface 930.

It is to be appreciated that the communication interface 930 in the computing device 900 shown in FIG. 9 may be used to communicate with other devices.

The processor 910 may be connected to the memory 920. The memory 920 may be used to store the program codes and data. Therefore, the memory 920 may be a memory module inside the processor 910, an external memory module independent of the processor 910, or a component including a memory module inside the processor 910 and an external memory module independent of the processor 910.

It should be understood that, in the embodiment of the present application, the processor 910 may employ a Central Processing Unit (CPU). The processor may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be a conventional processor or the like. Or the processor 910 may employ one or more integrated circuits for executing related programs to implement the technical solutions provided in the embodiments of the present application.

The memory 920 may include a read-only memory and a random access memory, and provides instructions and data to the processor 910. A portion of the processor 910 may also include non-volatile random access memory. For example, the processor 910 may also store information of the device type.

When the computing device 900 is running, the processor 910 executes the computer-executable instructions in the memory 920 to perform the operational steps of the above-described method.

It should be understood that the computing device 900 according to the embodiment of the present application may correspond to a corresponding main body for executing the method according to the embodiments of the present application, and the above and other operations and/or functions of the respective modules in the computing device 900 are respectively for implementing the corresponding flows of the methods according to the embodiments of the present application, and are not described herein again for brevity.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Storage medium embodiments

Embodiments of the present application also provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, is configured to perform at least one of the aspects described in the various embodiments of the present application.

The computer storage media of embodiments of the present application may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It should be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application.

Claims

1. A document retrieval method, comprising a coarse ranking stage, the coarse ranking stage comprising:

acquiring a retrieval statement;

generating a document filter and a scorer based on the retrieval statement;

accessing an index resource library, and acquiring index resources required by the document filter and the scorer, wherein the index resources form a rough index resource pool and comprise index information of each document;

filtering out documents meeting the requirements of the document filter based on the index resources in the rough index resource pool;

performing relevance score calculation according to the filtered documents, wherein the relevance score is calculated based on a coarse-line feature vector of the documents, and the coarse-line feature vector is calculated based on the scorer;

and selecting the document according to the relevance score, wherein the selected document is used as a retrieval result in the rough ranking stage.

2. The method of claim 1, wherein the scorer comprises an atomic correlation operator, wherein the document filter comprises an atomic filter,

the obtaining of the index resources required by the document filter and the scorer includes: and acquiring index resources required by the atom correlation operator and index resources required by the atom filter.

3. The method of claim 2, wherein filtering out documents that meet the document filter requirements comprises:

filtering out documents meeting the requirements of each atomic filter for the first time on the basis of the index resources in the rough index resource pool;

and filtering out documents meeting the requirements of the document filter aiming at the documents filtered out for the first time, wherein the document filter describes the combinational logic of a plurality of atomic filters.

4. The method of claim 2, wherein the coarse row feature vector is computed based on the scorer, comprising:

calculating the rough-row feature vector based on the calculation result of each atomicity correlation operator; wherein the calculation result of the atomic correlation operator is shared in the calculation process of the rough-row feature vector.

5. The method of any of claims 2-4, wherein the atomic correlation operator comprises an atomic correlation operator used in the coarse stage to compute the coarse feature vector.

6. The method of claim 4 or 5, further comprising a fine stage, the fine stage comprising:

aiming at the documents in the search result of the coarse ranking stage, obtaining fine ranking feature vectors of the documents, wherein the fine ranking feature vectors are calculated based on the calculation result and/or the intermediate result of the atomicity correlation operator;

and selecting and/or sorting the retrieval results of the rough ranking stage by a sorting model based on the fine ranking characteristic vector, wherein the selected and/or sorted results are used as the retrieval results of the fine ranking stage.

7. The method of claim 6, wherein the fine feature vector is computed based on computed results of an atomic correlation operator, comprising;

the fine-line feature vector comprises a first feature vector, and an atomicity correlation operator based on which the first feature vector is calculated belongs to an atomicity correlation operator used in coarse-line feature vector calculation;

and in the calculation process of the first feature vector, the calculation result and/or the intermediate result of the atomic correlation operator used in the calculation of the rough feature vector are shared.

8. The method of claim 7, wherein the calculation of the first eigenvector is performed during the coarse stage.

9. The method according to any of claims 6-8, wherein the fine feature vector is calculated based on the calculation result and/or the intermediate result of the atomicity correlation operator, comprising:

the fine line feature vector comprises a second feature vector, and the atomicity correlation operator on which the second feature vector is calculated comprises an atomicity correlation operator which is not used in the coarse line feature vector calculation;

and in the fine ranking feature vector calculation process, the calculation result and/or the intermediate result of the unused atom correlation operator are shared.

10. The method of claim 9, wherein computing the computation result and/or the intermediate computation result of the unused atomic relevance operator comprises:

accessing the index resource library to obtain index resources of the unused atomic correlation operator, wherein the index resources form a fine index resource pool;

computing a computation result and/or an intermediate result of the unused atomic correlation operator based on the index resources of the fine-indexed resource pool.

11. A document retrieval apparatus, applied to a coarse ranking stage, includes: the system comprises an acquisition module, a shared query construction module, a filtering module, a scoring module and a rough ranking module;

the acquisition module is used for acquiring retrieval statements;

the shared query construction module is used for generating a document filter and a scorer based on the retrieval statement;

the obtaining module is further configured to access an index resource library, obtain index resources required by the document filter and the scorer, where the index resources form a rough index resource pool, and the index resources include index information of each document;

the filtering module is used for filtering out documents meeting the requirement of the document filter based on the index resources in the rough index resource pool;

the scoring module is used for calculating a correlation score according to the filtered documents, the correlation score is calculated based on the rough-line feature vector of the documents, and the rough-line feature vector is calculated based on the scorer;

and the rough ranking module is used for selecting the documents according to the relevance scores, and the selected documents are used as the retrieval results of the rough ranking stage.

12. The apparatus of claim 11, wherein the scorer comprises an atomic relevance operator, wherein the document filter comprises an atomic filter,

13. The apparatus according to claim 12, wherein the filtering module is specifically configured to filter out documents meeting requirements of each atomic filter for the first time based on the index resources in the coarse index resource pool;

the filtering module is further specifically configured to filter out, for the documents filtered out for the first time, documents that meet requirements of the document filter, where the document filter describes a combinational logic of a plurality of the atomic filters.

14. The apparatus according to claim 12, wherein the scoring module, when computing the coarse feature vector, is specifically configured to compute the coarse feature vector based on a computation result of each of the atomic correlation operators; wherein the calculation result of the atomic correlation operator is shared in the calculation process of the rough-row feature vector.

15. The apparatus of any of claims 12-14, wherein the atomic correlation operator comprises an atomic correlation operator used in the coarse stage to compute the coarse feature vector.

16. The device according to claim 14 or 15, further applied to a fine ranking stage, further comprising a feature extraction module and a fine ranking module;

the feature extraction module is used for obtaining a fine ranking feature vector of each document according to the documents in the retrieval result of the coarse ranking stage, and the fine ranking feature vector is calculated based on the calculation result and/or the intermediate result of the atomicity correlation operator;

and the fine ranking module is used for selecting and/or ranking the retrieval results of the coarse ranking stage by the ranking model based on the fine ranking characteristic vector, and the selected and/or ranked results are used as the retrieval results of the fine ranking stage.

17. The apparatus according to claim 16, wherein the feature extraction module comprises a first feature extraction module, configured to extract a first feature vector in the fine-line feature vector, and an atomicity correlation operator based on which the first feature vector is calculated belongs to an atomicity correlation operator used in coarse-line feature vector calculation;

in the process of calculating the first feature vector by the first feature extraction module, the calculation result and/or the intermediate result of the atomic correlation operator used in the calculation by using the rough feature vector are/is shared.

18. The apparatus of claim 17, wherein the first feature extraction module is performed during the coarse stage.

19. The apparatus according to any one of claims 16-18, wherein the feature extraction module comprises a second feature extraction module configured to extract a second feature vector from the fine-line feature vector, and wherein the atomicity correlation operator on which the second feature vector is calculated comprises an atomic correlation operator that is not used in the coarse-line feature vector calculation;

sharing the calculation result and/or intermediate result using the unused atomic correlation operator in the calculation of the second feature vector by the second feature extraction module.

20. The apparatus of claim 19,

the acquisition module is further configured to access the index resource pool, and acquire index resources of the unused atomic correlation operator, where the index resources form a fine-ranking index resource pool;

the second feature extraction module, when calculating the calculation result and/or the intermediate result of the unused atomic correlation operator, calculates based on the index resources of the fine-grained index resource pool.

21. A computing device comprising at least one processor and at least one memory, the memory storing program instructions that, when executed by the at least one processor, cause the at least one processor to implement the method of any of claims 1-10.

22. A computer-readable storage medium having stored thereon program instructions, which, when executed by a computer, cause the computer to carry out the method of any one of claims 1-10.

23. A computer program product, characterized in that it comprises program instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 10.