CN112163065A

CN112163065A - Information retrieval method, system and medium

Info

Publication number: CN112163065A
Application number: CN202010927319.4A
Authority: CN
Inventors: 鲁小玲
Original assignee: Xiaogan Tianchuang Information Technology Co ltd
Current assignee: Xiaogan Tianchuang Information Technology Co ltd
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2021-01-01

Abstract

The invention discloses an information retrieval method, a system and a medium, wherein a pseudo-relevant document set is obtained by providing a plurality of query keywords; processing the pseudo-relevant document set according to a weight retrieval model and semantic query of a concept network to obtain a plurality of target expansion candidate words; and searching the plurality of target expansion candidate words to obtain a final search result. The information retrieval method, the system and the medium disclosed by the invention greatly improve the efficiency and the effect of user query.

Description

Information retrieval method, system and medium

Technical Field

The present invention relates to the field of information retrieval technologies, and in particular, to an information retrieval method, system, and medium.

Background

In the rapid development of internet technology, network information shows an explosive growth situation, and browsing and acquiring required information by means of a search engine becomes an important part of daily life of people. However, the number of network resources is huge and various, which brings convenience to people and makes it difficult for users to efficiently and accurately acquire really required information, and in order to more effectively process the growing mass data, the information retrieval method, as a classical text processing technology, has become a research key point of the information processing technology.

Pseudo-Relevance Feedback (PRF) provides an automatic local analysis method, which can automate the manual operation part of relevant Feedback, and users can obtain better retrieval performance without participating in additional interaction. The method firstly carries out a common retrieval process, returns the document which is most relevant to the initial query of the user and takes the document as an initial result set, then assumes that the document at the top of the ranking is relevant on the basis, and finally carries out relevant feedback on the assumption as before. The BERT model is a new method for representing pre-training languages, and provides a method for creating context retrieval expression semantics according to meanings of front words and rear words in a pre-training model of a large number of context-related languages, so that source codes and models of multiple languages are derived.

However, the amount of information obtained through a single model is too large, and the accuracy of the obtained extension words is not enough, so that if the BERT is directly used for calculating all documents, the problems of too large amount of information, insufficient accuracy and the like can occur.

Disclosure of Invention

In view of the above disadvantages of the prior art, the present invention is to provide an information retrieval method, system and medium, which solve the technical problems of excessive information amount and insufficient accuracy when BERT is directly used to calculate all documents in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

an information retrieval method, comprising the steps of: s1, providing a plurality of query keywords to obtain a pseudo-relevant document set; s2, processing the pseudo-relevant document set according to the weight retrieval model and semantic query of the concept network to obtain a plurality of target expansion candidate words; and S3, searching the target expansion candidate words to obtain a final search result.

Preferably, the S1 specifically includes the following steps: s11, providing a plurality of query keywords to obtain a target document set; s12, the target document set is transported to a BM25 model, BM25 scores of all documents in the target document set are obtained, and a former document is selected from high to low according to the scores and is marked as a first document set; s13, the first document set is transported to a BERT model, and a BERT score of each document in the first document set is obtained; s14, linearly fusing the BM25 score of each document in the target document set with the BERT score of each document in the first document set to obtain a second document set and the score of each document; and S15, selecting the previous document from the second document set according to the scores from high to low to serve as a pseudo related document set.

Preferably, the S2 specifically includes the following steps: s21, taking all the words in the pseudo-relevant document set as expansion candidate words, calculating the importance score of each expansion candidate word in the pseudo-relevant document set, and selecting the previous expansion candidate word as a first expansion candidate word set according to the scores from high to low; s22, applying the expansion candidate words in the first document selected from the pseudo-relevant document set to a concept network to obtain semantic vectors of the expansion candidate words in the concept network, applying the query keywords to the concept network to obtain semantic vectors of the query keywords in the concept network, and calculating the semantic distance between the two semantic vectors; s23, calculating the semantic distance between each expansion candidate word and all the query keywords, and selecting the previous expansion candidate word as a second expansion candidate word set according to the semantic distance from small to large; and S24, fusing the first expansion candidate word set and the second expansion candidate word set to obtain a plurality of target expansion candidate words.

Preferably, in S21, the calculation formula of the importance score of the candidate expansion word in the pseudo-related document set is as follows:

wherein the content of the first and second substances,

the score of the degree of importance is represented,

a vector representing a set of pseudo-relevant documents,

representing a pseudo-relevant document set D₁The ith document d in_iChinese word

N denotes a pseudo-relevant document set D₁The number of documents in.

Preferably, in S22, the formula for calculating the semantic distance is:

wherein the content of the first and second substances,

the semantic distance is represented by a distance in the semantic space,

representing expansion candidate words in ith document selected from pseudo-related document set

The semantic vectors in the concept net are,

representing semantic vectors of query keywords in a concept net, N representing a pseudo-relevant document set D₁Number of Chinese documents, Q_sRepresents the s-th word in the query keyword Q,

to represent

And

and performing semantic calculation through cosine similarity.

Preferably, in S23, the formula for calculating the semantic distance between the expanded candidate word and all the query keywords is:

wherein the content of the first and second substances,| Q | represents the total number of query terms,

representing expanded candidate words, Q_sRepresents the s-th word in the query keyword Q,

representing candidate expanded words

Semantic distance from all query keywords Q.

In order to solve the above technical problems, the present invention provides another technical solution as follows: an information-based retrieval system, the information retrieval system comprising a processor and a memory; the memory has stored thereon a computer readable program executable by the processor; the processor, when executing the computer readable program, implements the steps in the information retrieval method as described in any one of the above.

In order to solve the above technical problems, the present invention provides another technical solution as follows: a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement steps in an information retrieval method as described above.

Compared with the prior art, the information retrieval method, the system and the medium provided by the invention obtain the pseudo-relevant document set by providing a plurality of query keywords; processing the pseudo-relevant document set according to a weight retrieval model and semantic query of a concept network to obtain a plurality of target expansion candidate words; and searching the plurality of target expansion candidate words to obtain a final search result. The invention integrates the semantic query of the weight retrieval model and the concept network into the query expansion, so that the document score and the query expansion word both carry semantic features, have higher semantic relevance compared with the initial query, can overcome the defect of semantic confusion in a multi-semantic environment, can extract more effective information related to the query in actual needs, improve the retrieval precision and save the retrieval time.

Drawings

FIG. 1 is a flowchart illustrating an information retrieval method according to a preferred embodiment of the present invention;

FIG. 2 is a flowchart illustrating the step S1 shown in FIG. 1;

fig. 3 is a flowchart illustrating the step of S2 shown in fig. 1.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

In the description of the present application, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating relative importance or as implicitly indicating the number of technical features indicated. Thus, unless otherwise specified, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature; "plurality" means two or more. The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that one or more other features, integers, steps, operations, elements, components, and/or combinations thereof may be present or added.

Further, terms of orientation or positional relationship indicated by "center", "lateral", "upper", "lower", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, are described based on the orientation or relative positional relationship shown in the drawings, are simply for convenience of description of the present application, and do not indicate that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present application.

Furthermore, unless expressly stated or limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly and may include, for example, fixed connections, removable connections, and integral connections; can be mechanically or electrically connected; either directly or indirectly through intervening media, or through both elements. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

Example 1

Referring to fig. 1, fig. 1 is a flowchart illustrating a searching method according to a preferred embodiment of the present invention. The flow chart S10 of the information retrieval method provided by the invention comprises the following steps:

s1, providing a plurality of query keywords to obtain a pseudo-relevant document set;

s2, processing the pseudo-relevant document set according to the weight retrieval model and semantic query of the concept network to obtain a plurality of target expansion candidate words;

and S3, searching the target expansion candidate words to obtain a final search result.

The information retrieval method provided by the invention obtains a pseudo-relevant document set by providing a plurality of query keywords; processing the pseudo-relevant document set according to a weight retrieval model and semantic query of a concept network to obtain a plurality of target expansion candidate words; and searching the plurality of target expansion candidate words to obtain a final search result. The invention integrates the semantic query of the weight retrieval model and the concept network into the query expansion, so that the document score and the query expansion word both carry semantic features, have higher semantic relevance compared with the initial query, can overcome the defect of semantic confusion in a multi-semantic environment, can extract more effective information related to the query in actual needs, improve the retrieval precision and save the retrieval time.

Specifically, as shown in fig. 2, the S1 specifically includes the following steps:

s11, providing a plurality of query keywords to obtain a target document set;

s12, the target document set is transported to a BM25 model, BM25 scores of all documents in the target document set are obtained, and a former document is selected from high to low according to the scores and is marked as a first document set;

s13, the first document set is transported to a BERT model, and a BERT score of each document in the first document set is obtained;

s14, linearly fusing the BM25 score of each document in the target document set with the BERT score of each document in the first document set to obtain a second document set and the score of each document;

and S15, selecting the top N' documents of the second document set from high to low according to the scores as a pseudo-related document set.

When the user searches according to the related query subjects, the information retrieval system can establish query indexes according to the target document set, when the user submits the related query subjects, the system can preprocess the query subjects into query keywords, and the target document set can be obtained through retrieval of the query keywords. Then the retrieval system performs set screening on the target document set through a classical retrieval model BM25 model, calculates to obtain a BM25 score of each document in the target document set, obtains a first query result according to the high-to-low arrangement of the score results, extracts a previous document with a high score to be marked as a first document set, then uses a BERT model to evaluate each document in the first document set again, scores each sentence in the documents in the first document set with the original query based on the BERT semantic similarity to obtain a second document set and a score of each document thereof, and selects the previous document from the high score to the low score of the second document set as a pseudo-relevant document set. For selection of and, values can be preset by those skilled in the art, and are suitable.

As shown in fig. 3, the S2 specifically includes the following steps:

s21, taking all the words in the pseudo-relevant document set as expansion candidate words, calculating the importance score of each expansion candidate word in the pseudo-relevant document set, and selecting the top m1 expansion candidate words as a first expansion candidate word set according to the scores from high to low;

s22, applying the expansion candidate words in the first document selected from the pseudo-relevant document set to a concept network to obtain semantic vectors of the expansion candidate words in the concept network, applying the query keywords to the concept network to obtain semantic vectors of the query keywords in the concept network, and calculating the semantic distance between the two semantic vectors; and

s23, calculating the semantic distance between each expansion candidate word and all the query keywords, and selecting the previous expansion candidate word as a second expansion candidate word set according to the semantic distance from small to large;

and S24, fusing the first expansion candidate word set and the second expansion candidate word set to obtain a plurality of target expansion candidate words.

The target expansion candidate words are selected by applying a weight retrieval model and semantic query of a concept net to the pseudo-relevant document set, and compared with the traditional BM25 model, the obtained expansion candidate words are higher in precision and better in retrieval effect.

Specifically, in S21, the calculation formula of the importance score of the candidate expansion word in the pseudo-relevant document set is as follows:

wherein the content of the first and second substances,

the score of the degree of importance is represented,

a vector representing a set of pseudo-relevant documents,

N denotes a pseudo-relevant document set D₁The number of documents in.

In S22, the formula for calculating the semantic distance is:

wherein the content of the first and second substances,

the semantic distance is represented by a distance in the semantic space,

The semantic vectors in the concept net are,

to represent

And

and performing semantic calculation through cosine similarity.

In S23, the calculation formula of the semantic distance between the expanded candidate word and all the query keywords is:

wherein | Q | represents the total number of query keywords,

representing candidate expanded words

Semantic distance from all query keywords Q.

The final retrieval result is obtained by retrieving the plurality of target expansion candidate words, and the retrieval result is queried according to the BM25 model, the BERT model and the concept network semantics, so that compared with the traditional BM25 model, the retrieval precision is higher, and the feedback efficiency is better.

Example 2

The invention also provides an information retrieval system, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program is executed by the processor to realize the information retrieval method provided by the embodiment 1.

The information retrieval system provided in this embodiment is used to implement the information retrieval method, and therefore, the information retrieval system also has the technical effects of the information retrieval method, and details are not repeated here.

Example 3

The present invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the information retrieval method of embodiment 1.

The computer storage medium provided in this embodiment is used to implement the information retrieval method, and therefore, the technical effects of the information retrieval method are also achieved by the computer storage medium, which is not described herein again.

In summary, the information retrieval method, system and medium provided by the present invention obtain a pseudo-relevant document set by providing a plurality of query keywords; processing the pseudo-relevant document set according to a weight retrieval model and semantic query of a concept network to obtain a plurality of target expansion candidate words; and searching the plurality of target expansion candidate words to obtain a final search result. The invention integrates the semantic query of the weight retrieval model and the concept network into the query expansion, so that the document score and the query expansion word both carry semantic features, have higher semantic relevance compared with the initial query, can overcome the defect of semantic confusion in a multi-semantic environment, can extract more effective information related to the query in actual needs, improve the retrieval precision and save the retrieval time.

The preferred embodiments of the present invention have been described above in detail, but the present invention is not limited thereto. Within the scope of the technical idea of the invention, many simple modifications can be made to the technical solution of the invention, including combinations of various technical features in any other suitable way, and these simple modifications and combinations should also be regarded as the disclosure of the invention, and all fall within the scope of the invention.

Claims

1. An information retrieval method, comprising the steps of:

2. The information retrieval method according to claim 1, wherein the S1 specifically includes the steps of:

s11, providing a plurality of query keywords to obtain a target document set;

and S15, selecting the previous document from the second document set according to the scores from high to low to serve as a pseudo related document set.

3. The information retrieval method according to claim 1, wherein the S2 specifically includes the steps of:

4. The information retrieval method according to claim 3, wherein in S21, the calculation formula of the importance score of the candidate expansion word in the pseudo-relevant document set is:

wherein the content of the first and second substances,

the score of the degree of importance is represented,

a vector representing a set of pseudo-relevant documents,

N denotes a pseudo-relevant document set D₁The number of documents in.

5. The information retrieval method according to claim 3, wherein in the step S22, the formula for calculating the semantic distance is:

wherein the content of the first and second substances,

the semantic distance is represented by a distance in the semantic space,

The semantic vectors in the concept net are,

to represent

And

and performing semantic calculation through cosine similarity.

6. The information retrieval method as claimed in claim 3, wherein in the step S23, the formula for calculating the semantic distance between the expanded candidate word and all the query keywords is:

wherein | Q | represents the total number of query keywords,

representing candidate expanded words

Semantic distance from all query keywords Q.

7. An information retrieval system comprising a processor and a memory;

the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps in the information retrieval method of any one of claims 1-6.

8. A computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to perform the steps in the information retrieval method as recited in any one of claims 1-6.