CN112163065A - Information retrieval method, system and medium - Google Patents
Information retrieval method, system and medium Download PDFInfo
- Publication number
- CN112163065A CN112163065A CN202010927319.4A CN202010927319A CN112163065A CN 112163065 A CN112163065 A CN 112163065A CN 202010927319 A CN202010927319 A CN 202010927319A CN 112163065 A CN112163065 A CN 112163065A
- Authority
- CN
- China
- Prior art keywords
- document set
- pseudo
- document
- semantic
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000012545 processing Methods 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 7
- 239000000126 substance Substances 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3338—Query expansion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
Abstract
The invention discloses an information retrieval method, a system and a medium, wherein a pseudo-relevant document set is obtained by providing a plurality of query keywords; processing the pseudo-relevant document set according to a weight retrieval model and semantic query of a concept network to obtain a plurality of target expansion candidate words; and searching the plurality of target expansion candidate words to obtain a final search result. The information retrieval method, the system and the medium disclosed by the invention greatly improve the efficiency and the effect of user query.
Description
Technical Field
The present invention relates to the field of information retrieval technologies, and in particular, to an information retrieval method, system, and medium.
Background
In the rapid development of internet technology, network information shows an explosive growth situation, and browsing and acquiring required information by means of a search engine becomes an important part of daily life of people. However, the number of network resources is huge and various, which brings convenience to people and makes it difficult for users to efficiently and accurately acquire really required information, and in order to more effectively process the growing mass data, the information retrieval method, as a classical text processing technology, has become a research key point of the information processing technology.
Pseudo-Relevance Feedback (PRF) provides an automatic local analysis method, which can automate the manual operation part of relevant Feedback, and users can obtain better retrieval performance without participating in additional interaction. The method firstly carries out a common retrieval process, returns the document which is most relevant to the initial query of the user and takes the document as an initial result set, then assumes that the document at the top of the ranking is relevant on the basis, and finally carries out relevant feedback on the assumption as before. The BERT model is a new method for representing pre-training languages, and provides a method for creating context retrieval expression semantics according to meanings of front words and rear words in a pre-training model of a large number of context-related languages, so that source codes and models of multiple languages are derived.
However, the amount of information obtained through a single model is too large, and the accuracy of the obtained extension words is not enough, so that if the BERT is directly used for calculating all documents, the problems of too large amount of information, insufficient accuracy and the like can occur.
Disclosure of Invention
In view of the above disadvantages of the prior art, the present invention is to provide an information retrieval method, system and medium, which solve the technical problems of excessive information amount and insufficient accuracy when BERT is directly used to calculate all documents in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
an information retrieval method, comprising the steps of: s1, providing a plurality of query keywords to obtain a pseudo-relevant document set; s2, processing the pseudo-relevant document set according to the weight retrieval model and semantic query of the concept network to obtain a plurality of target expansion candidate words; and S3, searching the target expansion candidate words to obtain a final search result.
Preferably, the S1 specifically includes the following steps: s11, providing a plurality of query keywords to obtain a target document set; s12, the target document set is transported to a BM25 model, BM25 scores of all documents in the target document set are obtained, and a former document is selected from high to low according to the scores and is marked as a first document set; s13, the first document set is transported to a BERT model, and a BERT score of each document in the first document set is obtained; s14, linearly fusing the BM25 score of each document in the target document set with the BERT score of each document in the first document set to obtain a second document set and the score of each document; and S15, selecting the previous document from the second document set according to the scores from high to low to serve as a pseudo related document set.
Preferably, the S2 specifically includes the following steps: s21, taking all the words in the pseudo-relevant document set as expansion candidate words, calculating the importance score of each expansion candidate word in the pseudo-relevant document set, and selecting the previous expansion candidate word as a first expansion candidate word set according to the scores from high to low; s22, applying the expansion candidate words in the first document selected from the pseudo-relevant document set to a concept network to obtain semantic vectors of the expansion candidate words in the concept network, applying the query keywords to the concept network to obtain semantic vectors of the query keywords in the concept network, and calculating the semantic distance between the two semantic vectors; s23, calculating the semantic distance between each expansion candidate word and all the query keywords, and selecting the previous expansion candidate word as a second expansion candidate word set according to the semantic distance from small to large; and S24, fusing the first expansion candidate word set and the second expansion candidate word set to obtain a plurality of target expansion candidate words.
Preferably, in S21, the calculation formula of the importance score of the candidate expansion word in the pseudo-related document set is as follows:
wherein the content of the first and second substances,the score of the degree of importance is represented,a vector representing a set of pseudo-relevant documents,representing a pseudo-relevant document set D1The ith document d iniChinese wordN denotes a pseudo-relevant document set D1The number of documents in.
Preferably, in S22, the formula for calculating the semantic distance is:
wherein the content of the first and second substances,the semantic distance is represented by a distance in the semantic space,representing expansion candidate words in ith document selected from pseudo-related document setThe semantic vectors in the concept net are,representing semantic vectors of query keywords in a concept net, N representing a pseudo-relevant document set D1Number of Chinese documents, QsRepresents the s-th word in the query keyword Q,to representAndand performing semantic calculation through cosine similarity.
Preferably, in S23, the formula for calculating the semantic distance between the expanded candidate word and all the query keywords is:
wherein the content of the first and second substances,| Q | represents the total number of query terms,representing expanded candidate words, QsRepresents the s-th word in the query keyword Q,representing candidate expanded wordsSemantic distance from all query keywords Q.
In order to solve the above technical problems, the present invention provides another technical solution as follows: an information-based retrieval system, the information retrieval system comprising a processor and a memory; the memory has stored thereon a computer readable program executable by the processor; the processor, when executing the computer readable program, implements the steps in the information retrieval method as described in any one of the above.
In order to solve the above technical problems, the present invention provides another technical solution as follows: a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement steps in an information retrieval method as described above.
Compared with the prior art, the information retrieval method, the system and the medium provided by the invention obtain the pseudo-relevant document set by providing a plurality of query keywords; processing the pseudo-relevant document set according to a weight retrieval model and semantic query of a concept network to obtain a plurality of target expansion candidate words; and searching the plurality of target expansion candidate words to obtain a final search result. The invention integrates the semantic query of the weight retrieval model and the concept network into the query expansion, so that the document score and the query expansion word both carry semantic features, have higher semantic relevance compared with the initial query, can overcome the defect of semantic confusion in a multi-semantic environment, can extract more effective information related to the query in actual needs, improve the retrieval precision and save the retrieval time.
Drawings
FIG. 1 is a flowchart illustrating an information retrieval method according to a preferred embodiment of the present invention;
FIG. 2 is a flowchart illustrating the step S1 shown in FIG. 1;
fig. 3 is a flowchart illustrating the step of S2 shown in fig. 1.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
In the description of the present application, the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating relative importance or as implicitly indicating the number of technical features indicated. Thus, unless otherwise specified, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature; "plurality" means two or more. The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that one or more other features, integers, steps, operations, elements, components, and/or combinations thereof may be present or added.
Further, terms of orientation or positional relationship indicated by "center", "lateral", "upper", "lower", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, are described based on the orientation or relative positional relationship shown in the drawings, are simply for convenience of description of the present application, and do not indicate that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present application.
Furthermore, unless expressly stated or limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly and may include, for example, fixed connections, removable connections, and integral connections; can be mechanically or electrically connected; either directly or indirectly through intervening media, or through both elements. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.
Example 1
Referring to fig. 1, fig. 1 is a flowchart illustrating a searching method according to a preferred embodiment of the present invention. The flow chart S10 of the information retrieval method provided by the invention comprises the following steps:
s1, providing a plurality of query keywords to obtain a pseudo-relevant document set;
s2, processing the pseudo-relevant document set according to the weight retrieval model and semantic query of the concept network to obtain a plurality of target expansion candidate words;
and S3, searching the target expansion candidate words to obtain a final search result.
The information retrieval method provided by the invention obtains a pseudo-relevant document set by providing a plurality of query keywords; processing the pseudo-relevant document set according to a weight retrieval model and semantic query of a concept network to obtain a plurality of target expansion candidate words; and searching the plurality of target expansion candidate words to obtain a final search result. The invention integrates the semantic query of the weight retrieval model and the concept network into the query expansion, so that the document score and the query expansion word both carry semantic features, have higher semantic relevance compared with the initial query, can overcome the defect of semantic confusion in a multi-semantic environment, can extract more effective information related to the query in actual needs, improve the retrieval precision and save the retrieval time.
Specifically, as shown in fig. 2, the S1 specifically includes the following steps:
s11, providing a plurality of query keywords to obtain a target document set;
s12, the target document set is transported to a BM25 model, BM25 scores of all documents in the target document set are obtained, and a former document is selected from high to low according to the scores and is marked as a first document set;
s13, the first document set is transported to a BERT model, and a BERT score of each document in the first document set is obtained;
s14, linearly fusing the BM25 score of each document in the target document set with the BERT score of each document in the first document set to obtain a second document set and the score of each document;
and S15, selecting the top N' documents of the second document set from high to low according to the scores as a pseudo-related document set.
When the user searches according to the related query subjects, the information retrieval system can establish query indexes according to the target document set, when the user submits the related query subjects, the system can preprocess the query subjects into query keywords, and the target document set can be obtained through retrieval of the query keywords. Then the retrieval system performs set screening on the target document set through a classical retrieval model BM25 model, calculates to obtain a BM25 score of each document in the target document set, obtains a first query result according to the high-to-low arrangement of the score results, extracts a previous document with a high score to be marked as a first document set, then uses a BERT model to evaluate each document in the first document set again, scores each sentence in the documents in the first document set with the original query based on the BERT semantic similarity to obtain a second document set and a score of each document thereof, and selects the previous document from the high score to the low score of the second document set as a pseudo-relevant document set. For selection of and, values can be preset by those skilled in the art, and are suitable.
As shown in fig. 3, the S2 specifically includes the following steps:
s21, taking all the words in the pseudo-relevant document set as expansion candidate words, calculating the importance score of each expansion candidate word in the pseudo-relevant document set, and selecting the top m1 expansion candidate words as a first expansion candidate word set according to the scores from high to low;
s22, applying the expansion candidate words in the first document selected from the pseudo-relevant document set to a concept network to obtain semantic vectors of the expansion candidate words in the concept network, applying the query keywords to the concept network to obtain semantic vectors of the query keywords in the concept network, and calculating the semantic distance between the two semantic vectors; and
s23, calculating the semantic distance between each expansion candidate word and all the query keywords, and selecting the previous expansion candidate word as a second expansion candidate word set according to the semantic distance from small to large;
and S24, fusing the first expansion candidate word set and the second expansion candidate word set to obtain a plurality of target expansion candidate words.
The target expansion candidate words are selected by applying a weight retrieval model and semantic query of a concept net to the pseudo-relevant document set, and compared with the traditional BM25 model, the obtained expansion candidate words are higher in precision and better in retrieval effect.
Specifically, in S21, the calculation formula of the importance score of the candidate expansion word in the pseudo-relevant document set is as follows:
wherein the content of the first and second substances,the score of the degree of importance is represented,a vector representing a set of pseudo-relevant documents,representing a pseudo-relevant document set D1The ith document d iniChinese wordN denotes a pseudo-relevant document set D1The number of documents in.
In S22, the formula for calculating the semantic distance is:
wherein the content of the first and second substances,the semantic distance is represented by a distance in the semantic space,representing expansion candidate words in ith document selected from pseudo-related document setThe semantic vectors in the concept net are,representing semantic vectors of query keywords in a concept net, N representing a pseudo-relevant document set D1Number of Chinese documents, QsRepresents the s-th word in the query keyword Q,to representAndand performing semantic calculation through cosine similarity.
In S23, the calculation formula of the semantic distance between the expanded candidate word and all the query keywords is:
wherein | Q | represents the total number of query keywords,representing expanded candidate words, QsRepresents the s-th word in the query keyword Q,representing candidate expanded wordsSemantic distance from all query keywords Q.
The final retrieval result is obtained by retrieving the plurality of target expansion candidate words, and the retrieval result is queried according to the BM25 model, the BERT model and the concept network semantics, so that compared with the traditional BM25 model, the retrieval precision is higher, and the feedback efficiency is better.
Example 2
The invention also provides an information retrieval system, which comprises a processor and a memory, wherein the memory is stored with a computer program, and the computer program is executed by the processor to realize the information retrieval method provided by the embodiment 1.
The information retrieval system provided in this embodiment is used to implement the information retrieval method, and therefore, the information retrieval system also has the technical effects of the information retrieval method, and details are not repeated here.
Example 3
The present invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the information retrieval method of embodiment 1.
The computer storage medium provided in this embodiment is used to implement the information retrieval method, and therefore, the technical effects of the information retrieval method are also achieved by the computer storage medium, which is not described herein again.
In summary, the information retrieval method, system and medium provided by the present invention obtain a pseudo-relevant document set by providing a plurality of query keywords; processing the pseudo-relevant document set according to a weight retrieval model and semantic query of a concept network to obtain a plurality of target expansion candidate words; and searching the plurality of target expansion candidate words to obtain a final search result. The invention integrates the semantic query of the weight retrieval model and the concept network into the query expansion, so that the document score and the query expansion word both carry semantic features, have higher semantic relevance compared with the initial query, can overcome the defect of semantic confusion in a multi-semantic environment, can extract more effective information related to the query in actual needs, improve the retrieval precision and save the retrieval time.
The preferred embodiments of the present invention have been described above in detail, but the present invention is not limited thereto. Within the scope of the technical idea of the invention, many simple modifications can be made to the technical solution of the invention, including combinations of various technical features in any other suitable way, and these simple modifications and combinations should also be regarded as the disclosure of the invention, and all fall within the scope of the invention.
Claims (8)
1. An information retrieval method, comprising the steps of:
s1, providing a plurality of query keywords to obtain a pseudo-relevant document set;
s2, processing the pseudo-relevant document set according to the weight retrieval model and semantic query of the concept network to obtain a plurality of target expansion candidate words;
and S3, searching the target expansion candidate words to obtain a final search result.
2. The information retrieval method according to claim 1, wherein the S1 specifically includes the steps of:
s11, providing a plurality of query keywords to obtain a target document set;
s12, the target document set is transported to a BM25 model, BM25 scores of all documents in the target document set are obtained, and a former document is selected from high to low according to the scores and is marked as a first document set;
s13, the first document set is transported to a BERT model, and a BERT score of each document in the first document set is obtained;
s14, linearly fusing the BM25 score of each document in the target document set with the BERT score of each document in the first document set to obtain a second document set and the score of each document;
and S15, selecting the previous document from the second document set according to the scores from high to low to serve as a pseudo related document set.
3. The information retrieval method according to claim 1, wherein the S2 specifically includes the steps of:
s21, taking all the words in the pseudo-relevant document set as expansion candidate words, calculating the importance score of each expansion candidate word in the pseudo-relevant document set, and selecting the top m1 expansion candidate words as a first expansion candidate word set according to the scores from high to low;
s22, applying the expansion candidate words in the first document selected from the pseudo-relevant document set to a concept network to obtain semantic vectors of the expansion candidate words in the concept network, applying the query keywords to the concept network to obtain semantic vectors of the query keywords in the concept network, and calculating the semantic distance between the two semantic vectors; and
s23, calculating the semantic distance between each expansion candidate word and all the query keywords, and selecting the previous expansion candidate word as a second expansion candidate word set according to the semantic distance from small to large;
and S24, fusing the first expansion candidate word set and the second expansion candidate word set to obtain a plurality of target expansion candidate words.
4. The information retrieval method according to claim 3, wherein in S21, the calculation formula of the importance score of the candidate expansion word in the pseudo-relevant document set is:
wherein the content of the first and second substances,the score of the degree of importance is represented,a vector representing a set of pseudo-relevant documents,representing a pseudo-relevant document set D1The ith document d iniChinese wordN denotes a pseudo-relevant document set D1The number of documents in.
5. The information retrieval method according to claim 3, wherein in the step S22, the formula for calculating the semantic distance is:
wherein the content of the first and second substances,the semantic distance is represented by a distance in the semantic space,representing expansion candidate words in ith document selected from pseudo-related document setThe semantic vectors in the concept net are,representing semantic vectors of query keywords in a concept net, N representing a pseudo-relevant document set D1Number of Chinese documents, QsRepresents the s-th word in the query keyword Q,to representAndand performing semantic calculation through cosine similarity.
6. The information retrieval method as claimed in claim 3, wherein in the step S23, the formula for calculating the semantic distance between the expanded candidate word and all the query keywords is:
7. An information retrieval system comprising a processor and a memory;
the memory has stored thereon a computer readable program executable by the processor;
the processor, when executing the computer readable program, implements the steps in the information retrieval method of any one of claims 1-6.
8. A computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to perform the steps in the information retrieval method as recited in any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010927319.4A CN112163065A (en) | 2020-09-07 | 2020-09-07 | Information retrieval method, system and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010927319.4A CN112163065A (en) | 2020-09-07 | 2020-09-07 | Information retrieval method, system and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112163065A true CN112163065A (en) | 2021-01-01 |
Family
ID=73857343
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010927319.4A Pending CN112163065A (en) | 2020-09-07 | 2020-09-07 | Information retrieval method, system and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112163065A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113569566A (en) * | 2021-07-30 | 2021-10-29 | 苏州七星天专利运营管理有限责任公司 | Vocabulary extension method and system |
-
2020
- 2020-09-07 CN CN202010927319.4A patent/CN112163065A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113569566A (en) * | 2021-07-30 | 2021-10-29 | 苏州七星天专利运营管理有限责任公司 | Vocabulary extension method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107993724B (en) | Medical intelligent question and answer data processing method and device | |
CN103136352B (en) | Text retrieval system based on double-deck semantic analysis | |
JP6266080B2 (en) | Method and system for evaluating matching between content item and image based on similarity score | |
US9792304B1 (en) | Query by image | |
KR101721338B1 (en) | Search engine and implementation method thereof | |
US8463593B2 (en) | Natural language hypernym weighting for word sense disambiguation | |
US9183261B2 (en) | Lexicon based systems and methods for intelligent media search | |
US8977624B2 (en) | Enhancing search-result relevance ranking using uniform resource locators for queries containing non-encoding characters | |
US20090076800A1 (en) | Dual Cross-Media Relevance Model for Image Annotation | |
CN105045852A (en) | Full-text search engine system for teaching resources | |
WO2011130008A2 (en) | Automatic query suggestion generation using sub-queries | |
JP2017220205A (en) | Method and system for dynamically rankings images to be matched with content in response to search query | |
US20090094019A1 (en) | Efficiently Representing Word Sense Probabilities | |
KR20180125746A (en) | System and Method for Sentence Embedding and Similar Question Retrieving | |
US10810181B2 (en) | Refining structured data indexes | |
US8768105B2 (en) | Method for searching a database using query images and an image anchor graph-based ranking algorithm | |
CN112612875B (en) | Query term automatic expansion method, device, equipment and storage medium | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
US8229970B2 (en) | Efficient storage and retrieval of posting lists | |
CN111737413A (en) | Feedback model information retrieval method, system and medium based on concept net semantics | |
JP5497105B2 (en) | Document retrieval apparatus and method | |
CN112163065A (en) | Information retrieval method, system and medium | |
CN111125299B (en) | Dynamic word stock updating method based on user behavior analysis | |
CN114519132A (en) | Formula retrieval method and device based on formula reference graph | |
CN112199461A (en) | Document retrieval method, device, medium and equipment based on block index structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210101 |