CN116821280A - Document retrieval method, device, electronic equipment and storage medium - Google Patents

Document retrieval method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116821280A
CN116821280A CN202310677833.0A CN202310677833A CN116821280A CN 116821280 A CN116821280 A CN 116821280A CN 202310677833 A CN202310677833 A CN 202310677833A CN 116821280 A CN116821280 A CN 116821280A
Authority
CN
China
Prior art keywords
document
vocabulary
search
searched
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310677833.0A
Other languages
Chinese (zh)
Inventor
徐峰
潘晓明
陈曦
周亚
崔海雪
章晗
孙乐义
朱丹
万海波
袁林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaan Securities Co ltd
Original Assignee
Huaan Securities Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaan Securities Co ltd filed Critical Huaan Securities Co ltd
Priority to CN202310677833.0A priority Critical patent/CN116821280A/en
Publication of CN116821280A publication Critical patent/CN116821280A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides a document retrieval method, a device, an electronic device and a storage medium, wherein the method comprises the following steps: receiving content to be retrieved input by a user; matching the content to be searched with an auxiliary search vocabulary to obtain auxiliary search vocabularies with preset quantity; and taking the Cartesian product of the content to be searched and the auxiliary search vocabulary as a search condition set to search the document. The retrieval time length can be reduced, and the accuracy of document retrieval can be improved.

Description

Document retrieval method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of retrieval, and in particular, to a document retrieval method, apparatus, electronic device, and storage medium.
Background
In the related art, the document retrieval method generally directly carries out word segmentation on the text to be retrieved, then carries out simple processing on word segmentation results, and carries out similarity calculation and matching on the word segmentation results serving as query conditions and all keywords extracted from a document library to obtain retrieval results. Because the number of the searched documents and the searched keywords is large, the similarity calculation and the matching are directly carried out on the search keywords and all the keywords extracted from the documents, the time consumption is long, and the search result and the search coverage are not accurate enough.
Disclosure of Invention
The invention provides a document retrieval method, a document retrieval device, an electronic device and a storage medium, which are used for reducing retrieval time and improving accuracy of document retrieval.
The invention provides a document retrieval method, which comprises the following steps:
receiving content to be retrieved input by a user;
matching the content to be searched with an auxiliary search vocabulary to obtain auxiliary search vocabularies with preset quantity;
and taking the Cartesian product of the content to be searched and the auxiliary search vocabulary as a search condition set to search the document.
According to the document retrieval method provided by the invention, the auxiliary retrieval vocabulary comprises one or more of the following:
at least one vocabulary group obtained by a clustering algorithm for a plurality of vocabulary clusters or for vocabulary clusters in one or more of a hyponym vocabulary, a synonym vocabulary, and an industry vocabulary, the at least one vocabulary group comprising at least one vocabulary;
at least one keyword extracted from at least one standardized document, wherein the at least one standardized document is at least one standardized document in all the standardized documents which can be searched by the document search;
the total number of times the at least one keyword extracted from the at least one standardized document is extracted in the at least one standardized document;
a synonym table;
a paraphrasing;
industry vocabulary.
According to the document retrieval method provided by the invention, the matching of the input content to be retrieved with the auxiliary retrieval vocabulary table to obtain the auxiliary retrieval vocabulary with the preset quantity comprises the following steps:
word segmentation processing is carried out on the input content to be searched to obtain one or more keywords to be searched;
matching the one or more keywords to be searched with the auxiliary search vocabulary respectively to obtain auxiliary search vocabularies corresponding to the one or more keywords to be searched one by one;
and selecting the first N auxiliary search words with the highest total times extracted from at least one standardized document from all auxiliary search words with one-to-one correspondence to the one or more keywords to be searched as the auxiliary search words with one-to-one correspondence to the one or more keywords to be searched, wherein N is a preset value.
According to the document retrieval method provided by the invention, the Cartesian product of the content to be retrieved and the auxiliary retrieval vocabulary is used as a retrieval condition set to perform document retrieval, and the document retrieval method comprises the following steps:
the one or more keywords to be searched and N auxiliary search vocabularies respectively corresponding to the one or more keywords to be searched form one or more vocabulary sets;
taking the Cartesian product of the one or more collection vocabularies as a retrieval condition collection;
taking the search condition set as a search condition of a pre-constructed document keyword database, and searching in the pre-constructed document keyword database;
the pre-constructed document keyword database is based on an input standardized document, a TF-IDF algorithm is adopted to extract keywords and TF-IDF values of the input standardized document, the first M keywords, corresponding document IDs and storage paths are stored, and M is a preset value.
According to the document retrieval method provided by the invention, the method further comprises the following steps:
inputting the standardized document into a document preprocessing module to obtain the keywords extracted from the standardized document, wherein the document preprocessing module is used for extracting the keywords of the standardized document by adopting a TF-IDF algorithm.
According to the document retrieval method provided by the invention, the method further comprises the following steps:
outputting a document ID and a path corresponding to one or more standardized documents under the condition that the keywords in the search condition set are matched with M keywords of the one or more standardized documents stored in the pre-constructed document keyword database; and under the condition that the keywords in the search condition set are not matched with the keywords in the pre-constructed document keyword database, taking the search condition set as a query condition, and carrying out document search in a standard document library.
The invention also provides a document retrieval device, comprising:
the receiving module is used for receiving the content to be retrieved input by the user;
the matching module is used for matching the content to be searched with the auxiliary search vocabulary list to obtain auxiliary search vocabulary with preset quantity;
and the retrieval module is used for taking the Cartesian product of the vocabulary to be retrieved and the auxiliary retrieval vocabulary as a retrieval condition set to retrieve the document.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the document retrieval method when executing the program.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the document retrieval method.
The invention also provides a computer program product comprising a computer program which when executed by a processor implements the document retrieval method.
According to the document retrieval method, the device, the electronic equipment and the storage medium, the document retrieval is performed after the content to be retrieved is properly expanded, so that the retrieval time can be shortened, and the accuracy of document retrieval can be improved.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a document retrieval method provided by the invention;
FIG. 2 is a second flow chart of the document searching method according to the present invention;
FIG. 3 is a third flow chart of the document searching method according to the present invention;
FIG. 4 is a schematic diagram of a document retrieval apparatus according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The securities industry belongs to the strong supervision industry, and securities business personnel need to perform a large amount of document information inquiry work in the working process to determine related requirements and specifications. In the related art, the document retrieval method generally directly carries out word segmentation on the text to be retrieved, then carries out simple processing on word segmentation results, and carries out similarity calculation and matching on the word segmentation results serving as query conditions and all keywords extracted from a document library to obtain retrieval results. Because the number of the searched documents and the searched keywords is large, the similarity calculation and the matching are directly carried out on the search keywords and all the keywords extracted from the documents, the time consumption is long, and the search result and the search coverage are not accurate enough.
The invention provides a document retrieval method, a document retrieval device, an electronic device and a storage medium, which are used for reducing retrieval time and improving accuracy of document retrieval.
FIG. 1 is a schematic flow chart of a document retrieval method provided by the invention, as shown in FIG. 1, the method comprises the following steps:
step 100, receiving content to be retrieved input by a user;
alternatively, the content to be searched may be a word, a sentence, or a piece of text, which may be chinese or english, which is not limited in the present invention.
Alternatively, in order to expand the content to be retrieved, such as a vocabulary, a sentence, or a text, input by the user may be received first.
For example, the user may input the words "search", "match" and/or "query" as the content to be searched, or input the sentence "receive the content to be searched" input by the user as the content to be searched, or input a text "because the number of the searched documents and the number of the searched keywords are large, the similarity calculation and matching are directly performed on the search keywords and all the keywords extracted from the documents, which takes a long time and has inaccurate search results and search coverage. "as content to be retrieved". The content input by the user in this example is merely for illustration, and the content to be retrieved that the user can actually input is not limited to the content input by the user in this example.
Step 110, matching the content to be searched with an auxiliary search vocabulary to obtain a preset number of auxiliary search vocabularies;
optionally, in order to expand the content to be retrieved, an auxiliary retrieval vocabulary may be constructed first, and the auxiliary retrieval vocabulary may be used to match to a corresponding auxiliary retrieval vocabulary based on the vocabulary.
Optionally, the auxiliary search vocabulary may be synonyms, near-meaning words, and the like, which may improve the search accuracy.
Optionally, after receiving the content to be searched input by the user, the content to be searched and the auxiliary search vocabulary may be matched to obtain an auxiliary search vocabulary corresponding to the content to be searched.
Optionally, the auxiliary search vocabulary may include a plurality of vocabulary groups, each of which may include a plurality of vocabularies, and the vocabularies in each of the vocabulary groups may be synonyms, or near-synonyms, or have other associations, which are not limited in this respect.
Optionally, the matching of the content to be searched with the auxiliary search vocabulary may be that the vocabulary included in the content to be searched is searched in the auxiliary search vocabulary, and if the vocabulary included in the content to be searched is searched in the auxiliary search vocabulary, the other vocabulary except the vocabulary in the one or more vocabulary groups to which the vocabulary belongs may be used as the auxiliary search vocabulary.
For example, the content to be searched includes a vocabulary a, the vocabulary a may be searched in an auxiliary search vocabulary, and if the vocabulary a is searched and the vocabulary group to which the vocabulary a belongs further includes a vocabulary C and a vocabulary D, the vocabulary C and the vocabulary D may be used as auxiliary search vocabularies corresponding to the content to be searched.
And 120, taking the Cartesian product of the content to be searched and the auxiliary search vocabulary as a search condition set to search the document.
Optionally, after obtaining the auxiliary search vocabulary corresponding to the content to be searched, the cartesian product of the content to be searched and the auxiliary search vocabulary can be used as a search condition set to perform document search.
Optionally, after matching the content to be searched with the auxiliary search vocabulary, one or more vocabularies in the content to be searched and auxiliary search vocabularies corresponding to the one or more vocabularies can be obtained, the vocabularies in the content to be searched and the auxiliary search vocabularies corresponding to the vocabularies can be combined into a vocabulary set, and the Cartesian product of the vocabulary sets is used as a search condition set to perform document search.
For example, the content to be searched includes a word a and a word B, the auxiliary search word of the word a is a word a and a word B, the auxiliary search word of the word B is a word c and a word d, the word a and the word B may be combined into a set a, the word B, the word c and the word d may be combined into a set B, and the cartesian products of the set a and the set B, i.e., { word a, word B }, { word a, word c }, { word a, word d }, { word B, word c } and { word B, d } may be used as the search condition set to perform the document search.
According to the document retrieval method provided by the invention, the document retrieval is performed after the content to be retrieved is properly expanded, so that the retrieval time can be reduced and the accuracy of the document retrieval can be improved.
Optionally, the auxiliary search vocabulary includes one or more of the following:
at least one vocabulary group obtained by a clustering algorithm for a plurality of vocabulary clusters, or for vocabulary clusters in one or more of a hyponym, a synonym, and an industry vocabulary;
at least one keyword extracted from at least one standardized document, wherein the at least one standardized document is at least one standardized document in all the standardized documents which can be searched by the document search;
the total number of times the at least one keyword extracted from the at least one standardized document is extracted in the at least one standardized document;
a synonym table;
a paraphrasing;
industry vocabulary.
Alternatively, the clustering algorithm may be a K-Means cluster, or a mean shift cluster, or a maximum expected cluster of a Gaussian mixture model, or other clustering algorithms, which the present invention is not limited to.
For example, K groups can be selected through K-Means clustering, the central point of each group is randomly initialized, n vocabularies are selected from a standard Chinese dictionary to be used as samples, the distance between each vocabulary and the K central points is calculated, and the vocabulary is divided into the groups to which the central points belong when the vocabulary is closest to the central points; after the division is completed, calculating the center point of each group as a new center point, then repeatedly calculating the distance between each vocabulary and k center points, dividing the vocabulary into groups to which the center point belongs by which the vocabulary is closest, calculating the center point of each group as the new center point until the center point of each group does not change much after each iteration, and obtaining k vocabulary groups, wherein the vocabulary in each vocabulary group can be a plurality of vocabularies which are easily associated together or have a certain relation.
Alternatively, all the existing vocabularies, such as all the vocabularies recorded in the standard Chinese dictionary, can be divided into a plurality of vocabulary groups through a clustering algorithm, and the rule of division can be to divide the vocabularies which are synonymous with each other into the same vocabulary group, or divide the vocabularies which are close to each other into the same vocabulary group, or divide the Chinese holonomy, english holonomy and abbreviation of the industry vocabularies into the same vocabulary group based on the common terms in the industry, or divide a plurality of vocabularies which are easy to think or have certain connection into the same vocabulary group, which is not limited in the invention.
For example, the words that are synonyms for each other may be grouped into the same vocabulary group by a clustering algorithm, such as grouping the words "find", "perceive", "invent", and "detect" into the same vocabulary group.
For example, words that are mutually similar words may be divided into the same vocabulary group by a clustering algorithm, such as "very", "prominent", "Zhuo Shu", "extra", "extraordinary", and "extreme", etc., into the same vocabulary group.
For example, the english acronym "IPO", the english full name "Initial public offering" and the chinese full name "first-time public recruitment" may be divided into the same vocabulary group by a clustering algorithm based on industry vocabulary of the securities industry.
For example, the words "stock" and "securities" are not synonyms, paraphraseology, or chinese full scale, english full scale, and abbreviation of industry vocabulary, but belong to easily-imaginable words, and the "stock" and "securities" may be divided into the same vocabulary group by a clustering algorithm.
Optionally, the more words contained in the vocabulary group, the more auxiliary search words the content to be searched can be matched with.
Optionally, the synonyms, the near-meaning words and the easily-imagined words of a word can be divided into the same word group through a clustering algorithm, and compared with the word group which is only based on the synonym library and the near-meaning word library and is matched with auxiliary search words, the word group which is obtained based on the clustering algorithm can be matched with more auxiliary search words, so that the accuracy of search is improved.
For example, the "interests", "hobbies" and "specials" can be divided into the same vocabulary group through a clustering algorithm, wherein the "interests" and "hobbies" are similar words, are auxiliary search words which can be obtained based on a similar word stock, the "specials" are not similar words of the "interests" and "hobbies" and cannot be obtained only based on the similar word stock, but the "specials" belong to words which are easy to think based on the "interests" and "hobbies", so that the words can be divided into the same vocabulary group through the clustering algorithm, and the search accuracy is improved.
Alternatively, the extraction method for extracting at least one keyword from at least one standardized document may be a manual extraction, or an algorithm extraction, or any other method capable of extracting a keyword from a standardized document, which is not limited in the present invention.
For example, the vocabulary a, the vocabulary B, the vocabulary C, and the vocabulary D may be extracted from the standardized document a by manual extraction as keywords of the standardized document a.
For example, the vocabulary E, the vocabulary F, the vocabulary G, and the vocabulary H may be extracted from the standardized document B by the TF-IDF algorithm as keywords of the standardized document B.
Alternatively, the total number of times at least one keyword extracted from at least one standardized document is extracted in at least one standardized document may be the total number of times a keyword extracted from one standardized document is extracted in all standardized documents.
For example, a keyword a is extracted from one standardized document, and the keyword a is also extracted from the other 8 standardized documents, that is, the total number of times the keyword a is extracted in all the standardized documents is 9.
Alternatively, the synonym table may be a synonym table constructed based on a standard Chinese dictionary, and inputting a word into the synonym table may obtain all synonyms of the word.
For example, by entering the word "find" in the synonym table, synonyms such as "find", "perceive", "detect", "invention", "detect" and the like can be obtained.
Alternatively, the plurality of vocabularies which are synonyms in the synonym table can be divided into the same vocabulary group through a clustering algorithm to obtain a plurality of vocabulary groups.
For example, the synonym table contains words "find", "interest" and "hobbies", and the words "find" and "find" which are synonyms for each other may be classified into the vocabulary group a and the words "interest" and "hobbies" may be classified into the vocabulary group B by a clustering algorithm.
Alternatively, the paraphrase table may be a paraphrase table constructed based on a standard chinese dictionary, and inputting one word into the paraphrase table may obtain all the paraphrases of the word.
For example, by inputting the word "very" in the paraphraseology, the terms "prominent", "Zhuo Shu", "extra", "extraordinary", "extreme" and the like may be obtained.
Alternatively, a plurality of vocabularies which are similar words in the similar meaning word list can be divided into the same vocabulary group through a clustering algorithm, so as to obtain a plurality of vocabulary groups.
For example, the words "very", "prominent", "mystery" and "magic" are included in the paraphrasing table, and the words "very" and "prominent" which are the paraphrasing words may be classified into the vocabulary group a and the words "mystery" and "magic" may be classified into the vocabulary group B by a clustering algorithm.
Alternatively, the industry vocabulary may be an industry vocabulary constructed based on terms common in the industry, which may establish correspondence between chinese holonomics, english holonomics, and abbreviations.
For example, in the industry vocabulary of the securities industry, the english abbreviation "IPO" is input, and english full name "Initial public offering" and chinese full name "first-public-cut recruitment strand" can be obtained.
For example, in an industry vocabulary of the securities industry, the chinese full name "market value" is input, and the english full name "Market capitalization" can be obtained. Optionally, matching the content to be searched with the auxiliary search vocabulary, so as to obtain whether the content to be searched contains keywords extracted from the standardized documents and the total number of times the keywords are extracted from all the standardized documents, all synonyms of the content to be searched, all hyponyms of the content to be searched and/or corresponding Chinese holonomics, english holonomics and/or abbreviations belonging to industry vocabularies in the content to be searched.
Alternatively, multiple vocabularies, which are Chinese full names, english full names and/or abbreviations, in the industry vocabulary can be separated into the same vocabulary group through a clustering algorithm, so that multiple vocabulary groups are obtained.
For example, the industry vocabulary contains the words "IPO", "Initial public offering", "first public recruitment", "market value" and "Market capitalization", and the words "IPO", "Initial public offering" and "first public recruitment" that are chinese full names, english full names and/or abbreviations may be separated into the vocabulary group a and the words "market value" and "Market capitalization" into the vocabulary group B by a clustering algorithm.
Optionally, the auxiliary search vocabulary may be maintained and expanded in a manually annotated manner.
According to the document retrieval method provided by the invention, the auxiliary retrieval vocabulary is constructed, so that the document retrieval can be performed after the content to be retrieved is expanded based on the auxiliary retrieval vocabulary, the scope of the retrieval vocabulary can be increased, and the accuracy of document retrieval is improved.
Optionally, the matching the input content to be retrieved with the auxiliary retrieval vocabulary to obtain a preset number of auxiliary retrieval vocabularies includes:
word segmentation processing is carried out on the input content to be searched to obtain one or more keywords to be searched;
matching the one or more keywords to be searched with the auxiliary search vocabulary respectively to obtain auxiliary search vocabularies corresponding to the one or more keywords to be searched one by one;
and selecting the top N auxiliary search words which belong to the keywords extracted by the standardized document and have the highest total frequency in all the standardized documents from all the auxiliary search words which are matched with the one or more keywords to be searched and correspond to the one or more keywords to be searched, wherein N is a preset value.
Optionally, after receiving the content to be searched input by the user, word segmentation processing can be performed on the content to be searched to obtain one or more word segmentation results, and stop words in the word segmentation results can be removed to obtain one or more keywords to be searched.
Alternatively, the word segmentation processing of the content to be retrieved may be performed manually, or the word segmentation processing of the content to be retrieved may be performed by using a word segmentation algorithm, or the word segmentation processing of the content to be retrieved may be performed by using a word segmentation model, which is not limited in the present invention.
Optionally, each character string in the content to be searched can be matched with words in the dictionary one by maintaining the dictionary, and if the character string is matched with the word in the dictionary, the character string is used as a word segmentation result; or the frequency of the adjacent occurrence of the characters is utilized to reflect the reliability of the formed words, the frequency of the combination of the adjacent occurrence of each character in the corpus is counted, and when the combination frequency is higher than a certain critical value, the combination of each character is considered to form a word or other word segmentation rules, and the invention is not limited to the word.
Optionally, the reject stop word may be a word that cannot affect the search result, such as a functional word, or a mood word, or the like, in the content to be searched, or a stop word list is manually maintained, and all stop words recorded in the reject list in the content to be searched.
For example, when the content to be searched input by the user is "the market value of a company", the keyword to be searched is "the company" and "the market value" after the keyword is segmented and the stop word is removed.
Optionally, after obtaining the keywords to be searched, each keyword to be searched can be respectively matched with an auxiliary search vocabulary to obtain whether each keyword to be searched is a keyword extracted from a standardized document, the total number of times the keyword is extracted from all standardized documents, and auxiliary search vocabularies corresponding to each keyword to be searched one by one. Specifically, the keyword to be searched can be searched in the auxiliary search vocabulary, and if the vocabulary exists in the auxiliary search vocabulary, other vocabularies of the vocabulary group to which the vocabulary belongs are used as the auxiliary search vocabulary of the keyword to be searched.
For example, the content to be searched input by the user may be subjected to word segmentation to obtain a keyword a to be searched, the keyword a to be searched is matched with an auxiliary search vocabulary, it is determined that the keyword a to be searched belongs to keywords extracted from the standardized document a, and both the vocabulary group 1 and the vocabulary group 2 include the keyword a to be searched, wherein the vocabulary group 1 further includes a vocabulary C and a vocabulary D, the vocabulary group 2 further includes a vocabulary E and a vocabulary F, and then it may be determined that the total number of times the keyword a to be searched is extracted from all the standardized documents is 1, and the auxiliary search vocabulary corresponding to the keyword a to be searched is the vocabulary C, the vocabulary D, the vocabulary E and the vocabulary F.
Alternatively, after obtaining the auxiliary search vocabulary of each keyword to be searched, the first N auxiliary search vocabularies belonging to the keywords extracted from the standardized documents and having the highest total number of times extracted from all the standardized documents may be selected as final auxiliary search vocabularies, and N may be a preset value.
For example, the keyword a to be searched is matched with the auxiliary search vocabulary, and 5 auxiliary search vocabularies are obtained, namely, vocabulary a, vocabulary b, vocabulary c, vocabulary d and vocabulary e. The term a, the term b, the term c and the term d are keywords extracted from the standardized documents, the number of times the term a is extracted from all the standardized documents is 8, the number of times the term b is extracted from all the standardized documents is 6, the number of times the term c is extracted from all the standardized documents is 4, and the number of times the term d is extracted from all the standardized documents is 2, then the first 3 auxiliary search terms with the highest total number of times the term a, the term b and the term c are extracted from all the standardized documents can be selected as final auxiliary search terms.
In one embodiment of the invention, table I shows one possible auxiliary search vocabulary matching results.
Table one: auxiliary vocabulary
Wherein w1 is other keywords to be searched, wa1 is a first auxiliary search word matched with other keywords to be searched, and wa2 is a second auxiliary search word matched with other keywords to be searched.
According to the document retrieval method provided by the invention, the top N auxiliary retrieval words which belong to the keywords extracted from the standardized documents and have the highest total times of extraction in all the standardized documents are reserved in the matched keywords to be retrieved, the keywords which are most likely to be matched are selected, and the number of the auxiliary retrieval words is controlled, so that the retrieval time is shortened, and the accuracy of document retrieval is improved.
Optionally, the performing document retrieval using the cartesian product of the content to be retrieved and the auxiliary retrieval vocabulary as a retrieval condition set includes:
taking the Cartesian product of one or more sets consisting of the one or more keywords to be searched and the corresponding N auxiliary search vocabularies as a search condition set, taking the search condition set as a query condition of a pre-built document keyword database, and searching in the pre-built document keyword database;
the pre-constructed document keyword database is based on an input standardized document, a TF-IDF algorithm is adopted to extract keywords and TF-IDF values of the input standardized document, the first M keywords, corresponding document IDs and storage paths are stored, and M is a preset value.
Alternatively, a set of search conditions may be constructed as a cartesian product a×b= { (x, y) |x e a ∈y e B } of one or more sets of one or more keywords to be searched and corresponding N auxiliary search terms.
For example, the auxiliary search words of the keyword a to be searched are the word a and the word B, the auxiliary search words of the keyword B to be searched are the word c and the word d, and the search condition set may be { the keyword a to be searched, the keyword B to be searched }, { the keyword a to be searched, the word c }, { the keyword a to be searched, the word d }, { the word a, the keyword B to be searched }, { the word B }, { the word a, the word c }, { the word a, the word d }, { the word B, the word c } and { the word B, the word d }.
Alternatively, in the constructed search condition set, if the same vocabulary exists, only one vocabulary may be reserved,
alternatively, the auxiliary search vocabulary may be queried through the auxiliary search vocabulary immediately after the user inputs the content to be searched, and the query result may be displayed to the user as an auxiliary prompt function.
Optionally, the pre-constructed document keyword database may extract keywords and TF-IDF values of the standardized document by using TF-IDF algorithm based on the input standardized document, and store the first M extracted keywords and corresponding document IDs and storage paths in the database, where M is a preset value.
Optionally, if the search condition set includes M extracted keywords of the standardized document, the standardized document is represented as the standardized document to be searched.
According to the document retrieval method provided by the invention, the number of matched keywords can be controlled by retrieving the documents in the pre-constructed document keyword database, and the document keyword database contains a limited number of keywords extracted from the standardized documents, so that the accuracy of document retrieval is improved while the retrieval time is reduced.
Optionally, the method further comprises:
inputting the standardized document into a document preprocessing module to obtain the keywords extracted from the standardized document, wherein the document preprocessing module is used for extracting the keywords of the standardized document by adopting a TF-IDF algorithm.
Alternatively, all the standardized documents may be input to the document preprocessing module, keywords of all the standardized documents are extracted using TF-IDF algorithm, and the number of times the keywords of all the standardized documents are extracted is counted.
Optionally, after obtaining the keywords of all the standardized documents and the number of times the keywords are extracted, the keywords of all the standardized documents and the number of times the keywords are extracted may be added to the auxiliary search vocabulary.
According to the document retrieval method provided by the invention, the standardized document is input into the document preprocessing module, the keywords extracted from the standardized document and the times of extracting the keywords of all the standardized document are obtained, and the keywords of all the standardized document and the times of extracting the keywords can be added into the auxiliary retrieval vocabulary, so that the retrieval condition set can be conveniently determined based on the auxiliary retrieval vocabulary.
Optionally, the method further comprises:
outputting a document ID and a path corresponding to one or more standardized documents under the condition that the keywords in the search condition set are matched with M keywords of the one or more standardized documents stored in the pre-constructed document keyword database; and under the condition that the keywords in the search condition set are not matched with the keywords in the pre-constructed document keyword database, taking the search condition set as a query condition, and carrying out document search in a standard document library.
Alternatively, the matching of the keywords in the set of search criteria with the M keywords of the one or more standardized documents stored in the pre-constructed document keyword database may be that at least one keyword in the set of search criteria matches at least 1 keyword of the one or more standardized documents stored in the pre-constructed document keyword database.
For example, when one set a of the search condition sets is { vocabulary a, vocabulary B }, 3 keywords extracted from the standardized document a are vocabulary a, vocabulary B and vocabulary c, and 3 keywords extracted from the standardized document B are vocabulary a, vocabulary B and vocabulary d, the set a may be matched to the standardized document a and the standardized document B when being searched in the pre-constructed document keyword database as the search condition, and the pre-constructed document keyword database may output the document ID and the path of the standardized document a and the document ID and the path of the standardized document B.
FIG. 2 is a second flow chart of the document searching method provided by the invention, as shown in FIG. 2, in one embodiment of the invention, word segmentation is performed on the content to be searched input by a user to obtain a word segmentation result; then matching the word segmentation result with an auxiliary search vocabulary consisting of synonym and paraphrase dictionary, industry word stock and keywords extracted from standardized documents to obtain auxiliary search vocabulary, wherein the keywords extracted from the standardized documents are obtained by a standardized document input document preprocessing module; and then taking the Cartesian product of a set formed by a plurality of word segmentation results and auxiliary search words as a search condition, and searching in a keyword database to obtain a search result.
FIG. 3 is a third flow chart of the document searching method according to the present invention, as shown in FIG. 3, in one embodiment of the present invention, the document searching method includes the steps of:
1. receiving input content to be retrieved, wherein the content to be retrieved comprises phrases and sentences;
2. word segmentation processing is carried out on the input phrases and sentences to be searched respectively, so that all keywords to be searched are obtained;
3. matching each word segmentation result obtained in the second step with a pre-constructed auxiliary search vocabulary to obtain a first auxiliary search vocabulary set;
4. searching in a pre-constructed database by using the searching condition set obtained in the third step as the query condition of the document keyword database;
5. if the obtained search condition set is matched with the corresponding keyword in the fourth step, directly outputting the corresponding document ID and path; if no relevant documents are retrieved, document retrieval is performed in a standard document library stored in a distributed search Engine (ES).
According to the document retrieval method provided by the invention, the standardized document is input into the document preprocessing module, the keywords extracted from the standardized document and the times of extracting the keywords of all the standardized document are obtained, and the keywords of all the standardized document and the times of extracting the keywords can be added into the auxiliary retrieval vocabulary, so that the retrieval condition set can be conveniently determined based on the auxiliary retrieval vocabulary.
Fig. 4 is a schematic structural view of a document retrieval apparatus provided by the present invention, and as shown in fig. 4, the document retrieval apparatus 400 includes: a receiving module 410, a matching module 420 and a retrieving module 430, wherein,
a receiving module 410, configured to receive content to be retrieved input by a user;
the matching module 420 is configured to match the content to be retrieved with an auxiliary retrieval vocabulary to obtain a preset number of auxiliary retrieval vocabularies;
the retrieving module 430 is configured to perform document retrieval by using a cartesian product of the vocabulary to be retrieved and the auxiliary retrieval vocabulary as a retrieval condition set.
According to the document retrieval device provided by the invention, the document retrieval is performed after the content to be retrieved is properly expanded, so that the retrieval time can be reduced and the accuracy of the document retrieval can be improved.
It can be understood that the document retrieval device provided by the present invention corresponds to the document retrieval method provided by each embodiment, and the relevant technical features of the document retrieval device provided by the present invention may refer to the relevant technical features of the document retrieval method provided by each embodiment, which are not described herein.
Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a document retrieval method comprising: receiving content to be retrieved input by a user; matching the content to be searched with an auxiliary search vocabulary to obtain auxiliary search vocabularies with preset quantity; and taking the Cartesian product of the content to be searched and the auxiliary search vocabulary as a search condition set to search the document.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the document retrieval method provided by the methods described above, the method comprising: receiving content to be retrieved input by a user; matching the content to be searched with an auxiliary search vocabulary to obtain auxiliary search vocabularies with preset quantity; and taking the Cartesian product of the content to be searched and the auxiliary search vocabulary as a search condition set to search the document.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a document retrieval method provided by the above methods, the method comprising: receiving content to be retrieved input by a user; matching the content to be searched with an auxiliary search vocabulary to obtain auxiliary search vocabularies with preset quantity; and taking the Cartesian product of the content to be searched and the auxiliary search vocabulary as a search condition set to search the document.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A document retrieval method, comprising:
receiving content to be retrieved input by a user;
matching the content to be searched with an auxiliary search vocabulary to obtain auxiliary search vocabularies with preset quantity;
and taking the Cartesian product of the content to be searched and the auxiliary search vocabulary as a search condition set to search the document.
2. The document retrieval method of claim 1, wherein the auxiliary retrieval vocabulary includes one or more of:
at least one vocabulary group obtained by a clustering algorithm for a plurality of vocabulary clusters or for vocabulary clusters in one or more of a hyponym vocabulary, a synonym vocabulary, and an industry vocabulary, the at least one vocabulary group comprising at least one vocabulary;
at least one keyword extracted from at least one standardized document, wherein the at least one standardized document is at least one standardized document in all the standardized documents which can be searched by the document search;
the total number of times the at least one keyword extracted from the at least one standardized document is extracted in the at least one standardized document;
a synonym table;
a paraphrasing;
industry vocabulary.
3. The method for searching documents according to claim 2, wherein the matching the inputted content to be searched with the auxiliary search vocabulary to obtain a preset number of auxiliary search vocabularies includes:
word segmentation processing is carried out on the input content to be searched to obtain one or more keywords to be searched;
matching the one or more keywords to be searched with the auxiliary search vocabulary respectively to obtain auxiliary search vocabularies corresponding to the one or more keywords to be searched one by one;
and selecting the first N auxiliary search words with the highest total times extracted from at least one standardized document from all auxiliary search words with one-to-one correspondence to the one or more keywords to be searched as the auxiliary search words with one-to-one correspondence to the one or more keywords to be searched, wherein N is a preset value.
4. A document retrieval method according to claim 3, wherein said performing document retrieval using a cartesian product of the content to be retrieved and the auxiliary retrieval vocabulary as a set of retrieval conditions comprises:
the one or more keywords to be searched and N auxiliary search vocabularies respectively corresponding to the one or more keywords to be searched form one or more vocabulary sets;
taking the Cartesian product of the one or more collection vocabularies as a retrieval condition collection;
taking the search condition set as a search condition of a pre-constructed document keyword database, and searching in the pre-constructed document keyword database;
the pre-constructed document keyword database is based on an input standardized document, a TF-IDF algorithm is adopted to extract keywords and TF-IDF values of the input standardized document, the first M keywords, corresponding document IDs and storage paths are stored, and M is a preset value.
5. The document retrieval method according to claim 2, wherein the method further comprises:
inputting the standardized document into a document preprocessing module to obtain the keywords extracted from the standardized document, wherein the document preprocessing module is used for extracting the keywords of the standardized document by adopting a TF-IDF algorithm.
6. The document retrieval method as recited in claim 4, wherein the method further comprises:
outputting a document ID and a path corresponding to one or more standardized documents under the condition that the keywords in the search condition set are matched with M keywords of the one or more standardized documents stored in the pre-constructed document keyword database; and under the condition that the keywords in the search condition set are not matched with the keywords in the pre-constructed document keyword database, taking the search condition set as a query condition, and carrying out document search in a standard document library.
7. A document retrieval apparatus, comprising:
the receiving module is used for receiving the content to be retrieved input by the user;
the matching module is used for matching the content to be searched with the auxiliary search vocabulary list to obtain auxiliary search vocabulary with preset quantity;
and the retrieval module is used for taking the Cartesian product of the vocabulary to be retrieved and the auxiliary retrieval vocabulary as a retrieval condition set to retrieve the document.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the document retrieval method according to any one of claims 1 to 6 when the program is executed.
9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the document retrieval method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the document retrieval method of any one of claims 1 to 6.
CN202310677833.0A 2023-06-07 2023-06-07 Document retrieval method, device, electronic equipment and storage medium Pending CN116821280A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310677833.0A CN116821280A (en) 2023-06-07 2023-06-07 Document retrieval method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310677833.0A CN116821280A (en) 2023-06-07 2023-06-07 Document retrieval method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116821280A true CN116821280A (en) 2023-09-29

Family

ID=88121411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310677833.0A Pending CN116821280A (en) 2023-06-07 2023-06-07 Document retrieval method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116821280A (en)

Similar Documents

Publication Publication Date Title
US11182435B2 (en) Model generation device, text search device, model generation method, text search method, data structure, and program
JP5936698B2 (en) Word semantic relation extraction device
Froud et al. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
CN110851559B (en) Automatic data element identification method and identification system
CN109522547B (en) Chinese synonym iteration extraction method based on pattern learning
WO2015080561A1 (en) A method and system for automated relation discovery from texts
CN107577663B (en) Key phrase extraction method and device
CN112559684A (en) Keyword extraction and information retrieval method
US20220180317A1 (en) Linguistic analysis of seed documents and peer groups
WO2018056423A1 (en) Scenario passage classifier, scenario classifier, and computer program therefor
CN108038099B (en) Low-frequency keyword identification method based on word clustering
KR20070007001A (en) Method and apparatus for searching information using automatic query creation
CN107168953A (en) The new word discovery method and system that word-based vector is characterized in mass text
CN112989813A (en) Scientific and technological resource relation extraction method and device based on pre-training language model
Ghosh et al. A rule based extractive text summarization technique for Bangla news documents
CN111428031A (en) Graph model filtering method fusing shallow semantic information
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Ahmed et al. Question analysis for Arabic question answering systems
US10318565B2 (en) Method and system for searching phrase concepts in documents
CN116821280A (en) Document retrieval method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination