CN115048495A - Document retrieval method, document retrieval device, electronic equipment and medium - Google Patents

Document retrieval method, document retrieval device, electronic equipment and medium Download PDF

Info

Publication number
CN115048495A
CN115048495A CN202210874619.XA CN202210874619A CN115048495A CN 115048495 A CN115048495 A CN 115048495A CN 202210874619 A CN202210874619 A CN 202210874619A CN 115048495 A CN115048495 A CN 115048495A
Authority
CN
China
Prior art keywords
document
vocabulary
weight
frequency
vocabularies
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210874619.XA
Other languages
Chinese (zh)
Inventor
田琳
陈旻炜
何启凡
王玉龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202210874619.XA priority Critical patent/CN115048495A/en
Publication of CN115048495A publication Critical patent/CN115048495A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0281Customer communication at a business location, e.g. providing product or service information, consulting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Abstract

The disclosure provides a document retrieval method, which can be used in the financial field or other fields. The document retrieval method comprises the following steps: acquiring a retrieval keyword; extracting a characteristic vocabulary of each document in the standard document set, wherein the characteristic vocabulary is obtained based on word frequency-inverse document frequency and position information of the vocabulary; and matching the search keywords with the feature words to obtain a search result. The present disclosure also provides a document retrieval apparatus, a device, a storage medium, and a program product.

Description

Document retrieval method, document retrieval device, electronic equipment and medium
Technical Field
The present disclosure relates to the field of document retrieval, and may also be used in the financial field or other fields, and in particular, to a document retrieval method, apparatus, device, medium, and program product.
Background
Before answering the customer's question, the bank service generally searches the document library by searching the question keywords. The current document retrieval method is to match the query keywords with each vocabulary of each document in the document library to find out the document for the client problem.
In the process of implementing the concept disclosed by the present disclosure, the inventor finds that matching the question keywords with each vocabulary of each document in the document library results in low retrieval accuracy and low retrieval speed due to the large number of vocabularies in one document.
Disclosure of Invention
In view of the above, the present disclosure provides a document retrieval method, apparatus, device, medium, and program product.
According to a first aspect of the present disclosure, there is provided a document retrieval method applied to a canonical document set including a plurality of documents, each of the plurality of documents including a plurality of words, the method including: acquiring a retrieval keyword; extracting a characteristic vocabulary of each document in the standard document set, wherein the characteristic vocabulary is obtained based on word frequency-inverse document frequency and position information of the vocabulary; and matching the search keywords with the feature words to obtain a search result.
According to an embodiment of the present disclosure, the step of extracting the feature vocabulary of each document in the canonical document set includes: preprocessing a document of a feature vocabulary to be extracted to obtain a vocabulary set corresponding to the document, wherein the vocabulary set comprises n vocabularies, and n is an integer greater than or equal to 1; determining a weight for each of the n vocabularies; and obtaining the characteristic vocabulary of the document based on the weight.
According to the embodiment of the present disclosure, the step of preprocessing the document of the feature vocabulary to be extracted to obtain the vocabulary set corresponding to the document includes: performing word segmentation processing on a document of a feature word to be extracted to obtain a candidate word set; and removing the stop vocabulary in the candidate vocabulary set based on the general stop vocabulary table to obtain a vocabulary set corresponding to the document.
According to an embodiment of the present disclosure, the step of determining the weight of each of the n vocabularies comprises: calculating the word frequency-inverse document frequency of each vocabulary in the n vocabularies; determining a word frequency-inverse document frequency weight for each of the n vocabularies based on the word frequency-inverse document frequency; determining a location weight for each of the n words based on location information for each of the n words; and determining the weight of each vocabulary in the n vocabularies according to the word frequency-inverse document frequency weight and the position weight of each vocabulary in the n vocabularies.
According to an embodiment of the present disclosure, the step of calculating the word frequency-inverse document frequency of each of the n vocabularies comprises: calculating the word frequency of each word in the n words; calculating an inverse document frequency for each of the n vocabularies; and multiplying the word frequency of each vocabulary in the n vocabularies by the inverse document frequency to obtain the word frequency-inverse document frequency of each vocabulary in the n vocabularies.
According to an embodiment of the present disclosure, the determining the weight of each of the n vocabularies according to the word frequency-inverse document frequency weight and the position weight of each of the n vocabularies comprises: multiplying the word frequency-inverse document frequency weight and the position weight of each vocabulary in the n vocabularies to obtain the weight of each vocabulary in the n vocabularies.
According to an embodiment of the present disclosure, the step of obtaining the feature vocabulary of the document based on the magnitude of the weight includes: presetting a critical weight; and extracting the vocabulary with the weight larger than the critical weight from the n vocabularies to obtain the characteristic vocabulary of the document.
A second aspect of the present disclosure provides a document retrieval apparatus including: the acquisition module is used for acquiring a search keyword; the extraction module is used for extracting a characteristic vocabulary of each document in the standard document set, wherein the characteristic vocabulary is obtained based on the word frequency-inverse document frequency and the position information of the vocabulary; and the retrieval module is used for matching the retrieval keywords with the characteristic words to obtain retrieval results.
According to an embodiment of the present disclosure, the extraction module includes: the preprocessing submodule is used for preprocessing a document of a feature vocabulary to be extracted to obtain a vocabulary set corresponding to the document, wherein the vocabulary set comprises n vocabularies, and n is an integer greater than or equal to 1; a first determining submodule for determining a weight of each of the n vocabularies; and the second determining submodule is used for obtaining the characteristic vocabulary of the document based on the size of the weight.
According to the embodiment of the disclosure, the preprocessing submodule comprises a word segmentation submodule, and is used for performing word segmentation on a document of a feature word to be extracted to obtain a candidate word set; and the screening submodule is used for removing the stop vocabulary in the candidate vocabulary set based on the universal stop vocabulary table to obtain the vocabulary set corresponding to the document.
According to an embodiment of the present disclosure, the first determining submodule includes a first calculating submodule configured to calculate a word frequency-inverse document frequency of each of the n vocabularies; a third determining submodule, configured to determine a word frequency-inverse document frequency weight of each of the n vocabularies based on the word frequency-inverse document frequency; a fourth determining submodule, configured to determine a position weight of each of the n vocabularies based on position information of each of the n vocabularies; and a fifth determining submodule, configured to determine a weight of each of the n vocabularies according to the word frequency-inverse document frequency weight and the position weight of each of the n vocabularies.
According to an embodiment of the present disclosure, the first computation submodule is further configured to compute a word frequency of each of the n vocabularies; calculating an inverse document frequency for each of the n vocabularies; and multiplying the word frequency of each vocabulary in the n vocabularies by the inverse document frequency to obtain the word frequency-inverse document frequency of each vocabulary in the n vocabularies.
According to an embodiment of the present disclosure, the fourth determining sub-module is further configured to multiply the word frequency-inverse document frequency weight and the position weight of each of the n vocabularies to obtain a weight of each of the n vocabularies.
According to an embodiment of the present disclosure, the second determining submodule includes a preset submodule for presetting the critical weight; and the extraction submodule is used for extracting the vocabularies with the weight larger than the critical weight from the n vocabularies to obtain the characteristic vocabularies of the documents.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above-described document retrieval method.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described document retrieval method.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described document retrieval method.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of embodiments of the disclosure, which proceeds with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a document retrieval method, apparatus, device, medium and program product according to embodiments of the disclosure;
FIG. 2 schematically shows a flow diagram of a document retrieval method according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow diagram for extracting feature vocabulary according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart for determining a weight for each vocabulary according to an embodiment of the present disclosure;
FIG. 5 is a block diagram schematically illustrating the structure of a document retrieval apparatus according to an embodiment of the present disclosure; and
FIG. 6 schematically shows a block diagram of an electronic device suitable for implementing a document retrieval method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The nouns involved in the examples explain:
a canonical document refers to a document having a canonical format, such as a document of a bank, which has a specific writing format and usually has the core vocabulary of the document appearing at a specific position, such as the title, the beginning, the end, and the head of the paragraph, so that the importance of a certain vocabulary in the document can be determined by the position information of the vocabulary.
Term Frequency-Inverse Document Frequency (abbreviated TF-IDF) is used to assess the importance of a word to a set of documents or a Document in a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases inversely with the frequency with which it appears in the corpus.
The word deactivation means that some words, such as "according to", "will", etc., are automatically filtered before or after processing natural language data or text in the information retrieval for saving storage space and improving retrieval efficiency.
Banking businesses are various in types and numerous in content, and are continuously and rapidly growing along with the development of the internet. Even under the premise that the intelligent customer service is widely applied, the bank customer service inevitably needs to reply questions of a large number of customers every day in daily work. At present, before answering a customer question, a bank customer service generally searches in a document library by searching a question keyword, and matches the question keyword with each vocabulary of each document in the document library to obtain corresponding document content. The bank customer service has higher requirements on timeliness and accuracy when answering questions, but the method of matching the question keywords with each vocabulary of each document in the document library has the defects of low retrieval precision and low retrieval speed.
In view of the foregoing problems, an embodiment of the present disclosure provides a document retrieval method applied to a canonical document set including a plurality of documents, each of the plurality of documents including a plurality of words, the method including: acquiring a retrieval keyword; extracting a characteristic vocabulary of each document in the standard document set, wherein the characteristic vocabulary is obtained based on word frequency-inverse document frequency and position information of the vocabulary; and matching the search keywords with the feature words to obtain a search result.
It should be noted that the method and apparatus determined by the present disclosure may be used for document retrieval in the financial field, and may also be used for document retrieval in any field other than the financial field.
Fig. 1 schematically shows an application scenario diagram of a document retrieval method, apparatus, device, medium, and program product according to embodiments of the present disclosure.
As shown in fig. 1, the application scenario 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that the document retrieval method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the document retrieval apparatus provided by the embodiment of the present disclosure may be generally disposed in the server 105. The document retrieval method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the document retrieval apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The document retrieval method of the disclosed embodiment will be described in detail below through fig. 2 to 4 based on the scenario described in fig. 1.
FIG. 2 schematically shows a flow chart of a document retrieval method according to an embodiment of the present disclosure.
As shown in fig. 2, the document retrieval method of this embodiment includes operations S210 to S230.
In operation S210, a search keyword is acquired. The retrieval keywords are retrieval keywords input by a user or retrieval keywords extracted by a customer service according to input information of the user, and retrieval is carried out in the document set based on the retrieval keywords.
In operation S220, a feature vocabulary of each document in the canonical document set is extracted, wherein the feature vocabulary is obtained based on the word frequency-inverse document frequency and the position information of the vocabulary, and the feature vocabulary is the vocabulary which can represent the document most and is used for quickly locating the document in the document set.
In operation S230, the search keyword is matched with the feature vocabulary to obtain a search result. The document set comprises a plurality of documents, a plurality of characteristic words and phrases, wherein the characteristic words and phrases of one document are fewer and can represent the document most, so that the plurality of documents in the document set are converted into a plurality of groups of characteristic words and phrases by extracting the characteristic words and phrases, the number of words and phrases needing to be matched is greatly reduced, and the searching speed and the searching accuracy can be greatly improved when searching is carried out by matching the searching keywords with the plurality of groups of characteristic words and phrases in the document set.
The document retrieval method provided by the embodiment of the disclosure extracts the corresponding most representative characteristic vocabulary by using the word frequency-inverse document frequency and the position information of the vocabulary for each document in the standard document set, and reduces the number of vocabularies in the standard document set, so that when document retrieval is performed, a retrieval result can be quickly obtained only by matching the retrieval keywords with the characteristic vocabularies, the retrieval keywords do not need to be matched with each vocabulary of each document in a document library, and the document retrieval efficiency and the retrieval precision are greatly improved.
Fig. 3 schematically shows a flow chart of extracting feature vocabulary according to an embodiment of the present disclosure.
As shown in fig. 3, the step of extracting the feature vocabulary of each document in the canonical document set of this embodiment includes operations S310 to S330.
In operation S310, a document of feature words to be extracted is preprocessed to obtain a word set corresponding to the document, where the word set includes n words, and n is an integer greater than or equal to 1.
According to the embodiment of the present disclosure, the step of preprocessing the document of the feature vocabulary to be extracted to obtain the vocabulary set corresponding to the document includes: performing word segmentation processing on a document of a feature word to be extracted to obtain a candidate word set; and removing the stop vocabulary in the candidate vocabulary set based on the general stop vocabulary table to obtain a vocabulary set corresponding to the document. The word segmentation refers to a technology of extracting words in a sentence by using an algorithm and converting the sentence into a word combination. The specific word segmentation processing technology and the general stop word list may be selected according to actual situations, and any technology that can convert a document into a word and a stop word list that removes common stop words to improve the retrieval efficiency may be used, which is not limited in the embodiment of the present disclosure. And the number of interference vocabularies is reduced by removing the stop vocabularies through preprocessing, and the calculation speed and the retrieval speed of the weight are further improved.
In operation S320, a weight of each of the n vocabularies is determined.
FIG. 4 schematically illustrates a flow diagram for determining a weight for each vocabulary according to an embodiment of the present disclosure.
As shown in fig. 4, the step of determining the weight of each of the n vocabularies of this embodiment includes operations S410 to S440.
In operation S410, a word frequency-inverse document frequency of each of the n vocabularies is calculated.
According to an embodiment of the present disclosure, the step of calculating the word frequency-inverse document frequency of each of the n vocabularies comprises: calculating the word frequency of each word in the n words; calculating an inverse document frequency for each of the n vocabularies; and multiplying the word frequency of each vocabulary in the n vocabularies by the inverse document frequency to obtain the word frequency-inverse document frequency of each vocabulary in the n vocabularies. Common words can be filtered out and important words can be reserved based on the word frequency-inverse document frequency.
Illustratively, for a particular word, the particular word frequency-inverse document frequency calculation includes calculation of word frequency and inverse document frequency, where word frequency TF indicates the frequency of a word occurring in a document, TF is high indicates the frequency of a word occurring in an article, and for document d j A specific word t in i To say, calculate its word frequency TF i,j The specific formula of (A) is as follows:
Figure BDA0003759032750000091
wherein n is i,j Is a specific word t i In document d j Number of occurrences, Σ k n k,j Is a document d j The sum of the number of occurrences of all words in (1).
The main idea of the inverse document frequency IDF is:if there are fewer documents in a document set that contain a word, the greater the IDF of the word, i.e., the word has good category-distinguishing capability, the documents in the document set that contain the word can be distinguished. The inverse document frequency IDF may be obtained by dividing the total number of documents by the number of documents containing the vocabulary, and taking the resulting quotient to be a base-10 logarithm. For a particular word t i In a canonical document set, the inverse document frequency IDF is calculated i The formula of (1) is as follows:
Figure BDA0003759032750000092
where | D | is the total number of documents in the canonical document set, | { j: t is t i ∈d j Denotes the inclusion of a particular word t i Document d of j The number of documents.
In operation S420, a word frequency-inverse document frequency weight is determined for each of the n vocabularies based on the word frequency-inverse document frequency.
In operation S430, a location weight of each of the n words is determined based on location information of each of the n words. Since a canonical document has a specific writing format, the core vocabulary of the document usually appears at a specific position, for example, at the head, beginning, end, and beginning of the paragraph, and therefore, the importance of a certain vocabulary in this document can be determined by the position information of the vocabulary. For example, the position information of the vocabulary includes a title, a beginning, an end, a segment head, and the like, and different weight values may be set according to the position, for example, the weight values corresponding to the title, the beginning, the end, and the segment head are set as a first position weight, a second position weight, a third weight, and a fourth weight, respectively, where the values of the first position weight, the second position weight, the third weight, and the fourth weight decrease from large to small. For any vocabulary, firstly judging whether the vocabulary appears in the title of the document, if so, setting the position weight of the vocabulary as a first position weight; otherwise, judging whether the position weight of the document appears in the beginning of the document, and if so, setting the position weight of the document as a second position weight; otherwise, judging whether the position weight of the document appears in the end of the document or not, and if so, setting the position weight of the document as a third position weight; otherwise, judging whether the position weight of the document appears in the section head of the document, and if so, setting the position weight of the document as a fourth position weight. After the judgment is finished according to the sequence, if the vocabulary is not in any of the four positions, the position weight of the vocabulary is set as a fifth position weight, wherein the fourth position weight is greater than the fifth position weight, which indicates that the position weight of the vocabulary is relatively low. This is because the information contained in the title is rich in the documents of the normative document set, and when the vocabulary appears in the title of the document, a higher weight value is given to the vocabulary, so that the importance of the vocabulary can be enhanced. In practical applications, specific values of the first position weight, the second position weight, the third weight, the fourth weight, and the fifth weight may be set as needed, and this is not specifically limited in the embodiments of the present disclosure.
In operation S440, the weight of each of the n vocabularies is determined according to the word frequency-inverse document frequency weight and the position weight of each of the n vocabularies, and the word frequency-inverse document frequency weight and the position weight may be added or multiplied according to actual situations, or different coefficients may be given to each weight, for example, the coefficient of the word frequency-inverse document frequency weight is set to 0.6, and the coefficient of the position weight is set to 0.4, and then the weights are multiplied by the respective coefficients and added to represent the importance of the different weights. The specific calculation method may be selected according to actual situations, and the embodiment of the present disclosure is not limited to this. Because the finally obtained weight integrates the word frequency-inverse document frequency weight and the position weight, the finally screened feature vocabulary is the vocabulary which can represent the document most, thereby greatly improving the retrieval precision.
According to an embodiment of the present disclosure, the step of determining the weight of each of the n vocabularies according to the word frequency-inverse document frequency weight and the location weight of each of the n vocabularies comprises: multiplying the word frequency-inverse document frequency weight and the position weight of each vocabulary in the n vocabularies to obtain the weight of each vocabulary in the n vocabularies.
Returning to fig. 3, in operation S330, a feature vocabulary of the document is obtained based on the size of the weight. For example, a critical weight may be set, and a word whose weight is greater than the critical weight is extracted for each document as a feature word of the document, or the number of extracted feature words may also be set, for example, a word whose weight is ranked 3 is taken for each document as a feature word, which may be selected according to the actually required number of feature words and an extraction manner, which is not limited in the embodiment of the present disclosure. Because the extraction of the characteristic words is carried out according to the weight size of the words, the words are extracted and converted into numerical value size comparison, the word extraction process is simpler, the extracted characteristic words can represent the document, meanwhile, the number of the finally obtained characteristic words can be flexibly selected according to the actual retrieval requirement, and the retrieval efficiency is improved.
According to an embodiment of the present disclosure, the step of obtaining the feature vocabulary of the document based on the magnitude of the weight includes: presetting a critical weight; and extracting the vocabulary with the weight larger than the critical weight from the n vocabularies to obtain the characteristic vocabularies of the document, thereby greatly reducing the number of the characteristic vocabularies and improving the retrieval speed.
Based on the document retrieval method, the disclosure also provides a document retrieval device. The apparatus will be described in detail below with reference to fig. 5.
Fig. 5 schematically shows a block diagram of the structure of a document retrieval apparatus according to an embodiment of the present disclosure.
As shown in fig. 5, the document retrieval apparatus 500 of this embodiment is applied to a canonical document set including a plurality of documents each including a plurality of vocabularies, and includes an acquisition module 510, an extraction module 520, and a retrieval module 530.
An obtaining module 510, configured to obtain a search keyword. In an embodiment, the obtaining module 510 may be configured to perform the operation S210 described above, which is not described herein again.
An extracting module 520, configured to extract a feature vocabulary of each document in the canonical document set, where the feature vocabulary is obtained based on word frequency-inverse document frequency and location information of the vocabulary. In an embodiment, the extracting module 520 may be configured to perform the operation S220 described above, which is not described herein again.
And the retrieval module 530 is configured to match the retrieval keyword with the feature vocabulary to obtain a retrieval result. In an embodiment, the retrieving module 530 may be configured to perform the operation S230 described above, which is not described herein again.
According to an embodiment of the present disclosure, the extraction module includes: the preprocessing submodule is used for preprocessing a document of a feature vocabulary to be extracted to obtain a vocabulary set corresponding to the document, wherein the vocabulary set comprises n vocabularies, and n is an integer greater than or equal to 1; a first determining submodule for determining a weight of each of the n vocabularies; and the second determining submodule is used for obtaining the characteristic vocabulary of the document based on the size of the weight.
According to the embodiment of the disclosure, the preprocessing submodule comprises a word segmentation submodule, and is used for performing word segmentation on a document of a feature word to be extracted to obtain a candidate word set; and the screening submodule is used for removing the stop vocabulary in the candidate vocabulary set based on the universal stop vocabulary table to obtain the vocabulary set corresponding to the document.
According to an embodiment of the present disclosure, the first determining submodule includes a first calculating submodule configured to calculate a word frequency-inverse document frequency weight for each of the n vocabularies; a third determining submodule, configured to determine a position weight of each of the n vocabularies based on position information of each of the n vocabularies; and a fourth determining submodule, configured to determine a weight of each of the n vocabularies according to the word frequency-inverse document frequency weight and the position weight of each of the n vocabularies.
According to an embodiment of the present disclosure, the first computation submodule is further configured to compute a word frequency of each of the n vocabularies; calculating an inverse document frequency for each of the n vocabularies; and multiplying the word frequency of each vocabulary in the n vocabularies by the inverse document frequency to obtain the word frequency-inverse document frequency weight of each vocabulary in the n vocabularies.
According to an embodiment of the present disclosure, the fourth determining sub-module is further configured to multiply the word frequency-inverse document frequency weight and the position weight of each of the n vocabularies, so as to obtain a weight of each of the n vocabularies.
According to an embodiment of the present disclosure, the second determining submodule includes a preset submodule for presetting the critical weight; and the extraction submodule is used for extracting the vocabularies with the weight larger than the critical weight from the n vocabularies to obtain the characteristic vocabularies of the documents.
According to an embodiment of the present disclosure, any plurality of the obtaining module 510, the extracting module 520, and the retrieving module 530 may be combined and implemented in one module, or any one of the modules may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 510, the extracting module 520, and the retrieving module 530 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware by any other reasonable manner of integrating or packaging a circuit, or may be implemented in any one of or a suitable combination of software, hardware, and firmware. Alternatively, at least one of the obtaining module 510, the extracting module 520 and the retrieving module 530 may be at least partly implemented as a computer program module, which when executed may perform a corresponding function.
FIG. 6 schematically shows a block diagram of an electronic device suitable for implementing a document retrieval method according to an embodiment of the present disclosure.
As shown in fig. 6, an electronic device 600 according to an embodiment of the present disclosure includes a processor 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. Processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 601 may also include onboard memory for caching purposes. Processor 601 may include a single processing unit or multiple processing units for performing different actions of a method flow according to embodiments of the disclosure.
In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. The processor 601 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 602 and/or RAM 603. It is to be noted that the programs may also be stored in one or more memories other than the ROM 602 and RAM 603. The processor 601 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
Electronic device 600 may also include input/output (I/O) interface 605, input/output (I/O) interface 605 also connected to bus 604, according to an embodiment of the disclosure. The electronic device 600 may also include one or more of the following components connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 602 and/or RAM 603 described above and/or one or more memories other than the ROM 602 and RAM 603.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. The program code is for causing a computer system to perform the methods of the embodiments of the disclosure when the computer program product is run on the computer system.
The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 601. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, downloaded and installed through the communication section 609, and/or installed from the removable medium 611. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program, when executed by the processor 601, performs the above-described functions defined in the system of the embodiments of the present disclosure. The above described systems, devices, apparatuses, modules, units, etc. may be implemented by computer program modules according to embodiments of the present disclosure.
In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (11)

1. A document retrieval method applied to a canonical collection of documents that includes a plurality of documents, each of the plurality of documents including a plurality of words, the method comprising:
acquiring a retrieval keyword;
extracting a characteristic vocabulary of each document in the standard document set, wherein the characteristic vocabulary is obtained based on word frequency-inverse document frequency and position information of the vocabulary; and
and matching the search keywords with the feature vocabulary to obtain a search result.
2. The method of claim 1, wherein the step of extracting a feature vocabulary for each document in the set of canonical documents comprises:
preprocessing a document of a feature vocabulary to be extracted to obtain a vocabulary set corresponding to the document, wherein the vocabulary set comprises n vocabularies, and n is an integer greater than or equal to 1;
determining a weight for each of the n vocabularies; and
and obtaining the characteristic vocabulary of the document based on the weight.
3. The method according to claim 2, wherein the step of preprocessing the document with feature vocabulary to be extracted to obtain a vocabulary set corresponding to the document comprises:
performing word segmentation processing on a document of a feature word to be extracted to obtain a candidate word set; and
and removing the stop words in the candidate word set based on the universal stop word list to obtain a word set corresponding to the document.
4. The method of claim 3, wherein the step of determining the weight of each of the n words comprises:
calculating the word frequency-inverse document frequency of each vocabulary in the n vocabularies;
determining a word frequency-inverse document frequency weight for each of the n vocabularies based on the word frequency-inverse document frequency;
determining a location weight for each of the n words based on location information for each of the n words; and
and determining the weight of each vocabulary in the n vocabularies according to the word frequency-inverse document frequency weight and the position weight of each vocabulary in the n vocabularies.
5. The method of claim 4, wherein said step of calculating the word frequency-inverse document frequency for each of said n words comprises:
calculating the word frequency of each word in the n words;
calculating an inverse document frequency for each of the n vocabularies; and
multiplying the word frequency of each vocabulary in the n vocabularies by the inverse document frequency to obtain the word frequency-inverse document frequency of each vocabulary in the n vocabularies.
6. The method of claim 4, wherein determining the weight for each of the n vocabularies based on the word frequency-inverse document frequency weight and the location weight for each of the n vocabularies comprises:
multiplying the word frequency-inverse document frequency weight and the position weight of each vocabulary in the n vocabularies to obtain the weight of each vocabulary in the n vocabularies.
7. The method of claim 2, wherein the step of deriving the feature vocabulary of the document based on the magnitude of the weight comprises:
presetting a critical weight; and
and extracting words with weights larger than the critical weight from the n words to obtain the feature words of the documents.
8. A document retrieval apparatus for use with a canonical collection of documents that includes a plurality of documents that each include a plurality of words, the method comprising:
the acquisition module is used for acquiring search keywords;
the extraction module is used for extracting a characteristic vocabulary of each document in the standard document set, wherein the characteristic vocabulary is obtained based on the word frequency-inverse document frequency and the position information of the vocabulary; and
and the retrieval module is used for matching the retrieval keywords with the characteristic words to obtain retrieval results.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.
10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 7.
11. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 7.
CN202210874619.XA 2022-07-22 2022-07-22 Document retrieval method, document retrieval device, electronic equipment and medium Pending CN115048495A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210874619.XA CN115048495A (en) 2022-07-22 2022-07-22 Document retrieval method, document retrieval device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210874619.XA CN115048495A (en) 2022-07-22 2022-07-22 Document retrieval method, document retrieval device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN115048495A true CN115048495A (en) 2022-09-13

Family

ID=83166704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210874619.XA Pending CN115048495A (en) 2022-07-22 2022-07-22 Document retrieval method, document retrieval device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN115048495A (en)

Similar Documents

Publication Publication Date Title
CN110069698B (en) Information pushing method and device
US9495387B2 (en) Images for a question answering system
CN111538837A (en) Method and device for analyzing enterprise operation range information
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN114091426A (en) Method and device for processing field data in data warehouse
CN113660541A (en) News video abstract generation method and device
CN111651552A (en) Structured information determination method and device and electronic equipment
CN111435406A (en) Method and device for correcting database statement spelling errors
CN111126073B (en) Semantic retrieval method and device
CN115048495A (en) Document retrieval method, document retrieval device, electronic equipment and medium
CN114691850A (en) Method for generating question-answer pairs, training method and device of neural network model
CN111368036B (en) Method and device for searching information
CN114445179A (en) Service recommendation method and device, electronic equipment and computer readable medium
CN113656538A (en) Method and device for generating regular expression, computing equipment and storage medium
CN112016017A (en) Method and device for determining characteristic data
CN113177116B (en) Information display method and device, electronic equipment, storage medium and program product
CN116610782B (en) Text retrieval method, device, electronic equipment and medium
CN106777403B (en) Information pushing method and device
CN110737757B (en) Method and apparatus for generating information
CN113704408A (en) Retrieval method, retrieval apparatus, electronic device, storage medium, and program product
CN117435616A (en) Recommendation method and device for production problem records, electronic equipment and medium
CN117149651A (en) Test method, test device, test equipment and storage medium
CN116483954A (en) Data processing method, device, equipment and storage medium
CN117951253A (en) Text retrieval method, apparatus, electronic device, storage medium, and program product
CN117113963A (en) Document difference output method, device, apparatus, medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination