CN117591640A - Document retrieval method, device, equipment and medium - Google Patents

Document retrieval method, device, equipment and medium Download PDF

Info

Publication number
CN117591640A
CN117591640A CN202311369150.5A CN202311369150A CN117591640A CN 117591640 A CN117591640 A CN 117591640A CN 202311369150 A CN202311369150 A CN 202311369150A CN 117591640 A CN117591640 A CN 117591640A
Authority
CN
China
Prior art keywords
document
documents
vector
search result
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311369150.5A
Other languages
Chinese (zh)
Inventor
许磊超
张大成
韩堃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Orion Star Technology Co Ltd
Original Assignee
Beijing Orion Star Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Orion Star Technology Co Ltd filed Critical Beijing Orion Star Technology Co Ltd
Priority to CN202311369150.5A priority Critical patent/CN117591640A/en
Publication of CN117591640A publication Critical patent/CN117591640A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation

Abstract

The embodiment of the application relates to a document retrieval method, a device, equipment and a medium, which are used for reducing a maintained search word stock and simultaneously considering the scene of extremely small number of documents. The method comprises the following steps: acquiring a question to be replied; word vector conversion is carried out on the problem, a corresponding semantic vector is obtained, a target document vector with similarity to the semantic vector meeting a first preset condition is searched in a pre-established document vector library based on the semantic vector, and a document corresponding to the target document vector is used as a first search result; word segmentation processing is carried out on the problem, words contained in the problem are obtained, target words existing in a predetermined key vocabulary are screened out from the obtained words, searching is carried out in a preset document set based on the target words, and documents containing the target words are obtained to serve as second retrieval results; and taking part or all of the documents in the first search result and/or part or all of the documents in the second search result as candidate documents for replying to the problem.

Description

Document retrieval method, device, equipment and medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a document retrieval method, apparatus, device, and medium.
Background
With the development of artificial intelligence technology, various intelligent products, such as intelligent customer service and intelligent robots, are widely used, and such intelligent products can perform a dialogue with a user, for example, receive a question raised by the user by means of voice or text, and give a corresponding answer.
In the related art, when an intelligent product replies to a question raised by a user, a search engine is usually added into the intelligent product, after the question raised by the user is received in a voice or text mode, keywords in the question are extracted, and then related documents are obtained through keyword retrieval.
However, in the above-described method of obtaining a document by keyword search based on a search engine, there are the following problems in practical use: firstly, a search engine is a search technology based on reverse indexes, a huge word stock needs to be maintained to ensure that word segmentation is correct, so that correct document recall is realized, and the complexity and the cost of a system are high; secondly, aiming at the problem presented by the user, the search engine scores the search results based on the document weight mode, when the documents in the search results are scored, the more the number of the documents is, the better the effect is, and when the number of the documents is less, the effect is not good.
In summary, in the related art, the intelligent product searches related documents through the search engine, so that not only is a huge search word stock need to be maintained and searched, but also the intelligent product is not suitable for a scene with a small number of documents.
Disclosure of Invention
The embodiment of the application provides a document retrieval method, a device, equipment and a medium, which are used for reducing a maintained search word stock and considering the scene of extremely small number of documents when related documents are retrieved based on the problem input by a user.
In a first aspect, an embodiment of the present application provides a document retrieval method, including:
acquiring a question to be replied;
performing word vector conversion on the problem, obtaining a corresponding semantic vector, searching a target document vector with similarity meeting a first preset condition with the semantic vector in a pre-established document vector library based on the semantic vector, and taking a document corresponding to the target document vector as a first search result;
word segmentation is carried out on the problem, words contained in the problem are obtained, target words existing in a predetermined key vocabulary are screened out from the obtained words, documents containing the target words are obtained as second retrieval results based on the target words in a preset document set, wherein the key vocabulary is determined based on inverse document frequency (Inverse Document Frequency, IDF) values of the words contained in document fragments obtained by segmentation after the documents in the preset document set are segmented;
And taking part or all of the documents in the first search result and/or part or all of the documents in the second search result as candidate documents for replying to the problem.
In a possible implementation manner, in the method provided by the embodiment of the present application, the screening, among the obtained terms, the target terms existing in the predetermined key vocabulary includes:
performing word vector conversion on the words contained in the obtained problems to obtain corresponding word vectors;
based on the obtained word vectors, determining synonyms of each word respectively, wherein the similarity between the word vector of the synonym of each word and the word vector corresponding to each word is larger than a preset threshold;
and screening target words existing in a predetermined key vocabulary from the obtained words contained in the problem and synonyms of each word.
In one possible implementation manner, in the method provided in the embodiment of the present application, the key vocabulary is determined in the following manner:
segmenting the documents contained in the preset document set to obtain a plurality of document fragments;
calculating an IDF value of words contained in each document fragment;
And selecting words with IDF values meeting a second preset condition as a key vocabulary corresponding to the preset document set.
In a possible implementation manner, in the above method provided by the embodiment of the present application, the step of using part or all of the documents in the first search result and/or part or all of the documents in the second search result as candidate documents for replying to the problem includes:
based on the vector similarity of the document vector of each document and the semantic vector corresponding to the problem in the first search result, sorting the documents in the first search result to obtain a first sorting result;
determining keywords contained in each document in the second search result based on the key vocabulary, determining a grading value corresponding to each document in the second search result based on an IDF value of the keywords contained in each document by using a preset algorithm, and sorting the documents in the second search result according to the grading value corresponding to each document to obtain a second sorting result;
selecting part or all of the documents from the first search result based on the first sorting result as candidate documents when replying to the problem, and/or selecting part or all of the documents from the second search result based on the second sorting result as candidate documents when replying to the problem.
In a possible implementation manner, in the method provided by the embodiment of the present application, the method further includes:
and updating the key vocabulary and the vector document library when determining the document update in the preset document set.
In a second aspect, an embodiment of the present application provides a document retrieval apparatus, including:
an acquisition unit for acquiring a question to be answered;
the first retrieval unit is used for carrying out word vector conversion on the problem, obtaining a corresponding semantic vector, retrieving a target document vector with similarity meeting a first preset condition with the semantic vector in a pre-established document vector library based on the semantic vector, and taking a document corresponding to the target document vector as a first retrieval result;
the second retrieval unit is used for carrying out word segmentation on the problem to obtain words contained in the problem, screening target words in a predetermined key vocabulary from the obtained words, searching in a preset document set based on the target words to obtain a document containing the target words as a second retrieval result, wherein the key vocabulary is determined based on an inverse document frequency IDF value of the words contained in a document fragment obtained by segmentation after the documents in the preset document set are segmented;
And the processing unit is used for taking part or all of the documents in the first search result and/or part or all of the documents in the second search result as candidate documents for replying to the problem.
In a possible implementation manner, in the foregoing apparatus provided by the embodiment of the present application, the second search unit is specifically configured to:
performing word vector conversion on the words contained in the obtained problems to obtain corresponding word vectors;
based on the obtained word vectors, determining synonyms of each word respectively, wherein the similarity between the word vector of the synonym of each word and the word vector corresponding to each word is larger than a preset threshold;
and screening target words existing in a predetermined key vocabulary from the obtained words contained in the problem and synonyms of each word.
In a possible implementation manner, in the foregoing apparatus provided by the embodiment of the present application, the second search unit determines the key vocabulary by adopting the following manner:
segmenting the documents contained in the preset document set to obtain a plurality of document fragments;
calculating an IDF value of words contained in each document fragment;
And selecting words with IDF values meeting a second preset condition as a key vocabulary corresponding to the preset document set.
In a possible implementation manner, in the foregoing apparatus provided by the embodiment of the present application, the processing unit is specifically configured to:
based on the vector similarity of the document vector of each document and the semantic vector corresponding to the problem in the first search result, sorting the documents in the first search result to obtain a first sorting result;
determining keywords contained in each document in the second search result based on the key vocabulary, determining a grading value corresponding to each document in the second search result based on an IDF value of the keywords contained in each document by using a preset algorithm, and sorting the documents in the second search result according to the grading value corresponding to each document to obtain a second sorting result;
selecting part or all of the documents from the first search result based on the first sorting result as candidate documents when replying to the problem, and/or selecting part or all of the documents from the second search result based on the second sorting result as candidate documents when replying to the problem.
In a possible implementation manner, in the above device provided in the embodiment of the present application, the device further includes: and the updating unit is used for updating the key vocabulary and the vector document library when determining the document update in the preset document set.
In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor, at least one memory, and computer program instructions stored in the memory, which when executed by the processor, implement the method as provided by the first aspect of the embodiments of the present application.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method as provided in the first aspect of embodiments of the present application.
After obtaining a question to be replied, carrying out word vector conversion on the question on one side to obtain a corresponding semantic vector, searching a target document vector with similarity meeting a first preset condition with the semantic vector in a pre-established document vector library based on the semantic vector, and taking a document corresponding to the target document vector as a first search result; on the other hand, word segmentation processing is carried out on the problem to obtain words contained in the problem, target words existing in a predetermined key vocabulary are screened out from the obtained words, a document containing the target words is searched in a preset document set based on the target words to obtain a second search result, and finally, part or all of the documents in the first search result and/or part or all of the documents in the second search result are used as candidate documents for replying the problem.
According to the embodiment of the application, on one hand, on the basis of the retrieval of the semantic vectors, the documents in the retrieval results can be ordered based on the similarity between the document vectors and the semantic vectors no matter the scenes with more documents or the scenes with less documents, so that the retrieval precision and the reliability of the retrieval results are improved; on the other hand, word segmentation processing is carried out on the questions to be replied, words contained in the questions are filtered by utilizing a key vocabulary, for example, some special words are filtered, and then searching is carried out on the basis of the filtered key words, so that searching can be completed only by maintaining a general word stock, huge word stock is not required to be maintained, and system complexity and cost are reduced. Therefore, in the embodiment of the application, by combining the retrieval based on the semantic vector and the retrieval based on the filtered key words, compared with the mode of using a search engine in the prior art, huge search word libraries are not required to be maintained, the complexity and the cost of a system can be reduced, and the retrieved documents can be scored aiming at scenes with fewer documents, so that the retrieval precision and the reliability of the retrieval result are improved.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
Fig. 1 is an alternative schematic diagram of an application scenario in an embodiment of the present application;
FIG. 2 is a schematic flow chart of a document retrieval method provided in an embodiment of the present application;
FIG. 3 is a schematic flow chart of document retrieval based on terms contained in a question provided by an embodiment of the present application;
FIG. 4 is a schematic flow chart of a specific flow of a document retrieval method provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of a document retrieval device according to an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of another electronic device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the technical solutions of the present application, but not all embodiments. All other embodiments, which can be made by a person of ordinary skill in the art without any inventive effort, based on the embodiments described in the present application are intended to be within the scope of the technical solutions of the present application.
The following briefly describes the design concept of the embodiment of the present application:
with the development of artificial intelligence technology, various intelligent products, such as intelligent customer service and intelligent robots, are widely used, and such intelligent products can perform a dialogue with a user, for example, receive a question raised by the user by means of voice or text, and give a corresponding answer.
In the related art, when an intelligent product replies to a question raised by a user, a search engine is usually added into the intelligent product, after the question raised by the user is received in a voice or text mode, keywords in the question are extracted, and then related documents are obtained through keyword retrieval.
However, in the above-described method of obtaining a document by keyword search based on a search engine, there are the following problems in practical use: firstly, a search engine is a search technology based on reverse indexes, a huge word stock needs to be maintained to ensure that word segmentation is correct, so that correct document recall is realized, and the complexity and the cost of a system are high; secondly, aiming at the problem presented by the user, the search engine scores the search results based on the document weight mode, when the documents in the search results are scored, the more the number of the documents is, the better the effect is, and when the number of the documents is less, the effect is not good.
In view of this, an embodiment of the present application provides a document retrieval method, apparatus, device, and medium, after obtaining a problem to be replied, performing word vector conversion on the problem on one side, obtaining a corresponding semantic vector, retrieving a target document vector having a similarity with the semantic vector satisfying a first preset condition in a pre-established document vector library based on the semantic vector, and taking a document corresponding to the target document vector as a first retrieval result; on the other hand, word segmentation processing is carried out on the problem to obtain words contained in the problem, target words existing in a predetermined key vocabulary are screened out from the obtained words, a document containing the target words is searched in a preset document set based on the target words to obtain a second search result, and finally, part or all of the documents in the first search result and/or part or all of the documents in the second search result are used as candidate documents for replying the problem.
According to the embodiment of the application, on one hand, on the basis of the retrieval of the semantic vectors, the documents in the retrieval results can be ordered based on the similarity between the document vectors and the semantic vectors no matter the scenes with more documents or the scenes with less documents, so that the retrieval precision and the reliability of the retrieval results are improved; on the other hand, word segmentation processing is carried out on the questions to be replied, words contained in the questions are filtered by utilizing a key vocabulary, for example, some special words are filtered, and then searching is carried out on the basis of the filtered key words, so that searching can be completed only by maintaining a general word stock, huge word stock is not required to be maintained, and system complexity and cost are reduced. Therefore, in the embodiment of the application, by combining the retrieval based on the semantic vector and the retrieval based on the filtered key words, compared with the mode of using a search engine in the prior art, huge search word libraries are not required to be maintained, the complexity and the cost of a system can be reduced, and the retrieved documents can be scored aiming at scenes with fewer documents, so that the retrieval precision and the reliability of the retrieval result are improved.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and are not intended to limit the present application, and embodiments and features of embodiments of the present application may be combined with each other without conflict.
Fig. 1 is a schematic view of an application scenario in an embodiment of the present application. Any one of the plurality of smart products 110 and any one of the plurality of servers 120 are included in the application scenario diagram.
In the present embodiment, smart products 110 include, but are not limited to, cell phones, computers, smart robots, etc.; the server 120 is a background server of the intelligent product. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform.
Note that, the document searching method in the embodiment of the present application may be executed by the server 120 or may be executed by the intelligent product 110, which is not limited in the embodiment of the present application.
Taking the execution in the intelligent product 110 as an example, after the intelligent product 110 acquires the problem to be replied, performing word vector conversion on the problem on one side to acquire a corresponding semantic vector, searching a target document vector with similarity to the semantic vector meeting a first preset condition in a pre-established document vector library based on the semantic vector, and taking a document corresponding to the target document vector as a first search result; on the other hand, word segmentation processing is carried out on the problem to obtain words contained in the problem, target words existing in a predetermined key vocabulary are screened out from the obtained words, a document containing the target words is searched in a preset document set based on the target words to obtain a second search result, and finally, part or all of the documents in the first search result and/or part or all of the documents in the second search result are used as candidate documents for replying the problem.
In an alternative embodiment, the smart product 110 and the server 120 may communicate over a communication network, which may be a wired network or a wireless network.
It should be noted that, the number of intelligent products and servers and the communication manner are not limited in practice, and when the number of servers is plural, plural servers may be configured as a blockchain, and the servers are nodes on the blockchain, which is not specifically limited in the embodiment of the present application.
The document searching method provided by the exemplary embodiments of the present application will be described below with reference to the accompanying drawings in conjunction with the above-described application scenario, and it should be noted that the above-described application scenario is merely illustrated for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in any way in this respect.
As shown in fig. 2, a flowchart of an implementation of a document retrieval method in an embodiment of the present application is shown in the following specific implementation flows of S201-S204:
s201, obtaining a question to be replied.
The questions to be replied can be questions input by a user in a voice mode or a text mode, can be questions input by a user in a history mode, can be questions input by the user in real time, can be test questions generated by testers or other application software, and the like, and the embodiment of the application is not limited to the questions.
It should be noted that, if the user asks in a voice manner, after the voice data of the user is obtained, the voice data may be converted into text, so as to obtain the problem of the user, and the manner of converting the voice data into text is not limited in the embodiment of the present application.
S202, carrying out word vector conversion on the problem, obtaining a corresponding semantic vector, searching a target document vector with similarity to the semantic vector meeting a first preset condition in a pre-established document vector library based on the semantic vector, and taking a document corresponding to the target document vector as a first search result.
It should be noted that, the pre-established document vector library may be obtained by performing word vector conversion on the documents in the preset document set, and of course, in practical application, before performing word vector conversion on the documents in the preset document set, the documents may be further segmented to obtain a plurality of document segments, and then performing word vector conversion on the document segments to obtain document vectors, thereby establishing the document vector library.
The preset document set can be set according to the specific field of intelligent products, for example, an intelligent voice robot applied to medical scenes such as hospitals, documents related to application hospitals, documents related to diseases and the like can be added into the preset document set; for another example, the intelligent robot applied to the scenes such as the enterprise building can add the documents related to the introduction of the enterprise building into the preset document set.
In specific implementation, word vector conversion is performed on the problem to obtain a corresponding semantic vector, and word vector conversion is performed on documents in a preset document set to obtain a document vector, which can use the existing language model, for example: word2vec, glove, ELMo, BERT, and the like, which are not limited in this embodiment of the present application.
In practical application, after obtaining a semantic vector corresponding to a problem, calculating the similarity between the semantic vector corresponding to the problem and each document vector in a pre-established document vector library, then searching a target document vector with the similarity meeting a first preset condition, and taking a document or a document fragment corresponding to the target document vector as a first search result. The first preset condition may be set according to an actual situation, for example, the similarity is greater than a set threshold (for example, the value is 0.9 or 0.95), and so on.
S203, word segmentation processing is carried out on the problem to obtain words contained in the problem, target words in a predetermined key vocabulary are screened out from the obtained words, documents containing the target words are obtained as a second search result based on searching of the target words in a preset document set, wherein the key vocabulary is determined based on inverse document frequency IDF values of the words contained in the segmented document fragments after the documents in the preset document set are segmented.
The predetermined key vocabulary is a predetermined key vocabulary based on a preset document set, and specifically can be determined by the following manner: the method comprises the steps of segmenting documents contained in a preset document set to obtain a plurality of document fragments, calculating the IDF value of words contained in each document fragment, and selecting words with the IDF value meeting a second preset condition as a key vocabulary corresponding to the preset document set.
The second preset condition may be that the IDF value is greater than the set value, or that the IDF value is located in the first n% in the IDF descending order result of all the words, where the value of n is a natural number greater than or equal to 1 and less than or equal to 100, and of course, other conditions may be also used in other embodiments of the present application, and the present application is not limited specifically.
In specific implementation, the problem is subjected to word segmentation, and an existing mode can be used, which is not limited in the embodiment of the application. After the words contained in the problem are obtained, filtering the words contained in the problem by using a predetermined key vocabulary in the obtained words, screening out target words existing in the predetermined key vocabulary, and searching in a preset document set based on the target words to obtain a document containing the target words as a second search result.
It should be noted that, in other embodiments of the present application, the terms included in the problem may be filtered by maintaining a deactivated word stock, some insignificant deactivated words are filtered, and important terms are left as target terms, so that a document including the target terms is obtained as a second search result based on searching in a preset document set by the target terms.
In one example, as shown in fig. 3, step S203 may include the following steps when implemented:
s2031, segmenting the documents in the preset document set to obtain document fragments. The specific splitting manner may be an existing manner, which is not limited in this embodiment of the present application.
S2032, calculating the IDF value of each word in the document fragment, wherein the specific calculation mode can be an existing mode, and the embodiment of the application is not limited to this.
S2033, the IDF values of all words are arranged in descending order.
S2034, reserving the first 99% of words in the descending order of the arrangement results to obtain a key vocabulary corresponding to the preset document set.
It should be noted that, S2031-S2034 may be completed in advance in practical application, and in the case where the document in the preset document set is not changed or updated, the above steps are performed only once, and are not required to be performed each time a question to be replied is acquired.
S2035, a question to be answered is obtained.
S2036, word segmentation processing is carried out on the questions to obtain words contained in the questions.
S2037, among the words contained in the question, the target word existing in the key vocabulary is determined.
S2038, searching in the preset document set based on the target word, to obtain a document containing the target word.
S204, taking part or all of the documents in the first search result and/or part or all of the documents in the second search result as candidate documents for replying to the problem.
In the specific implementation, after the first search result and the second search result are obtained, the documents in the first search result and the second search result may be scored first, and then sorted according to the scoring values, and then based on the sorting results, part or all of the documents are selected from the first search result and/or the second search result as candidate documents for replying to the problem, specifically:
based on the vector similarity of the document vector of each document and the semantic vector corresponding to the problem in the first search result, sorting the documents in the first search result to obtain a first sorting result; determining keywords contained in each document in the second search result based on the key vocabulary, determining a score value corresponding to each document in the second search result based on an IDF value of the keywords contained in each document by using a preset algorithm, and sorting the documents in the second search result according to the score value corresponding to each document to obtain a second sorting result; and then selecting part or all of the documents from the first search results based on the first sorting results as candidate documents for answering the questions, and/or selecting part or all of the documents from the second search results based on the second sorting results as candidate documents for answering the questions.
When ranking the documents retrieved in the second retrieval result, in order to ensure that the score of the important terms is higher, when determining the score value corresponding to each document based on the keywords included in the documents by using a preset algorithm, the IDF value of each keyword may be preprocessed, for example, the square value of the IDF value is calculated, and then the sum of the square values of the IDF values of all the keywords included in each document is used as the score value corresponding to each document. Of course, in other embodiments of the present application, the preset algorithm may be other algorithms, which are not limited in this embodiment of the present application.
In practical applications, in the prior art, a search engine is used to search a document, and because the search engine is insensitive to semantics, aiming at the situation that synonyms are included in a problem, the search engine cannot search for a correct document, and usually needs to introduce a synonym dictionary externally and perform secondary development.
For the situation that the problem may include synonyms, in the embodiment of the present application, the synonyms of each term may be determined first, and then the synonyms are also used as key terms for searching, specifically: after the words contained in the problem are obtained, word vector conversion can be carried out on the obtained words to obtain corresponding word vectors, and synonyms of each word are respectively determined based on the obtained word vectors, wherein the similarity between the word vector of the synonym of each word and the word vector corresponding to each word is larger than a preset threshold; and screening target words existing in a predetermined key vocabulary from words contained in the obtained problem and synonyms of each word. The preset threshold may be set according to an empirical value, for example: 0.9 or 0.95, etc.
In addition, in the implementation, if it is determined that a document in the preset document set is updated, for example, a new document is added, a document is deleted, or a document version is updated, the embodiment of the present application may update the key vocabulary and the vector document library corresponding to the preset document set.
Taking the document searching method provided by the embodiment of the present application as an example in executing the document searching method in an intelligent product, a specific flow of the document searching method provided by the embodiment of the present application is described with reference to fig. 4, as shown in fig. 4, including:
s401, acquiring the latest document vector library and a key vocabulary.
In particular, the maintenance of the document vector library and the key vocabulary may be performed in a server, and when the preset document set is updated, for example, the document is updated, the document is deleted, etc., the server updates the document vector library and the key vocabulary corresponding to the preset document set.
The intelligent product may periodically obtain the latest document vector library and the key vocabulary from the server, or obtain the latest document vector library and the key vocabulary from the server when the intelligent product is started every time, or obtain the latest document vector library and the key vocabulary from the server every time a question to be replied is obtained, which is not limited in the embodiment of the present application.
S402, obtaining a question to be answered.
S403, performing word vector conversion on the problem to obtain a semantic vector of the response.
S404, calculating the similarity between the obtained semantic vector and each document vector in the document vector library.
S405, determining a first search result based on the calculated similarity. For example, a document corresponding to a document vector whose semantic vector similarity is greater than a set threshold may be used as the first search result.
S406, sorting the documents contained in the first search result based on the similarity of the document vectors corresponding to the documents and the semantic vectors corresponding to the questions in the first search result.
S407, performing word segmentation processing on the problem to obtain words contained in the problem.
S408, performing word vector conversion on words contained in the problem, and determining synonyms of each word.
S409, determining target words existing in the key vocabulary in terms contained in the question and synonyms of each term.
S410, searching in a preset document set based on the target words to obtain documents containing the target words as a second search result.
S411, calculating the grading value of each document in the second search result by using a preset algorithm based on the IDF value of the keyword contained in each document in the second search result, and sorting the documents contained in the second search result.
Note that, S403 to S406 are the document searching method based on the semantic vector, S407 to S411 are the document searching method based on the word included in the question, and the two searching methods in the embodiment of the present application are used in combination, and may be executed in parallel, or S403 to S406 may be executed first, and then S407 to S411 may be executed first, or S407 to S411 may be executed first, and then S403 to S406 may be executed, which is not limited in the embodiment of the present application.
S412, selecting part or all of the documents from the first search results based on the first sorting results as candidate documents for answering the questions, and/or selecting part or all of the documents from the second search results based on the second sorting results as candidate documents for answering the questions.
Based on the same inventive concept, the embodiment of the present application further provides a document retrieval device, as shown in fig. 5, where the document retrieval device provided in the embodiment of the present application includes:
an obtaining unit 501, configured to obtain a question to be answered;
the first search unit 502 is configured to perform word vector conversion on a problem, obtain a corresponding semantic vector, search a target document vector with similarity to the semantic vector meeting a first preset condition in a pre-established document vector library based on the semantic vector, and take a document corresponding to the target document vector as a first search result;
A second search unit 503, configured to perform word segmentation processing on a question to obtain terms included in the question, screen out target terms existing in a predetermined keyword vocabulary from the obtained terms, search in a preset document set based on the target terms, and obtain a document including the target terms as a second search result, where the keyword vocabulary is determined based on an inverse document frequency IDF value of the terms included in the segmented document segments after segmenting the documents in the preset document set;
a processing unit 504, configured to use part or all of the documents in the first search result and/or part or all of the documents in the second search result as candidate documents for replying to the question.
In a possible implementation, the second retrieving unit 503 is specifically configured to:
carrying out word vector conversion on words contained in the obtained problem to obtain corresponding word vectors;
based on the obtained word vectors, determining synonyms of each word respectively, wherein the similarity between the word vector of the synonym of each word and the word vector corresponding to each word is larger than a preset threshold;
and screening target words existing in a predetermined key vocabulary from words contained in the obtained problem and synonyms of each word.
In one possible implementation, the second retrieval unit 503 determines the key vocabulary in the following way:
segmenting documents contained in a preset document set to obtain a plurality of document fragments;
calculating an IDF value of words contained in each document fragment;
and selecting words with IDF values meeting a second preset condition as a key vocabulary corresponding to the preset document set.
In a possible implementation, the processing unit 504 is specifically configured to:
based on the vector similarity of the document vector of each document and the semantic vector corresponding to the problem in the first search result, sorting the documents in the first search result to obtain a first sorting result;
determining keywords contained in each document in the second search result based on the key vocabulary, determining a score value corresponding to each document in the second search result based on an IDF value of the keywords contained in each document by using a preset algorithm, and sorting the documents in the second search result according to the score value corresponding to each document to obtain a second sorting result;
selecting part or all of the documents from the first search results based on the first sorting results as candidate documents when replying to the problem, and/or selecting part or all of the documents from the second search results based on the second sorting results as candidate documents when replying to the problem.
In one possible embodiment, the document apparatus further includes: an updating unit 505 for updating the key vocabulary and the vector document library when determining the document update in the preset document set.
The embodiment of the application also provides electronic equipment based on the same inventive concept as the embodiment of the method. The electronic device may answer questions posed by the user via document retrieval. In one embodiment, the electronic device may be a server, such as server 120 shown in FIG. 1. In this embodiment, the electronic device may be configured as shown in fig. 6, including a memory 601, a communication module 603, and one or more processors 602.
A memory 601 for storing a computer program for execution by the processor 602. The memory 601 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, programs required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.
The memory 601 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 601 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or memory 601, is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 601 may be a combination of the above memories.
The processor 602 may include one or more central processing units (central processing unit, CPU) or digital processing units, etc. A processor 602 for implementing the above-described document retrieval method when calling the computer program stored in the memory 601.
The communication module 603 is used for communicating with terminal devices and other servers.
The specific connection medium between the memory 601, the communication module 603, and the processor 602 is not limited in the embodiment of the present application. The embodiment of the present disclosure is shown in fig. 6, where the memory 601 and the processor 602 are connected by a bus 604, where the bus 604 is shown in bold lines in fig. 6, and the connection between other components is merely illustrative, and not limiting. The bus 604 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.
The memory 601 stores a computer storage medium in which computer-executable instructions for implementing the document retrieval method of the embodiment of the present application are stored. The processor 602 is configured to perform the document retrieval method described above, as shown in fig. 2.
In another embodiment, the electronic device may be other electronic devices, such as the smart product 110 shown in FIG. 1. In this embodiment, the structure of the electronic device may include, as shown in fig. 7: communication component 710, memory 720, display unit 730, camera 740, sensor 750, audio circuit 760, bluetooth module 770, processor 780, etc.
The communication component 710 is for communicating with a server. In some embodiments, a circuit wireless fidelity (Wireless Fidelity, wiFi) module may be included, where the WiFi module belongs to a short-range wireless transmission technology, and the electronic device may help the user to send and receive information through the WiFi module.
Memory 720 may be used to store software programs and data. The processor 780 performs various functions and data processing of the intelligent product 110 by running software programs or data stored in the memory 720. Memory 720 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Memory 720 stores an operating system that enables smart product 110 to operate. The memory 720 in the present application may store an operating system and various application programs, and may also store codes for executing the document retrieval method according to the embodiment of the present application.
The display unit 730 may also be used to display information input by a user or information provided to the user and a graphical user interface (graphical user interface, GUI) of various menus of the smart product 110. In particular, the display unit 730 may include a display screen 732 disposed on the front side of the smart product 110. The display 732 may be configured in the form of a liquid crystal display, light emitting diodes, or the like. The display unit 730 may be used to present images or text in embodiments of the present application.
The display unit 730 may also be used to receive input numeric or character information, generate signal inputs related to user settings and function control of the smart product 110, and in particular, the display unit 730 may include a touch screen 731 disposed on the front surface of the smart product 110, and may collect touch operations on or near the user, such as clicking buttons, dragging scroll boxes, and the like.
The touch screen 731 may cover the display screen 732, or the touch screen 731 may be integrated with the display screen 732 to realize input and output functions of the intelligent product 110, and after integration, the touch screen may be simply referred to as a touch display screen. The display unit 730 may display an application program and corresponding operation steps.
The camera 740 may be used to capture still images, and a user may send images captured by the camera 740 to a user of a chat partner through a client. The camera 740 may be one or more. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then passed to a processor 780 for conversion into a digital image signal.
The smart product may also include at least one sensor 750, such as an acceleration sensor 751, a distance sensor 752, a fingerprint sensor 753, a temperature sensor 754. The intelligent product can be also provided with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors and the like.
Audio circuitry 760, speaker 761, microphone 762 may provide an audio interface between a user and smart product 110. The audio circuit 760 may transmit the received electrical signal converted from audio data to the speaker 761, where it is converted into a sound signal by the speaker 761 and output. The smart product 110 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, microphone 762 converts the collected sound signals into electrical signals, which are received by audio circuit 760 and converted into audio data, which are output to communication component 710 for transmission to, for example, another smart product 110, or to memory 720 for further processing.
The bluetooth module 770 is used for exchanging information with other bluetooth devices having a bluetooth module through a bluetooth protocol. For example, the smart product may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that also has a bluetooth module through the bluetooth module 770, thereby performing data interaction.
The processor 780 is a control center of the smart product, connects various parts of the entire terminal using various interfaces and lines, and performs various functions of the smart product and processes data by running or executing software programs stored in the memory 720 and calling data stored in the memory 720. In some embodiments, the processor 780 may include one or more processing units; the processor 780 may also integrate an application processor that primarily processes operating systems, user interfaces, applications, etc., and a baseband processor that primarily processes wireless communications. It will be appreciated that the baseband processor described above may not be integrated into the processor 780. The processor 780 may run an operating system, an application program, a user interface display and a touch response, and a document retrieval method according to an embodiment of the present application. In addition, a processor 780 is coupled to the display unit 730.
In some possible embodiments, aspects of the document retrieval method provided herein may also be implemented in the form of a program product comprising program code for causing a computer device to perform the steps of the document retrieval method according to the various exemplary embodiments of the present application as described herein above, when the program product is run on a computer device, e.g. the computer device may perform the steps as shown in fig. 2.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code and may run on a computing device. However, the program product of the present application is not limited thereto, and in the embodiments of the present application, the readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.
The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's equipment, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program commands may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the commands executed by the processor of the computer or other programmable data processing apparatus produce means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program commands may also be stored in a computer readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the commands stored in the computer readable memory produce an article of manufacture including command means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (10)

1. A document retrieval method, comprising:
acquiring a question to be replied;
performing word vector conversion on the problem, obtaining a corresponding semantic vector, searching a target document vector with similarity meeting a first preset condition with the semantic vector in a pre-established document vector library based on the semantic vector, and taking a document corresponding to the target document vector as a first search result;
word segmentation processing is carried out on the problem to obtain words contained in the problem, target words existing in a predetermined key vocabulary are screened out from the obtained words, documents containing the target words are searched in a preset document set based on the target words, and the documents containing the target words are obtained to serve as second retrieval results, wherein the key vocabulary is determined based on inverse document frequency IDF values of the words contained in document fragments obtained through segmentation after the documents in the preset document set are segmented;
And taking part or all of the documents in the first search result and/or part or all of the documents in the second search result as candidate documents for replying to the problem.
2. The method of claim 1, wherein screening out target words present in a predetermined key vocabulary among the obtained words comprises:
performing word vector conversion on the words contained in the obtained problems to obtain corresponding word vectors;
based on the obtained word vectors, determining synonyms of each word respectively, wherein the similarity between the word vector of the synonym of each word and the word vector corresponding to each word is larger than a preset threshold;
and screening target words existing in a predetermined key vocabulary from the obtained words contained in the problem and synonyms of each word.
3. The method of claim 1, wherein the key vocabulary is determined by:
segmenting the documents contained in the preset document set to obtain a plurality of document fragments;
calculating an IDF value of words contained in each document fragment;
and selecting words with IDF values meeting a second preset condition as a key vocabulary corresponding to the preset document set.
4. A method according to claim 3, wherein said responding to the candidate documents of the question by using part or all of the documents in the first search result and/or part or all of the documents in the second search result comprises:
based on the vector similarity of the document vector of each document and the semantic vector corresponding to the problem in the first search result, sorting the documents in the first search result to obtain a first sorting result;
determining keywords contained in each document in the second search result based on the key vocabulary, determining a grading value corresponding to each document in the second search result based on an IDF value of the keywords contained in each document by using a preset algorithm, and sorting the documents in the second search result according to the grading value corresponding to each document to obtain a second sorting result;
selecting part or all of the documents from the first search result based on the first sorting result as candidate documents when replying to the problem, and/or selecting part or all of the documents from the second search result based on the second sorting result as candidate documents when replying to the problem.
5. The method according to any one of claims 1-4, further comprising:
and updating the key vocabulary and the vector document library when determining the document update in the preset document set.
6. A document retrieval apparatus, comprising:
an acquisition unit for acquiring a question to be answered;
the first retrieval unit is used for carrying out word vector conversion on the problem, obtaining a corresponding semantic vector, retrieving a target document vector with similarity meeting a first preset condition with the semantic vector in a pre-established document vector library based on the semantic vector, and taking a document corresponding to the target document vector as a first retrieval result;
the second retrieval unit is used for carrying out word segmentation on the problem to obtain words contained in the problem, screening target words in a predetermined key vocabulary from the obtained words, searching in a preset document set based on the target words to obtain a document containing the target words as a second retrieval result, wherein the key vocabulary is determined based on an inverse document frequency IDF value of the words contained in a document fragment obtained by segmentation after the documents in the preset document set are segmented;
And the processing unit is used for taking part or all of the documents in the first search result and/or part or all of the documents in the second search result as candidate documents for replying to the problem.
7. The apparatus of claim 6, wherein the second retrieval unit determines the key vocabulary by:
segmenting the documents contained in the preset document set to obtain a plurality of document fragments;
calculating an IDF value of words contained in each document fragment;
and selecting words with IDF values meeting a second preset condition as a key vocabulary corresponding to the preset document set.
8. The apparatus according to claim 7, wherein the processing unit is specifically configured to:
based on the vector similarity of the document vector of each document and the semantic vector corresponding to the problem in the first search result, sorting the documents in the first search result to obtain a first sorting result;
determining keywords contained in each document in the second search result based on the key vocabulary, determining a grading value corresponding to each document in the second search result based on an IDF value of the keywords contained in each document by using a preset algorithm, and sorting the documents in the second search result according to the grading value corresponding to each document to obtain a second sorting result;
Selecting part or all of the documents from the first search result based on the first sorting result as candidate documents when replying to the problem, and/or selecting part or all of the documents from the second search result based on the second sorting result as candidate documents when replying to the problem.
9. An electronic device, comprising: at least one processor, at least one memory, and computer program instructions stored in the memory, which when executed by the processor, implement the method of any one of claims 1-5.
10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1-5.
CN202311369150.5A 2023-10-20 2023-10-20 Document retrieval method, device, equipment and medium Pending CN117591640A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311369150.5A CN117591640A (en) 2023-10-20 2023-10-20 Document retrieval method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311369150.5A CN117591640A (en) 2023-10-20 2023-10-20 Document retrieval method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN117591640A true CN117591640A (en) 2024-02-23

Family

ID=89922451

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311369150.5A Pending CN117591640A (en) 2023-10-20 2023-10-20 Document retrieval method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117591640A (en)

Similar Documents

Publication Publication Date Title
US20210397980A1 (en) Information recommendation method and apparatus, electronic device, and readable storage medium
US11263208B2 (en) Context-sensitive cross-lingual searches
CN114612749B (en) Neural network model training method and device, electronic device and medium
CN111400504A (en) Method and device for identifying enterprise key people
CN112765387A (en) Image retrieval method, image retrieval device and electronic equipment
CN113656587A (en) Text classification method and device, electronic equipment and storage medium
CN115879508A (en) Data processing method and related device
CN115952274A (en) Data generation method, training method and device based on deep learning model
CN111078849A (en) Method and apparatus for outputting information
CN112328896B (en) Method, apparatus, electronic device, and medium for outputting information
CN116186197A (en) Topic recommendation method, device, electronic equipment and storage medium
CN110895587A (en) Method and device for determining target user
CN116541536B (en) Knowledge-enhanced content generation system, data generation method, device, and medium
CN116962516A (en) Data query method, device, equipment and storage medium
CN117591640A (en) Document retrieval method, device, equipment and medium
CN114238745A (en) Method and device for providing search result, electronic equipment and medium
CN113868481A (en) Component acquisition method and device, electronic equipment and storage medium
CN113780827A (en) Article screening method and device, electronic equipment and computer readable medium
CN115730047A (en) Intelligent question-answering method, equipment, device and storage medium
CN111382365A (en) Method and apparatus for outputting information
CN112860813B (en) Method and device for retrieving information
CN115809364B (en) Object recommendation method and model training method
US20230138741A1 (en) Social network adapted response
CN117743555B (en) Reply decision information transmission method, device, equipment and computer readable medium
CN115794984B (en) Data storage method, data retrieval method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination