CN110866102A

CN110866102A - Search processing method

Info

Publication number: CN110866102A
Application number: CN201911082817.7A
Authority: CN
Inventors: 潘心冰; 李明明; 曾光; 张红若
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2020-03-06

Abstract

The embodiment of the invention discloses a retrieval processing method which can improve the retrieval efficiency of mass retrieval. The retrieval processing method comprises the following steps: obtaining a question, and extracting at least one keyword from the question; determining a massive document library for retrieving answers corresponding to the questions; extracting relevant documents of the problems from the massive document library to form a relevant document set according to the relevance degree of the at least one keyword; and retrieving answers corresponding to the questions from the associated document set. The embodiment of the invention obtains the question, extracts at least one keyword from the question, determines the massive document library used for retrieving the answer corresponding to the question, extracts the relevant documents of the question from the massive document library to form a relevant document set according to the relevance degree of the at least one keyword, and retrieves the answer corresponding to the question from the relevant document set. Therefore, the associated documents are selected from the massive document library according to the problems, and answers are searched in the associated documents, so that the searching efficiency of massive searching is improved.

Description

Search processing method

Technical Field

The invention relates to the field of retrieval, in particular to a retrieval processing method.

Background

In the information age, information is explosively increased, and the method for rapidly retrieving answers corresponding to user questions from massive information becomes one of the keys in the field of intelligent conversation systems. With the increase of the number of documents, for example, in mass documents such as product specifications, legal documents and the like, the amount of retrieval data is huge, which often results in slow query speed and even failure of query.

Disclosure of Invention

The embodiment of the invention provides a retrieval processing method which can improve the retrieval efficiency of mass retrieval.

The embodiment of the invention adopts the following technical scheme:

a search processing method, comprising:

obtaining a question, and extracting at least one keyword from the question;

determining a massive document library for retrieving answers corresponding to the questions;

extracting documents related to the problem from the massive document library to form a related document set according to the relevance of the keywords;

and retrieving a result corresponding to the problem from the associated document set.

Optionally, the extracting, according to the degree of association with the at least one keyword, documents associated with the question from the massive document library to form an associated document set includes:

obtaining the theme of each document in the massive document library, and matching each keyword in the at least one keyword with the theme of each document in the massive document library to obtain a first series of probabilities of the keywords;

matching the semantic similarity of each keyword in the at least one keyword with each document in the massive document library to obtain a second series of probabilities of the keywords;

and extracting documents associated with the problems from the massive document library according to the first series of probabilities and the second series of probabilities to form the associated document set.

Optionally, the obtaining the theme of each document in the mass document library includes:

constructing a theme model based on an LDA algorithm;

and determining the theme of each document in the massive document library according to the theme model.

Optionally, the determining the topic of each document in the mass document library according to the topic model includes:

determining a series of alternative topics of each document in the massive document library and the probability of each alternative topic according to the topic model;

and determining the theme of each document in the massive document library according to the probability of each candidate theme, wherein the theme of each document in the massive document library can be one or more.

Optionally, the matching of each keyword in the at least one keyword with the semantic similarity of each document in the mass document library to obtain a second series of probabilities of the keywords includes:

establishing a semantic similarity model of the massive document library according to at least one algorithm of TF-IDF algorithm, BM25 algorithm and ES algorithm;

and matching the semantic similarity of each keyword in the at least one keyword with the semantic similarity of each document in the massive document library based on the semantic similarity model to obtain a second series of probabilities of the keywords.

Optionally, the extracting, according to the first series of probabilities and the second series of probabilities, the documents associated with the question from the massive document library to form the associated document set includes:

determining the comprehensive probability of each document in the massive document library and the problem correlation degree according to the first series of probabilities and the second series of probabilities;

and sequencing the documents in the massive document library according to the comprehensive probability, and extracting the documents associated with the problems from the massive document library to form the associated document set.

Optionally, the determining, according to the first series of probabilities and the second series of probabilities, a comprehensive probability of the relevance of each document in the massive document library to the problem includes:

and weighting and adding the first series of probabilities and the second series of probabilities to obtain the comprehensive probability of the relevance of each document and the problem in the massive document library.

Optionally, the sorting the documents in the massive document library according to the comprehensive probability, and extracting the documents associated with the problem from the massive document library to form the associated document set includes:

sequencing the documents in the massive document library from high to low according to the comprehensive probability;

and acquiring a set number of documents from the first document in the sequence, and forming the associated document set as the documents associated with the problem.

Optionally, the obtaining the question, extracting at least one keyword from the question includes:

receiving the question input by a user;

and performing preprocessing operation on the problem to obtain the at least one keyword, wherein the preprocessing operation comprises one or more operations of word segmentation, error correction, stop-go, entity identification, long and difficult sentence compression and reference resolution.

Optionally, the retrieving the result corresponding to the question from the associated document set includes:

establishing a deep learning model;

and inquiring answers corresponding to the questions from the associated document set according to the deep learning model.

According to the retrieval processing method based on the technical scheme, the problem is obtained, at least one keyword is extracted from the problem, a massive document library used for retrieving answers corresponding to the problem is determined, documents related to the problem are extracted from the massive document library according to the association degree of the at least one keyword to form an associated document set, and the answer corresponding to the problem is retrieved from the associated document set. Therefore, the associated documents are selected from the massive document library according to the problems, and answers are searched in the associated documents, so that the searching efficiency of massive searching is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flowchart of a retrieval processing method according to an embodiment of the present invention;

fig. 2 is a second flowchart of the search processing method according to the embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

According to the embodiment of the invention, topic extraction and semantic similarity analysis are fused, the documents associated with the question are extracted from a massive document library to form an associated document set, and the machine reading understanding method based on deep learning retrieves the query answers from the related articles, so that the answers are quickly and accurately retrieved from the massive documents.

Example 1

As shown in fig. 1, the present embodiment provides a retrieval processing method, including:

11. a question is obtained, and at least one keyword is extracted from the question.

12. And determining a massive document library for retrieving answers corresponding to the questions.

13. Extracting documents related to the problem from the massive document library to form a related document set according to the relevance of the keywords;

14. and retrieving a result corresponding to the problem from the associated document set.

In one embodiment, said extracting documents associated with said question from said mass document repository to form an associated document set according to the degree of association with said at least one keyword comprises:

In one embodiment, the obtaining the topic of each document in the mass document library includes:

constructing a theme model based on an LDA algorithm;

Specifically, taking an lda (content Dirichlet allocation) document theme generation model as an example, the theme extraction refers to extracting a document theme from a document according to the content of the document. More specifically, the document set is regarded as a word sequence { a, B, c, d, … }, each word corresponds to a corresponding probability of different topics, for example, the probability that the word a corresponds to the topic a is p1, the probability that the word B corresponds to the topic B is p2 …, the different word sequences form different document topics, for example, abc, acd, and cda … take the word sequence with the highest probability as the document topic, and the LDA algorithm modeling process is a process for generating probabilities corresponding to the topics.

In one embodiment, the determining the topic of each document in the mass document library according to the topic model includes:

For example, a series of topics with certain probability values are obtained from each document in the massive document library through an LDA algorithm, the topics are sorted according to the probability values, and a top probability value is taken as the document topic. Specifically, the document is regarded as a word sequence { a, B, c, d, … }, each word corresponds to a corresponding probability of different topics, for example, the probability that the word a corresponds to the topic a is p1, the probability that the word B corresponds to the topic B is p2 …, and different word sequences form different document topics, for example, the word sequence with the highest probability is taken as the document topic, such as abc, acd, and cda ….

For another example, a series of topics with a certain probability value of each document in the massive document library are obtained through an LDA algorithm, the topics are sorted according to the probability value, the first few (settable) probability value topics are taken as the document topics, the topics can reflect the content of the document, the document can be taken as a series of topic sets subject to certain probability distribution, a certain topic is randomly extracted from the document, and the document generates the topic with a certain probability. Each topic may consist of words, words randomly drawn from the topic obeying a certain probability distribution, i.e. the words are included in the topic with a certain probability. A probability distribution is formed from words to topics and from topics to text.

In an embodiment, the matching semantic similarity between each keyword of the at least one keyword and each document in the mass document library to obtain the second series of probabilities of the keywords includes:

Specifically, the Semantic similarity algorithm is used for calculating the similarity between the keywords and the documents, and the Semantic similarity algorithm may be TF-IDF, BM25, ES algorithm, or DSSM (Deep Structured Semantic Models), CNN-DSSM (relational Neural Networks-Deep Structured Semantic Models), LSTM-DSSM (long short-Term Memory-Deep Structured Semantic Models), and the like based on a Neural network.

Specifically, in TF-IDF, there are two main things, first, TF word frequency, which represents the frequency with which a given word appears in the document. IDF is the inverse file frequency, indicating the importance of a word. For massive texts, segmenting the texts, and then counting the word frequency (TF) of each word in the current article, namely dividing the occurrence frequency by the total number of words in the current document to obtain the word frequency; calculating the IDF of each word, namely dividing the total document number by the number of the documents, and taking the logarithm of the obtained quotient to determine the measurement of the universal importance of the word; the TF-IDF is computed, i.e., the word frequency (TF) multiplied by the Inverse Document Frequency (IDF). The TF-IDF is proportional to the number of occurrences of a word in a document and inversely proportional to the number of occurrences of the word in the whole language, and a probability, i.e., the degree of association of the word in question with the current article, can be obtained by the TF-IDF for use in subsequent steps. Similarly, algorithms such as BM25, ES, DSSM, CNN-DSSM, LSTM-DSSM and the like model massive texts and output ranking lists with questions similar to the texts. The BM25 increases the document weight and the query weight, which is equivalent to an improved version of TF-IDF and can improve the accuracy of output results; the ES bottom layer is based on lucene.

The algorithms used for each are slightly different, while for a massive reading understanding, the different algorithms, input and output, are the same. Therefore, the embodiment of the present invention is not limited to the above three methods, and other modules for calculating semantic similarity may also be adopted.

In one embodiment, said extracting documents associated with the question from the corpus of documents to form the associated document set according to the first series of probabilities and the second series of probabilities comprises:

In one embodiment, the determining a combined probability of relevance of each document in the mass document library to the problem according to the first series of probabilities and the second series of probabilities includes:

In an embodiment, said ranking the documents in the massive document library according to the comprehensive probability, and extracting the documents associated with the problem from the massive document library to form the associated document set includes:

In one embodiment, the obtaining the question, the extracting at least one keyword from the question comprises:

receiving the question input by a user;

and performing preprocessing operations such as word segmentation, stop word removal and the like on the problem to obtain the at least one keyword. The word segmentation can use word segmentation tools such as jieba and HanLP which are open sources, for example, for the weather of the current day, the word segmentation processing result is similar to that of the word segmentation processing result: "today", "weather", "how; stop words include, ground, etc. The preprocessing operation comprises one or more operations of word segmentation, error correction, stop removal (stop word removal), entity recognition, long and difficult sentence compression and reference resolution, and can be combined according to different application scenes.

In one embodiment, the retrieving the result corresponding to the question from the associated document set includes:

establishing a deep learning model;

Specifically, when the answers to the questions are obtained from the massive texts, the answers can be obtained from the associated document set based on the deep learning model. The deep learning model is a model obtained by training a large number of data sets, and answers corresponding to the questions are extracted from the documents by combining an algorithm and utilizing the model.

The retrieval processing method of the embodiment acquires a question, extracts at least one keyword from the question, determines a massive document library for retrieving answers corresponding to the question, extracts relevant documents of the question from the massive document library according to the relevance degree of the at least one keyword to form a relevant document set, and retrieves the answer corresponding to the question from the relevant document set. Therefore, the associated documents are selected from the massive document library according to the problems, and answers are searched in the associated documents, so that the searching efficiency of massive searching is improved.

Example 2

The present embodiment describes the search processing method according to the present invention in detail with reference to specific examples, as shown in fig. 2, the method includes:

21. and acquiring a mass document library.

The embodiment uses a massive document library for storing a large number of documents, and the documents contained in the massive document library are Doc1, Doc2 and Doc3 ….

22. And acquiring a question, and processing the question to obtain a keyword.

For example, Doc1 is an article introduced by chrysanthemum tea, and the input question is "what the growth transition of chrysanthemum tea is", when the question is preprocessed, the question is analyzed and processed, including the processes of deactivating words (for example, of), correcting errors (changing), dividing words, deleting non-keywords (words), and converting the question into keywords: chrysanthemum tea and growing environment.

23. And performing theme matching and semantic similarity matching on the keywords and the documents in the massive document library.

Specifically, a topic and semantic similarity matching model is generated for the documents in the mass document library (in other embodiments, this step may also precede 22). Taking the LDA document topic generation model as an example, a document is taken as a word sequence { a, B, c, d, … }, each word corresponds to a different topic with a corresponding probability, for example, the probability that word a corresponds to topic a is P1, the probability that word B corresponds to topic B is P2 …, different word sequences form different document topics, and each word corresponds to a probability P1(w/d) of the document, for example, abc, acd, cda … takes the word sequence with the highest probability as the document topic.

Further, after a model is built through a semantic similarity algorithm (such as a TF-IDF algorithm), the probability P2(w/d) of the document corresponding to each word is obtained, and the probability of the document corresponding to the question word is obtained through the model. Obtaining the probability of the document corresponding to the question word through a semantic similarity algorithm according to a series of probabilities of P1 ('chrysanthemum tea'/Doc 1), P1 ('growing environment'/Doc 1) and P1 ('chrysanthemum tea'/Doc 2) …; such as P2 ("chrysanthemum tea"/Doc 1), P2 ("growing environment"/Doc 1), P2 ("chrysanthemum tea"/Doc 2) ….

24. And comprehensively sequencing the matching results of the keywords and the documents in the massive document library by subject matching and semantic similarity.

Specifically, the probabilities generated by the two methods are weighted to obtain the comprehensive probability of generating words word by document Doc, wherein λ P1(w/d) + λ P2 (w/d). For example, the probability of the chrysanthemum tea in Doc1 is P ("chrysanthemum tea"/Doc 1) ═ λ P1 ("chrysanthemum tea"/Doc 1) + λ P2 ("chrysanthemum tea"/Doc 1), so as to obtain the probability of the chrysanthemum tea in Doc 1. For the "growth environment". Similarly, the probability of "chrysanthemum tea" and "growth environment" in Doc1, that is, the probability of Doc1 generating the problem "how is the growth transition of chrysanthemum tea", that is, the probability of the problem "how is the growth transition of chrysanthemum tea" being related to Doc1, may be calculated. By the same method, the probabilities of the fact that the growth and the transformation of the chrysanthemum tea are similar to Doc2, Doc3 and Doc4 … can be obtained.

25. And determining answers corresponding to the questions by basing the comprehensive ranking on a deep learning model.

Specifically, after the probabilities are ranked, the article set most relevant to the question is obtained, and the question and the article set most relevant to the question are submitted to a deep learning model (such as bert) to obtain an answer corresponding to the question.

The retrieval processing method of the embodiment acquires the question, extracts the key words from the question, determines the massive document library used for retrieving answers corresponding to the question, extracts the relevant documents of the question from the massive document library to form the associated document set according to the association degree of the key words, and retrieves the answers corresponding to the question from the associated document set. Therefore, the associated documents are selected from the massive document library according to the problems, and answers are searched in the associated documents, so that the searching efficiency of massive searching is improved.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A search processing method, comprising:

obtaining a question, and extracting at least one keyword from the question;

extracting relevant documents of the problems from the massive document library to form a relevant document set according to the relevance degree of the at least one keyword;

and retrieving answers corresponding to the questions from the associated document set.

2. The method according to claim 1, wherein said extracting relevant documents related to the question from the mass document library to form a relevant document set according to the relevance degree to the at least one keyword comprises:

and extracting relevant documents of the problems from the massive document library according to the first series of probabilities and the second series of probabilities to form the associated document set.

3. The method of claim 2, wherein obtaining the topic of each document in the mass document library comprises:

constructing a theme model based on an LDA algorithm;

4. The method of claim 3, wherein determining the topic of each document in the mass document library according to the topic model comprises:

determining at least one alternative theme of each document in the massive document library and the probability of each alternative theme according to the theme model;

5. The method of claim 2, wherein matching semantic similarity of each keyword of the at least one keyword with each document of the corpus of documents to obtain a second series of probabilities of the keyword comprises:

6. The method according to any one of claims 2 to 5, wherein the extracting relevant documents related to the question from the mass document library according to the first series of probabilities and the second series of probabilities to form the associated document set comprises:

and sequencing the documents in the massive document library according to the comprehensive probability, and extracting documents related to the problems from the massive document library to form the associated document set.

7. The method of claim 6, wherein determining a combined probability of relevance of each document in the corpus of documents to the problem based on the first and second series of probabilities comprises:

8. The method of claim 6, wherein the ranking the documents in the mass document repository according to the composite probability, and extracting the documents related to the problem from the mass document repository to form the associated document set comprises:

and acquiring a set number of documents from the first document in the sequence, and forming the associated document set as the documents related to the problem.

9. The method of any one of claims 1 to 5, wherein the obtaining a question, the extracting at least one keyword from the question comprises:

receiving the question input by a user;

10. The method according to any one of claims 1 to 5, wherein the retrieving the answer corresponding to the question from the set of associated documents comprises:

establishing a deep learning model;