CN118069811A

CN118069811A - LLM-based document knowledge question-answering method and system

Info

Publication number: CN118069811A
Application number: CN202410273307.2A
Authority: CN
Inventors: 金震; 张京日; 万俊; 张府涛
Original assignee: Beijing SunwayWorld Science and Technology Co Ltd
Current assignee: Beijing SunwayWorld Science and Technology Co Ltd
Priority date: 2024-03-11
Filing date: 2024-03-11
Publication date: 2024-05-24
Anticipated expiration: 2044-03-11
Also published as: CN118069811B

Abstract

The invention provides a document knowledge question-answering method and system based on LLM, comprising the following steps: receiving questioning information of a visitor at a questioning and answering interface, carrying out distance calculation on the questioning information and recall knowledge in a pre-established questioning and answering knowledge base through word embedding distance, and determining recall target knowledge; the question-answer knowledge base is created based on the question-answer document uploaded by the visitor, and the question-answer document and the corresponding word embedding are stored in the question-answer knowledge base; combining the questioning information with the recalled target knowledge to generate prompt information; carrying out question-answering processing through the LLM according to the prompt information and the target knowledge to obtain recall content; and feeding back the recall content to the visitor so as to complete the question and answer of the visitor to the own question information. The method and the device are used for helping the visitor to quickly and accurately acquire the answers to the questions from the massive knowledge base.

Description

LLM-based document knowledge question-answering method and system

Technical Field

The invention relates to the technical field of knowledge base questions and answers, in particular to a document knowledge question and answer method and system based on LLM.

Background

Knowledge base question-answering technology is a technology for answering questions posed by visitors based on knowledge bases and natural language processing technology. In recent years, knowledge base question-answering technology has been widely used and continuously improved with the development of artificial intelligence and natural language processing technology. With further maturation and development of technology, it is expected that knowledge base question-answering technology will play an increasingly important role in various fields, and provide accurate, real-time and personalized question-answering services for visitors. Knowledge base question-answering techniques have a number of existing technical schemes, including the following: a question and answer system based on rules, a question and answer system based on a statistical method, a question and answer system based on a knowledge graph and a question and answer system based on deep learning. These solutions have the scene and advantages and disadvantages that they are suitable for, and the specific choice depends on the complexity of the problem, the accuracy of the requirements and the particularities of the field. Of course, there are also some hybrid schemes that integrate multiple methods to improve the accuracy and performance of the question-answering system.

At present, finding out descriptions related to questions from a massive knowledge base and summarizing and organizing answers matching the descriptions is a challenging task. Not only ensuring the accuracy and completeness of the answer, but also reducing redundant contents. For small batches of QA pair data, the answer is generally satisfied by calculating the similarity between the questions and the knowledge base in pairs. However, for massive data, computing the similarity in pairs causes great consumption of computing resources and redundancy in contents, and the common practice is basically not available.

Disclosure of Invention

The invention provides a document knowledge question-answering method and system based on LLM, which adopts word embedding, a vector database and an LLM model to realize question-answering of visitors to a question-answering knowledge base. The method can accurately answer the target of the visitor question from the massive knowledge base, and ensures the accuracy of the answer.

In one aspect, the invention provides a document knowledge question-answering method based on LLM, which comprises the following steps:

receiving questioning information of a visitor at a questioning and answering interface, carrying out distance calculation on the questioning information and recall knowledge in a pre-established questioning and answering knowledge base through word embedding distance, and determining recall target knowledge; wherein,

The question-answer knowledge base is created based on the question-answer document uploaded by the visitor, and the question-answer document and the corresponding word embedding are stored in the question-answer knowledge base;

Combining the questioning information with the recalled target knowledge to generate prompt information;

carrying out question-answering processing through the LLM according to the prompt information and the target knowledge to obtain recall content;

And feeding back the recall content to the visitor so as to complete the question and answer of the visitor to the own question information.

Preferably, the step of creating the question-answer knowledge base specifically includes:

Acquiring a question-answer document uploaded by a visitor; wherein the question-answer document contains target knowledge associated with the question information;

When a vectorization operation request of a visitor is received, dividing a question-answer document into words by using a word segmentation model, and dividing paragraphs of segmented texts according to indicators;

converting each word into a word vector by using a word embedding model, summing the word vectors in each paragraph, and expressing the paragraphs in a vector form to obtain a word embedding result corresponding to the question-answer document;

and storing the question-answer document and the word embedding result corresponding to the question-answer document into a vector database to finish the creation and update of the question-answer knowledge base.

Preferably, when the LLM model performs question-answering processing, a corresponding question-answering induction scheme can be selected according to extraction parameters set by a visitor; the four question-answer induction schemes are a first question-answer induction scheme, a second question-answer induction scheme, a third question-answer induction scheme and a fourth question-answer induction scheme respectively;

the first question-answer induction scheme is to transmit recalled target knowledge to the LLM model for one time to be summarized, so that answers are obtained;

The second question-answer induction scheme is that each knowledge segment forming target knowledge is transmitted to the LLM model to be summarized, and then the result obtained by the summary of all knowledge segments is transmitted to the LLM model to be summarized again, so that an answer is obtained;

The third question-answer induction scheme is that the nth knowledge segment is transmitted to the LLM model to be summarized, then the content obtained by the nth knowledge segment and the n+1th knowledge segment are transmitted to the LLM model to be summarized, and the steps are repeated until all knowledge segments are summarized, so that an answer is obtained;

the fourth question-answer induction scheme is to summarize each knowledge segment once, obtain a score, and finally select a summary with the highest score to obtain an answer.

Preferably, the combining the question information with the recalled target knowledge, and generating the prompt information includes:

preprocessing the questioning information and the recalled target knowledge of each time to obtain data to be identified;

Inputting the data to be identified into a trained keyword generation model to obtain keywords related to the context;

and screening the generated keywords according to a preset rule to obtain prompt information.

Preferably, the step of obtaining the question-answer document uploaded by the visitor is as follows:

Monitoring a document dragging event of a visitor;

when a visitor drags the question-answer document to a designated position, acquiring information of the dragged question-answer document;

Calculating the checksum of the question-answer document, and mapping the document content into a check code with a fixed length; and attaching the calculated checksum to the end of the document;

uploading the question-answer document, re-calculating the checksum of the received question-answer document after the uploading is successful, and comparing the checksum with the checksum attached to the end of the question-answer document;

if the two are inconsistent, generating prompt information that the document is damaged or tampered, and feeding back the prompt information to the visitor.

Preferably, after the recall content is fed back to the visitor to complete the question and answer of the visitor to the own question information, the method further includes:

Calculating the distance between the two questioning vectors by using the cosine similarity; the questioning vector is used for converting questioning information into a vector form after preprocessing and vectorizing the questioning information;

When the distance is smaller than a first preset threshold value and the times exceed a second preset threshold value, extracting keywords from the questioning information based on word frequency, searching a matching template in a preset template rule base, and generating a plurality of recommendation questions according to the matching template and the keywords;

A list of recommended questions is generated and fed back to the visitor to assist the visitor in asking questions.

Preferably, the step of dividing the question-answer document into words by using the word segmentation model specifically includes:

Carrying out corpus cleaning on the question-answer documents to obtain texts to be divided;

performing word segmentation on the text to be segmented by using the constructed hidden Markov model, calculating the probability of each segmentation result, and obtaining a word segmentation mode with the maximum probability;

and carrying out rationality judgment according to the probability of co-occurrence of adjacent words to obtain the divided words.

Preferably, when receiving the question information of the visitor, the method further includes the steps of:

receiving a plurality of question-answer documents uploaded by a visitor;

Analyzing the multiple question-answer documents to obtain text contents, and converting the text contents into a plain text format to obtain multiple texts to be classified;

and classifying the texts to be classified based on the keywords and the classification rules, and storing the texts to be classified in the corresponding question-answer knowledge base according to the classification results.

A document knowledge question-answering system based on LLM comprises a question-answering knowledge base, a data receiving module, a data processing module, a prompt information generating module, a LLM model and an answer feedback module, wherein:

The data receiving module is used for receiving the questioning information of the visitor;

The data processing module is used for carrying out distance calculation on the questioning information and recall knowledge in a pre-established questioning and answering knowledge base through word embedding distance when the questioning information of the visitor is received, and determining recall target knowledge;

The prompt information generation module is used for combining the question information with the recalled target knowledge to generate prompt information;

the LLM model is used for carrying out question-answering processing through the LLM model according to the prompt information and the target knowledge to obtain recall content;

the answer feedback module is used for feeding back recall content to the visitor so as to complete the question and answer of the visitor to own question information.

Preferably, the question-answer knowledge base includes:

Data uploading unit: the method comprises the steps of acquiring a question-answer document uploaded by a visitor; wherein the question-answer document contains knowledge associated with the question information;

a data dividing processing unit: when a vectorization operation request of a visitor is received, dividing a question-answer document into words by using a word segmentation model, and dividing paragraphs of segmented texts according to indicators;

An embedding unit: the method comprises the steps of converting each word into a word vector by using a word embedding model, summing the word vectors in each paragraph, and representing the paragraphs in a vector form to obtain a word embedding result corresponding to a question-answer document;

Knowledge base update creation unit: and the method is used for storing the word embedding result of the question-answer document and the corresponding word embedding result into a vector database to finish the creation and update of the question-answer knowledge base.

The application has the beneficial effects that: according to the application, the visitor uploads the question-answering knowledge base by himself and then stores the knowledge base into the vector database, and the data are mapped into a high-dimensional space, so that knowledge related to the questions can be calculated and recalled quickly, and then the LLM model is adopted to complete knowledge summarization and answer generation work. The questions of the visitor can be accurately answered from the massive question-answering knowledge base, the requirements of the accuracy, completeness and low redundancy of the answers are guaranteed, and the visitor can complete the questions and answers of the own question-answering knowledge base through the chat window.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flowchart of a LLM-based document knowledge question-answering method in embodiment 1 of the present invention;

FIG. 2 is a flowchart showing the steps for creating a knowledge base of questions and answers in accordance with embodiment 1 of the present invention;

FIG. 3 is a flowchart showing the steps for obtaining a question-answer document uploaded by a visitor at the discretion of the visitor in embodiment 1 of the present invention;

FIG. 4 is a display interface for obtaining an uploaded question-answering document in embodiment 2 of the present invention;

FIG. 5 is a display interface of an automatic question-answering application based on local knowledge in embodiment 2 of the present invention;

Fig. 6 is a display interface of an automatic question-answering application based on local knowledge in embodiment 2 of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

Referring to fig. 1-3, embodiment 1 provides a LLM-based document knowledge question-answering method, comprising the steps of:

The working principle of the technical scheme is as follows:

When receiving the questioning information of the visitor, the method can firstly perform preprocessing and vectorization processing on the acquired questioning information, convert the questioning information into vector representation so as to capture semantic information of the questioning information, and recall target knowledge related to the questioning information from a pre-established questioning and answering knowledge base by utilizing word embedding distance; and then generating prompt information according to the questioning information and the recalled target knowledge, wherein the prompt information can help the model understand the intention input by the visitor, so as to generate output meeting the requirement of the visitor. For example, based on known information, answers to question matches are refined, only my questions need to be answered concisely, without being exemplified, and without adding content that is not present in the known information. If the answers are not matched in the known information, directly replying that the answers are not found in the question-answer knowledge base and the answers cannot be solved for you; then, the prompt information is used for guiding the LLM model to generate corresponding answers, and recall content is obtained; and finally, feeding back recall contents to the visitor through the modes of displaying on a webpage or an interface of a mobile application, sending an email or a message to inform the visitor and the like, and completing the question and answer of the visitor to the own question and answer knowledge base.

The beneficial effects of the technical scheme are that: according to the application, the visitor uploads the question-answering knowledge base by himself and then stores the knowledge base into the vector database, and the data are mapped into a high-dimensional space, so that knowledge related to the questions can be calculated and recalled quickly, and then the LLM model is adopted to complete knowledge summarization and answer generation work. The questions of the visitor can be accurately answered from the massive question-answering knowledge base, the requirements of the accuracy, completeness and low redundancy of the answers are guaranteed, and the visitor can complete the questions and answers of the own question-answering knowledge base through the chat window.

In one embodiment, the step of creating the question-answer knowledge base specifically includes:

The working principle of the technical scheme is as follows:

The application can complete the question and answer of the visitor to the own question and answer knowledge base. Therefore, when the visitor first performs question-answering in the question-answering knowledge base, the local document needs to be uploaded, and file input in various formats such as docx, pdf, markdown is supported. After uploading a question-answer document containing knowledge associated with the question information, the visitor may click on a question-answer knowledge base vectorization in the interaction screen. When a vectorization operation request of a visitor is received, dividing a question-answer document into words by utilizing a word segmentation model, dividing paragraphs of text after word segmentation according to indicators, and preprocessing the question-answer document before word segmentation, namely word segmentation, stop word removal, stem extraction and other operations, so as to further analyze and understand the problem; converting each word into a word vector by using a word embedding model, summing the word vectors in each paragraph, and expressing the paragraphs in a vector form to obtain a word embedding result corresponding to the question-answer document, wherein the word embedding can capture the relation and semantic information among the words; and finally, embedding and storing the question-answer document and the corresponding word thereof into a vector database to finish the creation of a question-answer knowledge base. When the target knowledge is recalled, paragraphs most relevant to the questions can be screened out by comparing the similarity of the question vectors and the paragraph vectors, and then the similarity of the question vectors and the word vectors is further compared to determine target information to be recalled.

The beneficial effects of the technical scheme are that:

the application adopts the word embedding model to process the document, the word embedding can understand the meaning of the word and the context of the word, capture richer semantic information, and can quickly search and match text paragraphs related to the question when inquiring the answer of the question, thereby providing more accurate answer.

In one embodiment, when the LLM model performs question-answering processing, a corresponding question-answering induction scheme can be selected according to extraction parameters set by a visitor; the four question-answer induction schemes are a first question-answer induction scheme, a second question-answer induction scheme, a third question-answer induction scheme and a fourth question-answer induction scheme respectively;

The technical scheme has the working principle and beneficial effects that: when the LLM model is utilized to refine the answers, the answers obtained in different refining modes are different, a visitor can set refining parameters by himself, and a corresponding question-answer induction scheme is selected, so that the experience of the visitor is enhanced, and the requirement of the visitor on the answers is met.

In one embodiment, the combining the questioning information with the recalled target knowledge, generating the hint information comprises:

The preprocessing includes, but is not limited to, text cleaning, word segmentation and part-of-speech tagging of each question information and corresponding recalled target knowledge, namely dialogue data. The keyword generation model may employ a recurrent neural network or a long-short-term memory network, which is capable of well processing the input dialog context. The preset rules may be based on rules of part of speech, semantics or grammatical structures, for example, the word frequency of the generated context prompt information cannot be 5 times, or the current question-answer dialogue refers to "banking business", and then the generated context prompt information is related to the banking business.

The technical scheme has the working principle and beneficial effects that: the application obtains the keywords related to the context by using the keyword generation model, screens the keywords by using the preset rule, and ensures that the generated context prompt information accords with the context and the semantics of the question-answer dialogue.

In one embodiment, the step of obtaining the question-answer document uploaded by the visitor is:

Monitoring a document dragging event of a visitor;

The working principle and the beneficial effects of the technical scheme are as follows: the application is used for realizing the question and answer of the document uploaded by the visitor, the security and the integrity of the document are particularly important, if the document is damaged or tampered in the transmission process, the follow-up step cannot be performed, so that the checksum of the question and answer document is calculated before the question and answer document is uploaded, and the checksum is added to the end of the document. The checksum is a hash function that can map the contents of a file to a fixed-length check code. After the question-answer document is successfully uploaded, the checksum is recalculated by using the same hash function, the two checksums are compared, the next operation is executed after the comparison is successful, and when the comparison fails, namely the two checksums are inconsistent, prompt information that the document is damaged or tampered is generated and fed back to a visitor. When uploading the document, the document can be uploaded by dragging the document to the designated position, the operation steps are simple, and the use experience of visitors is improved.

In one embodiment, after the recall content is fed back to the visitor to complete the question and answer of the visitor to the own question information, the method further comprises:

The working principle and the beneficial effects of the technical scheme are as follows: because the influence of the expression mode of the visitor can be that the generated answer is not consistent with the expected answer of the visitor, in the process of each question and answer, the distance between the two question vectors is calculated by using the cosine similarity, whether the visitor is satisfied with the answer is detected by the distance, and if the distance is smaller than a first preset threshold value and the number of times that the distance is smaller than the first preset threshold value exceeds a second threshold value, the visitor is indicated to always repeat the same question, but the expected answer is not obtained due to the influence of the expression mode. At this time, extracting keywords from the multiple questioning information based on word frequency, searching a matching template in a preset template rule base, and generating multiple recommendation questions according to the matching template and the keywords; a list of recommended questions is generated and fed back to the visitor to assist the visitor in asking questions.

In one embodiment, the step of dividing the question-answer document into terms by using the word segmentation model specifically includes:

The working principle and the beneficial effects of the technical scheme are as follows: according to the application, the word is segmented through the word segmentation model, so that the words in the document can be automatically identified, human errors are reduced, the word segmentation accuracy is improved, and the efficiency is improved.

In one embodiment, when receiving the question information of the visitor, the method calculates the distance between the question information and the recall knowledge in the pre-created question-answer knowledge base through the word embedding distance, and before determining the recall target knowledge, the method further includes:

receiving a plurality of question-answer documents uploaded by a visitor;

The working principle and the beneficial effects of the technical scheme are as follows: when a visitor uploads a plurality of question-answering documents, the formats of the documents can be converted into plain text formats, the plurality of texts to be classified are classified based on keywords and classification rules, and the texts to be classified are stored in corresponding question-answering knowledge bases according to classification results. In use, the visitor can select the database as required and conduct questions and answers.

Referring to fig. 4-6, embodiment 2 provides an automatic question and answer application based on local knowledge that enables visitors to ask and answer a bid platform operation guide for an enterprise. The specific implementation steps comprise:

Step 1: receiving the settings of the visitor on the parameters of the model, including k results before the search of the pointing quantity, the history length, the temperature, the p selection mechanism and the like as shown in fig. 3;

Step 2: when a visitor drags the operation guide document to a designated position, acquiring information of the dragged operation guide document and finishing file uploading, and supporting file input of docx, pdf, markdown and other formats;

step 3: when a vectorization operation request of a visitor is received, the question-answer knowledge base is segmented, words are embedded, and then the segmented question-answer knowledge base is stored in a vector database, and data are mapped into a high-dimensional space;

step 4: the inquiry information of the visitor is received, for example, as shown in fig. 5, which is "what the system construction target is" or "how the provider inquires about the contact of the purchasing unit" in fig. 6.

Step 5: processing the questioning information, and recalling target knowledge related to the questioning information from a questioning and answering knowledge base according to the word embedding distance; generating prompt information according to the questioning information and the recalled target knowledge; and finally, returning a corresponding answer by using the LLM model and displaying the answer on a display interface, wherein the display results are shown in figures 5 and 6.

Embodiment 3, a document knowledge question-answering system based on LLM, including question-answering knowledge base, data receiving module, data processing module, prompt message generation module, LLM model and answer feedback module, wherein:

The working principle of the technical scheme is as follows: the method comprises the steps that the data receiving module receives the questioning information of a visitor in real time, when the data receiving module receives the questioning information of the visitor, the data processing module can be used for preprocessing and vectorizing the acquired questioning information, the questioning information is converted into vector representation so as to capture semantic information of the questioning information, and target knowledge related to the questioning information is recalled from a pre-established questioning and answering knowledge base by means of word embedding distance; and then, generating prompt information by using a prompt information generation module according to the question information and the recalled target knowledge, wherein the prompt information can help the model to understand the intention input by the visitor, so as to generate output meeting the requirement of the visitor. Then, the prompt information is used for guiding the LLM model to generate corresponding answers, and recall content is obtained; and finally, displaying recall content on a webpage or an interface of a mobile application through an answer feedback module, sending an email or a message, and notifying a visitor of the fact that the visitor answers the question of the own question-answer knowledge base.

In one embodiment, the question-answer knowledge base comprises:

The beneficial effects of the technical scheme are that: the application adopts the word embedding model to process the document, the word embedding can understand the meaning of the word and the context of the word, capture richer semantic information, and can quickly search and match text paragraphs related to the question when inquiring the answer of the question, thereby providing more accurate answer.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The LLM-based document knowledge question-answering method is characterized by comprising the following steps of:

2. The LLM-based document knowledge question-answering method according to claim 1, wherein the step of creating a question-answer knowledge base specifically comprises:

3. The LLM-based document knowledge question-answering method of claim 2, wherein when the LLM model performs question-answering processing, a corresponding question-answering induction scheme can be selected according to extraction parameters set by a visitor; the four question-answer induction schemes are a first question-answer induction scheme, a second question-answer induction scheme, a third question-answer induction scheme and a fourth question-answer induction scheme respectively;

4. The LLM-based document knowledge question-answering method according to claim 2, wherein the combining the question information with the recalled target knowledge to generate the hint information comprises:

5. The LLM-based document knowledge question-answering method according to claim 2, wherein the step of obtaining the question-answering document uploaded by the visitor by himself is:

Monitoring a document dragging event of a visitor;

6. The LLM-based document knowledge question-answering method according to claim 1, wherein after feeding back recall content to a visitor to complete question-answering of the visitor's own question information, further comprising:

7. The LLM-based document knowledge question-answering method according to claim 2, wherein the step of dividing the question-answering document into words using a word segmentation model comprises:

8. The LLM-based document knowledge question-answering method according to claim 1, wherein when receiving the question information of the visitor, the method calculates a distance between the question information and recall knowledge in a pre-created question-answer knowledge base through a word embedding distance, and further comprises, before determining the recall target knowledge:

receiving a plurality of question-answer documents uploaded by a visitor;

9. The document knowledge question-answering system based on the LLM is characterized by comprising a question-answering knowledge base, a data receiving module, a data processing module, a prompt information generating module, an LLM model and an answer feedback module, wherein:

10. The LLM based document knowledge question-answering system according to claim 9, wherein the question-answering knowledge base comprises: