CN116842126B

CN116842126B - Method, medium and system for realizing accurate output of knowledge base by using LLM

Info

Publication number: CN116842126B
Application number: CN202311090665.1A
Authority: CN
Inventors: 周书田; 于海洋; 王炳文; 彭晓彬; 孙桂英; 洪锋; 薛雁
Original assignee: Qingdao Wangxin Information Technology Co ltd
Current assignee: Qingdao Wangxin Information Technology Co ltd
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-12-19
Anticipated expiration: 2043-08-29
Also published as: CN116842126A

Abstract

The invention provides a method, medium and system for realizing accurate output of a knowledge base by using LLM, belonging to the technical field of accurate output of the knowledge base, wherein the method for realizing accurate output of the knowledge base by using LLM comprises the following steps: vectorizing knowledge in a knowledge base to obtain a knowledge vector data set containing a plurality of knowledge vectors; acquiring a problem of a user and carrying out vectorization processing to obtain a problem vector; matching the problem vector with the knowledge vector data set to obtain M knowledge vectors with highest matching degree; performing text processing on the M obtained knowledge vectors to obtain corresponding problem texts as campt, and submitting the problem texts to N LLM models to obtain N output texts; and carrying out relevance analysis on the obtained N output texts, and taking the output text with the highest relevance as an output result. The method, the medium and the system better exert the language understanding and generating capability of LLM, and realize the accurate retrieval and expression of a large-scale knowledge base.

Description

Method, medium and system for realizing accurate output of knowledge base by using LLM

Technical Field

The invention belongs to the technical field of accurate output of a knowledge base, and particularly relates to a method, medium and system for realizing accurate output of the knowledge base by using LLM.

Background

With the rapid development of the Internet, a huge text knowledge base is formed on the network, which provides great convenience for people to learn and acquire knowledge. However, how to quickly and accurately obtain the required knowledge from the complicated web text remains a challenge to be solved. The traditional text matching method based on the word vector has low matching accuracy. In recent years, large Language Model (LLM) technology has been developed, and has demonstrated a strong capability in the task of understanding natural language, LLM is a large-scale language model, which is a deep learning-based natural language processing model capable of learning the grammar and semantics of natural language, so that human-readable text can be generated. The "language model" is an AI model for processing only language characters (or symbology), and finds rules therein, and can automatically generate contents conforming to the rules based on prompt (prompt). LLM is typically based on neural network models, trained using large-scale corpora, such as using massive text data on the internet. These models typically possess billions to trillions of parameters that are capable of handling various natural language processing tasks such as natural language generation, text classification, text summarization, machine translation, speech recognition, and the like. How to use the powerful language understanding capability of LLM to realize the accurate search and expression of a large-scale text knowledge base is a problem worthy of exploration. At present, the related technology for realizing the accurate output of the knowledge base by utilizing LLM is not mature enough. The existing method is mainly based on a semantic matching strategy, the LLM is utilized to encode the problems and the knowledge base, then the similarity between the encodings is calculated, and the knowledge text with the highest similarity is selected as output. The method has two problems, namely 1) depending on semantic matching, language generation capability of LLM cannot be fully utilized, and 2) context semantics cannot be considered in the matching process, so that output is inaccurate. In order to realize accurate output of the knowledge base, it is necessary to study how to better utilize the language understanding and generating dual capabilities of LLM, and generate knowledge expression conforming to the context on the basis of fully understanding the semantics. This requires further introduction of context modeling based on coding semantic representations, which allows LLM to fully understand the semantics and context information of the problem, thereby generating accurate, smooth, context-semantic-compliant knowledge expressions.

In general, the problem that the prior art cannot effectively solve the problem of accurate output of a knowledge base, and a new technical scheme is urgently needed to better exert the language understanding and generating capability of LLM, so as to realize accurate retrieval and expression of a large-scale knowledge base.

Disclosure of Invention

In view of the above, the invention provides a method, medium and system for realizing accurate output of a knowledge base by using LLM, which solve the technical problems that the prior art cannot exert the language understanding and generating capability of LLM and cannot realize accurate retrieval and expression of a large-scale knowledge base.

The invention is realized in the following way:

the first aspect of the present invention provides a method for implementing accurate output of a knowledge base by using LLM, comprising the steps of:

s10, carrying out vectorization processing on knowledge in a knowledge base to obtain a knowledge vector data set containing a plurality of knowledge vectors;

s20, acquiring a problem of a user and carrying out vectorization processing to obtain a problem vector;

s30, matching the knowledge vector data set with the problem vector to obtain M knowledge vectors with highest matching degree;

s40, performing text processing on the M obtained knowledge vectors to obtain a corresponding problem text serving as a prompt;

s50, submitting the obtained campt to N LLM models to obtain N output texts;

s60, performing correlation analysis on the obtained N output texts, and taking the output text with the highest correlation as an output result.

Based on the technical scheme, the method for realizing accurate output of the knowledge base by using the LLM can be further improved as follows:

the step of performing relevance analysis on the obtained N output texts and taking the output text with the highest relevance as an output result specifically includes:

s61, carrying out vectorization processing on the obtained N output texts to obtain N output vectors;

s62, carrying out correlation analysis on each output vector and a knowledge base to obtain the correlation of each output vector;

s63, if the correlation degree of the output vector with the largest correlation degree is larger than a correlation degree threshold value, taking the output text corresponding to the output vector with the largest correlation degree as an output result; if no output vector greater than the correlation threshold exists, repeating the steps S40-S60 or repeating the steps S30-S60 after adjusting the value of M until an output text meeting the correlation requirement is obtained or the maximum cycle number is exceeded; and if the maximum number of the loops is exceeded, taking the output text corresponding to the output vector with the highest correlation degree in the conventional loops as an output result.

M is adjusted, typically by M+1.

Further, if there is no output vector greater than the relevance threshold, repeating steps S40-S60 until an output text meeting the relevance requirement is obtained or the maximum number of loops is exceeded, and further including a step of optimizing the sample after repeating step S40, specifically:

step 1, summarizing N output texts obtained in the previous cycle by using LLM to obtain N summarized texts;

step 2, merging the campt with the obtained N summary texts to obtain N merged texts;

step 3, carrying out relevance analysis on the N combined texts and the problems of the user, and taking the combined text with the highest relevance as a target text;

and 4, submitting the target text to a LLM model for generating the target text, analyzing and generating a new promt to replace the original promt, and realizing the optimization of the promt.

Wherein, the N LLMs all adopt an API calling mode.

In the step of vectorizing the knowledge in the knowledge base and obtaining the problem of the user and vectorizing, the vectorizing method is to process the knowledge text in the knowledge base or the problem text of the user into a vector by Word2 Vec.

The step of matching the using problem vector with the knowledge vector data set to obtain M knowledge vectors with highest matching degree comprises the following steps: and calculating the similarity between the problem vector and each knowledge vector in the knowledge vector data set, and selecting M knowledge vectors with the highest M matching degrees as problem matching results, wherein the similarity calculation method is cosine similarity.

The text processing is performed on the obtained M knowledge vectors to obtain a corresponding problem text as a prompt, which specifically includes:

mapping the obtained M knowledge vectors into natural language and converting the natural language into text expression;

and splicing the text obtained by text expression into a sample sequence.

Wherein m=5; n=3.

A second aspect of the present invention provides a computer readable storage medium having stored therein program instructions that, when executed, are configured to perform a method for implementing accurate output of a knowledge base using LLM as described above.

A third aspect of the present invention provides a system for implementing knowledge base precision output using LLM, comprising the computer readable storage medium described above.

Specifically, the method for realizing the accurate output of the knowledge base by using the LLM provided by the invention obtains candidate knowledge through semantic matching, and performs multi-round interaction refinement by using the generating capacity of the LLM, thereby realizing the accurate retrieval and expression of the knowledge base text. The method has the following technical effects:

1. improving the accuracy of knowledge retrieval

According to the invention, through vectorization expression of questions and knowledge, preliminary matching is performed by calculating the similarity between vectors, so that the probability that knowledge related to the questions is retrieved can be improved, and the interference of a large amount of irrelevant knowledge is avoided. Compared with the traditional method which only relies on keyword matching, the vector matching of the invention significantly improves the accuracy of knowledge retrieval.

2. Enhancing the correctness and fluency of knowledge expression

By submitting the matched knowledge vector to LLM to generate response text as a prompt, the language generating capability of the LLM can be fully exerted, and knowledge expression conforming to the context can be generated. Compared with directly outputting the retrieved knowledge text, the response text synthesized by the method is more smooth and accurate in grammar and semanteme.

3. Iterative optimization of knowledge output

The invention designs a multi-round interaction mechanism based on correlation analysis, and knowledge expression can be iteratively optimized until the precision requirement is met. Meanwhile, the effect of each round can be further improved by summarizing the optimization of the promt. This strategy of progressive refinement significantly improves the quality of knowledge representation.

4. Greatly improves the utilization efficiency of the knowledge base

The invention greatly improves the utilization efficiency of a mass knowledge base by quickly positioning related knowledge through semantic matching and carrying out knowledge expression by assisting with the generation capacity of LLM. The user can quickly obtain the desired knowledge without having to read the entire knowledge base word by word.

5. The method has strong expansibility

The framework of the invention has a modularized design, and each module of vector representation, semantic matching, language generation and the like can be flexibly adjusted, such as more advanced vector representation and matching algorithm, integration of a plurality of LLMs and the like. Therefore, the method has strong expansibility and can optimize space.

In general, the method for realizing accurate output of the knowledge base by utilizing LLM realizes efficient and accurate retrieval and expression of a large-scale text knowledge base by organically combining semantic matching and language generation. The method improves the quality of knowledge service, provides an effective novel solution for knowledge acquisition, and has important technical progress significance. It is believed that with the continuous progress of core technology, the application prospect of the invention is very broad.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for realizing accurate output of a knowledge base by using LLM;

fig. 2 is a flowchart of performing relevance analysis on the obtained N output texts, and taking the output text with the highest relevance as an output result.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

As shown in fig. 1, a flowchart of a method for implementing accurate output of a knowledge base by using LLM according to a first aspect of the present invention is provided, where the method includes the following steps:

s30, matching the problem vector with a knowledge vector data set to obtain M knowledge vectors with highest matching degree;

s50, submitting the obtained campt to N LLM models to obtain N output texts;

Wherein S60 includes:

For S60, specifically:

s61, vectorizing the obtained N output texts to obtain N output vectors:

for N output textsVectorization using a pre-trained language model or the like to obtain +.>。

The input text may be represented by multiple layers of encodings using models such as BERT, taking the resulting vector as the text vector.

S62, carrying out correlation analysis on each output vector and a knowledge base to obtain the correlation of each output vector:

for each of:

(1) Calculation ofAll knowledge vectors from knowledge base +.>Similarity of->Cosine similarity, etc.;

(2) Setting a correlation threshold τ, ifConsider->And->Related, general τ=90%;

(3) ObtainingRelated knowledge vector set>；

An efficient vector search algorithm can be employed to accelerate this process.

Is defined as the sum of its associated knowledge similarities:

；

S63：

(1) Setting a correlation threshold T, defaulting t=90%;

(2) If presentSatisfy->Select +.>Correspondingly outputting as a result;

(3) Otherwise, carrying out multiple rounds of generation attempts;

(4) Cycle optimization

In the multi-round generation, correlation knowledge matching and correlation calculation are continuously optimized.

Further, in the above technical solution, if there is no output vector greater than the relevance threshold, repeating steps S40-S60 until an output text meeting the relevance requirement is obtained or the maximum number of loops is exceeded, and further including a step of optimizing the template after repeating step S40, specifically:

The LLM is utilized herein to automatically generate a pattern of prompt from input, which is supported by the currently popular LLM.

In the above technical solution, N LLMs all adopt an API call manner.

In the above technical solution, in the step of vectorizing the knowledge in the knowledge base and obtaining the problem of the user and vectorizing, the vectorizing method is to use Word2Vec to process the knowledge text in the knowledge base or the problem text of the user into vectors.

The purpose of step S10 is to vectorize the knowledge in the knowledge base to obtain a knowledge vector data set comprising a plurality of knowledge vectors. The specific implementation mode can adopt the following method:

1. construction of knowledge base

First a knowledge base containing rich knowledge needs to be built. The knowledge base can store professional knowledge in different fields, such as industry knowledge of medicine, law, finance and the like, and can also store common general knowledge of daily life. The knowledge base is constructed by extracting knowledge by using the existing knowledge sources such as encyclopedia, dictionary, professional book and the like, manually writing knowledge points, constructing a knowledge map by using question-answer pair and the like. The constructed knowledge base should cover as much domain knowledge as possible, and each knowledge point is expressed clearly and accurately.

2. Knowledge representation

For each knowledge point in the knowledge base, proper expression is needed to facilitate subsequent vectorization processing. It is contemplated that the expression may be in natural language, such as expressing a medical knowledge point as a piece of text. It is also contemplated to use a more succinct structuring approach to expression, such as a triples structure in a knowledge-graph. It is necessary to ensure that the expression of each knowledge point reflects its semantic content.

3. Text vectorization

If the knowledge points are expressed by natural language, the text can be regarded as a sequence, each word is mapped into a dense vector by a word embedding method, and the vector representation of the whole text sequence is obtained through a model. Common text vectorization models include Bag-of-Words (BoW) models, TF-IDF models, word2Vec models, bert et al pre-trained language models, and the like. These models can learn semantic features of text, mapping the text into a semantic vector space of fixed dimensions.

For example, given a piece of knowledge point text:wherein->Representing the i-th word. Word2Vec can be used to obtain a Word vector for each Word>All word vectors are then averaged as text vectors:

；

a pretrained language model such as Bert and the like can be adopted, a text sequence is input, and a corresponding text vector is output by the model to represent.

4. Knowledge graph vectorization

If knowledge points are expressed in the form of knowledge maps, vectorization of entities and relationships can be considered. Common knowledge-graph vectorization models include translation models such as TransE, transH, transR. The main idea of these models is to capture semantic translation relationships between entities and relationships.

For example, given a tripleWherein->Representing head and tail entities,/->Representing the relationship between the two. The TransE model will learn the vector representation of the entity and relationship, respectively +.>The requirement is satisfied->I.e. the relation vector can be regarded as a translation relation of the head-to-tail entities in the vector space.

By vectorization, each entity and relationship in the knowledge-graph is mapped into a dense vector of a fixed dimension, thereby achieving vectorized representation of the knowledge-graph.

5. Vector normalization

Vectorizing different knowledge points may present a problem of large orders of magnitude difference. To eliminate the dimension effect, all knowledge vectors can be normalized:

；

wherein,is the original knowledge vector, < >>Is a normalized vector, ++>The mean and standard deviation of all knowledge vectors, respectively.

After normalization, the average value of all knowledge vectors is 0, and the standard deviation is 1, so that the influence of different vector orders can be eliminated, and the subsequent vector matching operation is facilitated.

6. Constructing a knowledge vector dataset

All knowledge in the knowledge base is sequentially subjected to the vectorization and standardization processing, and finally a data set containing a plurality of knowledge vectors can be obtained and recorded as:

；

wherein,a normalized vector representing the i-th knowledge point.

To this end, either textual knowledge or structured knowledge in the knowledge base is mapped to dense vectors of fixed dimensions and stored in the knowledge vector dataset. When a user problem is input, the user problem can be vectorized, and then the knowledge vector most relevant to the user problem is searched in the knowledge vector space, so that knowledge accurate matching is realized.

The whole step S10 realizes the mapping from unstructured text to structured knowledge vectors, and is the basis of knowledge accurate reasoning. Through a reasonable vectorization method, semantic features of knowledge points can be abstracted, and relations among knowledge are established in a vector space, so that a data basis is provided for realizing knowledge reasoning by using LLM.

Of course, the creation and vectorization of knowledge bases can also be performed directly using currently popular tools, such as Wen Da (https:// gitsub.com/l 15 y/wenda).

In the above technical solution, the steps of matching the problem vector with the knowledge vector data set to obtain M knowledge vectors with the highest matching degree specifically include: and calculating the similarity of each knowledge vector in the problem vector and knowledge vector data set, and selecting M knowledge vectors with the highest M matching degrees as problem matching results, wherein the similarity calculation method is cosine similarity.

Similar to vectorizing the knowledge points in step S10, the stepsStep S20 first entails vectorizing the input natural language question, expressed as a dense vector of fixed dimensions. Specifically, the text vectorization model introduced in step S10, such as Word2Vec, bert, etc., may be used to encode the question text to obtain the question vector +.>。

The purpose of step S30 is to match the knowledge vector data set with the problem vector, so as to obtain M knowledge vectors with the highest matching degree. The specific implementation mode can adopt the following method:

1. similarity calculation

Given a problem vectorAnd knowledge vector data set->Next, calculation is required +.>And every knowledge vector->Measure the degree of semantic relatedness of the question to each knowledge point.

Common similarity calculation methods include cosine similarity, euclidean distance, and the like:

(1) Cosine similarity

；

Wherein,is the angle between the two vectors, +.>Representing a vector dot product. Remainder of the processThe chord similarity considers the proximity degree of two vector directions, and the value range is [ -1,1]1 represents complete similarity.

(2) Euclidean distance

；

Euclidean distance measures the actual distance between two vector points. The euclidean distance of the similarity vector is small.

Both the two have advantages and disadvantages, and can be selected and used according to actual requirements.

3. Top K matching

All knowledge vectors are calculatedAnd question vector->After the similarity of the (c), the knowledge vectors may be ranked according to the similarity, and the first K knowledge vectors with the highest similarity are selected as the problem matching result, that is, returned:

；

wherein,comprises K knowledge vectors with highest matching degree, wherein K is>M。

The selection of (3) requires a comprehensive consideration of the actual effect. />A smaller value may not cover all knowledge points involved in the problem, while a larger +.>Increasing the difficulty of reasoning. Usually the first try can choose +.>For 5-10 knowledge vectors, typically m=5. In this step, M knowledge vectors with the highest matching degree may be directly selected as the problem matching result.

4. Vector index optimization

Alternatively, to increase the efficiency of large-scale knowledge vector searching, a vector indexing technique may be used to index the knowledge base. Common vector indexing algorithms include tree indexes (e.g., KD-trees), hash indexes (e.g., LSH), and the like. These methods can greatly increase the search speed.

For example, a KD tree can be used for knowledge vector data setsAnd establishing an index. Input question vector->When in use, the nearest neighbor vector can be found by searching on the KD tree quickly, and then the Top K vector is selected from the nearest neighbor vector without traversing the whole +.>Thereby reducing the amount of matching calculation.

5. Ranking model optimization

Alternatively, the vector sorting after similarity calculation may be optimized, and it is not necessarily required to sort the vector sorting completely according to the similarity. A ranking model, such as a rank svm, may be trained to learn a linear ranking function:

；

the model can capture interaction between the query problem and the knowledge vectors in a dot product mode, achieve more accurate ordering and determine M knowledge vectors with highest ordering.

In summary, step S30 achieves fast matching of the problem and the large-scale knowledge vector through problem vectorization and similarity matching, and provides a candidate knowledge set for subsequent knowledge retrieval. The matching efficiency and quality can be further improved by adopting cosine similarity, vector index and other technologies.

In the above technical solution, text processing is performed with the obtained M knowledge vectors, and a corresponding problem text is obtained as a sample, which specifically includes:

and splicing the text obtained by text expression into a sample sequence.

The purpose of step S40 is to generate text from the M knowledge vectors obtained in step S30, and submit the generated text to LLM as a prompt. Step S50 is to submit the generated template to N different LLM models to obtain N output texts. The specific implementation mode is as follows:

1. knowledge vector decoding

In step S30, M knowledge vectors with highest correlation are obtained by using the problem vector matching. These knowledge vectors first need to be mapped back to natural language and converted to text expressions.

A seq2seq model can be trained to decode the vector into text. I.e., training an automatic encoder to learn vector to text and text to vector mappings simultaneously, such as:

Encoder:,/>representing text->Representing the corresponding vector;

Decoder:,/>is from->Decoding to obtainAn arrival text;

the training goal is to minimize the reconstruction Loss:

；

trained DecoderThe input knowledge vector can be->Decoding to corresponding text->。

2. Prompt generation

After decoding the M knowledge vectors into text, they can be spliced into a sample sequence as input to LLM:

；

the constitution of the Prompt sequence can be designed according to the requirements of different LLMs. For example, "question:" may be added as a question description before each predicted text, and "answer:" may be added as an answer prompt at the end.

；

The addition of the prompt conforming to language habits can enable the LLM to work better on the question-answer generating task.

3. LLM prediction

Alternatively, given the previously generated sample is expressed asMay be input to a plurality of different LLM models to generate answer text.

Commonly used LLMs include GPT, bert, etc., and training modes include self-supervising pre-training, fine tuning, etc. Will beAs an input sequence for LLM, a reply to a prompt may be generated as an answer to the question.

Assume that there are N LLM models, pairBy performing the calculation, N different predicted texts can be obtained:

；

wherein,is the answer generated by the ith LLM.

4. Model integration

The above process produces predictive answers to a plurality of LLM models. To synthesize these answers, the idea of model integration (model ensable) may be employed.

Specifically, N LLM prediction results are used as the context of the new campt and then input to the integrated model to generate a final answer. Such as:

；

here, theIs a more powerful integrated language model, possibly through pre-training on additional large-scale multitasking data. The method can synthesize and grasp the predictions of each LLM and output final answers with higher quality.

Through steps S40-S50, the retrieved relevant knowledge points are encoded as campt, and integrated prediction is utilized to generate a final answer. Prompt design and model integration are key to the drawing of multiple LLM advantages.

Wherein, in the above technical solution, m=5; n=3.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for realizing accurate output of a knowledge base by using LLM is characterized by comprising the following steps:

s50, submitting the obtained campt to N LLM models to obtain N output texts;

s60, performing correlation analysis on the N obtained output texts, and taking the output text with the highest correlation as an output result; the method specifically comprises the following steps:

s63, if the correlation degree of the output vector with the largest correlation degree is larger than a correlation degree threshold value, taking the output text corresponding to the output vector with the largest correlation degree as an output result; if no output vector greater than the correlation threshold exists, repeating the steps S40-S60 or repeating the steps S30-S60 after adjusting the value of M until an output text meeting the correlation requirement is obtained or the maximum cycle number is exceeded; if the maximum number of the loops is exceeded, taking an output text corresponding to the output vector with the highest correlation degree in the previous loops as an output result;

the method further comprises the step of optimizing the promt after repeatedly executing the step S40, specifically:

2. The method for realizing accurate output of a knowledge base by utilizing LLMs according to claim 1, wherein the N LLMs all adopt an API call mode.

3. The method for realizing accurate output of a knowledge base by utilizing LLM according to claim 1, wherein in the step of vectorizing knowledge in the knowledge base and obtaining a problem of a user and vectorizing, the vectorizing method is to use Word2Vec to process knowledge text in the knowledge base or problem text of the user as a vector.

4. The method for realizing accurate output of knowledge base by utilizing LLM according to claim 1, wherein the step of matching the problem vector with the knowledge vector data set to obtain M knowledge vectors with highest matching degree comprises the following steps: and calculating the similarity between the problem vector and each knowledge vector in the knowledge vector data set, and selecting M knowledge vectors with the highest M matching degrees as problem matching results, wherein the similarity calculation method is cosine similarity.

5. The method for realizing accurate output of knowledge base by utilizing LLM according to claim 1, wherein the step of performing text processing with the obtained M knowledge vectors to obtain corresponding question text as a prompt specifically comprises:

and splicing the text obtained by text expression into a sample sequence.

6. The method for realizing accurate output of a knowledge base by using LLM according to claim 1, wherein m=5; n=3.

7. A computer readable storage medium, wherein program instructions are stored in the computer readable storage medium, and when the program instructions are executed, the program instructions are configured to perform a method for implementing accurate output of a knowledge base using LLM according to any one of claims 1-6.

8. A system for implementing knowledge base precision output using LLM, comprising the computer readable storage medium of claim 7.