CN116842126B - Method, medium and system for realizing accurate output of knowledge base by using LLM - Google Patents

Method, medium and system for realizing accurate output of knowledge base by using LLM Download PDF

Info

Publication number
CN116842126B
CN116842126B CN202311090665.1A CN202311090665A CN116842126B CN 116842126 B CN116842126 B CN 116842126B CN 202311090665 A CN202311090665 A CN 202311090665A CN 116842126 B CN116842126 B CN 116842126B
Authority
CN
China
Prior art keywords
knowledge
output
vector
text
llm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311090665.1A
Other languages
Chinese (zh)
Other versions
CN116842126A (en
Inventor
周书田
于海洋
王炳文
彭晓彬
孙桂英
洪锋
薛雁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Wangxin Information Technology Co ltd
Original Assignee
Qingdao Wangxin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Wangxin Information Technology Co ltd filed Critical Qingdao Wangxin Information Technology Co ltd
Priority to CN202311090665.1A priority Critical patent/CN116842126B/en
Publication of CN116842126A publication Critical patent/CN116842126A/en
Application granted granted Critical
Publication of CN116842126B publication Critical patent/CN116842126B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method, medium and system for realizing accurate output of a knowledge base by using LLM, belonging to the technical field of accurate output of the knowledge base, wherein the method for realizing accurate output of the knowledge base by using LLM comprises the following steps: vectorizing knowledge in a knowledge base to obtain a knowledge vector data set containing a plurality of knowledge vectors; acquiring a problem of a user and carrying out vectorization processing to obtain a problem vector; matching the problem vector with the knowledge vector data set to obtain M knowledge vectors with highest matching degree; performing text processing on the M obtained knowledge vectors to obtain corresponding problem texts as campt, and submitting the problem texts to N LLM models to obtain N output texts; and carrying out relevance analysis on the obtained N output texts, and taking the output text with the highest relevance as an output result. The method, the medium and the system better exert the language understanding and generating capability of LLM, and realize the accurate retrieval and expression of a large-scale knowledge base.

Description

Method, medium and system for realizing accurate output of knowledge base by using LLM
Technical Field
The invention belongs to the technical field of accurate output of a knowledge base, and particularly relates to a method, medium and system for realizing accurate output of the knowledge base by using LLM.
Background
With the rapid development of the Internet, a huge text knowledge base is formed on the network, which provides great convenience for people to learn and acquire knowledge. However, how to quickly and accurately obtain the required knowledge from the complicated web text remains a challenge to be solved. The traditional text matching method based on the word vector has low matching accuracy. In recent years, large Language Model (LLM) technology has been developed, and has demonstrated a strong capability in the task of understanding natural language, LLM is a large-scale language model, which is a deep learning-based natural language processing model capable of learning the grammar and semantics of natural language, so that human-readable text can be generated. The "language model" is an AI model for processing only language characters (or symbology), and finds rules therein, and can automatically generate contents conforming to the rules based on prompt (prompt). LLM is typically based on neural network models, trained using large-scale corpora, such as using massive text data on the internet. These models typically possess billions to trillions of parameters that are capable of handling various natural language processing tasks such as natural language generation, text classification, text summarization, machine translation, speech recognition, and the like. How to use the powerful language understanding capability of LLM to realize the accurate search and expression of a large-scale text knowledge base is a problem worthy of exploration. At present, the related technology for realizing the accurate output of the knowledge base by utilizing LLM is not mature enough. The existing method is mainly based on a semantic matching strategy, the LLM is utilized to encode the problems and the knowledge base, then the similarity between the encodings is calculated, and the knowledge text with the highest similarity is selected as output. The method has two problems, namely 1) depending on semantic matching, language generation capability of LLM cannot be fully utilized, and 2) context semantics cannot be considered in the matching process, so that output is inaccurate. In order to realize accurate output of the knowledge base, it is necessary to study how to better utilize the language understanding and generating dual capabilities of LLM, and generate knowledge expression conforming to the context on the basis of fully understanding the semantics. This requires further introduction of context modeling based on coding semantic representations, which allows LLM to fully understand the semantics and context information of the problem, thereby generating accurate, smooth, context-semantic-compliant knowledge expressions.
In general, the problem that the prior art cannot effectively solve the problem of accurate output of a knowledge base, and a new technical scheme is urgently needed to better exert the language understanding and generating capability of LLM, so as to realize accurate retrieval and expression of a large-scale knowledge base.
Disclosure of Invention
In view of the above, the invention provides a method, medium and system for realizing accurate output of a knowledge base by using LLM, which solve the technical problems that the prior art cannot exert the language understanding and generating capability of LLM and cannot realize accurate retrieval and expression of a large-scale knowledge base.
The invention is realized in the following way:
the first aspect of the present invention provides a method for implementing accurate output of a knowledge base by using LLM, comprising the steps of:
s10, carrying out vectorization processing on knowledge in a knowledge base to obtain a knowledge vector data set containing a plurality of knowledge vectors;
s20, acquiring a problem of a user and carrying out vectorization processing to obtain a problem vector;
s30, matching the knowledge vector data set with the problem vector to obtain M knowledge vectors with highest matching degree;
s40, performing text processing on the M obtained knowledge vectors to obtain a corresponding problem text serving as a prompt;
s50, submitting the obtained campt to N LLM models to obtain N output texts;
s60, performing correlation analysis on the obtained N output texts, and taking the output text with the highest correlation as an output result.
Based on the technical scheme, the method for realizing accurate output of the knowledge base by using the LLM can be further improved as follows:
the step of performing relevance analysis on the obtained N output texts and taking the output text with the highest relevance as an output result specifically includes:
s61, carrying out vectorization processing on the obtained N output texts to obtain N output vectors;
s62, carrying out correlation analysis on each output vector and a knowledge base to obtain the correlation of each output vector;
s63, if the correlation degree of the output vector with the largest correlation degree is larger than a correlation degree threshold value, taking the output text corresponding to the output vector with the largest correlation degree as an output result; if no output vector greater than the correlation threshold exists, repeating the steps S40-S60 or repeating the steps S30-S60 after adjusting the value of M until an output text meeting the correlation requirement is obtained or the maximum cycle number is exceeded; and if the maximum number of the loops is exceeded, taking the output text corresponding to the output vector with the highest correlation degree in the conventional loops as an output result.
M is adjusted, typically by M+1.
Further, if there is no output vector greater than the relevance threshold, repeating steps S40-S60 until an output text meeting the relevance requirement is obtained or the maximum number of loops is exceeded, and further including a step of optimizing the sample after repeating step S40, specifically:
step 1, summarizing N output texts obtained in the previous cycle by using LLM to obtain N summarized texts;
step 2, merging the campt with the obtained N summary texts to obtain N merged texts;
step 3, carrying out relevance analysis on the N combined texts and the problems of the user, and taking the combined text with the highest relevance as a target text;
and 4, submitting the target text to a LLM model for generating the target text, analyzing and generating a new promt to replace the original promt, and realizing the optimization of the promt.
Wherein, the N LLMs all adopt an API calling mode.
In the step of vectorizing the knowledge in the knowledge base and obtaining the problem of the user and vectorizing, the vectorizing method is to process the knowledge text in the knowledge base or the problem text of the user into a vector by Word2 Vec.
The step of matching the using problem vector with the knowledge vector data set to obtain M knowledge vectors with highest matching degree comprises the following steps: and calculating the similarity between the problem vector and each knowledge vector in the knowledge vector data set, and selecting M knowledge vectors with the highest M matching degrees as problem matching results, wherein the similarity calculation method is cosine similarity.
The text processing is performed on the obtained M knowledge vectors to obtain a corresponding problem text as a prompt, which specifically includes:
mapping the obtained M knowledge vectors into natural language and converting the natural language into text expression;
and splicing the text obtained by text expression into a sample sequence.
Wherein m=5; n=3.
A second aspect of the present invention provides a computer readable storage medium having stored therein program instructions that, when executed, are configured to perform a method for implementing accurate output of a knowledge base using LLM as described above.
A third aspect of the present invention provides a system for implementing knowledge base precision output using LLM, comprising the computer readable storage medium described above.
Specifically, the method for realizing the accurate output of the knowledge base by using the LLM provided by the invention obtains candidate knowledge through semantic matching, and performs multi-round interaction refinement by using the generating capacity of the LLM, thereby realizing the accurate retrieval and expression of the knowledge base text. The method has the following technical effects:
1. improving the accuracy of knowledge retrieval
According to the invention, through vectorization expression of questions and knowledge, preliminary matching is performed by calculating the similarity between vectors, so that the probability that knowledge related to the questions is retrieved can be improved, and the interference of a large amount of irrelevant knowledge is avoided. Compared with the traditional method which only relies on keyword matching, the vector matching of the invention significantly improves the accuracy of knowledge retrieval.
2. Enhancing the correctness and fluency of knowledge expression
By submitting the matched knowledge vector to LLM to generate response text as a prompt, the language generating capability of the LLM can be fully exerted, and knowledge expression conforming to the context can be generated. Compared with directly outputting the retrieved knowledge text, the response text synthesized by the method is more smooth and accurate in grammar and semanteme.
3. Iterative optimization of knowledge output
The invention designs a multi-round interaction mechanism based on correlation analysis, and knowledge expression can be iteratively optimized until the precision requirement is met. Meanwhile, the effect of each round can be further improved by summarizing the optimization of the promt. This strategy of progressive refinement significantly improves the quality of knowledge representation.
4. Greatly improves the utilization efficiency of the knowledge base
The invention greatly improves the utilization efficiency of a mass knowledge base by quickly positioning related knowledge through semantic matching and carrying out knowledge expression by assisting with the generation capacity of LLM. The user can quickly obtain the desired knowledge without having to read the entire knowledge base word by word.
5. The method has strong expansibility
The framework of the invention has a modularized design, and each module of vector representation, semantic matching, language generation and the like can be flexibly adjusted, such as more advanced vector representation and matching algorithm, integration of a plurality of LLMs and the like. Therefore, the method has strong expansibility and can optimize space.
In general, the method for realizing accurate output of the knowledge base by utilizing LLM realizes efficient and accurate retrieval and expression of a large-scale text knowledge base by organically combining semantic matching and language generation. The method improves the quality of knowledge service, provides an effective novel solution for knowledge acquisition, and has important technical progress significance. It is believed that with the continuous progress of core technology, the application prospect of the invention is very broad.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for realizing accurate output of a knowledge base by using LLM;
fig. 2 is a flowchart of performing relevance analysis on the obtained N output texts, and taking the output text with the highest relevance as an output result.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
As shown in fig. 1, a flowchart of a method for implementing accurate output of a knowledge base by using LLM according to a first aspect of the present invention is provided, where the method includes the following steps:
s10, carrying out vectorization processing on knowledge in a knowledge base to obtain a knowledge vector data set containing a plurality of knowledge vectors;
s20, acquiring a problem of a user and carrying out vectorization processing to obtain a problem vector;
s30, matching the problem vector with a knowledge vector data set to obtain M knowledge vectors with highest matching degree;
s40, performing text processing on the M obtained knowledge vectors to obtain a corresponding problem text serving as a prompt;
s50, submitting the obtained campt to N LLM models to obtain N output texts;
s60, performing correlation analysis on the obtained N output texts, and taking the output text with the highest correlation as an output result.
Wherein S60 includes:
s61, carrying out vectorization processing on the obtained N output texts to obtain N output vectors;
s62, carrying out correlation analysis on each output vector and a knowledge base to obtain the correlation of each output vector;
s63, if the correlation degree of the output vector with the largest correlation degree is larger than a correlation degree threshold value, taking the output text corresponding to the output vector with the largest correlation degree as an output result; if no output vector greater than the correlation threshold exists, repeating the steps S40-S60 or repeating the steps S30-S60 after adjusting the value of M until an output text meeting the correlation requirement is obtained or the maximum cycle number is exceeded; and if the maximum number of the loops is exceeded, taking the output text corresponding to the output vector with the highest correlation degree in the conventional loops as an output result.
For S60, specifically:
s61, vectorizing the obtained N output texts to obtain N output vectors:
for N output textsVectorization using a pre-trained language model or the like to obtain +.>
The input text may be represented by multiple layers of encodings using models such as BERT, taking the resulting vector as the text vector.
S62, carrying out correlation analysis on each output vector and a knowledge base to obtain the correlation of each output vector:
for each of:
(1) Calculation ofAll knowledge vectors from knowledge base +.>Similarity of->Cosine similarity, etc.;
(2) Setting a correlation threshold τ, ifConsider->And->Related, general τ=90%;
(3) ObtainingRelated knowledge vector set>
An efficient vector search algorithm can be employed to accelerate this process.
Is defined as the sum of its associated knowledge similarities:
S63:
(1) Setting a correlation threshold T, defaulting t=90%;
(2) If presentSatisfy->Select +.>Correspondingly outputting as a result;
(3) Otherwise, carrying out multiple rounds of generation attempts;
(4) Cycle optimization
In the multi-round generation, correlation knowledge matching and correlation calculation are continuously optimized.
Further, in the above technical solution, if there is no output vector greater than the relevance threshold, repeating steps S40-S60 until an output text meeting the relevance requirement is obtained or the maximum number of loops is exceeded, and further including a step of optimizing the template after repeating step S40, specifically:
step 1, summarizing N output texts obtained in the previous cycle by using LLM to obtain N summarized texts;
step 2, merging the campt with the obtained N summary texts to obtain N merged texts;
step 3, carrying out relevance analysis on the N combined texts and the problems of the user, and taking the combined text with the highest relevance as a target text;
and 4, submitting the target text to a LLM model for generating the target text, analyzing and generating a new promt to replace the original promt, and realizing the optimization of the promt.
The LLM is utilized herein to automatically generate a pattern of prompt from input, which is supported by the currently popular LLM.
In the above technical solution, N LLMs all adopt an API call manner.
In the above technical solution, in the step of vectorizing the knowledge in the knowledge base and obtaining the problem of the user and vectorizing, the vectorizing method is to use Word2Vec to process the knowledge text in the knowledge base or the problem text of the user into vectors.
The purpose of step S10 is to vectorize the knowledge in the knowledge base to obtain a knowledge vector data set comprising a plurality of knowledge vectors. The specific implementation mode can adopt the following method:
1. construction of knowledge base
First a knowledge base containing rich knowledge needs to be built. The knowledge base can store professional knowledge in different fields, such as industry knowledge of medicine, law, finance and the like, and can also store common general knowledge of daily life. The knowledge base is constructed by extracting knowledge by using the existing knowledge sources such as encyclopedia, dictionary, professional book and the like, manually writing knowledge points, constructing a knowledge map by using question-answer pair and the like. The constructed knowledge base should cover as much domain knowledge as possible, and each knowledge point is expressed clearly and accurately.
2. Knowledge representation
For each knowledge point in the knowledge base, proper expression is needed to facilitate subsequent vectorization processing. It is contemplated that the expression may be in natural language, such as expressing a medical knowledge point as a piece of text. It is also contemplated to use a more succinct structuring approach to expression, such as a triples structure in a knowledge-graph. It is necessary to ensure that the expression of each knowledge point reflects its semantic content.
3. Text vectorization
If the knowledge points are expressed by natural language, the text can be regarded as a sequence, each word is mapped into a dense vector by a word embedding method, and the vector representation of the whole text sequence is obtained through a model. Common text vectorization models include Bag-of-Words (BoW) models, TF-IDF models, word2Vec models, bert et al pre-trained language models, and the like. These models can learn semantic features of text, mapping the text into a semantic vector space of fixed dimensions.
For example, given a piece of knowledge point text:wherein->Representing the i-th word. Word2Vec can be used to obtain a Word vector for each Word>All word vectors are then averaged as text vectors:
a pretrained language model such as Bert and the like can be adopted, a text sequence is input, and a corresponding text vector is output by the model to represent.
4. Knowledge graph vectorization
If knowledge points are expressed in the form of knowledge maps, vectorization of entities and relationships can be considered. Common knowledge-graph vectorization models include translation models such as TransE, transH, transR. The main idea of these models is to capture semantic translation relationships between entities and relationships.
For example, given a tripleWherein->Representing head and tail entities,/->Representing the relationship between the two. The TransE model will learn the vector representation of the entity and relationship, respectively +.>The requirement is satisfied->I.e. the relation vector can be regarded as a translation relation of the head-to-tail entities in the vector space.
By vectorization, each entity and relationship in the knowledge-graph is mapped into a dense vector of a fixed dimension, thereby achieving vectorized representation of the knowledge-graph.
5. Vector normalization
Vectorizing different knowledge points may present a problem of large orders of magnitude difference. To eliminate the dimension effect, all knowledge vectors can be normalized:
wherein,is the original knowledge vector, < >>Is a normalized vector, ++>The mean and standard deviation of all knowledge vectors, respectively.
After normalization, the average value of all knowledge vectors is 0, and the standard deviation is 1, so that the influence of different vector orders can be eliminated, and the subsequent vector matching operation is facilitated.
6. Constructing a knowledge vector dataset
All knowledge in the knowledge base is sequentially subjected to the vectorization and standardization processing, and finally a data set containing a plurality of knowledge vectors can be obtained and recorded as:
wherein,a normalized vector representing the i-th knowledge point.
To this end, either textual knowledge or structured knowledge in the knowledge base is mapped to dense vectors of fixed dimensions and stored in the knowledge vector dataset. When a user problem is input, the user problem can be vectorized, and then the knowledge vector most relevant to the user problem is searched in the knowledge vector space, so that knowledge accurate matching is realized.
The whole step S10 realizes the mapping from unstructured text to structured knowledge vectors, and is the basis of knowledge accurate reasoning. Through a reasonable vectorization method, semantic features of knowledge points can be abstracted, and relations among knowledge are established in a vector space, so that a data basis is provided for realizing knowledge reasoning by using LLM.
Of course, the creation and vectorization of knowledge bases can also be performed directly using currently popular tools, such as Wen Da (https:// gitsub.com/l 15 y/wenda).
In the above technical solution, the steps of matching the problem vector with the knowledge vector data set to obtain M knowledge vectors with the highest matching degree specifically include: and calculating the similarity of each knowledge vector in the problem vector and knowledge vector data set, and selecting M knowledge vectors with the highest M matching degrees as problem matching results, wherein the similarity calculation method is cosine similarity.
Similar to vectorizing the knowledge points in step S10, the stepsStep S20 first entails vectorizing the input natural language question, expressed as a dense vector of fixed dimensions. Specifically, the text vectorization model introduced in step S10, such as Word2Vec, bert, etc., may be used to encode the question text to obtain the question vector +.>
The purpose of step S30 is to match the knowledge vector data set with the problem vector, so as to obtain M knowledge vectors with the highest matching degree. The specific implementation mode can adopt the following method:
1. similarity calculation
Given a problem vectorAnd knowledge vector data set->Next, calculation is required +.>And every knowledge vector->Measure the degree of semantic relatedness of the question to each knowledge point.
Common similarity calculation methods include cosine similarity, euclidean distance, and the like:
(1) Cosine similarity
Wherein,is the angle between the two vectors, +.>Representing a vector dot product. Remainder of the processThe chord similarity considers the proximity degree of two vector directions, and the value range is [ -1,1]1 represents complete similarity.
(2) Euclidean distance
Euclidean distance measures the actual distance between two vector points. The euclidean distance of the similarity vector is small.
Both the two have advantages and disadvantages, and can be selected and used according to actual requirements.
3. Top K matching
All knowledge vectors are calculatedAnd question vector->After the similarity of the (c), the knowledge vectors may be ranked according to the similarity, and the first K knowledge vectors with the highest similarity are selected as the problem matching result, that is, returned:
wherein,comprises K knowledge vectors with highest matching degree, wherein K is>M。
The selection of (3) requires a comprehensive consideration of the actual effect. />A smaller value may not cover all knowledge points involved in the problem, while a larger +.>Increasing the difficulty of reasoning. Usually the first try can choose +.>For 5-10 knowledge vectors, typically m=5. In this step, M knowledge vectors with the highest matching degree may be directly selected as the problem matching result.
4. Vector index optimization
Alternatively, to increase the efficiency of large-scale knowledge vector searching, a vector indexing technique may be used to index the knowledge base. Common vector indexing algorithms include tree indexes (e.g., KD-trees), hash indexes (e.g., LSH), and the like. These methods can greatly increase the search speed.
For example, a KD tree can be used for knowledge vector data setsAnd establishing an index. Input question vector->When in use, the nearest neighbor vector can be found by searching on the KD tree quickly, and then the Top K vector is selected from the nearest neighbor vector without traversing the whole +.>Thereby reducing the amount of matching calculation.
5. Ranking model optimization
Alternatively, the vector sorting after similarity calculation may be optimized, and it is not necessarily required to sort the vector sorting completely according to the similarity. A ranking model, such as a rank svm, may be trained to learn a linear ranking function:
the model can capture interaction between the query problem and the knowledge vectors in a dot product mode, achieve more accurate ordering and determine M knowledge vectors with highest ordering.
In summary, step S30 achieves fast matching of the problem and the large-scale knowledge vector through problem vectorization and similarity matching, and provides a candidate knowledge set for subsequent knowledge retrieval. The matching efficiency and quality can be further improved by adopting cosine similarity, vector index and other technologies.
In the above technical solution, text processing is performed with the obtained M knowledge vectors, and a corresponding problem text is obtained as a sample, which specifically includes:
mapping the obtained M knowledge vectors into natural language and converting the natural language into text expression;
and splicing the text obtained by text expression into a sample sequence.
The purpose of step S40 is to generate text from the M knowledge vectors obtained in step S30, and submit the generated text to LLM as a prompt. Step S50 is to submit the generated template to N different LLM models to obtain N output texts. The specific implementation mode is as follows:
1. knowledge vector decoding
In step S30, M knowledge vectors with highest correlation are obtained by using the problem vector matching. These knowledge vectors first need to be mapped back to natural language and converted to text expressions.
A seq2seq model can be trained to decode the vector into text. I.e., training an automatic encoder to learn vector to text and text to vector mappings simultaneously, such as:
Encoder:,/>representing text->Representing the corresponding vector;
Decoder:,/>is from->Decoding to obtainAn arrival text;
the training goal is to minimize the reconstruction Loss:
trained DecoderThe input knowledge vector can be->Decoding to corresponding text->
2. Prompt generation
After decoding the M knowledge vectors into text, they can be spliced into a sample sequence as input to LLM:
the constitution of the Prompt sequence can be designed according to the requirements of different LLMs. For example, "question:" may be added as a question description before each predicted text, and "answer:" may be added as an answer prompt at the end.
The addition of the prompt conforming to language habits can enable the LLM to work better on the question-answer generating task.
3. LLM prediction
Alternatively, given the previously generated sample is expressed asMay be input to a plurality of different LLM models to generate answer text.
Commonly used LLMs include GPT, bert, etc., and training modes include self-supervising pre-training, fine tuning, etc. Will beAs an input sequence for LLM, a reply to a prompt may be generated as an answer to the question.
Assume that there are N LLM models, pairBy performing the calculation, N different predicted texts can be obtained:
wherein,is the answer generated by the ith LLM.
4. Model integration
The above process produces predictive answers to a plurality of LLM models. To synthesize these answers, the idea of model integration (model ensable) may be employed.
Specifically, N LLM prediction results are used as the context of the new campt and then input to the integrated model to generate a final answer. Such as:
here, theIs a more powerful integrated language model, possibly through pre-training on additional large-scale multitasking data. The method can synthesize and grasp the predictions of each LLM and output final answers with higher quality.
Through steps S40-S50, the retrieved relevant knowledge points are encoded as campt, and integrated prediction is utilized to generate a final answer. Prompt design and model integration are key to the drawing of multiple LLM advantages.
Wherein, in the above technical solution, m=5; n=3.
A second aspect of the present invention provides a computer readable storage medium having stored therein program instructions that, when executed, are configured to perform a method for implementing accurate output of a knowledge base using LLM as described above.
A third aspect of the present invention provides a system for implementing knowledge base precision output using LLM, comprising the computer readable storage medium described above.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (8)

1. A method for realizing accurate output of a knowledge base by using LLM is characterized by comprising the following steps:
s10, carrying out vectorization processing on knowledge in a knowledge base to obtain a knowledge vector data set containing a plurality of knowledge vectors;
s20, acquiring a problem of a user and carrying out vectorization processing to obtain a problem vector;
s30, matching the knowledge vector data set with the problem vector to obtain M knowledge vectors with highest matching degree;
s40, performing text processing on the M obtained knowledge vectors to obtain a corresponding problem text serving as a prompt;
s50, submitting the obtained campt to N LLM models to obtain N output texts;
s60, performing correlation analysis on the N obtained output texts, and taking the output text with the highest correlation as an output result; the method specifically comprises the following steps:
s61, carrying out vectorization processing on the obtained N output texts to obtain N output vectors;
s62, carrying out correlation analysis on each output vector and a knowledge base to obtain the correlation of each output vector;
s63, if the correlation degree of the output vector with the largest correlation degree is larger than a correlation degree threshold value, taking the output text corresponding to the output vector with the largest correlation degree as an output result; if no output vector greater than the correlation threshold exists, repeating the steps S40-S60 or repeating the steps S30-S60 after adjusting the value of M until an output text meeting the correlation requirement is obtained or the maximum cycle number is exceeded; if the maximum number of the loops is exceeded, taking an output text corresponding to the output vector with the highest correlation degree in the previous loops as an output result;
the method further comprises the step of optimizing the promt after repeatedly executing the step S40, specifically:
step 1, summarizing N output texts obtained in the previous cycle by using LLM to obtain N summarized texts;
step 2, merging the campt with the obtained N summary texts to obtain N merged texts;
step 3, carrying out relevance analysis on the N combined texts and the problems of the user, and taking the combined text with the highest relevance as a target text;
and 4, submitting the target text to a LLM model for generating the target text, analyzing and generating a new promt to replace the original promt, and realizing the optimization of the promt.
2. The method for realizing accurate output of a knowledge base by utilizing LLMs according to claim 1, wherein the N LLMs all adopt an API call mode.
3. The method for realizing accurate output of a knowledge base by utilizing LLM according to claim 1, wherein in the step of vectorizing knowledge in the knowledge base and obtaining a problem of a user and vectorizing, the vectorizing method is to use Word2Vec to process knowledge text in the knowledge base or problem text of the user as a vector.
4. The method for realizing accurate output of knowledge base by utilizing LLM according to claim 1, wherein the step of matching the problem vector with the knowledge vector data set to obtain M knowledge vectors with highest matching degree comprises the following steps: and calculating the similarity between the problem vector and each knowledge vector in the knowledge vector data set, and selecting M knowledge vectors with the highest M matching degrees as problem matching results, wherein the similarity calculation method is cosine similarity.
5. The method for realizing accurate output of knowledge base by utilizing LLM according to claim 1, wherein the step of performing text processing with the obtained M knowledge vectors to obtain corresponding question text as a prompt specifically comprises:
mapping the obtained M knowledge vectors into natural language and converting the natural language into text expression;
and splicing the text obtained by text expression into a sample sequence.
6. The method for realizing accurate output of a knowledge base by using LLM according to claim 1, wherein m=5; n=3.
7. A computer readable storage medium, wherein program instructions are stored in the computer readable storage medium, and when the program instructions are executed, the program instructions are configured to perform a method for implementing accurate output of a knowledge base using LLM according to any one of claims 1-6.
8. A system for implementing knowledge base precision output using LLM, comprising the computer readable storage medium of claim 7.
CN202311090665.1A 2023-08-29 2023-08-29 Method, medium and system for realizing accurate output of knowledge base by using LLM Active CN116842126B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311090665.1A CN116842126B (en) 2023-08-29 2023-08-29 Method, medium and system for realizing accurate output of knowledge base by using LLM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311090665.1A CN116842126B (en) 2023-08-29 2023-08-29 Method, medium and system for realizing accurate output of knowledge base by using LLM

Publications (2)

Publication Number Publication Date
CN116842126A CN116842126A (en) 2023-10-03
CN116842126B true CN116842126B (en) 2023-12-19

Family

ID=88162061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311090665.1A Active CN116842126B (en) 2023-08-29 2023-08-29 Method, medium and system for realizing accurate output of knowledge base by using LLM

Country Status (1)

Country Link
CN (1) CN116842126B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235238B (en) * 2023-11-13 2024-03-08 广东蘑菇物联科技有限公司 Question answering method, question answering device, storage medium and computer equipment
CN117891925B (en) * 2024-03-15 2024-05-28 青岛网信信息科技有限公司 Method, medium and system for realizing chat memory by using large language model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115204143A (en) * 2022-09-19 2022-10-18 江苏移动信息系统集成有限公司 Method and system for calculating text similarity based on prompt
CN116303980A (en) * 2023-05-19 2023-06-23 无码科技(杭州)有限公司 Large language model knowledge enhancement method, system, electronic equipment and medium
CN116541536A (en) * 2023-05-30 2023-08-04 北京百度网讯科技有限公司 Knowledge-enhanced content generation system, data generation method, device, and medium
CN116561277A (en) * 2023-05-05 2023-08-08 科大讯飞股份有限公司 Knowledge question-answering method, device, equipment and storage medium
KR102563550B1 (en) * 2023-04-14 2023-08-11 고려대학교산학협력단 Method and apparatus for read-only prompt learning
CN116628172A (en) * 2023-07-24 2023-08-22 北京酷维在线科技有限公司 Dialogue method for multi-strategy fusion in government service field based on knowledge graph

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115204143A (en) * 2022-09-19 2022-10-18 江苏移动信息系统集成有限公司 Method and system for calculating text similarity based on prompt
KR102563550B1 (en) * 2023-04-14 2023-08-11 고려대학교산학협력단 Method and apparatus for read-only prompt learning
CN116561277A (en) * 2023-05-05 2023-08-08 科大讯飞股份有限公司 Knowledge question-answering method, device, equipment and storage medium
CN116303980A (en) * 2023-05-19 2023-06-23 无码科技(杭州)有限公司 Large language model knowledge enhancement method, system, electronic equipment and medium
CN116541536A (en) * 2023-05-30 2023-08-04 北京百度网讯科技有限公司 Knowledge-enhanced content generation system, data generation method, device, and medium
CN116628172A (en) * 2023-07-24 2023-08-22 北京酷维在线科技有限公司 Dialogue method for multi-strategy fusion in government service field based on knowledge graph

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Improving Knowledge Extraction from LLMs for Robotic Task Learning through Agent Analysis;Kirk, J.R. et al.;arxiv;全文 *
融合知识表示的知识库问答系统;安波;韩先培;孙乐;;中国科学:信息科学(11);第59-70页 *

Also Published As

Publication number Publication date
CN116842126A (en) 2023-10-03

Similar Documents

Publication Publication Date Title
CN116842126B (en) Method, medium and system for realizing accurate output of knowledge base by using LLM
CN111259127B (en) Long text answer selection method based on transfer learning sentence vector
Wang et al. Common sense knowledge for handwritten chinese text recognition
Shi et al. Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN111191002A (en) Neural code searching method and device based on hierarchical embedding
CN110781306A (en) English text aspect layer emotion classification method and system
US11183175B2 (en) Systems and methods implementing data query language and utterance corpus implements for handling slot-filling and dialogue intent classification data in a machine learning task-oriented dialogue system
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
Huang et al. Interactive knowledge-enhanced attention network for answer selection
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
Chai Design and implementation of English intelligent communication platform based on similarity algorithm
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN116628192A (en) Text theme representation method based on Seq2Seq-Attention
CN116340530A (en) Intelligent design method based on mechanical knowledge graph
CN115238705A (en) Semantic analysis result reordering method and system
CN115203388A (en) Machine reading understanding method and device, computer equipment and storage medium
CN115017260A (en) Keyword generation method based on subtopic modeling
Wang et al. One vs. many qa matching with both word-level and sentence-level attention network
CN115422934B (en) Entity identification and linking method and system for space text data
Li et al. CypherQA: Question-answering method based on Attribute Knowledge Graph
Zou et al. Hierarchical and Multiple-Perspective Interaction Network for Long Text Matching
Hajihashemi Varnousfaderani Challenges and insights in semantic search using language models
Aggarwal Question Answering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant