CN118210908A

CN118210908A - Retrieval enhancement method and device, electronic equipment and storage medium

Info

Publication number: CN118210908A
Application number: CN202410627966.1A
Authority: CN
Inventors: 孔晶; 左智; 黄杰
Original assignee: Shanghai Puhua Science And Technology Development Co ltd
Current assignee: Shanghai Puhua Science And Technology Development Co ltd
Priority date: 2024-05-21
Filing date: 2024-05-21
Publication date: 2024-06-18
Anticipated expiration: 2044-05-21
Also published as: CN118210908B

Abstract

The invention discloses a retrieval enhancement method, a retrieval enhancement device, electronic equipment and a storage medium. Extracting corpus from files in a knowledge base, extracting key information in the files and identifying metadata; cutting the text of the extracted corpus to obtain text blocks; vectorizing the text block to obtain a low-dimensional vector index, merging the metadata with the low-dimensional vector index to generate a multi-dimensional vector index, and storing the multi-dimensional vector index into a vector database; after vectorization processing is carried out on the user problem, searching is carried out in the vector database to determine a plurality of candidate text corpora; generating a prompt word by combining history conversation text after carrying out preset processing on the candidate text corpus and the user problem, wherein the preset processing comprises the steps of extracting the candidate text corpus with high weight from the candidate text corpus through a fusion ranking algorithm, and compressing irrelevant contents; and inputting the prompt words into the large model to output query contents, and generating search results after induction reasoning of the query contents. The method can improve the accuracy of knowledge searching.

Description

Retrieval enhancement method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of knowledge question and answer, in particular to a retrieval enhancement method, a retrieval enhancement device, electronic equipment and a storage medium.

Background

The engineering industry has various knowledge files with various types, complex content and high professionality, and the knowledge files comprise: industry specifications, enterprise specifications, design specifications, reports, work letter, work quality safety technical data, historical project data, and the like.

The traditional knowledge base generally adopts a large language model to search questions, and patterns and characteristics in text data are learned through large-scale pre-training, so that the method has strong text generation and understanding capability, is excellent in natural language processing tasks, however, enterprises often self-build the knowledge base in the vertical fields of engineering consultation, engineering design, engineering management and the like, store knowledge, technical data, historical project data and the like deposited by the enterprises, and the general large language model lacks basis when answering because the general large language model does not learn aiming at the vertical field, and cannot accurately answer based on the knowledge base.

Disclosure of Invention

The invention provides a retrieval enhancement method, a retrieval enhancement device, electronic equipment and a storage medium, which are used for solving the problem that in the prior art, accurate answer cannot be performed based on a knowledge base by searching through a large language model.

According to an aspect of the present invention, there is provided a retrieval enhancement method including:

Extracting corpus from the files in the knowledge base through the data connector, extracting key information in the files through the metadata identifier and identifying metadata;

Text cutting is carried out on the extracted corpus through a block cutting divider to obtain at least one text block;

Carrying out vectorization processing on the at least one text block through a vector calculation library to obtain a low-dimensional vector index, merging the metadata with the low-dimensional vector index to generate a multi-dimensional vector index, and storing the multi-dimensional vector index into a vector database;

After vectorization processing is carried out on the user problem, searching is carried out in the vector database through a multidimensional index manager to determine a plurality of candidate text corpora;

The method comprises the steps of carrying out preset processing on the candidate text corpus and a user problem through a result synthesizer, and then generating a prompt word by combining a history conversation text, wherein the preset processing comprises the steps of extracting the candidate text corpus with high weight from the candidate text corpus through a fusion ranking algorithm, and compressing irrelevant contents;

and inputting the prompt word into a large model through a result synthesizer to output query contents, and generating a search result after induction and reasoning of the query contents.

According to another aspect of the present invention, there is provided a retrieval enhancement device comprising:

the extraction module is used for extracting corpus of the files in the knowledge base through the data connector, extracting key information in the files through the metadata identifier and identifying metadata;

The cutting module is used for carrying out text cutting on the extracted corpus through the block cutting divider to obtain at least one text block;

The merging module is used for carrying out vectorization processing on the at least one text block through a vector calculation library to obtain a low-dimensional vector index, merging the metadata with the low-dimensional vector index to generate a multi-dimensional vector index and storing the multi-dimensional vector index into a vector database;

the searching module is used for searching in the vector database through the multidimensional index manager after vectorizing the user problem to determine a plurality of candidate text corpora;

the processing module is used for generating prompt words by combining history conversation texts after carrying out preset processing on the candidate text corpus and the user problem through a result synthesizer, wherein the preset processing comprises the steps of extracting high-weight candidate text corpus from the candidate text corpus through a fusion ranking algorithm, and compressing irrelevant contents;

and the generation module is used for inputting the prompt words into the big model through the result synthesizer to output query contents, and generating search results after induction and reasoning of the query contents.

According to another aspect of the present invention, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor;

Wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the retrieval enhancement method of any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a retrieval enhancement method according to any embodiment of the present invention.

According to the technical scheme provided by the embodiment of the invention, through the large language model and knowledge retrieval, the problem that the accuracy of the query result is low because the general large language model does not learn aiming at the field of engineering when the problem of question-answering query is directly carried out by using the large language model in the prior art is solved, and the beneficial effect of improving the accuracy of knowledge retrieval is achieved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a search enhancement method according to a first embodiment of the present invention;

Fig. 2 is a flow chart of a search enhancement method according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of a text message dicing method according to a second embodiment of the present invention;

Fig. 4 is a flow chart of a retrieval enhancement method according to a third embodiment of the present invention;

FIG. 5 is an overall operational flow diagram of a search enhancement method provided by an example embodiment of the invention;

FIG. 6 is a detailed schematic diagram of a first portion of a search enhancement method according to an exemplary embodiment of the present invention;

FIG. 7 is a schematic diagram showing a second part of details of a search enhancement method according to an exemplary embodiment of the present invention;

fig. 8 is a schematic structural diagram of a retrieval enhancement device according to a fourth embodiment of the present invention;

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention. It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those skilled in the art will appreciate that "one or more" is intended to be construed as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the devices in the embodiments of the present invention are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Example 1

Fig. 1 is a flow chart of a search enhancement method according to an embodiment of the present invention, where the method is applicable to the case of searching questions and answers in a knowledge base in engineering industry, and the method may be performed by a search enhancement device, where the device may be implemented by software and/or hardware and is generally integrated on an electronic device, and in this embodiment, the electronic device includes but is not limited to: a computer device. As shown in fig. 1, a search enhancement method provided in a first embodiment of the present invention includes the following steps:

s110, corpus extraction is carried out on the files in the knowledge base through the data connector, and key information in the files is extracted through the metadata identifier and metadata is identified.

The knowledge base may be an enterprise engineering private knowledge base. Metadata is data describing key information attributes.

In this embodiment, the data connector may load a file in the enterprise engineering private knowledge base, and may extract text corpus by using different cleaning methods according to different file types; the metadata identifier may extract key information in the knowledge base corpus using entity recognition and relationship extraction techniques, the key information may include item names, item addresses, item types, design units, file types, creator, process models, professions, file properties, and the like.

S120, performing text cutting on the extracted corpus through a block cutting divider to obtain at least one text block.

The segmentation divider can properly segment the text corpus, so that the accuracy of semantic search is ensured, and the size of the text block is reduced as much as possible to reduce interference information.

In this embodiment, a text information dicing method may be used to convert the bulk file into small text blocks that are easy to find. The tree structure obtained after text cutting comprises pages, paragraphs and sentences, wherein one page can be split into a plurality of paragraphs, and one paragraph can be split into a plurality of sentences. The page is a father node of a paragraph, and the paragraph is a father node of a sentence; all pages may form a set of page text blocks, all paragraphs may form a set of paragraph text blocks, and all sentences may form a set of sentence text blocks.

S130, carrying out vectorization processing on the at least one text block through a vector calculation library to obtain a low-dimensional vector index, merging the metadata with the low-dimensional vector index to generate a multi-dimensional vector index, and storing the multi-dimensional vector index into a vector database.

In this embodiment, the multidimensional index manager may send the text block to the vector computation library, and the vector computation library may call Embedding the model to perform text vector computation to obtain a low-dimensional vector index, and send the low-dimensional vector index to the multidimensional index manager, so that the multidimensional index manager constructs index data including multidimensional information, i.e., multidimensional vector index. The establishment process of the multidimensional vector index comprises the following steps: identifying, by a metadata identifier, metadata attributes of each file; extracting key information in a file title and a summary, such as a project name, a project address, a project type, a design unit, a file type, a creator, a process model, a specialty, a file property and other metadata labels by using entity identification and relation extraction technologies, wherein the extracted metadata is stored in text block metadata; calling Embedding a model to perform text word vector embedding calculation on the properly cut text block set with 3 layers, and performing vectorization calculation on the concentrated abstract due to more text contents when the text block type is a page or a paragraph; when the text block type is a sentence, calculating a vector representation of the text content; and integrating the text block content, the index relation, the vector value of the abstract or the sentence and the metadata tag information to form multi-dimensional vector index data, and storing the multi-dimensional vector index data into a vector data multi-dimensional vector library.

The vector calculation library can perform vectorization calculation on input texts, the input texts are converted into low-dimensional index representations from a high-order vector space, a Embedding model can be adopted, the vectorization effect of the model on Chinese is good, and the model is small in volume and can be operated on a low-cost display card.

It is understood that the multidimensional index manager interfaces with the vector database, can store the constructed multidimensional vector index data into the vector database, and can also perform vector lookup from the vector database.

And S140, searching in the vector database through the multidimensional index manager after carrying out vectorization processing on the user problem, and determining a plurality of candidate text corpora.

In this embodiment, the task scheduler may obtain the user question through the user question-answer interaction interface and perform vectorization processing, and the multidimensional index manager performs appropriate search in the vector database according to the content of the question, to find a plurality of candidate text corpora that conform to semantics.

The method comprises the steps of training user problems in advance to obtain common problem modes, extracting instructions, compounding problems or not and the like according to the engineering field; the trained instruction set can be used for identifying the user problem mode; when the user problem is identified as a compound problem, the compound problem can be disassembled into a plurality of sub-queries, and vectorization searching is performed respectively.

S150, carrying out preset processing on the candidate text corpus and the user problem through a result synthesizer, and then generating a prompt word by combining the history conversation text, wherein the preset processing comprises the steps of extracting the candidate text corpus with high weight from the candidate text corpus through a fusion ranking algorithm, and compressing irrelevant contents.

The preset processing may include screening, sorting, and aggregation. User questions may be understood as questions that a user wants to query.

In this embodiment, the result synthesizer may perform screening, sorting and aggregation according to the candidate text prediction and the user problem through the fusion ranking algorithm, remove the candidate text corpus with low weight, take the candidate text corpus with high weight, and compress the irrelevant content to obtain the final text corpus.

In this embodiment, the result synthesizer may select the Prompt word required for constructing the LLM model by combining the final text corpus and the history dialogue text through the Prompt template.

The Prompt word combination can be performed by using different Prompt templates according to the category of the user problem, and the Prompt word combination comprises a conventional problem template, a combined problem template and a problem template combined with a history text.

S160, inputting the prompt word into a large model through a result synthesizer to output query content, and generating a search result after induction reasoning of the query content.

In this embodiment, the result synthesizer may further input the prompt word generated in step S150 into the LLM large language model to perform query, so that the large language model may perform query according to the prompt word to output a query result; the result synthesizer can further inductive and infer the query content to generate the concentrated information which is read by the load user.

When the user problem is a compound problem, a plurality of LLM submitting processes can be initiated to obtain a plurality of return results; if the user problem includes comparison reasoning, merging and submitting the returned results to the LLM again to obtain a final merging result, and pushing the final merging result to the user.

According to the retrieval enhancement method provided by the embodiment of the invention, firstly, corpus extraction is carried out on files in a knowledge base through a data connector, and key information in the files is extracted through a metadata identifier and metadata is identified; secondly, text cutting is carried out on the extracted corpus through a block cutting divider to obtain at least one text block; carrying out vectorization processing on the at least one text block through a vector calculation library to obtain a low-dimensional vector index, merging the metadata with the low-dimensional vector index to generate a multi-dimensional vector index, and storing the multi-dimensional vector index into a vector database; then, carrying out vectorization processing on the user problem, and searching in the vector database through a multidimensional index manager to determine a plurality of candidate text corpora; then, a result synthesizer is used for carrying out preset processing on the candidate text corpus and the user problem, and then, a prompt word is generated by combining the history conversation text, wherein the preset processing comprises the steps of extracting the candidate text corpus with high weight from the candidate text corpus through a fusion ranking algorithm, and compressing irrelevant contents; and finally, inputting the prompt word into a large model through a result synthesizer to output query contents, and generating a search result after induction reasoning of the query contents. The method does not involve training of a large language model, does not need to consume a large amount of token of a public network large language model, and has low cost; when a new knowledge base is put in storage, a large language model does not need to be trained again, and the searched data is always kept up to date.

Example two

Fig. 2 is a flow chart of a search enhancement method according to a second embodiment of the present invention, where the second embodiment is optimized based on the above embodiments. For details not yet described in detail in this embodiment, refer to embodiment one.

As shown in fig. 2, a search enhancement method provided in a second embodiment of the present invention includes the following steps:

S210, extracting corpus of files in the knowledge base through a data connector, extracting key information in the files through a metadata identifier and identifying metadata.

S220, acquiring the corpus extracted from each file.

S230, extracting texts according to page numbers aiming at the corpus extracted from each file to form a page text block set; extracting natural segments for each page text block in the page text block set to form a paragraph text block set; and for each paragraph text block in the paragraph text block set, sentence segmentation is carried out according to punctuation marks as sentence separators to form a sentence text block set.

Wherein the page text block set, the paragraph text block set and the sentence text block set all comprise the following contents:

Index number; text block type; which file the text block originates from; page number of file where text block is located; a previous text block index to the current text block; a text block index subsequent to the current text block; a parent text block index of the current text block; sub-text block indexes of the current text block; text content; and (5) concentrating the content abstract.

Fig. 3 is a schematic diagram of a text information dicing method according to a second embodiment of the present invention, where, as shown in fig. 3, after dicing, a page text block is used as a root node of a tree structure, a paragraph text is used as a child node of the page text block, and the page text block may be used as a parent node of a sentence text block.

The text information dicing method comprises the following steps:

step one, reading text information in each file in a knowledge base through a data connector;

Extracting texts according to page numbers to form a text block set S1;

Step three, circularly traversing S1, and extracting natural segments according to the "\n\n" marks and the beginning of the paragraph to form a text block set S2;

And step four, circularly traversing S2, and forming a text block S3 by taking punctuation marks (such as periods, question marks, exclamation marks and semicolons) as sentence separators.

Each set described above includes:

INDEX: index number, quick finding text block;

TYPE: text block type, PAGE/paragraph PARA/sentence SENTS;

FILE: from which file the text block originates;

PAGE: page number of file where text block is located; when the pages are spread, the page number where the start is positioned is recorded;

PREVIOUS (r): a previous text block index to the current text block;

NEXT: a text block index subsequent to the current text block;

Part: a parent text block index of the current text block;

CHILD: sub-text block indexes of the current text block;

ABSTRACT: when TYPE=PAGE/PARA, calling the LLM model to concentrate the content abstract;

CONTENT: text content.

S240, carrying out vectorization processing on the at least one text block through a vector calculation library to obtain a low-dimensional vector index, merging the metadata with the low-dimensional vector index to generate a multi-dimensional vector index, and storing the multi-dimensional vector index into a vector database.

S250, searching in the vector database through the multidimensional index manager after vectorizing the user problem, and determining a plurality of candidate text corpora.

S260, carrying out preset processing on the candidate text corpus, the query problem and the historical conversation text through a result synthesizer to generate a prompt word, wherein the preset processing comprises extracting the candidate text corpus with high weight from the candidate text corpus through a fusion ranking algorithm, and compressing irrelevant contents.

S270, inputting the prompt words into the large model to output query contents, and generating search results after induction and reasoning of the query contents.

The text cutting mode used in the text cutting process of the retrieval enhancement method provided by the embodiment of the invention can convert a large amount of files into small text blocks which are convenient to search, and the query efficiency is effectively improved.

Example III

Fig. 4 is a flow chart of a retrieval enhancement method according to a third embodiment of the present invention, where the third embodiment is optimized based on the above embodiments. For details not yet described in this embodiment, please refer to the first and second embodiments.

As shown in fig. 4, a retrieval enhancement method provided in a third embodiment of the present invention includes the following steps:

S310, extracting corpus from the files in the knowledge base through the data connector, extracting key information in the files through the metadata identifier and identifying metadata.

S320, performing text cutting on the extracted corpus through a block cutting divider to obtain at least one text block.

S330, carrying out vectorization processing on the at least one text block through a vector calculation library to obtain a low-dimensional vector index, merging the metadata with the low-dimensional vector index to generate a multi-dimensional vector index, and storing the multi-dimensional vector index into a vector database.

S340, searching in the vector database through the multidimensional index manager after vectorizing the user problem, and determining a plurality of candidate text corpora.

S350, according to the query task distributed by the task scheduler, calling the multidimensional index manager to query from the vector database to obtain a plurality of vector retrieval results.

The task scheduler can distribute query tasks to the result synthesizer, and the result synthesizer can call the multidimensional index manager to query from the vector database to obtain a plurality of vector retrieval results after receiving the query tasks.

S360, extracting high-weight candidate text corpus from the candidate text corpus through a fusion ranking algorithm according to the user problem, and compressing irrelevant content to obtain final text corpus.

Specifically, in a vector database, calculating vector similarity through cosine similarity, and forming a result set by the first n multidimensional vectors with highest similarity; newly adding a weight attribute in the result set, and putting a similarity value into the weight attribute; performing metadata hit weighting operation and text aggregation operation on the result set; combining text contents to form a target text corpus for the first m result sets after the metadata hit weighting operation and the aggregate text operation; and after compressing the target text corpus, segmenting the text into new text blocks with preset lengths, calculating the similarity of the new text blocks, merging and deduplicating the new text blocks with high similarity, discarding the new text blocks with low similarity to the user problem, and compressing the rest text blocks to form the final text corpus.

Further, the metadata hit weighting operation includes: comparing the keywords of the user questions with metadata in the result set; if the same, the weight is added by 1.

The key words of the user questions are used for weight adjustment, so that the search relevance is improved, and the answer quality is improved.

Further, the aggregate text operation includes: taking the results of all text blocks with the same type as the sentences, merging the index values of the father text blocks of the results and removing the duplication; judging hit percentages of child nodes of the parent text blocks; and if the hit percentage exceeds the preset value, extracting the text content of the rich text block, and discarding the text block of the child node.

The extracted texts are combined and aggregated under the action of the step, and when a plurality of sentences with high correlation are hit, the text contents of the paragraphs are taken to obtain more valuable information.

S370, according to the task type, combining the final text corpus and the historical conversation text, and constructing a prompt word through a template.

The first 10 historical session texts of the user questions and answers can be provided as prompt words to the large model, so that the large model can integrate and answer in combination with the historical information.

S380, inputting the prompt word into a large model through a result synthesizer to output query content, and generating a search result after induction reasoning of the query content.

The third embodiment of the invention provides a retrieval enhancement method, which embodies the process of generating the prompt word. The method uses the fusion ranking algorithm to search text, so that the query accuracy can be improved.

The embodiment of the invention provides several specific implementation modes based on the technical scheme of each embodiment.

As a specific implementation manner of the present embodiment, fig. 5 is an overall operation flowchart of a search enhancement method provided by an exemplary embodiment of the present invention, as shown in fig. 5, including the following steps:

1. And extracting text corpus from the files in the enterprise engineering private knowledge base, and simultaneously extracting text information and identifying related metadata.

2. And (3) properly cutting the extracted corpus to form text blocks, calling Embedding a model, carrying out vectorization, merging to generate a multidimensional vector index, and storing the multidimensional vector index in a vector database.

3. And obtaining user questions through a user question-answer interaction interface, carrying out vectorization processing, and carrying out proper retrieval in a vector database to find a plurality of candidate text corpora conforming to semantics.

A) Aiming at the engineering field, the user problem can be trained in advance, a common problem mode is obtained, instructions are extracted, whether the problem is compounded or not, and the like; the trained instruction set can be used for identifying the user problem mode;

b) When the user is identified as a compound problem, the compound problem is lent into a plurality of sub-queries, and vectorized searching is performed respectively.

4. After necessary screening and sorting are carried out on the search text corpus, the first k corpus are taken and filled into a Prompt template, and the search text corpus is converted into Prompt word information required by the LLM model.

A) During screening, metadata filtering is considered, if the user problem contains specific metadata keywords, corpus irrelevant to the keywords is removed;

b) When the user selects high-precision searching, the keyword searching and the vector searching are combined, and the matched corpus weight is higher.

5. And calling the LLM model, summarizing the provided materials, and returning the final result to the user.

A) When a user presents a compound problem, a plurality of LLM submitting processes are initiated, and a plurality of return results are obtained;

b) If the user instruction comprises comparison and reasoning, merging and submitting the returned results again to the LLM to obtain a final merging result, and pushing the final merging result to the user;

c) The first 10 session histories of the user questions and answers will also be provided again to the LLM as promt for integrated answers based on past information.

Further, fig. 6 is a schematic diagram of a first part of details of a search enhancement method according to an exemplary embodiment of the present invention, as shown in fig. 6, including the following procedures:

The method comprises the steps of cleaning and expected extraction of various files in an enterprise private knowledge base through a data connector, extracting key information in various files and identifying metadata by using a metadata identifier, text cutting by using a block cutting divider to obtain text blocks, inputting the text blocks into a vector calculation base to perform text vector calculation to obtain vector indexes by using a multidimensional index manager, establishing a plurality of vector indexes according to the metadata and the vector indexes, and storing the multidimensional vector indexes into a vector database.

The data connector is used for loading files, and extracting text corpus by adopting different cleaning processing methods according to different file types; the metadata identifier is used for extracting key information in the corpus of the knowledge base by using entity identification and relation extraction technology; the block cutting divider is used for properly cutting the text corpus, so that the accuracy of semantic search is ensured, and the size of the text block is reduced as much as possible to reduce interference information; the vector calculation library is used for carrying out vectorization calculation on the input text and converting the high-dimensional vector space into a low-dimensional index representation; the multidimensional index manager is connected with the vector database in a butt joint mode, index data containing multidimensional information is constructed, and data storage and vector searching are achieved.

Further, fig. 7 is a second detailed schematic diagram of a retrieval enhancement method according to an exemplary embodiment of the present invention, as shown in fig. 7, including the following procedures:

The task scheduler recognizes an instruction sent by the instruction library after receiving a user problem, performs task arrangement according to the instruction, decomposes the user problem into a plurality of problems if the user problem is compound query, and the multidimensional index manager can search in the vector database according to the content of the problem to obtain a search result and output a vector similarity text block; the task dispatcher judges whether to combine the histories, if yes, the history session of the previous N steps is taken to be transmitted to obtain a history session text; the task scheduler performs word segmentation operation on the user problem to obtain a keyword input result synthesizer, the result synthesizer performs text search through a fusion ranking algorithm according to the vector similarity text block and the keyword, a template is used for constructing a prompt word in combination with a history session text, the prompt word is input into a LLM large language model for text understanding, a query result is output, and the query result is returned to the user.

The instruction library is used for an industry problem instruction library trained in advance and used for identifying whether a user problem rechecks and inquires, whether a session context needs to be combined, whether a specific keyword is contained or not, and whether a special operation instruction is required or not; the task scheduler is used for providing a user question-answering interface, performing task allocation scheduling according to user questions, accessing the LLM large language model, returning a query result to the user, possibly creating a plurality of LLM session instances when the questions are complex, respectively executing tasks, and finally merging the results; the result synthesizer is the core of the query result and comprises the following responsibilities: a) According to the query task distributed by the task scheduler, invoking a multidimensional index manager to query and obtain a plurality of vector retrieval results; after vectorizing the query problem, combining vector indexes and metadata to jointly screen, and discarding the result below a certain threshold value; b) Discarding results lower than a threshold value through a fusion ranking algorithm, taking the first few query results with high weight, and compressing irrelevant contents; c) Constructing a Prompt word required by the LLM model through a Prompt template according to the task type; d) And calling the LLM large language model, and carrying out induction/reasoning on the inquired content text to generate concentrated information which accords with the reading of the user.

The engineering field instruction library is trained, user questions are trained in advance, user question modes are identified, and user intention is understood:

A. Extracting user problems in a knowledge base search log, cleaning to remove repeated and nonsensical problems and obtaining a user problem set Q;

B. Manually marking the problem set Q, identifying the nature of the problem, whether to compound the query, whether to compare the query, whether to query the keywords, and whether to multitask the query;

C. Removing stop words (Stopwords Removal), stem extraction (Stemming), morphological merging (Lemmatization) and other treatments from the set Q, and reducing the dimension of a feature space;

D. performing vectorization pretreatment on the set Q by using a Word2Vec method, wherein 70% of the set Q is taken as a training set, and 30% is taken as a testing set;

E. Using a Scikit-learn learning library and using a BernoulliNB Bernoulli Bayes calculation formula to conduct classification learning;

F. and performing cross verification by using the test set, and adjusting smoothing parameters to obtain an optimal user problem instruction classification model.

The text searching method by the fusion ranking algorithm comprises the following steps:

A. obtaining a user query problem, and calling Embedding a model to vectorize the problem;

B. Calculating vector similarity through cosine similarity (Cosine Similarity) in a vector database, and discarding a text block set with similarity lower than a threshold parameter K; taking the first n multidimensional vector records with highest similarity to form a result set R;

C. Adding a WEIGHT WEIGHT attribute in R, and putting a similarity value into the WEIGHT attribute;

D. Performing METADATA hit weighting operation on R, taking keywords of a user problem, and comparing the keywords with METAADTA in a result set R after word segmentation, wherein each hit is one METAADTA, and the WEIGHT is WEIGHT+1;

E. Performing text aggregation operation on R, and taking all TYPE= SENTS results; merging the part values and de-duplicating; then judging the hit percentage of the CHILD nodes of the parts, if more than 50% are hit, taking the CONTENT of the part blocks, and discarding the CHILD text blocks;

F. The first m result sets after weighted sorting and aggregation are taken, CONTENT text CONTENTs are combined, and corpus valuable for the problem is formed;

G. In the last step, in order to avoid submitting the redundant text to LLM, the cost is increased and the operation speed is increased, so that compression operation is also required to be carried out on the final corpus, and redundant and low-correlation content is removed; cutting the finally combined text into blocks with the specified length L again; calculating the similarity of the new blocks, merging and de-duplicating text blocks with high similarity; text blocks with similarity to the problem below the threshold C are also discarded; and finally, compressing to form corpus M.

The details of generating the Prompt call large language model are as follows:

A. The text corpus is supplied by using a LLM large language model Prompt word mode, and is restrained to be summarized only on the basis of the corpus supply, so that the deduction is not expanded to avoid generating phantom answers;

B. according to the category of the user problem recognition, combining Prompt words by using different Prompt templates;

a) Conventional problem templates:

The following problems exist:

"$user question'

Please answer based on text only, e.g. no mention is made here of please answer directly "no answer temporarily"

Text:

"$corpus M'

B) Combining the problem templates:

The following problems exist:

"$user question'

Please answer questions based on a plurality of texts provided below only, e.g. the text is not mentioned to answer "answer temporarily" directly "

Text 1:

"$corpus M1'

Text 2:

"$corpus M2'

C) Problem templates of the combination history:

The following problems exist:

"$user question'

Please answer questions based only on text and user history sessions provided below, e.g., text and history sessions do not mention requesting direct answer "no answer temporarily";

Text 1:

"$corpus M1'

Text 2:

"$corpus M2'

History session:

"$ history session".

Example IV

Fig. 8 is a schematic structural diagram of a search enhancement device according to a fourth embodiment of the present invention, where the device is applicable to knowledge base search questions and answers in engineering industry, and the device may be implemented by software and/or hardware and is generally integrated on an electronic device.

As shown in fig. 8, the apparatus includes an extraction module 110, a cutting module 120, a merging module 130, a searching module 140, a processing module 150, and a generating module 160.

The extraction module 110 is configured to perform corpus extraction on the files in the knowledge base through the data connector, extract key information in the files through the metadata identifier, and identify metadata;

A cutting module 120, configured to perform text cutting on the extracted corpus by using a dicer to obtain at least one text block;

The merging module 130 is configured to perform vectorization processing on the at least one text block through a vector computation library to obtain a low-dimensional vector index, merge the low-dimensional vector index with the metadata to generate a multi-dimensional vector index, and store the multi-dimensional vector index into a vector database;

The searching module 140 is configured to search the vector database through the multidimensional index manager after performing vectorization processing on the user problem to determine a plurality of candidate text corpora;

The processing module 150 is configured to perform preset processing on the candidate text corpus and the user problem through a result synthesizer, and then generate a prompt word in combination with the history dialogue text, where the preset processing includes extracting a candidate text corpus with high weight from the candidate text corpus through a fusion ranking algorithm, and compressing irrelevant contents;

And the generating module 160 is used for inputting the prompt words into the big model through the result synthesizer to output query contents, and generating search results after summarizing and reasoning the query contents.

In this embodiment, the extraction module 110 performs corpus extraction on the files in the knowledge base through the data connector, extracts key information in the files through the metadata identifier, and identifies metadata; secondly, the cutting module 120 performs text cutting on the extracted corpus through a block cutting divider to obtain at least one text block; the merging module 130 performs vectorization processing on the at least one text block through a vector calculation library to obtain a low-dimensional vector index, merges the low-dimensional vector index with the metadata to generate a multi-dimensional vector index, and stores the multi-dimensional vector index into a vector database; then the searching module 140 performs vectorization processing on the user problem and searches in the vector database through the multidimensional index manager to determine a plurality of candidate text corpora; the processing module 150 performs preset processing on the candidate text corpus and the user problem through a result synthesizer, and then generates a prompt word by combining with the history conversation text, wherein the preset processing comprises extracting the candidate text corpus with high weight from the candidate text corpus through a fusion ranking algorithm, and compressing irrelevant contents; and finally, inputting the prompt word into a large model through a result synthesizer through a generation module 160 to output query contents, and generating a search result after induction and reasoning of the query contents.

The embodiment provides a retrieval enhancement device, which can improve the accuracy of knowledge searching.

Further, the dicing module 120 includes:

the acquisition unit is used for acquiring the corpus extracted from each file;

the first extraction unit is used for extracting texts according to page numbers aiming at the corpus extracted from each file to form a page text block set;

The second extraction unit is used for extracting natural segments to form a paragraph text block set aiming at each page text block in the page text block set;

And the segmentation unit is used for carrying out sentence segmentation according to punctuation marks as sentence separators for each paragraph text block in the paragraph text block set to form a sentence text block set.

On the basis of the optimization, the page text block set, the paragraph text block set and the sentence text block set all comprise the following contents:

Further, the processing module 150 includes:

The acquisition unit is used for calling the multidimensional index manager to acquire a plurality of vector retrieval results from the vector database according to the query task distributed by the task scheduler;

The extraction unit is used for extracting the candidate text corpus with high weight from the candidate text corpus through a fusion ranking algorithm according to the user problem, and compressing irrelevant contents to obtain a final text corpus;

and the construction unit is used for constructing the prompt word through the template according to the task type and combining the final text corpus and the historical conversation text.

Based on the above technical scheme, the extraction unit is specifically configured to:

in a vector database, calculating vector similarity through cosine similarity, and forming a result set by the first n multidimensional vectors with highest similarity;

Newly adding a weight attribute in the result set, and putting a similarity value into the weight attribute;

Performing metadata hit weighting operation and text aggregation operation on the result set;

Combining text contents to form a target text corpus for the first m result sets after the metadata hit weighting operation and the aggregate text operation;

and after compressing the target text corpus, segmenting the text into new text blocks with preset lengths, calculating the similarity of the new text blocks, merging and deduplicating the new text blocks with high similarity, discarding the new text blocks with low similarity to the user problem, and compressing the rest text blocks to form the final text corpus.

Based on the above technical solution, the metadata hit weighting operation includes: comparing the keywords of the user questions with metadata in the result set; if the same, the weight is added by 1. The aggregate text operation includes: taking the results of all text blocks with the same type as the sentences, merging the index values of the father text blocks of the results and removing the duplication; judging hit percentages of child nodes of the parent text blocks; and if the hit percentage exceeds the preset value, extracting the text content of the rich text block, and discarding the text block of the child node.

The retrieval enhancement device can execute the retrieval enhancement method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example five

Fig. 9 shows a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 9, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the search enhancement method.

In some embodiments, the retrieval enhancement method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more of the steps of the retrieval enhancement method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the retrieval enhancement method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A retrieval enhancement method, the method comprising:

2. The method of claim 1, wherein text-cutting the extracted corpus by a dicer-segmenter results in at least one text block, comprising:

acquiring the corpus extracted from each file;

extracting texts according to page numbers aiming at the corpus extracted from each file to form a page text block set;

extracting natural segments for each page text block in the page text block set to form a paragraph text block set;

And for each paragraph text block in the paragraph text block set, sentence segmentation is carried out according to punctuation marks as sentence separators to form a sentence text block set.

3. The method of claim 2, wherein the set of page text blocks, the set of paragraph text blocks, and the set of sentence text blocks each comprise:

4. The method according to claim 1, wherein the generating, by the result synthesizer, the prompt word in combination with the history conversational text after the candidate text corpus and the user question are preset, includes:

according to the query task distributed by the task scheduler, invoking the multidimensional index manager to query from a vector database to obtain a plurality of vector retrieval results;

According to the user problem, extracting high-weight candidate text corpus from the candidate text corpus by a fusion ranking algorithm, and compressing irrelevant content to obtain final text corpus;

and according to the task type, combining the final text corpus and the historical conversation text, and constructing a prompt word through a template.

5. The method of claim 4, wherein the extracting the high-weight candidate text corpus from the candidate text corpus by the fusion ranking algorithm according to the user problem, and compressing the irrelevant content to obtain the final text corpus comprises:

6. The method of claim 5, wherein the metadata hit weighting operation comprises:

Comparing the keywords of the user questions with metadata in the result set;

If the same, the weight is added by 1.

7. The method of claim 5, wherein the aggregate text operation comprises:

taking the results of all text blocks with the same type as the sentences, merging the index values of the father text blocks of the results and removing the duplication;

Judging hit percentages of child nodes of the parent text blocks;

and if the hit percentage exceeds the preset value, extracting the text content of the rich text block, and discarding the text block of the child node.

8. A retrieval enhancement device, the device comprising:

9. An electronic device, the electronic device comprising:

At least one processor;

and a memory communicatively coupled to the at least one processor;

Wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the retrieval enhancement method of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions for causing a processor to implement the retrieval enhancement method of any one of claims 1-7 when executed.