CN117725182A

CN117725182A - Data retrieval method, device, equipment and storage medium based on large language model

Info

Publication number: CN117725182A
Application number: CN202311776820.5A
Authority: CN
Inventors: 戴铮; 陈新华; 徐晹; 刘泉; 陈思仪
Original assignee: Hunan Aerospace Tianlu New Material Testing Co ltd
Current assignee: Hunan Aerospace Tianlu New Material Testing Co ltd
Priority date: 2023-12-21
Filing date: 2023-12-21
Publication date: 2024-03-19

Abstract

The application relates to a data retrieval method, device, equipment and storage medium based on a large language model. The method comprises the following steps: constructing a detection information knowledge base; converting each text segment in the detection information knowledge base into a corresponding text vector, and storing the text vector in a vector database; fine-tuning the pre-trained large language model by using a detection test business dialogue corpus; inputting the question text input by the user into the trimmed large language model to obtain an answer text corresponding to the question text; if the answer text comprises the detection requirement of the user, converting the detection requirement into a corresponding keyword vector, and searching a text vector matched with the keyword vector in a vector database to obtain a matching vector; decompiling the matching vector to obtain text information, screening information related to the detection requirement in the text information according to the detection requirement, and obtaining a retrieval result. By adopting the method, the knowledge of the large model can be expanded on the premise of not carrying out additional model training, and the training cost is effectively reduced.

Description

Data retrieval method, device, equipment and storage medium based on large language model

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a data retrieval method, apparatus, device, and storage medium based on a large language model.

Background

With the advent of ChatGPT, the use of large language models in various industries has increased. Meanwhile, the quantity of detection institutions and laboratories is continuously increased in the inspection and detection industry in China, the industry gradually develops towards marketization, and the data of the inspection and detection industry is also sharply increased. In order to make the large language model learn the newly added detection test data continuously, the model needs to be trained continuously. While large language models often contain hundreds of millions of parameters, training large models requires a persistent computing cluster with a high-speed network interface and hardware accelerators such as GPUs for training and fine-tuning, which inevitably increases the training cost of the model. Therefore, how to combine the inspection industry with a large language model on the premise of fully considering cost effectiveness and technical feasibility, and providing more opportunities and potential for industry development becomes a technical problem to be solved currently.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data retrieval method, apparatus, device, and storage medium based on a large language model.

A method of data retrieval based on a large language model, the method comprising:

constructing a detection information knowledge base; the detection information knowledge base comprises a plurality of text segments; each text segment includes a number of attributes associated with the detected demand;

converting each text segment in the detection information knowledge base into a corresponding text vector, and storing the text vector in a vector database;

acquiring a detection inspection business dialogue corpus, and fine-tuning a pre-trained large language model by using the detection inspection business dialogue corpus;

acquiring a question text input by a user, and inputting the question text into a trimmed large language model to obtain an answer text corresponding to the question text;

monitoring whether the answer text comprises the detection requirement of a user in real time, if so, converting the detection requirement into a corresponding keyword vector, and searching a text vector matched with the keyword vector in the vector database to obtain a matching vector;

decompiling the matching vector to obtain corresponding text information, screening out information related to the detection requirement in the text information according to the detection requirement, and obtaining a retrieval result.

In one embodiment, the method further comprises: and inputting the search result and the problem text into a pre-trained large language model to obtain corresponding generated content.

In one embodiment, the method further comprises: the tail of each text segment in the detection information knowledge base comprises a cutting symbol; each attribute in the text segment comprises a separation symbol; the attribute comprises a detection item, a detection standard, a detection device and an information parameter of the detection device.

In one embodiment, the method further comprises: and cutting by using the cutting symbols of each text segment in the detection information knowledge base to obtain a plurality of text blocks, and converting each text block into a corresponding text vector.

In one embodiment, the method further comprises: judging whether the answer text accords with a preset answer style, and if so, taking the text content which accords with the answer style as the detection requirement of a user.

In one embodiment, the method further comprises: calculating cosine similarity between each text vector and the keyword vector in the vector database; and judging the magnitude relation between the cosine similarity corresponding to each text vector and a threshold value, and taking the text vector higher than the threshold value as a matching vector.

A large language model based data retrieval device, the device comprising:

the knowledge base construction module is used for constructing a detection information knowledge base; the detection information knowledge base comprises a plurality of text segments; each text segment includes a number of attributes associated with the detected demand;

the text processing module is used for converting each text segment in the detection information knowledge base into a corresponding text vector, and storing the text vector in the vector database;

the model fine tuning module is used for acquiring a detection and inspection business dialogue corpus and fine tuning a pre-trained large language model by utilizing the detection and inspection business dialogue corpus;

the demand acquisition module is used for acquiring a question text input by a user, inputting the question text into the trimmed large language model, and obtaining an answer text corresponding to the question text;

the text matching module is used for monitoring whether the answer text comprises the detection requirement of the user in real time, if so, converting the detection requirement into a corresponding keyword vector, and searching a text vector matched with the keyword vector in the vector database to obtain a matching vector;

and the result output module is used for decompiling the matching vector to obtain corresponding text information, screening out information related to the detection requirement in the text information according to the detection requirement, and obtaining a retrieval result.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

According to the data retrieval method, device, equipment and storage medium based on the large language model, the detection information knowledge base is constructed, related information of a detection mechanism and related detection standard information can be provided for the large model, texts in the detection information knowledge base are segmented and stored in a vector form, then a large language model for fine adjustment training of a detection service dialogue corpus is used as a semantic analyzer for semantic understanding of user problems, user requirements in the large language model are extracted, output content of the large language model is monitored in real time, after the detection requirements of the user are obtained by the large language model, the detection requirements of the user are subjected to similarity comparison in the vector database to obtain texts related to the large language model, and meanwhile, text information is processed before the matched texts are output, so that the finally obtained data can be ensured to have complete semantic information, and other semantic interference content is not included, and therefore clean and reliable data information is provided for the large language model, and a subsequent reasoning process is convenient. According to the embodiment of the invention, the knowledge of the large model can be expanded on the premise of not carrying out additional model training, and the training cost is effectively reduced.

Drawings

FIG. 1 is a flow diagram of a data retrieval method based on a large language model in one embodiment;

FIG. 2 is a schematic diagram of a data retrieval process facing a knowledge base of detection information in one embodiment;

FIG. 3 is a block diagram of a large language model based data retrieval device in one embodiment;

fig. 4 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, there is provided a data retrieval method based on a large language model, including the steps of:

step 102, constructing a detection information knowledge base.

The detection information knowledge base comprises a plurality of text segments, the tail of each text segment comprises a cutting symbol, each text segment comprises a plurality of attributes related to detection requirements, a separation symbol is arranged between each attribute, the separation symbol can be a line feed symbol, and the attributes comprise detection items, detection standards, detection equipment and information parameters of the detection equipment. Specific detection requirements can be locked by these key properties. By adding a special symbol at the end of each piece of information, the special symbol can be $, @ or #, so that after the corresponding text segment is searched by the vector database, the text segment is cut, the searched information can be more conveniently analyzed and processed, and meanwhile, the problem can be rapidly located by checking the cut text segment, so that the checking efficiency is improved. The separation between each attribute is performed by using a line feed character, and the main purpose of this is to emphasize the independence of each attribute, in this way, the data neatness can be ensured when the data is input into the vector database, and at the same time, the accuracy of the retrieval can be improved, because each attribute can be independently retrieved, so that the retrieval result is more accurate.

The detection information knowledge base can provide more comprehensive and accurate background knowledge for the learning and reasoning process of the large model, so that the application capability of the large model in the detection and inspection field is improved.

And 104, converting each text segment in the detection information knowledge base into a corresponding text vector, and storing the text vector in a vector database.

The text data in the detected information knowledge base is divided into a plurality of independent text blocks by using a text cutting technology, and the text blocks are processed by using an Embedding function of text2 vec. Embedding is a technique for converting text data into vectors, which can convert each word or character in the text into a fixed-length vector. After the Embedding processing, the text block is stored in a vector database.

And 106, acquiring a detection and inspection business dialogue corpus, and fine-tuning the pre-trained large language model by using the detection and inspection business dialogue corpus.

The detection and inspection business dialogue corpus is a multi-round dialogue corpus constructed by using the dialogue technology commonly used in business and simulating business personnel and clients through communication and study with business personnel in detection and inspection industry, so that the large model has the capability of capturing the detection requirements of the user, and can be an open source model such as ChatGLM, LLaMA (Large Language Model Meta AI).

And step 108, acquiring a question text input by a user, and inputting the question text into the trimmed large language model to obtain an answer text corresponding to the question text.

Step 110, monitoring whether the answer text includes the detection requirement of the user in real time, if so, converting the detection requirement into a corresponding keyword vector, and searching a text vector matched with the keyword vector in a vector database to obtain a matching vector.

The key extraction is performed on the questions of the user through the large model, and the key often comprises detection items, for example, the user wants to detect impurities in ores, so that text blocks related to the impurities can be matched in a vector database through the detection items of the impurities, and each text block is marked and cut by using special symbols. Therefore, the text block matched by the detection item of the impurity contains complete detection equipment, equipment parameters and detection standards according to the detection equipment and the equipment parameters. The information can be output as a basis, and the reliability of the information can be improved.

As shown in the data retrieval flow diagram of the detection information knowledge base in fig. 2, when a user presents a problem, a plurality of dialogs are performed through a language model to obtain the detection requirement of the user problem, and the detection requirement is processed by using text2vec and is converted into vector representation. Then, the similarity between the keyword vector and the text block vector is compared by using a cosine similarity calculation method to find the text block most similar to the user problem. This process can quickly and accurately match the user detection needs with existing text blocks. In the problem processing stage, text2vec technology converts the problem into a vector, which can capture the semantics and meaning of the problem. The cosine similarity between the problem vector and the text block vector can be calculated by comparing them, and a vector higher than the threshold is selected as a candidate vector by setting a threshold for the cosine similarity score. Therefore, the user problem and the text block can be effectively matched by using the vector similarity method, and better service and support are provided for the user. The application of the technology can help the model to connect with an external knowledge base, and reduce the cost of the model for acquiring new knowledge.

And 112, decompiling the matching vector to obtain corresponding text information, screening out information related to the detection requirement in the text information according to the detection requirement, and obtaining a retrieval result.

And obtaining text information in the decompiled text block by decompiling the text block. Keyword information in a user question is extracted in combination with text understanding capability of the large model so as to be used for screening related texts. By using the keywords, texts can be screened, and finally, contents related to the keywords can be obtained. This process can better understand and extract information related to user questions. Specifically, since the vector generates a similarity score (the range is [ -1,1 ]) in the process of similarity calculation, the score is more similar as the score is higher, so that matched vector text information is obtained by setting the threshold value to 0.8, however, the matched vector text information is not necessarily completely related to keywords, so that whether the matched text information contains keywords needs to be detected, if so, the text information is reserved, if not, the text information is removed, and text block information irrelevant to the search text is filtered, so that the accuracy of the search information can be ensured.

After the retrieval result is obtained, the retrieval result is subjected to text reasoning through a large language model, so that accurate information is finally output for the user, and the answer conforming to the human language habit is obtained.

According to the data retrieval method based on the large language model, the detection information knowledge base is constructed, related information of a detection mechanism and related detection standard information can be provided for the large model, texts in the detection information knowledge base are segmented and stored in a vector form, then the large language model which is used for fine adjustment training of a detection and inspection business dialogue corpus is used as a semantic analyzer for semantic understanding of user problems, user requirements in the large language model are extracted, output content of the large language model is monitored in real time, after the detection requirements of the user are obtained by the large language model, the detection requirements of the user are subjected to similarity comparison in the vector database to obtain texts related to the large language model, and meanwhile, the text information is processed before the matched texts are output, so that the finally obtained data has complete semantic information and does not contain other contents interfering with semantics, and clean and reliable data information is provided for the large language model, and a subsequent reasoning process is convenient. According to the embodiment of the invention, the knowledge of the large model can be expanded on the premise of not carrying out additional model training, and the training cost is effectively reduced.

In one embodiment, detecting the end of each text segment in the knowledge base includes cutting a symbol; each attribute in the text segment comprises a separation symbol; the attributes include detection items, detection criteria, detection devices, and information parameters of the detection devices.

In one embodiment, the step of converting each text segment in the detected information knowledge base to a corresponding text vector comprises: and cutting by using the cutting symbols of each text segment in the detection information knowledge base to obtain a plurality of text blocks, and converting each text block into a corresponding text vector.

In one embodiment, monitoring in real time whether the answer text includes a detected need of the user includes: judging whether the answer text accords with a preset answer style, and if so, taking the text content which accords with the answer style as the detection requirement of the user.

In this embodiment, when a user asks a question, a large model for fine tuning training of a testing business dialogue corpus is detected to perform business dialogue on the user and guide the user to provide specific detection requirements thereof, whether the knowledge base is searched is judged by monitoring output content of the large language model, if the large model obtains the detection requirements of the user through dialogue guidance, specific answers (for example, a user inputs which enterprises can detect medlar, the large language model guides the user to provide specific detection requirements thereof, asks the user what specific items of medlar need to be detected, and thus obtains detection samples and corresponding detection items) are output, and when the text of the structural style is monitored, the knowledge base is searched by taking the text as a keyword. The method can be particularly used for judging whether the large model outputs text answers in a specific format or not by adopting a character string matching method.

In one embodiment, searching a vector database for a text vector matching the keyword vector, the obtaining a matching vector includes: calculating cosine similarity between each text vector and the keyword vector in the vector database; and judging the magnitude relation between the cosine similarity corresponding to each text vector and the threshold value, and taking the text vector higher than the threshold value as a matching vector. It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.

In one embodiment, as shown in fig. 3, there is provided a data retrieval device based on a large language model, including: a knowledge base construction module 302, a text processing module 304, a model fine tuning module 306, a demand acquisition module 308, a text matching module 310, and a result output module 312, wherein:

a knowledge base construction module 302, configured to construct a detection information knowledge base; the detection information knowledge base comprises a plurality of text segments; each text segment includes a number of attributes associated with the detected demand;

the text processing module 304 is configured to convert each text segment in the detected information knowledge base into a corresponding text vector, and store the text vector in the vector database;

the model fine tuning module 306 is configured to obtain a detection test business dialogue corpus, and fine tune a pre-trained large language model using the detection test business dialogue corpus;

the requirement acquisition module 308 is configured to acquire a question text input by a user, input the question text into the trimmed large language model, and obtain an answer text corresponding to the question text;

the text matching module 310 is configured to monitor in real time whether the answer text includes a detection requirement of the user, if yes, convert the detection requirement into a corresponding keyword vector, and search a text vector matching the keyword vector in the vector database to obtain a matching vector;

the result output module 312 is configured to decompil the matching vector to obtain corresponding text information, screen out information related to the detection requirement from the text information according to the detection requirement, and obtain a search result.

In one embodiment, the method is further used for inputting the search result and the problem text into a pre-trained large language model to obtain corresponding generated content.

In one embodiment, the method is further used for detecting that the tail of each text segment in the information knowledge base comprises a cutting symbol; each attribute in the text segment comprises a separation symbol; the attributes include detection items, detection criteria, detection devices, and information parameters of the detection devices.

In one embodiment, the method is further used for cutting by using the cutting symbols of each text segment in the detection information knowledge base to obtain a plurality of text blocks, and converting each text block into a corresponding text vector.

In one embodiment, the method is further used for judging whether the answer text accords with a preset answer style, and if so, the text content which accords with the answer style is used as a detection requirement of a user.

In one embodiment, the method is further used for calculating cosine similarity between each text vector and the keyword vector in the vector database; and judging the magnitude relation between the cosine similarity corresponding to each text vector and the threshold value, and taking the text vector higher than the threshold value as a matching vector. For specific limitations on the large language model-based data retrieval apparatus, reference may be made to the above limitations on the large language model-based data retrieval method, and detailed descriptions thereof are omitted herein. The respective modules in the above-described large language model-based data retrieval apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a data retrieval method based on a large language model. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the structures shown in FIG. 4 are block diagrams only and do not constitute a limitation of the computer device on which the present aspects apply, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment a computer device is provided comprising a memory storing a computer program and a processor implementing the steps of the method of the above embodiments when the computer program is executed.

In one embodiment, a computer readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method of the above embodiments.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples represent only a few embodiments of the present application, which are described in more detail and detail, but are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims

1. A method for data retrieval based on a large language model, the method comprising:

2. The method according to claim 1, wherein the method further comprises:

and inputting the search result and the problem text into a pre-trained large language model to obtain corresponding generated content.

3. The method of claim 1, wherein the end of each of the text segments in the knowledge base of detected information comprises a cut symbol;

each attribute in the text segment comprises a separation symbol; the attribute comprises a detection item, a detection standard, a detection device and an information parameter of the detection device.

4. A method according to claim 3, wherein the step of converting each text segment in the detected information repository into a corresponding text vector comprises:

and cutting by using the cutting symbols of each text segment in the detection information knowledge base to obtain a plurality of text blocks, and converting each text block into a corresponding text vector.

5. The method of any of claims 1-4, wherein monitoring in real time whether the answer text includes a user's detected needs comprises:

judging whether the answer text accords with a preset answer style, and if so, taking the text content which accords with the answer style as the detection requirement of a user.

6. The method of claim 5, wherein looking up text vectors in the vector database that match the keyword vector, the matching vectors comprising:

calculating cosine similarity between each text vector and the keyword vector in the vector database;

and judging the magnitude relation between the cosine similarity corresponding to each text vector and a threshold value, and taking the text vector higher than the threshold value as a matching vector.

7. A large language model based data retrieval apparatus, the apparatus comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.