CN117056531A

CN117056531A - Domain knowledge driven large language model fine tuning method, system, equipment and storage medium

Info

Publication number: CN117056531A
Application number: CN202311096075.XA
Authority: CN
Inventors: 武星; 彭少淇
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-11-14

Abstract

The invention relates to a method, a system, equipment and a storage medium for finely adjusting a domain knowledge driven large language model, which are used for combining the domain knowledge driven Large Language Model (LLM) with an enterprise knowledge base to realize the construction of a domain-specific AI application. The method finely adjusts the LLM by utilizing a large pre-training language model and the domain knowledge in the enterprise so as to better meet the specific requirements of the enterprise. Meanwhile, the system also introduces an enterprise knowledge base based on a vector database, and the retrieval efficiency of knowledge is improved by storing the domain knowledge in a vector representation. Finally, by combining LLM and an enterprise knowledge base and then applying the promtt technology, the exclusive AI application of the enterprise is constructed, and more accurate and efficient natural language processing and knowledge application are realized.

Description

Domain knowledge driven large language model fine tuning method, system, equipment and storage medium

Technical Field

The invention relates to an artificial intelligent large language model, in particular to a domain knowledge driven large language model fine tuning method, a system, equipment and a storage medium.

Background

With the continued development of the Natural Language Processing (NLP) field, the advent and use of large language models (such as the GPT family) has attracted considerable attention. The large models acquire general language modes and knowledge on a large-scale general corpus through pre-training, and a strong foundation is provided for various NLP tasks. However, in domain-specific applications, these generic models may be limited by a lack of domain expertise, which makes it difficult to meet the high demands in the domain.

Traditionally, rule-based specialized NLP methods require manual construction of complex rules and features, are labor-intensive and are not easy to maintain. In some specific fields, the change and update speed of expert knowledge are high, so that the traditional method cannot keep pace with the update of information. Therefore, there is a need for a more efficient, automated, adaptive method to better adapt large language models to specific fields, fully exploiting their potential in the field.

In order to solve the problems, the invention provides a domain knowledge driven Large Language Model (LLM) fine tuning method and a system, which combine the powerful generating capacity of a large-scale language model and the expertise of the domain knowledge in enterprises to realize more accurate and targeted natural language processing and knowledge application. Through the pre-training large model and the fine tuning method, the system not only can learn language modes from a large amount of general data, but also can acquire information of a specific field from knowledge of the enterprise field so as to realize better field adaptability.

Meanwhile, the system also introduces the concept of a vector database, and stores the domain knowledge in the enterprise in a vectorized form so as to improve the retrieval efficiency of the knowledge. The invention can extract analysis in real time and self-circulate and supplement to the knowledge base, can help enterprises to maintain the updating and accuracy of the knowledge base content, and enables the AI application of the enterprises to make accurate answers and suggestions based on the latest information.

In summary, the invention provides an innovative method and system for natural language processing and knowledge application in the enterprise domain, so that a large language model can be better adapted to the specific domain, and a higher-quality and more personalized information support is provided for users.

Disclosure of Invention

Aiming at the problems that a general model is possibly limited by the lack of domain expertise, the high requirements in the domain are difficult to meet, the traditional specialized NLP method based on rules needs to manually construct complex rules and features, the workload is large, the maintenance is difficult, the change and update speed of expert knowledge in a specific domain are high, the traditional method cannot keep pace with the update of information and the like, the invention provides a domain knowledge driven Large Language Model (LLM) fine tuning method and system, and the powerful generating capacity of a large-scale language model and the expertise of the domain knowledge in an enterprise are combined to realize more accurate and targeted natural language processing and knowledge application.

The invention firstly provides a domain knowledge driven large language model fine tuning method, which comprises the following steps:

pre-training a large language model by adopting a large-scale universal corpus;

constructing an enterprise knowledge graph, arranging and constructing the enterprise knowledge graph in the enterprise domain, guiding a large language model to extract and analyze knowledge in real time according to the context and knowledge in the enterprise domain in a fine tuning stage, and supplementing the extracted knowledge to a knowledge base in a self-circulation manner;

vectorizing the entity and the relation element in the enterprise knowledge graph, and storing the vectorized entity and the relation element in a vector database;

and receiving user input, guiding a user to interact with the AI application through a prompt technology, and generating a targeted reply by using vectorized knowledge and domain knowledge in an enterprise knowledge base.

In some embodiments, the extracting and analyzing the knowledge in real time and supplementing the extracted knowledge to the knowledge base in a self-circulation manner comprises: and (3) carrying out full analysis according to text and voice dialogue records generated in the process of enterprise and customer contact, extracting high-frequency and latest customer problems, and finishing the problems into manually quickly-confirmed newly-added knowledge content through comprehensive treatment of a problem extraction model, a problem clustering model, a problem ordering model and a large language model, so as to finish content precipitation from the front end to automatic analysis and treatment, and finally quickly confirming a closed-loop maintenance system for warehousing by manual work.

The invention also provides another domain knowledge driven large language model fine tuning method, which comprises the following steps:

synchronizing data from a database or cloud service to a vector database in real time through a data pipeline to form a knowledge base;

when the user is in dialogue with the enterprise AI application, the AI application carries out semantic retrieval on the questions of the user in the enterprise knowledge base, then sends the retrieved relevant answers and the questions together with a certain prompt to a large model, and returns the final answers to the user.

The invention also provides a domain knowledge driven large language model fine tuning method system which comprises a pre-training large language module, an enterprise domain knowledge introducing and fine tuning module, an enterprise vector database knowledge storage module, an enterprise exclusive AI application module and an enterprise exclusive AI application result output module.

In some embodiments, the pre-training large language module is configured to pre-train a large language model using a large-scale generic corpus.

In some embodiments, the knowledge introducing and fine tuning module of the enterprise domain is configured to construct an enterprise knowledge graph, sort and construct the enterprise knowledge graph in the enterprise domain, guide the large language model to extract and analyze the knowledge in real time according to the context and knowledge of the enterprise domain in the fine tuning stage, and self-circulate the extracted knowledge to the knowledge base.

In some embodiments, the enterprise vector database knowledge storage module is configured to vectorize entities and relationship elements in an enterprise knowledge graph and store the vectorized entities and relationship elements in a vector database.

In some embodiments, the enterprise-specific AI application module is configured to receive user input, guide a user to interact with the AI application through a prompt technique, and generate a targeted reply by using vectorized knowledge and domain knowledge in an enterprise knowledge base.

The invention also provides electronic equipment which is characterized by comprising a memory and a processor, wherein at least one program instruction is stored in the memory, and the processor is used for realizing the large language model fine tuning method according to any one of the above steps by loading and executing the at least one program instruction.

The invention also provides a computer storage medium, which is characterized in that at least one program instruction is stored in the computer storage medium, and the at least one program instruction is loaded and executed by a processor to realize the large language model fine tuning method of any one of the above.

The invention has the beneficial effects that: firstly, the domain knowledge driven large language model fine tuning method and system fully utilize the advantages of a large model and enterprises, not only can fully utilize the existing knowledge of the enterprises, but also can utilize the powerful expression and reasoning capacity of the large model, and the large model and the enterprise knowledge are perfectly fused. Then, the large language model system driven by domain knowledge enables AI application to have long-term memory, and the large model only has short-term memory due to the limitation of Token, so that all knowledge of enterprises cannot be completely memorized; and by utilizing an external knowledge base, mass data assets owned by enterprises can be integrated completely, so that the AI application of the enterprises is helped to construct long-term memory. Then, the enterprise data of the large language model system driven by the domain knowledge is relatively safe and controllable, and the enterprise can locally construct own knowledge base to avoid the leakage of core data assets. Finally, the large language model driven by the field knowledge is low in floor cost, and by the floor AI application of the scheme, enterprises do not need to invest a large amount of resources to build own local large model, so that the enterprises are helped to save the training cost of tens of millions.

Drawings

FIG. 1 is a flow chart of one embodiment of a knowledge-driven large language model fine tuning method and system in accordance with the present invention;

FIG. 2 is a system architecture diagram of a knowledge-driven large language model fine tuning method and system in accordance with the present invention;

FIG. 3 is a schematic diagram of a knowledge base self-circulation and maintenance architecture of a knowledge-driven large language model fine tuning method and system in accordance with an embodiment of the present invention.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs; the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention; the terms "comprising" and "having" and any variations thereof in the description of the invention and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.

The implementation flow, system architecture and knowledge base self-circulation and maintenance technology of the domain knowledge driven large language model LLM fine tuning method and system provided by the invention are further described in detail below with reference to the accompanying drawings and the specific embodiments.

Referring to FIG. 1, a flowchart of one embodiment of a domain knowledge driven large language model LLM fine tuning method and system as shown in FIG. 1 is shown.

The domain knowledge driven method and system for fine tuning of a large language model LLM includes the steps of firstly, pre-training the large language model, and pre-training the large language model LLM by adopting a large-scale universal corpus; then, introducing knowledge in the enterprise domain, constructing an enterprise knowledge graph, and guiding the LLM to better adapt to the context and knowledge in the enterprise domain in a fine tuning stage; then, the knowledge of the enterprise vector database is stored, and elements such as entities, relations and the like in the enterprise knowledge graph are vectorized and stored in the vector database; then, an enterprise exclusive AI application is constructed, after user input information enters the system, the user is guided to interact with the AI application through a prompt technology, reply generation is carried out by utilizing vectorized knowledge domain knowledge in an enterprise knowledge base, and a reply and solution scheme is automatically generated; and finally, outputting an application result of the exclusive AI of the enterprise, generating an answer and a suggestion, and providing information support.

S101: the domain knowledge drives the pre-training large language model stage of the LLM, the large language model is pre-trained, and the large language model LLM is pre-trained by adopting a large-scale universal corpus.

In particular, the pre-training large language model stage allows the model to automatically capture and learn the language's internal patterns from massive text data through unsupervised learning for more efficient application in subsequent tasks. After the pre-training is completed, the model obtains a universal language representation capability, the learned knowledge is migrated to various natural language processing tasks, and the migration learning mode enables the large model to show strong performance in different fields and tasks without independent training for each task from scratch.

S102: the domain knowledge drives the enterprise domain knowledge introduction and fine tuning stage of the LLM, constructs an enterprise knowledge graph, and guides the LLM to better adapt to the enterprise domain context and knowledge in the fine tuning stage.

Specifically, the method firstly converts the enterprise field into a knowledge graph representation. From various sources such as documents inside enterprises, databases, domain expert knowledge, and the like, PDF files containing domain knowledge, and the like are collected, converted into processable text or image data by a PDF parsing technique, and specification information is correctly extracted from PDF by using a text recognition algorithm such as OCR (optical character recognition) technique. And then carrying out table extraction, and identifying and extracting table contents in PDF by a system, wherein the table contents relate to image processing technology, typesetting analysis and structure identification so as to determine the entity, relationship, attribute and other elements of the domain knowledge graph. And finally, converting all the extracted contents into the knowledge graph of the enterprise domain by data finishing methods such as coordinate aggregation, text aggregation and the like.

Further, based on the domain knowledge graph of the enterprise, a specific loss function adapting to the enterprise domain is designed, and the large language LLM model is guided to pay more attention to the entity, the relation and the knowledge structure in the domain in the fine tuning stage. In the fine tuning stage, the large language LLM model is optimized in an iterative mode. In each iteration, the enterprise domain knowledge graph, the specific loss function and the domain training data are introduced into the model training process to continuously adjust the weight and the parameters of the model. The iterative optimization enables the model to gradually adapt to the field requirements of enterprises.

S103: and in the field knowledge driven LLM (logical level management) enterprise vector database knowledge storage stage, the enterprise vector database knowledge is stored, and elements such as entities and relations in an enterprise knowledge graph are vectorized and stored in a vector database.

Specifically, first, vector representation conversion is performed on a knowledge graph in the enterprise field, and elements such as entities, relationships, attributes and the like in the knowledge graph are mapped to points in a vector space so as to facilitate subsequent calculation and retrieval. Then, index construction is carried out on knowledge vectors in the enterprise domain, and vector retrieval technology is introduced. The system builds a hash table index structure from the vector representations in order to efficiently store and manage large amounts of vector data. Finally, when the system needs to retrieve specific knowledge content from the knowledge base, the most similar vector representation is found by vector retrieval techniques.

Further, the domain knowledge drives the LLM model, extracts and analyzes enterprise domain knowledge in real time and self-circularly supplements the knowledge base, as shown in FIG. 3. And (3) carrying out full analysis according to text and voice dialogue records generated in the process of enterprise and customer contact, extracting high-frequency and latest customer problems, and finishing the problems into manually quickly-confirmed newly-added knowledge content through comprehensive treatment of a problem extraction model, a problem clustering model, a problem ordering model and a large language model, so as to finish content precipitation from the front end to automatic analysis and treatment, and finally quickly confirming a closed-loop maintenance system for warehousing by manual work.

The information is extracted through AI technical capability, and the text information which can be understood by a large language model is rapidly generated through the fusion of technologies such as picture OCR text extraction, audio track ASR in audio and video, document format content extraction and the like. And extracting FAQ and map information through a large model, and ensuring that the information is effectively combined, enhanced and supplemented into the existing knowledge base through a WSD (wireless sensor device) related technology during warehouse entry. Knowledge driven LLM in the art supports the extraction of knowledge of text, word, PPT, PDF, wma, MP, video, pictures, web pages, dialogue recordings, custom format files, etc. After the document is extracted, the output is FAQ, and can also be output as a knowledge graph, so that the rapid enhancement of the existing content and the advanced processing after the information structuring are met. Finally, one key is imported into the knowledge base and history information is automatically matched for merging, replacing and enhancing, so that the update of the domain knowledge base is realized.

S104: and in the construction enterprise exclusive AI application stage of the domain knowledge driven LLM, after user input information enters the system, the user is guided to interact with the AI application through a prompt technology, reply generation is carried out by utilizing vectorization knowledge domain knowledge in an enterprise knowledge base, and a reply and solution scheme is automatically generated.

Specifically, the question request of the user is input into the domain knowledge driven large language model as a prompt, and the enterprise AI application inputs the question input by the user and the context information into the large language model, so as to generate a corresponding reply to the enterprise AI application. The domain knowledge is input into a data pipeline in the form of file corpus and data corpus, the domain knowledge and the general knowledge are associated through vectorization corpus embedding, and a language large model driven by the domain knowledge is constructed. In addition, the LLM model driven by knowledge in the art designs an automated process to convert user input into a format that can be understood by the model, such as text preprocessing, converting into a form suitable for model input, and the like, to ensure the quality of the input data.

S105: and in the output enterprise exclusive AI application result stage of the domain knowledge driven LLM, generating replies based on the domain knowledge driven large language model according to the questions of the user, and outputting and feeding back to the user.

Referring to fig. 2, a system architecture diagram of a method and system for domain knowledge driven language big model LLM fine tuning is shown in fig. 2.

Specifically, the enterprise first builds a knowledge base based on private data. And synchronizing the data from the database or the cloud service into the vector database in real time through the data pipeline to form a knowledge base of the user. In the process, an Embedding interface of a large model is required to be called, and the corpus is vectorized and then stored in a vector database. When the user converses with the enterprise AI application, the AI application firstly carries out semantic retrieval on the questions of the user in the enterprise knowledge base, then sends the retrieved related answers and the questions together with a certain promt to a large model, and returns the final answers to the user.

The embodiment of the invention also provides an AI design large model construction device, which is characterized by comprising a memory and a processor, wherein at least one program instruction is stored in the memory, and the processor loads and executes the at least one program instruction to realize the large language model LLM fine tuning method in any embodiment.

The embodiment of the invention also provides a computer storage medium, which is characterized in that at least one program instruction is stored in the computer storage medium, and the at least one program instruction is loaded and executed by a processor to realize the large language model LLM fine tuning method in any embodiment.

The foregoing examples illustrate only a few embodiments of the invention, which are described in greater detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims

1. The domain knowledge driven large language model fine tuning method is characterized by comprising the following steps of:

pre-training a large language model by adopting a large-scale universal corpus;

2. The large language model fine tuning method of claim 1, wherein the extracting and analyzing the knowledge in real time and self-circulating the extracted knowledge to the knowledge base comprises:

and (3) carrying out full analysis according to text and voice dialogue records generated in the process of enterprise and customer contact, extracting high-frequency and latest customer problems, and finishing the problems into manually quickly-confirmed newly-added knowledge content through comprehensive treatment of a problem extraction model, a problem clustering model, a problem ordering model and a large language model, so as to finish content precipitation from the front end to automatic analysis and treatment, and finally quickly confirming a closed-loop maintenance system for warehousing by manual work.

3. The domain knowledge driven large language model fine tuning method is characterized by comprising the following steps of:

4. The domain knowledge driven large language model fine tuning method system is characterized by comprising a pre-training large language module, an enterprise domain knowledge introducing and fine tuning module, an enterprise vector database knowledge storage module, an enterprise exclusive AI application module and an enterprise exclusive AI application result output module.

5. The large language model fine tuning method system of claim 4, wherein the pre-training large language module is configured to pre-train the large language model using a large-scale generic corpus.

6. The system of claim 4, wherein the knowledge introducing and fine tuning module is configured to construct an enterprise knowledge graph, sort and construct the enterprise knowledge graph in the enterprise domain, guide the large language model to extract and analyze knowledge in real time according to the context and knowledge of the enterprise domain in the fine tuning stage, and self-circulate the extracted knowledge to the knowledge base.

7. The large language model fine tuning method system of claim 4, wherein the enterprise vector database knowledge storage module is configured to vectorize entities and relationship elements in an enterprise knowledge graph and store the vectorized entities and relationship elements in a vector database.

8. The system of claim 4, wherein the building enterprise-specific AI application module is configured to receive user input, guide a user to interact with the AI application through a prompt technique, and generate a targeted reply using vectorized knowledge and domain knowledge in an enterprise knowledge base.

9. An electronic device comprising a memory having stored therein at least one program instruction and a processor that implements the large language model refinement method of any one of claims 1 to 3 by loading and executing the at least one program instruction.

10. A computer storage medium having stored therein at least one program instruction that is loaded and executed by a processor to implement the large language model fine tuning method of any one of claims 1 to 3.