CN117151245A - Private knowledge content generation method based on general knowledge large model and transfer learning - Google Patents

Private knowledge content generation method based on general knowledge large model and transfer learning Download PDF

Info

Publication number
CN117151245A
CN117151245A CN202311172659.0A CN202311172659A CN117151245A CN 117151245 A CN117151245 A CN 117151245A CN 202311172659 A CN202311172659 A CN 202311172659A CN 117151245 A CN117151245 A CN 117151245A
Authority
CN
China
Prior art keywords
reference material
data
similarity
vector
private
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311172659.0A
Other languages
Chinese (zh)
Inventor
张慧
丁鲲
张骁雄
蒋国权
刘姗姗
刘茗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202311172659.0A priority Critical patent/CN117151245A/en
Publication of CN117151245A publication Critical patent/CN117151245A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a private knowledge content generation method based on a general knowledge model and transfer learning. The method comprises the following steps: obtaining a structured database, and performing foreign key association between tables on structured relational data in the structured database to obtain table association data; respectively processing the table-associated data to obtain a text data set and an instruction data set, and performing lora fine tuning on the knowledge model by using the text data set and the instruction data set to obtain a private knowledge model; acquiring real-time private data, constructing a reference material vector library according to a plurality of segmented text vectors corresponding to the real-time private data, and selecting a reference material vector closest to an input vector corresponding to user input from the reference material vector library to obtain a corresponding reference material; and splicing the user input and the reference materials according to the instruction template, and inputting the spliced result into the private knowledge large model to obtain corresponding generated content. By adopting the method, the real-time performance and the effectiveness of the generated content can be improved.

Description

Private knowledge content generation method based on general knowledge large model and transfer learning
Technical Field
The application relates to the technical field of deep learning, in particular to a private knowledge content generation method based on a knowledge model and transfer learning.
Background
Currently, the large pre-training language models such as ChatGPT of OpenAI, concentric dialect of hundred degrees, tian Gong Jin Wan, tong Ying Qian Hun Zhong of Aly and the like basically adopt billions of level parameters and billions of token characters, and are pre-trained based on technologies such as instruction fine tuning, reinforcement learning and the like, so that excellent content generation capability is shown. In the vertical field, a plurality of excellent migration models, such as a DoctorGLM Chinese inquiry model trained on the basis of ChatGLM-6B and Chinese medical dialogue data, a LexiLaw Chinese law big model trained on the basis of Chinese-LLaMA and Chinese law field dialogue inquiry data, and a XuanYuan Chinese financial big model trained on the basis of BLOOM and Chinese financial public data, also appear.
However, the generic pre-training language big models such as ChatGPT, discontent, etc., and the vertical private knowledge big models such as DoctorGLM, lexiLaw are all trained on their respective required domain data, and these big models cannot truly provide professional spectrum answers for enterprise question answers and content generation due to lack of enterprise private knowledge. The traditional large-scale recognition model mostly uses plain text data or instruction data which are manually arranged when the field fine adjustment is carried out. Whereas the private knowledge of enterprises is mostly structured data and private relational data stored in relational databases such as Oracle, mysql, etc., traditional knowledge-based big models cannot directly train on these high-value private data. The traditional large-scale knowledge model has slower self-updating of real-time information, and can not effectively distinguish new knowledge from old knowledge, so that corresponding reasoning prediction and content generation can not be performed based on the latest facts, and each information updating needs high cost for retraining. The content generation of the enterprise has high real-time requirement, so that the conventional large-scale knowledge model cannot really meet the use requirement of enterprise users.
Disclosure of Invention
Based on the above, it is necessary to provide a private knowledge content generation method based on a knowledge model and transfer learning.
A private knowledge content generation method based on a recognition big model and transfer learning, the method comprising:
obtaining a structured database corresponding to enterprise private data required by a current service mode, and carrying out foreign key association between tables on structured relational data in the structured database to obtain table association data;
processing the table associated data by adopting an sql2text method and an sql2Instruction method respectively to obtain a text data set and an Instruction data set, and performing lora fine tuning on a pre-trained knowledge model by utilizing the text data set and the Instruction data set to obtain a private knowledge model;
acquiring real-time private data, constructing a reference material vector library according to a plurality of segmented text vectors corresponding to the real-time private data, acquiring user input, performing vector conversion on the user input to obtain an input vector, and selecting a reference material vector closest to the input vector from the reference material vector library to obtain a reference material corresponding to the user input;
and splicing the user input and the reference material according to an instruction template, and inputting the spliced result into the private knowledge big model to obtain corresponding generated content.
In one embodiment, the method further comprises: acquiring a preset associated field pair; the associated field pairs are related to a business model provided by the private knowledge big model; and carrying out foreign key association between tables on the structured relational data containing the key field pairs to obtain table association data.
In one embodiment, the method further comprises: exporting the table association data into sql sentences, and constructing an instruction template set according to a plurality of preset question modes and answer modes; and filling data according to the sql statement and the instruction template set to obtain an instruction data set.
In one embodiment, the method further comprises: obtaining a mixed data set according to the text data set and the instruction data set; decomposing parameters to be trained into a dimension-reducing matrix and a dimension-increasing matrix, and obtaining parameters of a private knowledge large model according to the parameters of the general knowledge large model and the parameters to be trained; and inputting the mixed data set into the general knowledge model, and performing iterative training on parameters of the private knowledge model until the difference between model output and a real answer is minimum, and stopping iteration to obtain the private knowledge model.
In one embodiment, the method further comprises: converting the real-time private data into text by adopting an sql2text method, and dividing the text into a plurality of natural segments; and converting each natural segment into a multidimensional vector to obtain a reference material vector, and obtaining a reference material vector library according to the reference material vectors corresponding to the natural segments.
In one embodiment, the method further comprises: calculating cosine similarity between the input vector and reference material vectors in a reference material vector library, and obtaining a reference material vector closest to the input vector according to the magnitude relation between the cosine similarity corresponding to each reference material vector and a preset threshold; and obtaining the reference material corresponding to the user input according to the natural segment corresponding to the nearest reference material vector.
In one embodiment, the method further comprises: obtaining a maximum input length and similarity difference interval supported by a private knowledge large model; the similarity difference interval comprises a minimum difference value and a maximum difference value; ordering cosine similarity corresponding to each reference material vector according to descending order to obtain a similarity list; traversing each cosine similarity in the similarity list, calculating a similarity difference value between the current cosine similarity and the last-ordered similarity meeting a preset condition, and obtaining a reference material corresponding to user input according to the relationship between the similarity difference value and the similarity difference value interval; the total length of the user input and the reference material is less than the maximum input length.
In one embodiment, the method further comprises: and if the similarity difference value is in the similarity difference value interval, reserving a reference material corresponding to the current sequencing similarity.
In one embodiment, the method further comprises: and if the similarity difference value is smaller than the minimum difference value, discarding the reference material corresponding to the current sorting similarity.
In one embodiment, the method further comprises: and if the similarity difference value is larger than the maximum difference value, discarding the reference material vector corresponding to the current sorting similarity, and taking the reference material corresponding to the last sorting similarity meeting the preset condition as the reference material corresponding to the user input.
According to the private knowledge content generation method based on the general knowledge model and the transfer learning, the general knowledge model is trained by utilizing the domain private knowledge to obtain the private knowledge model, real-time private data are obtained, a reference material vector library is constructed according to a plurality of segmented text vectors corresponding to the real-time private data, user input is obtained, vector conversion is carried out on the user input to obtain an input vector, and a reference material vector closest to the input vector is selected from the reference material vector library to obtain a reference material corresponding to the user input; and splicing the user input and the reference materials according to the instruction template, and inputting the spliced result into the private knowledge large model to obtain corresponding generated content. The embodiment of the application can improve the real-time performance and the effectiveness of the generated content.
Drawings
FIG. 1 is a flow diagram of a method for generating private knowledge content based on a recognition big model and transfer learning in one embodiment;
FIG. 2 is a schematic diagram of an inference flow of a big model of private knowledge in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, a private knowledge content generating method based on a knowledge model and transfer learning is provided, which includes the following steps:
step 102, obtaining a structured database corresponding to enterprise private data required by the current service mode, and performing foreign key association between tables on structured relational data in the structured database to obtain table association data.
Private knowledge refers to proprietary knowledge assets accumulated inside an enterprise, including both empirical knowledge and skills formed inside the enterprise and business intelligence inside the enterprise. A large proportion of enterprise private knowledge is stored in structured databases such as Oracle and Mysql in the form of structured data, such as product parameter data, description data, trend data, customer data, financial record data, and the like, in the form of relationships and attributes. And carrying out external key association between tables according to association fields corresponding to the service modes, wherein the service modes comprise predictive analysis of future product parameters. The product parameters comprise technical parameters and descriptive parameters, the predictive analysis refers to the evolution of new product parameters according to the technical parameters and descriptive parameters of historical products, the descriptive parameters comprise the application range and research direction of the products, and in the predictive analysis mode, the two fields of the 'front type' and the 'modification' of the products can be associated, so that the parameters of the current products can be known and the parameters of the front type and the modification products can be also known in the same piece of data. The implementation method of other service modes is similar to the predictive analysis mode, and will not be described here again. And 104, processing the table-associated data by adopting an sql2text method and an sql2Instruction method respectively to obtain a text data set and an Instruction data set, and performing the lora fine tuning on the pre-trained knowledge model by utilizing the text data set and the Instruction data set to obtain a private knowledge model.
The large-scale learning model refers to a machine learning model with a very large-scale parameter number, the model is usually trained by a deep learning method, and the large-scale learning model is applied to the general field and is not incrementally trained on special industry field data, including ChatGLM, chatGLM, MOSS and LLAMA.
In the training process of the general knowledge model, table association data are converted into a text data set through an sql2text method, so that incremental pre-training is carried out on the general knowledge model, more enterprise private knowledge can be obtained, an Instruction template set is constructed manually according to a question mode and an answer mode required by content generation in enterprise development, then the table association data are converted into an Instruction data set through an sql2Instruction method, instruction fine adjustment is carried out on the general knowledge model, and therefore intention of an enterprise user for inputting a problem is better understood, and field fine adjustment is carried out on the general knowledge model by combining the text data set and the Instruction data set based on a lota technology.
And 106, acquiring real-time private data, constructing a reference material vector library according to a plurality of segmented text vectors corresponding to the real-time private data, acquiring user input, performing vector conversion on the user input to obtain an input vector, and selecting a reference material vector closest to the input vector from the reference material vector library to obtain a reference material corresponding to the user input.
The real-time private data refers to private data with higher real-time performance in the current service mode, namely, the data for fine adjustment is not contained in training data for training the general knowledge model, in addition, the real-time private data is the latest service data which can be acquired, and the specific content of the service data is determined according to a research target. The user input includes a question or instruction to be asked by the user.
And step 108, splicing the user input and the reference materials according to the instruction template, and inputting the spliced result into the private knowledge big model to obtain corresponding generated content.
The content generation technology is a technology for automatically generating content such as articles, news, reports, and guides related to a specific field using artificial intelligence technology. The technology can automatically generate a large amount of high-quality content, provide timely, accurate and useful information for field practitioners, and improve the information transmission efficiency and the industry competitiveness. According to the embodiment of the application, related content related to future product parameter descriptions can be generated, so that enterprise users can assist in producing new-generation products by utilizing the future product parameters, and know possible future parameters of the new-generation products, so that the enterprise users can deploy other products suitable for the new-generation products in advance.
In the private knowledge content generation method based on the general knowledge model and the transfer learning, the general knowledge model is trained by utilizing the domain private knowledge to obtain the private knowledge model, real-time private data is obtained, a reference material vector library is constructed according to a plurality of segmented text vectors corresponding to the real-time private data, user input is obtained, vector conversion is carried out on the user input to obtain an input vector, and a reference material vector closest to the input vector is selected from the reference material vector library to obtain a reference material corresponding to the user input; and splicing the user input and the reference materials according to the instruction template, and inputting the spliced result into the private knowledge large model to obtain corresponding generated content. According to the embodiment of the application, the high-instantaneity field data is used as the reference material, so that the problem that the traditional general knowledge large model is low in instantaneity and indistinguishable in new and old knowledge is effectively solved.
In one embodiment, the step of performing foreign key association between tables on structured relational data in a structured database to obtain table association data includes: acquiring a preset associated field pair; the associated field pairs relate to the business patterns provided by the private knowledge big model; and carrying out foreign key association between tables on the structured relational data containing the key field pairs to obtain table association data.
Specifically, taking the predictive analysis mode as an example, the data stored in the Oracle and Mysql structured database are subjected to foreign key association between tables, and the foreign key association is mainly performed through two fields of 'front type' and 'remodel'. The parameters, descriptions and development trend information of the current product can be known in the same piece of data, and the parameters, descriptions and development trend information of the previous and modified products can be also known.
In one embodiment, the step of processing the table-associated data using the sql2 instrumentation method to obtain a text dataset comprises: and exporting the table-associated data into sql sentences, and filling data according to the sql sentences and a pre-constructed conversion template to obtain a text data set.
Specifically, table association data in a structured database is converted into a natural language text data set with a trainable model by an sql2text method, and firstly, the structured table association data is derived into an sql statement. For example, export of this piece of data for product "ship 1" yields the following sql statement:
insert into table _name ("category", "length", "development trend", "pro-category", "pro-length", "pro-development trend", "retrofit-category", "retrofit-length", "retrofit-development trend") value
(value 1, value2,) value n, front-value 1, front-value 2,) front-value n, retrofit-value 1, retrofit-value 2, # n, retrofit-value n
Then, a conversion template is constructed, for example:
ship 1, xxxxx (brief description of the product), the basic parameters of which are as follows: length xxx; width: xxx; ..; application scene xxx; trend is xxx. The front form of the ship 1 is a ship 10, and basic parameters of the ship 10 are as follows: .... The modification of vessel 1 to vessel 12, the basic parameters of vessel 12 are as follows: ....
And finally, based on the derived sql statement and the conversion template, filling data, and completing the construction of the text data set.
In one embodiment, the step of processing the table association data using the sql2 instrumentation method to obtain the Instruction data set includes: exporting the table association data into sql sentences, and constructing an instruction template set according to a plurality of preset questioning modes and answer modes; and filling data according to the sql statement and the instruction template set to obtain an instruction data set.
Specifically, table association data in a structured database is converted into a command data set with a trainable model through an sql2Instruction method, firstly, the structured table association data is exported to be an sql statement, and the corresponding steps of the process and the text data set generation are consistent.
Then, manually constructing an instruction template set according to a question mode and an answer mode required by content generation in the enterprise development. To enrich the diversity of question and answer modes in content generation, a plurality of instruction templates are constructed manually. For example, when a user inputs various parameter indexes and development trend data of a certain ship, and predicts the parameter indexes of the next generation product of the ship, an instruction template can be constructed as follows:
input: known information: { Ship 1, xxxxx (brief description of the product), its basic parameters are as follows: length xxx; width: xxx; ..; application scene xxx; trend is xxx. }. Based on the above known information, the user's questions are answered concisely and professionally. If an answer cannot be obtained from the answer, please say "the question cannot be answered according to the known information" or "sufficient relevant information is not provided", the addition of the composition to the answer is not allowed, and the answer is made using Chinese. The problems are: { whatis the basic parameters of vessel 1 next generation product vessel 12? }
Output: from the provided parameter information and development trend of the ship 1, it can be deduced that the basic parameters of the next generation product ship 12 are as follows: length xxx; width: xxx; ..; application scene xxx;
and finally, based on the derived sql statement and the instruction template set, filling data, and completing the construction of the instruction data set.
In one embodiment, the step of performing the lora fine tuning on the pre-trained knowledge base model using the text data set and the instruction data set to obtain the private knowledge base model comprises: obtaining a mixed data set according to the text data set and the instruction data set; decomposing the parameters to be trained into a dimension-reducing matrix and a dimension-increasing matrix, and obtaining parameters of the private knowledge large model according to the parameters of the general knowledge large model and the parameters to be trained; and (3) inputting the mixed data set into the knowledge model, and performing iterative training on parameters of the private knowledge model until the difference between the model output and the real answer is minimum, and stopping iteration to obtain the private knowledge model.
In this embodiment, the text data set helps the large model learn more private knowledge of the enterprise, and the instruction data set helps the large model better understand the intent of the user to input questions or instructions, resulting in professional and reliable content generation.
lora is a kind ofFine tuning technology of lightweight large model, wherein the parameter of original general knowledge large model is W 0 The parameters of the private knowledge large model formed after fine tuning are W, and the parameters of fine tuning in the whole training process are DeltaW, then:
W=W 0 +ΔW
at this time, the parameters Δw to be trained are decomposed into two matrices, namely a dimension-reducing matrix a and a dimension-increasing matrix B, respectively, including:
W=W 0 +ΔW=W 0 +BA
wherein W is 0 ∈R d×k Representing the parameters of the initial recognition big model, B epsilon R d×r Representing an ascending dimension matrix, and initializing with an all-0 matrix; a epsilon R r×k Representing a dimension reduction matrix, and initializing by using random Gaussian distribution; r is rank, is a priori parameter and r.ltoreq.min (d, k). Throughout the training process A, B is a trainable parameter, the others are a priori parameters and fixed parameters.
In the model forward training process, W 0 And ΔW will be multiplied by the same input x, namely:
h=W 0 x+ΔWx=W 0 x+BAx
the training targets of the model are:
min(h'-h)
wherein h' is the true answer, h is the model output. Multiple iterations are thus performed until the model converges.
In one embodiment, as shown in fig. 2, there is provided a schematic drawing of an inference flow of a big model of private knowledge, and the step of constructing a reference material vector library according to a plurality of segmented text vectors corresponding to real-time private data includes: converting the real-time private data into text by adopting an sql2text method, and dividing the text into a plurality of natural segments; and converting each natural segment into a multidimensional vector to obtain a reference material vector, and obtaining a reference material vector library according to the reference material vectors corresponding to the natural segments.
Specifically, private data with higher real-time performance is converted into text through an sql2text method, the text is segmented, natural segment division is carried out on the text according to a 200-300 word interval, vector conversion is carried out on the natural segment, an open source vector conversion model text2vec-base-Chinese on a huggingface website is used for converting each segment into 768-dimensional vectors, and the vectors are stored to obtain a reference material vector library.
In one embodiment, the step of selecting the reference material vector closest to the input vector from the reference material vector library to obtain the reference material corresponding to the user input includes: and calculating cosine similarity between the input vector and the reference material vectors in the reference material vector library, and obtaining the reference material corresponding to the user input according to the magnitude relation between the cosine similarity corresponding to each reference material vector and a preset threshold value.
Specifically, the user inputs a question or instruction to be asked, and the question or instruction input by the user is converted into 768-dimensional vector based on an open source vector conversion model text2 vec-base-Chinese. And dynamically designing a similarity threshold according to the actual distribution characteristics of the private data. And calculating cosine similarity of vectors of the input problem and vectors in a reference material library one by one, and selecting paragraph texts with cosine similarity meeting the design threshold requirement as reference materials of the problem. The cosine similarity is calculated as follows:
wherein q i To input a problem vector, y i Are individual vectors in the library of reference materials. Then, splicing the input problem and the original text of the reference material section according to the instruction template to obtain final input; and sending the final input into the fine-tuned private knowledge large model to obtain final content generation.
In one embodiment, the step of obtaining the reference material corresponding to the user input according to the magnitude relation between the cosine similarity corresponding to each reference material vector and the preset threshold value includes: obtaining a maximum input length and similarity difference interval supported by a private knowledge large model; the similarity difference interval comprises a minimum difference value and a maximum difference value; ordering cosine similarity corresponding to each reference material vector according to descending order to obtain a similarity list; traversing each cosine similarity in the similarity list, calculating a similarity difference value between the current cosine similarity and the last-ordered similarity meeting a preset condition, and obtaining a reference material corresponding to user input according to the relationship between the similarity difference value and a similarity difference value interval; the total length of the user input and the reference material is less than the maximum input length.
In this embodiment, the same threshold is not used for any data distribution feature, and the definition of the threshold is related to specific service data, so that accuracy of reference materials in model input is improved. If too many paragraphs with similarity to the input problem vector higher than the threshold value appear in the private database, the final input characters formed after splicing are too long, and the effect of the large model is seriously affected; if a plurality of paragraphs almost consistent with the similarity of the input problem vector appear in the private library and are all higher than the set threshold, if the paragraphs are spliced with the input problem to form final input, the redundancy of the final input can be greatly increased, and the effect of the large model is affected. And designing a method for dynamically designing a similarity threshold according to the actual distribution of private data aiming at the problems, thereby improving the stability of model input.
In one embodiment, the method further comprises: if the similarity difference value is in the similarity difference value interval, reserving a reference material corresponding to the current sorting similarity; if the similarity difference is smaller than the minimum difference, discarding the reference material corresponding to the current sorting similarity; if the similarity difference is larger than the maximum difference, discarding the reference material vector corresponding to the current sorting similarity, and taking the reference material corresponding to the last sorting similarity meeting the preset condition as the reference material corresponding to the user input.
In this embodiment, the method for dynamically designing the similarity threshold according to the actual distribution of the private data is as follows:
the maximum input length character supported by the model is n, and the similarity difference interval is set as (S i ,S j ) For example (0.01,0.1). The cosine similarity list of the input problem vector and the reference material vector in the reference material library is firstly obtained and arranged according to a descending order. Based on the first reference material vector with highest similarityIf the difference in similarity between the second reference material vector and the first reference material vector is within the interval (S i ,S j ) If the second reference material vector is reserved, continuously comparing the similarity difference value of the third reference material vector and the second reference material vector; if the similarity difference is smaller than S i Then discarding the second reference material vector and continuing to compare the similarity difference between the third reference material vector and the first reference material vector; if the similarity difference is greater than S j Then the second reference material vector is discarded, the comparison is stopped and the first paragraph is selected as the final reference material. And so on until the final fully qualified paragraph is obtained and the input question plus the total length of the paragraph is required to be less than the maximum input length n. For private domain databases, in practical application, multiple sets are designed (S i ,S j ) As the grid parameters, an optimal group is selected according to the content generation effect (S i ,S j ) Parameters.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A private knowledge content generation method based on a recognition big model and transfer learning, the method comprising:
obtaining a structured database corresponding to enterprise private data required by a current service mode, and carrying out foreign key association between tables on structured relational data in the structured database to obtain table association data;
processing the table associated data by adopting an sql2text method and an sql2Instruction method respectively to obtain a text data set and an Instruction data set, and performing lora fine tuning on a pre-trained knowledge model by utilizing the text data set and the Instruction data set to obtain a private knowledge model;
acquiring real-time private data, constructing a reference material vector library according to a plurality of segmented text vectors corresponding to the real-time private data, acquiring user input, performing vector conversion on the user input to obtain an input vector, and selecting a reference material vector closest to the input vector from the reference material vector library to obtain a reference material corresponding to the user input;
and splicing the user input and the reference material according to an instruction template, and inputting the spliced result into the private knowledge big model to obtain corresponding generated content.
2. The method of claim 1, wherein the step of performing table-to-table foreign key association on the structured relational data in the structured database to obtain table association data comprises:
setting an associated field pair according to the current task mode; the method comprises the steps of carrying out a first treatment on the surface of the
And carrying out foreign key association between tables on the structured relational data containing the key field pairs to obtain table association data.
3. The method of claim 1, wherein the step of processing the table association data using an sql2 instrumentation method to obtain an Instruction data set comprises:
exporting the table association data into sql sentences, and constructing an instruction template set according to a plurality of preset question modes and answer modes;
and filling data according to the sql statement and the instruction template set to obtain an instruction data set.
4. The method of claim 1, wherein the step of using the text data set and the instruction data set to perform a lora fine-tuning on a pre-trained knowledge-based large model to obtain a private knowledge-based large model comprises:
obtaining a mixed data set according to the text data set and the instruction data set;
decomposing parameters to be trained into a dimension-reducing matrix and a dimension-increasing matrix, and obtaining parameters of a private knowledge large model according to the parameters of the general knowledge large model and the parameters to be trained;
and inputting the mixed data set into the general knowledge model, and performing iterative training on parameters of the private knowledge model until the difference between model output and a real answer is minimum, and stopping iteration to obtain the private knowledge model.
5. The method according to claim 1, wherein the step of constructing a reference material vector library from a plurality of segmented text vectors corresponding to the real-time private data comprises:
converting the real-time private data into text by adopting an sql2text method, and dividing the text into a plurality of natural segments;
and converting each natural segment into a multidimensional vector to obtain a reference material vector, and obtaining a reference material vector library according to the reference material vectors corresponding to the natural segments.
6. The method of claim 1, wherein the step of selecting a reference material vector from the reference material vector library that is closest to the input vector to obtain the reference material corresponding to the user input comprises:
and calculating cosine similarity between the input vector and the reference material vectors in the reference material vector library, and obtaining the reference material corresponding to the user input according to the magnitude relation between the cosine similarity corresponding to each reference material vector and a preset threshold value.
7. The method of claim 6, wherein the step of obtaining the reference material corresponding to the user input according to the magnitude relation between the cosine similarity corresponding to each reference material vector and the preset threshold value comprises:
obtaining a maximum input length and similarity difference interval supported by a private knowledge large model; the similarity difference interval comprises a minimum difference value and a maximum difference value;
ordering cosine similarity corresponding to each reference material vector according to descending order to obtain a similarity list;
traversing each cosine similarity in the similarity list, calculating a similarity difference value between the current cosine similarity and the last-ordered similarity meeting a preset condition, and obtaining a reference material corresponding to user input according to the relationship between the similarity difference value and the similarity difference value interval; the total length of the user input and the reference material is less than the maximum input length.
8. The method of claim 7, wherein the method further comprises:
and if the similarity difference value is in the similarity difference value interval, reserving a reference material corresponding to the current sequencing similarity.
9. The method of claim 7, wherein the method further comprises:
and if the similarity difference value is smaller than the minimum difference value, discarding the reference material corresponding to the current sorting similarity.
10. The method of claim 7, wherein the method further comprises:
and if the similarity difference value is larger than the maximum difference value, discarding the reference material vector corresponding to the current sorting similarity, and taking the reference material corresponding to the last sorting similarity meeting the preset condition as the reference material corresponding to the user input.
CN202311172659.0A 2023-09-12 2023-09-12 Private knowledge content generation method based on general knowledge large model and transfer learning Pending CN117151245A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311172659.0A CN117151245A (en) 2023-09-12 2023-09-12 Private knowledge content generation method based on general knowledge large model and transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311172659.0A CN117151245A (en) 2023-09-12 2023-09-12 Private knowledge content generation method based on general knowledge large model and transfer learning

Publications (1)

Publication Number Publication Date
CN117151245A true CN117151245A (en) 2023-12-01

Family

ID=88907826

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311172659.0A Pending CN117151245A (en) 2023-09-12 2023-09-12 Private knowledge content generation method based on general knowledge large model and transfer learning

Country Status (1)

Country Link
CN (1) CN117151245A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117709441A (en) * 2024-02-06 2024-03-15 云南联合视觉科技有限公司 Method for training professional medical large model through gradual migration field
CN117952121A (en) * 2024-03-27 2024-04-30 北方健康医疗大数据科技有限公司 Medical text quality assessment method, system, electronic equipment and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117709441A (en) * 2024-02-06 2024-03-15 云南联合视觉科技有限公司 Method for training professional medical large model through gradual migration field
CN117709441B (en) * 2024-02-06 2024-05-03 云南联合视觉科技有限公司 Method for training professional medical large model through gradual migration field
CN117952121A (en) * 2024-03-27 2024-04-30 北方健康医疗大数据科技有限公司 Medical text quality assessment method, system, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN117151245A (en) Private knowledge content generation method based on general knowledge large model and transfer learning
Lebret et al. Neural text generation from structured data with application to the biography domain
He et al. See: Syntax-aware entity embedding for neural relation extraction
Mallapragada et al. Semiboost: Boosting for semi-supervised learning
Rakotomamonjy et al. $\ell_ {p}-\ell_ {q} $ Penalty for Sparse Linear and Sparse Multiple Kernel Multitask Learning
Li et al. A stable variational autoencoder for text modelling
CN107608953B (en) Word vector generation method based on indefinite-length context
WO2020151310A1 (en) Text generation method and device, computer apparatus, and medium
CN117009490A (en) Training method and device for generating large language model based on knowledge base feedback
Chen An introduction to machine learning for panel data
CN112559706B (en) Training method of dialogue generating model, dialogue method, device and storage medium
Guo et al. Music online education reform and wireless network optimization using artificial intelligence piano teaching
CN116775843A (en) Question-answer pair evaluation data generation method, question-answer pair evaluation data generation device, computer equipment and storage medium
Santillan et al. Poem generation using transformers and doc2vec embeddings
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
Ni et al. An Introduction to Machine Learning in Quantitative Finance
CN114741507A (en) Method for establishing and classifying quotation network classification model of graph convolution network based on Transformer
CN114692746A (en) Information entropy based classification method of fuzzy semi-supervised support vector machine
CN109815323B (en) Human-computer interaction training question-answer generation algorithm
Dai et al. Intelligent audit question answering system based on knowledge graph and semantic similarity
Liang et al. Painting Classification in Art Teaching under Machine Learning from the Perspective of Emotional Semantic Analysis
Friedland Information-Driven Machine Learning: Data Science as an Engineering Discipline
Zhang et al. Design and implementation of teaching analysis system based on data mining
Maurya Learning low dimensional word based linear classifiers using data shared adaptive bootstrap aggregated lasso with application to imdb data
Bai et al. Food Pairing Based on Generative Adversarial Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination