CN117828049A - Data processing method and related device - Google Patents

Data processing method and related device Download PDF

Info

Publication number
CN117828049A
CN117828049A CN202311847859.1A CN202311847859A CN117828049A CN 117828049 A CN117828049 A CN 117828049A CN 202311847859 A CN202311847859 A CN 202311847859A CN 117828049 A CN117828049 A CN 117828049A
Authority
CN
China
Prior art keywords
model
large model
data
target
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311847859.1A
Other languages
Chinese (zh)
Inventor
吴伟华
薛玉洁
张妙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN HARZONE TECHNOLOGY CO LTD
Original Assignee
SHENZHEN HARZONE TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN HARZONE TECHNOLOGY CO LTD filed Critical SHENZHEN HARZONE TECHNOLOGY CO LTD
Priority to CN202311847859.1A priority Critical patent/CN117828049A/en
Publication of CN117828049A publication Critical patent/CN117828049A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a data processing method and a related device, wherein the method comprises the following steps: acquiring a related document set in the target field, and pre-training a first large model through the related document set to obtain a second large model; acquiring an instruction data set which is pre-constructed for each subdivision field of the target field, and performing low-rank matrix incremental weight training on the second large model through the instruction data set to obtain a fine-tuned third large model corresponding to the subdivision field; and constructing a reward model according to a pre-constructed grading sequencing data set and the third large model, taking the third large model as an intelligent body, taking the reward model as an environment, and performing reinforcement learning training on the third large model to obtain a target large model in the target field. By adopting the embodiment of the application, the response speed and accuracy of the intelligent consultation system in the professional field can be improved.

Description

Data processing method and related device
Technical Field
The application relates to the technical field of artificial intelligence or reinforcement learning, in particular to a data processing method and a related device.
Background
Artificial intelligence and natural language processing techniques are increasingly used in the field. With the dramatic increase in the volume of social information, organizations face processing challenges for large-scale unstructured text data, including but not limited to, tasks such as advisory services, information retrieval, and the like. In order to improve the professionality and efficiency of the field intelligent consultation system, researchers begin to explore the training method of generating a large model. The generative large model is capable of generating natural, fluent text replies based on a given context, however training out a generative large model that meets expertise requirements remains a challenging problem. Therefore, the problem of how to improve the response speed and accuracy of the intelligent consultation system in the professional field is needed to be solved.
Disclosure of Invention
The embodiment of the application provides a data processing method and a related device, which can improve the response speed and accuracy of an intelligent consultation system in the professional field.
In a first aspect, an embodiment of the present application provides a data processing method, where the method includes:
acquiring a related document set in the target field, and pre-training a first large model through the related document set to obtain a second large model;
acquiring an instruction data set which is pre-constructed for each subdivision field of the target field, and performing low-rank matrix incremental weight training on the second large model through the instruction data set to obtain a fine-tuned third large model corresponding to the subdivision field;
and constructing a reward model according to a pre-constructed grading sequencing data set and the third large model, taking the third large model as an intelligent body, taking the reward model as an environment, and performing reinforcement learning training on the third large model to obtain a target large model in the target field.
In a second aspect, embodiments of the present application provide a data processing apparatus, the apparatus including: a first acquisition unit, a second acquisition unit and a strengthening unit, wherein,
the first acquisition unit is used for acquiring a related document set in the target field, and pre-training the first large model through the related document set to obtain a second large model;
The second obtaining unit is configured to obtain an instruction data set pre-constructed for each subdivision region of the target region, and perform low-rank matrix incremental weight training on the second large model through the instruction data set to obtain a fine-tuned third large model corresponding to the subdivision region;
the reinforcement unit is used for constructing a reward model according to a pre-constructed grading and sorting data set and the third large model, taking the third large model as an intelligent body, taking the reward model as an environment, and performing reinforcement learning training on the third large model to obtain a target large model of the target field.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps in the first aspect of the embodiment of the present application.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program causes a computer to perform some or all of the steps as described in the first aspect of the embodiments of the present application.
In a fifth aspect, embodiments of the present application provide a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps described in the first aspect of the embodiments of the present application. The computer program product may be a software installation package.
By implementing the embodiment of the application, the following beneficial effects are achieved:
it can be seen that, in the data processing method and related apparatus described in the embodiments of the present application, a related document set of a target field is obtained, a first large model is pre-trained by the related document set to obtain a second large model, a pre-constructed instruction data set for each subdivision field of the target field is obtained, a low-rank matrix incremental weight training is performed on the second large model by the instruction data set to obtain a fine-tuned third large model corresponding to the subdivision field, a reward model is constructed according to the pre-constructed score ranking data set and the third large model, the third large model is used as an intelligent body, the reward model is used as an environment, reinforcement learning training is performed on the third large model to obtain a target large model of the target field, and one of the target large models is pre-trained by large-scale unstructured structured field text data and general Chinese data, so that a pre-set generation type large model forms a basis for language learning and becomes a starting point for subsequent work; secondly, a carefully prepared instruction data set is constructed for each subdivision field, the data of the specific fields are utilized to carry out low-rank matrix increment weight training on the large model obtained by pre-training, so that the model has field-specific knowledge and professional terms, thirdly, a reward model is introduced, the reward model is used for accurately evaluating the quality of generated text replies based on the pre-constructed grading sorting data set and the trimmed large model, the feedback of the reward model becomes the basis of reinforcement learning training, the trimmed large model is regarded as an intelligent body, and the process is carried out by repeatedly interacting with the environment, learning and adjusting under the guidance of the reward model, continuously optimizing own reply strategies, so that the large model can generate professional and smooth field replies, provide high-efficiency and accurate technical support for consultation services, realize important breakthrough in the unstructured text data processing field, and improve the response speed and accuracy of an intelligent system of the professional field.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a general flow of large model training provided in an embodiment of the present application;
FIG. 3 is a schematic flow chart of data preprocessing and pre-training according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of a supervised fine tuning phase provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of a flow chart for constructing a reward model according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a reinforcement learning flow based on PPO algorithm according to an embodiment of the present application;
FIG. 7 is a schematic flow chart of a semantic type and relationship type design of knowledge graph provided in an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
Fig. 9 is a block diagram of functional units of a data processing apparatus according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The electronic devices described in the embodiments of the present application may include smart phones (such as Android mobile phones, iOS mobile phones, windows Phone mobile phones, etc.), tablet computers, palm computers, automobile recorders, video matrixes, traffic guidance platforms, servers, notebook computers, mobile internet devices (MID, mobile Internet Devices), wearable devices (such as smartwatches, bluetooth headsets), etc., which are merely examples, but not exhaustive, including but not limited to the electronic devices described above.
In the embodiment of the application, the large model is the LLM model.
In the related art, the supervised learning method has limitations in the aspects of data labeling and model generalization, and cannot meet the requirements of the field. In order to overcome the problems, in the embodiment of the application, an innovative auxiliary consultation method and a domain generation type large model training method are provided, and the method realizes fine adjustment of the generation type large model in the domain by pre-training a large-scale language model and combining an instruction data set of the subdivision domain to perform low-rank matrix incremental weight training. Meanwhile, a reward model and a reinforcement learning technology are introduced, the trimmed large model is used as an intelligent body, and reinforcement learning training is performed by taking the reward model as an environment, so that a more specialized and efficient field generation type large model is obtained.
In the embodiment of the application, based on unstructured field text data and general Chinese data, a preset generation type large model is pre-trained to obtain a large-scale language model serving as a base; based on an instruction data set pre-constructed for each subdivision domain, performing low-rank matrix incremental weight training on the large model to obtain a fine-tuned large model corresponding to the subdivision domain task; constructing a reward model according to a pre-constructed grading sequencing data set and the trimmed large model; and taking the trimmed large model as an intelligent agent, taking the rewarding model as an environment, and performing reinforcement learning training on the trimmed large model to obtain the field generation type large model. The field generation type large model obtained by training through the method can greatly improve the specificity of the large model in the knowledge field recovery when consultation recovery is carried out.
According to the embodiment of the application, the response speed and the accuracy of the intelligent consultation system in the field are improved, and a feasible solution is provided for solving the problem of large-scale unstructured text data processing. In the tasks such as consultation and reply, the generated large model trained by the method has higher level of expertise and self-adaption capability, and powerful technical support is provided for work.
The embodiments of the present application are described in detail below.
Referring to fig. 1, fig. 1 is a flow chart of a data processing method according to an embodiment of the present application, and as shown in the drawing, the data processing method includes:
101. and acquiring a related document set in the target field, and pre-training the first large model through the related document set to obtain a second large model.
In the embodiment of the application, the target domain may be set by a user or default by a system, and the target domain may be one or more domains. The target area may include at least one of: law, education, psychology, crime, property, finance, etc., are not limited herein.
In a specific implementation, a related document set in the target field can be obtained, and the first large model is pre-trained through the related document set to obtain a second large model. For example, a pre-set generative large model is pre-trained through large-scale unstructured structured field text data and general Chinese data, so that a language learning basis is formed, and the method becomes a starting point of follow-up work. The second large model may be a chinese large model.
In the pre-training model stage, three types of data (large-scale unstructured structured field text data and general Chinese data) are used to generate a corpus required for training, so that the knowledge richness of a large model is enhanced.
102. And acquiring an instruction data set which is constructed in advance for each subdivision field of the target field, and performing low-rank matrix incremental weight training on the second large model through the instruction data set to obtain a fine-tuned third large model corresponding to the subdivision field.
In the embodiment of the application, the training of the low-rank matrix increment weights is mainly performed, namely, an instruction data set which is constructed in advance for each subdivision field of the target field can be obtained, the low-rank matrix increment weights are performed on the second large model through the instruction data set, and a fine-tuned third large model corresponding to the subdivision field is obtained.
In a specific implementation, a pre-prepared instruction data set is constructed for each subdivision domain. And performing low-rank matrix incremental weight training on the large model obtained by pre-training by utilizing the data in the specific fields. This step aims at adapting and learning knowledge and expressions in a specific field to a large model, enabling fine tuning of the model, making it more specialized.
The design of the large model fine tuning stage uses various different field refinement tasks to realize fine tuning of the model so as to enable the large model to be suitable for various different tasks, and for different tasks, a low-rank adapter is introduced to fine tune each module, full parameter fine tuning is not needed, and the modules can be loaded as required.
103. And constructing a reward model according to a pre-constructed grading sequencing data set and the third large model, taking the third large model as an intelligent body, taking the reward model as an environment, and performing reinforcement learning training on the third large model to obtain a target large model in the target field.
In the embodiment of the application, according to the pre-constructed grading and sorting data set and the third large model, a reward model is constructed, the third large model is used as an intelligent agent, the reward model is used as an environment, reinforcement learning training is carried out on the third large model, and the target large model in the target field is obtained, namely, the reward model is introduced, and the reward model is used for accurately evaluating the quality of the generated text reply based on the pre-constructed grading and sorting data set and the fine-tuned large model. Feedback of the reward model becomes the basis for reinforcement learning training. The trimmed large model is regarded as an intelligent body, and continuously optimizes the self recovery strategy by multiple interactions with the environment, learning and adjustment under the guidance of the rewarding model. Through multiple iterations, the large model can generate professional and smooth field replies, high-efficiency and accurate technical support is provided for consultation services, and important breakthrough in the field of unstructured text data processing is realized.
The model is continuously optimized and strengthened through different behavior feedback of law enforcement and users through the design of the reinforcement learning stage.
In this embodiment of the present application, in order to improve the efficiency and quality of the work, through analysis and mining of a large amount of data, the large language model may help institutions to better perform the tasks such as intelligence analysis, administrative management, public opinion monitoring, law enforcement assistance, etc., as shown in fig. 2, may include the following steps:
step S1, pre-training a text large model by utilizing a large-scale law enforcement field related document data set, and specifically, preprocessing the large-scale document data set to obtain a Chinese text large model;
step S2, further fine tuning to obtain a law enforcement field question-answer reference model, specifically, performing supervision fine tuning on the Chinese text large model Chinese question-answer reference model by a large-scale consultation dialogue data set to obtain a Chinese question-answer reference model, wherein a plurality of machine replies can be generated for a given input user problem, and the machine replies can be used for feedback labeling;
step S3, training an automatic medical reply effect evaluation model based on question-answer feedback labels returned by a user to the machine, and training to obtain feedback labels, wherein the automatic reply effect rewarding model is specific;
And S4, further training by adopting a reinforcement learning method, and specifically, performing reinforcement learning on the Chinese question-answer reference model to obtain an automatic answer effect evaluation model.
In the specific implementation, firstly, a pre-set generation type large model is pre-trained through large-scale unstructured structured field text data and general Chinese data, so that a language learning basis is formed, and the method becomes a starting point of subsequent work. Then, for each subdivision region, a carefully prepared instruction data set is constructed. And (3) performing low-rank matrix incremental weight training on the large model obtained by pre-training by utilizing the data in the specific fields, so that the model has specific knowledge and professional expression in the fields. Next, a reward model is introduced that ranks the data sets based on pre-constructed scores and a trimmed large model for accurate assessment of the quality of the generated text replies. Feedback of the reward model becomes the basis for reinforcement learning training. The trimmed large model is regarded as an intelligent body, and continuously optimizes the self recovery strategy by multiple interactions with the environment, learning and adjustment under the guidance of the rewarding model. Through multiple iterations, the large model can generate professional and smooth field replies, high-efficiency and accurate technical support is provided for consultation services, and important breakthrough in the field of unstructured text data processing is realized.
Optionally, in step 101, the pre-training the first large model through the related document set to obtain a second large model may be performed according to the following steps:
processing the related document set to obtain text data, form data and map data; and pre-training the first large model through the text data, the table data and the map data to obtain the second large model.
In the embodiment of the application, the related document set can be processed to obtain text data, form data and map data, and the first large model is pre-trained through the text data, the form data and the map data to obtain the second large model.
In specific implementation, unstructured field text data and general Chinese data can be adopted to pretrain a preset generation type large model, and the pretraining process utilizes a large-scale data set, so that the model can learn general language rules and context association, and the pretrained model becomes a base, thereby providing a foundation for subsequent fine adjustment and reinforcement learning.
In the pre-training stage, there are massive unstructured text data or structured form data in the field, in addition, general text data needs to be input for pre-training, and the pre-processing methods of the data are different for different types of data, as shown in fig. 3, the text data are as laws and regulations, the form data are as specific cases (fine sheets) of illegal behaviors, the map data are specific entities and relations, specific contents are detailed in the drawings, and then the contents can be input into a large model for training.
The input of the large model may be an embedded vector obtained by Text and Position embedding (Text & Position embedded), and the large model may include multiple layers, such as Masked MultiSelf Attention Layer, normalized (Layer Norm) Layer, feed Forward (Feed Forward) Layer, layer Norm Layer, and the like.
In specific implementation, the main process of the pre-training is to perform the non-supervision pre-training, and give the non-supervision word segmentation corpus U i Given a standard to language model modeling objective, maximize likelihood estimation:
L i (U)=∑ i log P(u i |u i-k ,...,u i-1 ;θ)
a multi-layer transducer decoder is used as a language model (large model), which is a variant of the transducer. The model applies a multi-headed self-care operation to the input context word segment, followed by a position-dependent feed-forward layer, producing an output distribution over the target word segment:
h 0 =UW e +W p
wherein W is e For the context word segmentation vector, n is the layer number, W p Is a word position embedding matrix, the transducer_block refers to the multi-layer transducer decoder, and the first layer in the transducer decoder is denoted as h l ,h n To hide layers, P (u) is the probability distribution of the generated text.
Optionally, the method further comprises the following steps:
freezing the second large model during fine tuning, and learning knowledge of a specific downstream task through an Adapter module; the Adapter module comprises two feedforward layers and an intermediate layer, wherein the two feedforward layers are a first feedforward layer and a second feedforward layer respectively, the first feedforward layer and the intermediate layer play a role in reducing dimension, and the second feedforward layer and the intermediate layer play a role in lifting dimension.
In the embodiment of the application, in the fine tuning stage, questions and answers are provided at the user side and the law enforcement department side; and the law enforcement department has information analysis, auxiliary law enforcement, case handling record, administrative punishment decision book, legal document and the like.
By way of example, as shown in FIG. 4, law enforcement questionnaires and records of case handling may be obtained, with the law enforcement questionnaires and records of case handling being used to provide supervised fine tuning of the large model.
Furthermore, the second large model can be frozen during fine tuning, and knowledge of a specific downstream task is learned through an Adapter module, wherein the Adapter module comprises two feedforward layers and an intermediate layer, the two feedforward layers are a first feedforward layer and a second feedforward layer respectively, the first feedforward layer and the intermediate layer play a role in reducing dimension, and the second feedforward layer and the intermediate layer play a role in lifting dimension.
In a specific implementation, after pre-training the model using the original text, the model parameters will be adapted to the supervised target tasks. Assuming a labeled dataset C, wherein each instance consists of a series of input segmentations X, and labels y, fine-tuning the training data<X,y>. Inputting the obtained signals to obtain final activation of the transducer-Block Then, an Adapter feedforward layer is entered, and the specific formula is as follows:
wherein W is y As a parameter, y is a predicted value, then there is the following objective function:
L 2 (C)=∑ (x,y) logP(y|x 1 ,...,x m )
then, language modeling is used as an auxiliary target for fine tuning, generalization and accelerated convergence of the supervision model are facilitated through learning improvement, and optimization targets are as follows:
L 3 (C)=L 2 (C)+λ*L 1 (C)
wherein, the model structure is shown on the right side of fig. 4, the main body of the pre-training model is frozen during fine tuning, and the adaptive module learns the knowledge of specific downstream tasks. The Adapter module structure is shown on the right side of the figure and comprises two feedforward layers and an intermediate layer, wherein the first feedforward layer and the intermediate layer play a role in dimension reduction, and the latter feedforward layer and the intermediate layer play a role in dimension lifting. In summary, the only additional parameter needed during fine tuning of a task is W y As well as embedding of delimiter markers, other parameters may be frozen.
Optionally, in step 103, constructing the reward model according to the pre-constructed score ranking dataset and the third bigram model may be implemented as follows:
inputting the pre-constructed scoring sorting data set into the third large model to obtain question-answer pairs; determining a scoring ordered set according to the question-answer pairs; and constructing the rewarding model according to the scoring ordered set, wherein the rewarding model is used for evaluating the quality of the generated text.
In the embodiment of the application, a pre-constructed grading and sorting data set can be input into a third large model to obtain question-answer pairs, then a grading and sorting set is determined according to the question-answer pairs, and a reward model is constructed according to the grading and sorting set and used for evaluating the quality of the generated text.
In a specific implementation, the reward model is constructed. First, the trimmed model (SFT model) can generate a series of answers through a series of data sets, and then produce a ranked set of scores through labeling and feedback by the user and law enforcement. The ELO ranking system is a method for calculating the relative skill level of players, commonly used in competitive games and sporting events, and is used to quantify the quality and expertise of a model-generated text reply based on the scoring of the output results by the owners, and then to construct a bonus model to score the output results of the SFT model. The reward model is used for evaluating the quality of the generated text, guiding the subsequent reinforced data learning process and ensuring that the generated reply meets the requirements of the knowledge field and is smooth and natural.
For further illustration, as shown in FIG. 5, a Prompts dataset is obtained and input to the SFT model to obtain an output result, the model output results are manually ordered for satisfaction, and the ELO system converts the ranking to scalar scores to obtain a reward model.
Optionally, in step 103, reinforcement learning training is performed on the third large model to obtain a target large model in the target field, which may be implemented as follows:
acquiring test data, inputting the test data into the third large model to obtain target data, and acquiring feedback content aiming at the target data in a feedback interface of an application end;
and carrying out parameter updating on the third large model and the reward model by using the feedback content by adopting a PPO reinforcement learning algorithm to obtain the target large model.
The application end may include a user end or a law enforcement end.
In the embodiment of the application, test data are acquired, the test data are input into a third large model to obtain target data, feedback content aiming at the target data in a feedback interface of an application end is acquired, a PPO reinforcement learning algorithm is adopted, and the feedback content is utilized to update parameters of the third large model and a reward model to obtain the target large model.
In the specific implementation, in the reinforcement learning training process, the trimmed large model is regarded as an intelligent body, the reward model is regarded as an environment, and the reinforcement learning algorithm-PPO algorithm is adopted for training. In this process, the model adjusts the strategy of generating text based on law enforcement or user feedback (rewards signals) in order to generate a more specialized and efficient domain reply. Through multiple rounds of iteration, the model gradually optimizes the reply strategy, and the performance in the consultation reply task is improved.
For illustration, as shown in fig. 6, for the pre-training language model, for pre-training the reward model, and for the HHH dataset distillation prompt learning, initializing the strategy, based on reinforcement learning (PPO algorithm) of law enforcement and users, the RLHF strategy, obtaining feedback content of the user/law enforcement feedback interface, obtaining the feedback contrast dataset based on the feedback content, and obtaining the fine-tuned reward model based on the feedback contrast dataset feedback fine-tuning.
In the embodiment of the present application, in the whole reinforcement learning process, modeling of a language model and a reward model is performed by a strategy of a human feedback reinforcement learning method (Reinforcement Learning from Human Feedback, RLFH), and the specific steps are as follows: performing supervised learning by using a manually-marked data set meeting the 3H (HHH) requirement; in the rewarding model, feedback of users and law enforcement departments in the use process of the questioning and answering consultation application is used for fine adjustment of the model; in the iterative learning link, a PPO reinforcement learning algorithm is adopted to update parameters of the pre-training language model and the rewarding model, wherein the updating method is to use an interactive interface of a user or law enforcement department, and the interactive interface evaluates different dimension contents of the model, including model generation quality, content correlation degree, ordering values of a plurality of generation results, whether the generation results of the model are harmful or not and the like.
Optionally, the method further comprises the following steps:
acquiring a target problem and identifying a domain label of the target problem; and inputting the target problem and the domain label into the target large model to obtain a retrieval result of the target problem.
In the embodiment of the application, the target problem is acquired, the domain label of the target problem is identified, and the target problem and the domain label are input into the target large model to obtain the retrieval result of the target problem.
For example, as shown in fig. 7, a base model is loaded, a domain label and a problem are input, then a weight is loaded, specifically, a common user end increment weight is loaded, a law enforcement department end increment weight is loaded, a knowledge base is loaded, a latest executive department law and regulation knowledge base is loaded, and a problem is generated: and generating user side answers and law enforcement department side answers respectively by the user questions and the search results.
In specific implementation, the method can comprise the following steps: (1) Loading a final trained pre-training language model, identifying a domain label according to the input problem, and inputting the domain label and the problem into a large model; (2) Loading weights of a common user end or a law enforcement department end according to the field label and the problem input; (3) The latest legal and legal knowledge base is loaded to play a role in retrieval enhancement, so that the retrieval of a user has a certain instantaneity; (4) Finally, the big model generates corresponding user-side questions and answers or law enforcement-side questions and answers according to the user questions and the search results.
An embodiment of the present application may include the following 4 parts: 1. first, pre-training and base construction of a large model are performed. And pre-training a preset generation type large model by adopting unstructured field text data and general Chinese data. This pre-training process utilizes a large-scale dataset so that the model can learn general language rules and contextual associations. The model after pre-training becomes a base, and provides a foundation for subsequent fine adjustment and reinforcement learning; 2. then performing fine tuning and low rank matrix incremental weight training. For each subdivision domain, a pre-prepared instruction data set is constructed. And performing low-rank matrix incremental weight training on the large model obtained by pre-training by utilizing the data in the specific fields. The step aims at adapting and learning knowledge and words in a specific field to a large model, so that fine tuning of the model is realized and more specialization is realized; 3. followed by construction of the reward model. A scoring ranking dataset is pre-constructed that is used to quantify the quality and expertise of the model-generated text replies. Based on this scoring ranking dataset and the trimmed large model, a reward model is constructed. The reward model is used for evaluating the quality of the generated text, guiding the subsequent reinforcement learning process and ensuring that the generated reply meets the requirements of the knowledge field and is smooth and natural. 4. And finally, strengthening the learning training process. And taking the trimmed large model as an intelligent body, taking the rewarding model as an environment, and training by adopting a reinforcement learning algorithm. In this process, the model adjusts the strategy of generating text based on law enforcement or user feedback (rewards signals) in order to generate a more specialized and efficient domain reply. Through multiple rounds of iteration, the model gradually optimizes the reply strategy, and the performance in the consultation reply task is improved.
Therefore, the generated large model has high professional, accurate and self-adaptive capacity in the field, and provides advanced technical support for consultation services.
It can be seen that, in the data processing method described in the embodiment of the present application, a relevant document set of a target field is obtained, a first large model is pre-trained by the relevant document set, a second large model is obtained, a pre-built instruction data set for each subdivision field of the target field is obtained, a low-rank matrix incremental weight training is performed on the second large model by the instruction data set, a fine-tuned third large model corresponding to the subdivision field is obtained, a reward model is built according to the pre-built score ranking data set and the third large model, the third large model is used as an intelligent agent, reinforcement learning training is performed on the third large model as an environment, and a target large model of the target field is obtained, wherein one of the target large models is pre-trained by large-scale unstructured structured field text data and general Chinese data, and a pre-set generation type large model forms a basis of language learning, and becomes a starting point of subsequent work; secondly, a carefully prepared instruction data set is constructed for each subdivision field, the data of the specific fields are utilized to carry out low-rank matrix increment weight training on the large model obtained by pre-training, so that the model has field-specific knowledge and professional terms, thirdly, a reward model is introduced, the reward model is used for accurately evaluating the quality of generated text replies based on the pre-constructed grading sorting data set and the trimmed large model, the feedback of the reward model becomes the basis of reinforcement learning training, the trimmed large model is regarded as an intelligent body, and the process is carried out by repeatedly interacting with the environment, learning and adjusting under the guidance of the reward model, continuously optimizing own reply strategies, so that the large model can generate professional and smooth field replies, provide high-efficiency and accurate technical support for consultation services, realize important breakthrough in the unstructured text data processing field, and improve the response speed and accuracy of an intelligent system of the professional field.
Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in the drawing, the electronic device includes a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and in the embodiment of the present application, the programs include instructions for executing the following steps:
acquiring a related document set in the target field, and pre-training a first large model through the related document set to obtain a second large model;
acquiring an instruction data set which is pre-constructed for each subdivision field of the target field, and performing low-rank matrix incremental weight training on the second large model through the instruction data set to obtain a fine-tuned third large model corresponding to the subdivision field;
and constructing a reward model according to a pre-constructed grading sequencing data set and the third large model, taking the third large model as an intelligent body, taking the reward model as an environment, and performing reinforcement learning training on the third large model to obtain a target large model in the target field.
Optionally, in the pre-training the first large model by the related document set to obtain the second large model, the program includes instructions for:
Processing the related document set to obtain text data, form data and map data;
and pre-training the first large model through the text data, the table data and the map data to obtain the second large model.
Optionally, the above program further comprises instructions for performing the steps of:
freezing the second large model during fine tuning, and learning knowledge of a specific downstream task through an Adapter module; the Adapter module comprises two feedforward layers and an intermediate layer, wherein the two feedforward layers are a first feedforward layer and a second feedforward layer respectively, the first feedforward layer and the intermediate layer play a role in reducing dimension, and the second feedforward layer and the intermediate layer play a role in lifting dimension.
Optionally, in said ranking the data set according to pre-constructed scores and said third bigram, constructing a reward model, the above procedure comprises instructions for performing the steps of:
inputting the pre-constructed scoring sorting data set into the third large model to obtain question-answer pairs;
determining a scoring ordered set according to the question-answer pairs;
and constructing the rewarding model according to the scoring ordered set, wherein the rewarding model is used for evaluating the quality of the generated text.
Optionally, in the performing reinforcement learning training on the third large model to obtain a target large model of the target field, the program includes instructions for performing the following steps:
acquiring test data, inputting the test data into the third large model to obtain target data, and acquiring feedback content aiming at the target data in a feedback interface of an application end;
and carrying out parameter updating on the third large model and the reward model by using the feedback content by adopting a PPO reinforcement learning algorithm to obtain the target large model.
Optionally, the above program further comprises instructions for performing the steps of:
acquiring a target problem and identifying a domain label of the target problem;
and inputting the target problem and the domain label into the target large model to obtain a retrieval result of the target problem.
It can be seen that, in the electronic device described in the embodiment of the present application, a relevant document set of a target field is obtained, a first large model is pre-trained by the relevant document set, a second large model is obtained, an instruction data set pre-constructed for each subdivision field of the target field is obtained, a low-rank matrix incremental weight training is performed on the second large model by the instruction data set, a fine-tuned third large model corresponding to the subdivision field is obtained, a reward model is constructed according to the pre-constructed scoring sequence data set and the third large model, the third large model is used as an agent, the reward model is used as an environment, reinforcement learning is performed on the third large model, so as to obtain a target large model of the target field, and one of the target large models is pre-trained by large-scale unstructured structured field text data and general Chinese data, so that a pre-set generation type large model forms a basis of language learning, and becomes a starting point of subsequent work; secondly, a carefully prepared instruction data set is constructed for each subdivision field, the data of the specific fields are utilized to carry out low-rank matrix increment weight training on the large model obtained by pre-training, so that the model has field-specific knowledge and professional terms, thirdly, a reward model is introduced, the reward model is used for accurately evaluating the quality of generated text replies based on the pre-constructed grading sorting data set and the trimmed large model, the feedback of the reward model becomes the basis of reinforcement learning training, the trimmed large model is regarded as an intelligent body, and the process is carried out by repeatedly interacting with the environment, learning and adjusting under the guidance of the reward model, continuously optimizing own reply strategies, so that the large model can generate professional and smooth field replies, provide high-efficiency and accurate technical support for consultation services, realize important breakthrough in the unstructured text data processing field, and improve the response speed and accuracy of an intelligent system of the professional field.
Fig. 9 is a block diagram of functional units of a data processing apparatus 900 according to an embodiment of the present application, where the data processing apparatus 900 may include: a first acquisition unit 901, a second acquisition unit 902, and a reinforcement unit 903, wherein,
the first obtaining unit 901 is configured to obtain a related document set in the target field, and pretrain the first large model through the related document set to obtain a second large model;
the second obtaining unit 902 is configured to obtain an instruction data set pre-configured for each subdivision region of the target region, and perform low-rank matrix incremental weight training on the second large model through the instruction data set to obtain a fine-tuned third large model corresponding to the subdivision region;
the reinforcement unit 903 is configured to construct a reward model according to a pre-constructed score ranking dataset and the third large model, take the third large model as an agent, take the reward model as an environment, and perform reinforcement learning training on the third large model to obtain a target large model in the target field.
Optionally, in the aspect that the first large model is pre-trained through the related document set to obtain a second large model, the first obtaining unit 901 is configured to:
Processing the related document set to obtain text data, form data and map data;
and pre-training the first large model through the text data, the table data and the map data to obtain the second large model.
Optionally, the data processing apparatus 900 is further specifically configured to:
freezing the second large model during fine tuning, and learning knowledge of a specific downstream task through an Adapter module; the Adapter module comprises two feedforward layers and an intermediate layer, wherein the two feedforward layers are a first feedforward layer and a second feedforward layer respectively, the first feedforward layer and the intermediate layer play a role in reducing dimension, and the second feedforward layer and the intermediate layer play a role in lifting dimension.
Optionally, in the aspect of constructing a reward model according to the pre-constructed score ranking dataset and the third bigram model, the reinforcement unit 903 is specifically configured to:
inputting the pre-constructed scoring sorting data set into the third large model to obtain question-answer pairs;
determining a scoring ordered set according to the question-answer pairs;
and constructing the rewarding model according to the scoring ordered set, wherein the rewarding model is used for evaluating the quality of the generated text.
Optionally, in the performing reinforcement learning training on the third large model to obtain a target large model in the target field, the reinforcement unit 903 is specifically configured to:
acquiring test data, inputting the test data into the third large model to obtain target data, and acquiring feedback content aiming at the target data in a feedback interface of an application end;
and carrying out parameter updating on the third large model and the reward model by using the feedback content by adopting a PPO reinforcement learning algorithm to obtain the target large model.
Optionally, the data processing apparatus 900 is further specifically configured to:
acquiring a target problem and identifying a domain label of the target problem;
and inputting the target problem and the domain label into the target large model to obtain a retrieval result of the target problem.
It can be seen that, the data processing apparatus described in the embodiments of the present application obtains a relevant document set of a target field, pretrains a first large model through the relevant document set to obtain a second large model, obtains an instruction data set pre-constructed for each subdivision field of the target field, performs low-rank matrix incremental weight training on the second large model through the instruction data set to obtain a fine-tuned third large model corresponding to the subdivision field, sorts the data set and the third large model according to a pre-constructed score, constructs a reward model, uses the third large model as an agent, uses the reward model as an environment, performs reinforcement learning training on the third large model to obtain a target large model of the target field, and firstly, forms a basis of language learning through pretraining a preset generation type large model through large-scale unstructured field text data and general Chinese data, thereby becoming a starting point of subsequent work; secondly, a carefully prepared instruction data set is constructed for each subdivision field, the data of the specific fields are utilized to carry out low-rank matrix increment weight training on the large model obtained by pre-training, so that the model has field-specific knowledge and professional terms, thirdly, a reward model is introduced, the reward model is used for accurately evaluating the quality of generated text replies based on the pre-constructed grading sorting data set and the trimmed large model, the feedback of the reward model becomes the basis of reinforcement learning training, the trimmed large model is regarded as an intelligent body, and the process is carried out by repeatedly interacting with the environment, learning and adjusting under the guidance of the reward model, continuously optimizing own reply strategies, so that the large model can generate professional and smooth field replies, provide high-efficiency and accurate technical support for consultation services, realize important breakthrough in the unstructured text data processing field, and improve the response speed and accuracy of an intelligent system of the professional field.
It may be understood that the functions of each program module of the data processing apparatus of the present embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not repeated herein.
The embodiment of the application also provides a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, where the computer program causes a computer to execute part or all of the steps of any one of the methods described in the embodiments of the method, where the computer includes an electronic device.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the methods described in the method embodiments above. The computer program product may be a software installation package, said computer comprising an electronic device.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (10)

1. A method of data processing, the method comprising:
acquiring a related document set in the target field, and pre-training a first large model through the related document set to obtain a second large model;
acquiring an instruction data set which is pre-constructed for each subdivision field of the target field, and performing low-rank matrix incremental weight training on the second large model through the instruction data set to obtain a fine-tuned third large model corresponding to the subdivision field;
And constructing a reward model according to a pre-constructed grading sequencing data set and the third large model, taking the third large model as an intelligent body, taking the reward model as an environment, and performing reinforcement learning training on the third large model to obtain a target large model in the target field.
2. The method of claim 1, wherein the pre-training the first large model through the set of related documents to obtain a second large model comprises:
processing the related document set to obtain text data, form data and map data;
and pre-training the first large model through the text data, the table data and the map data to obtain the second large model.
3. The method according to claim 1 or 2, characterized in that the method further comprises:
freezing the second large model during fine tuning, and learning knowledge of a specific downstream task through an Adapter module; the Adapter module comprises two feedforward layers and an intermediate layer, wherein the two feedforward layers are a first feedforward layer and a second feedforward layer respectively, the first feedforward layer and the intermediate layer play a role in reducing dimension, and the second feedforward layer and the intermediate layer play a role in lifting dimension.
4. The method of claim 1 or 2, wherein said constructing a reward model from the pre-constructed scoring data set and the third largest model comprises:
inputting the pre-constructed scoring sorting data set into the third large model to obtain question-answer pairs;
determining a scoring ordered set according to the question-answer pairs;
and constructing the rewarding model according to the scoring ordered set, wherein the rewarding model is used for evaluating the quality of the generated text.
5. The method according to claim 1 or 2, wherein the performing reinforcement learning training on the third large model to obtain a target large model of the target field includes:
acquiring test data, inputting the test data into the third large model to obtain target data, and acquiring feedback content aiming at the target data in a feedback interface of an application end;
and carrying out parameter updating on the third large model and the reward model by using the feedback content by adopting a PPO reinforcement learning algorithm to obtain the target large model.
6. The method according to claim 1 or 2, characterized in that the method further comprises:
acquiring a target problem and identifying a domain label of the target problem;
And inputting the target problem and the domain label into the target large model to obtain a retrieval result of the target problem.
7. A data processing apparatus, the apparatus comprising: a first acquisition unit, a second acquisition unit and a strengthening unit, wherein,
the first acquisition unit is used for acquiring a related document set in the target field, and pre-training the first large model through the related document set to obtain a second large model;
the second obtaining unit is configured to obtain an instruction data set pre-constructed for each subdivision region of the target region, and perform low-rank matrix incremental weight training on the second large model through the instruction data set to obtain a fine-tuned third large model corresponding to the subdivision region;
the reinforcement unit is used for constructing a reward model according to a pre-constructed grading and sorting data set and the third large model, taking the third large model as an intelligent body, taking the reward model as an environment, and performing reinforcement learning training on the third large model to obtain a target large model of the target field.
8. The apparatus of claim 7, wherein in the pre-training the first large model by the set of related documents to obtain a second large model, the first obtaining unit is configured to:
Processing the related document set to obtain text data, form data and map data;
and pre-training the first large model through the text data, the table data and the map data to obtain the second large model.
9. An electronic device comprising a processor, a memory for storing one or more programs and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-6.
10. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-6.
CN202311847859.1A 2023-12-28 2023-12-28 Data processing method and related device Pending CN117828049A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311847859.1A CN117828049A (en) 2023-12-28 2023-12-28 Data processing method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311847859.1A CN117828049A (en) 2023-12-28 2023-12-28 Data processing method and related device

Publications (1)

Publication Number Publication Date
CN117828049A true CN117828049A (en) 2024-04-05

Family

ID=90523823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311847859.1A Pending CN117828049A (en) 2023-12-28 2023-12-28 Data processing method and related device

Country Status (1)

Country Link
CN (1) CN117828049A (en)

Similar Documents

Publication Publication Date Title
CN111368996B (en) Retraining projection network capable of transmitting natural language representation
CN110188358B (en) Training method and device for natural language processing model
US10540967B2 (en) Machine reading method for dialog state tracking
CN110929515B (en) Reading understanding method and system based on cooperative attention and adaptive adjustment
CN110175227B (en) Dialogue auxiliary system based on team learning and hierarchical reasoning
CN110110062B (en) Machine intelligent question and answer method and device and electronic equipment
US11610064B2 (en) Clarification of natural language requests using neural networks
CN114565104A (en) Language model pre-training method, result recommendation method and related device
CN111898374B (en) Text recognition method, device, storage medium and electronic equipment
CN117009490A (en) Training method and device for generating large language model based on knowledge base feedback
Kim et al. A Bi-LSTM memory network for end-to-end goal-oriented dialog learning
CN111027292B (en) Method and system for generating limited sampling text sequence
US20190228297A1 (en) Artificial Intelligence Modelling Engine
CN112257841A (en) Data processing method, device and equipment in graph neural network and storage medium
CN113826125A (en) Training machine learning models using unsupervised data enhancement
CN111858898A (en) Text processing method and device based on artificial intelligence and electronic equipment
CN111400461A (en) Intelligent customer service problem matching method and device
CN114282528A (en) Keyword extraction method, device, equipment and storage medium
CN116992151A (en) Online course recommendation method based on double-tower graph convolution neural network
CN116910201A (en) Dialogue data generation method and related equipment thereof
Alberola et al. Artificial Vision and Language Processing for Robotics: Create end-to-end systems that can power robots with artificial vision and deep learning techniques
CN111783473B (en) Method and device for identifying best answer in medical question and answer and computer equipment
CN117828049A (en) Data processing method and related device
CN116150306A (en) Training method of question-answering robot, question-answering method and device
CN109815323B (en) Human-computer interaction training question-answer generation algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination