CN117291245A

CN117291245A - Model training method, device, computer equipment and storage medium

Info

Publication number: CN117291245A
Application number: CN202311244963.1A
Authority: CN
Inventors: 陈孝良; 李良斌; 常乐; 黄赟贺
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2023-09-25
Filing date: 2023-09-25
Publication date: 2023-12-26

Abstract

The application provides a model training method, a model training device, computer equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: pre-training the large language model based on the first data set; determining a second data set based on the first data set and the model task; determining parameter constraint conditions based on the weights of the parameters in the large language model, wherein the weights are used for representing the importance degree of the corresponding parameters on the large language model learning language in a pre-training stage, and the parameter constraint conditions are used for constraining the variable quantity of the parameters; and adjusting parameters of the large language model after pre-training based on the second data set and the parameter constraint conditions. The method can effectively inhibit the catastrophic forgetting phenomenon of the large language model in the parameter adjustment process, so that the model can obtain excellent performance on the basis of tasks, and meanwhile, the rich knowledge learned in the pre-training stage is reserved.

Description

Model training method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a model training method, a model training device, a computer device, and a storage medium.

Background

In the model training process, training and fine-tuning of the neural network model is a two-core step. In particular, models are typically first pre-trained on a large-scale generic data set, then fine-tuned on a specific task data set to better accommodate new tasks and scenarios. But this training approach faces an important challenge, namely catastrophic forgetfulness (Catastrophic Forgetting). That is, when the model begins to learn a new task, it may forget the original knowledge learned during the pre-training phase. For example, a model pre-trained on a multi-lingual dataset may have the ability to recognize and understand various languages. However, when the model is trimmed on the Chinese dataset, its understanding of other languages may be significantly degraded. Therefore, how to suppress the catastrophic forgetfulness of the model during model training is the focus of research.

Disclosure of Invention

The embodiment of the application provides a model training method, a device, computer equipment and a storage medium, which can effectively inhibit the catastrophic forgetting phenomenon of a large language model in the parameter adjustment process, so that the model can obtain excellent performance on the basis of tasks, and meanwhile, the rich knowledge learned in the pre-training stage is reserved. The technical scheme is as follows:

In one aspect, a model training method is provided, the method comprising:

pre-training a large language model based on a first dataset comprising language text in a plurality of languages;

determining a second data set based on the first data set and a model task, the second data set including a portion of language text in the first data set and language text associated with the model task;

determining parameter constraint conditions based on the weights of the parameters in the large language model, wherein the weights are used for representing the importance degree of the corresponding parameters on the large language model learning language in a pre-training stage, and the parameter constraint conditions are used for constraining the variable quantity of the parameters;

and adjusting parameters of the pre-trained large language model based on the second data set and the parameter constraint condition, wherein the large language model with the parameters adjusted is used for executing the model task through any one of the languages.

In another aspect, there is provided a model training apparatus, the apparatus comprising:

the first training module is used for pre-training the large language model based on a first data set, wherein the first data set comprises language texts of a plurality of languages;

A first determining module configured to determine, based on the first data set and a model task, a second data set including a portion of language text in the first data set and language text related to the model task;

the second determining module is used for determining parameter constraint conditions based on the weights of the parameters in the large language model, wherein the weights are used for representing the importance degree of the corresponding parameters to the large language model learning language in the pre-training stage, and the parameter constraint conditions are used for constraining the variation of the parameters;

and the second training module is used for adjusting parameters of the large language model after pre-training based on the second data set and the parameter constraint condition, and the large language model after parameter adjustment is used for executing the model task through any one of the languages.

In some embodiments, the first determining module is configured to obtain, for any language in the first data set, a language text belonging to the language; the language texts of the plurality of languages are used as a first text subset; acquiring a second text subset based on the model task; and taking the collection of the first text subset and the second text subset as the second data set.

In some embodiments, the first determining module includes:

a first determining unit, configured to determine, for any language text in the first data set, a subject of the language text based on semantics of the language text;

the first acquisition unit is used for acquiring language texts of which the topics are related to the model tasks from the first data set to obtain a third text subset;

the second acquisition unit is used for acquiring a second text subset based on the model task;

and the merging unit is used for taking the second text subset and the third text subset as the second data set.

In some embodiments, the first obtaining unit is configured to determine, for any language text in the first data set, a topic similarity based on a topic of the language text and the model task, where the topic similarity is used to represent a degree of correlation between the topic of the language text and the model task; and obtaining language texts with the topic similarity reaching a similarity threshold value from the first data set, and obtaining the third text subset.

In some embodiments, the second determining module includes:

The third acquisition unit is used for acquiring the weight of the parameter in the large language model;

the processing unit is used for multiplying any parameter of the large language model by the weight of the parameter to obtain a constraint item of the parameter;

and the second determining unit is used for determining the parameter constraint condition based on constraint terms of a plurality of parameters in the large language model, wherein the parameter constraint condition is a part of loss of the large language model.

In some embodiments, the second determining unit is configured to filter out a plurality of parameters with weights reaching a weight threshold from all parameters in the large language model; and summing the constraint terms of the parameters to obtain a parameter constraint condition.

In some embodiments, the second determining module is configured to determine, for any parameter of the large language model, a learning rate of the parameter based on a weight of the parameter, the learning rate of the parameter being inversely related to the weight of the parameter; and determining parameter constraint conditions based on the learning rates of a plurality of parameters in the large language model.

In some embodiments, the second training module is further configured to adjust parameters of the large language model by at least one of a weight attenuation policy and Dropout.

In another aspect, a computer device is provided that includes a processor and a memory for storing at least one segment of a computer program that is loaded and executed by the processor to implement a model training method in an embodiment of the present application.

In another aspect, a computer readable storage medium having stored therein at least one segment of a computer program that is loaded and executed by a processor to implement a model training method as in embodiments of the present application is provided.

In another aspect, a computer program product is provided, comprising a computer program stored in a computer readable storage medium, the computer program being read from the computer readable storage medium by a processor of a computer device, the computer program being executed by the processor to cause the computer device to perform the model training method provided in each of the above aspects or in various alternative implementations of each of the aspects.

After the large language model is pre-trained, language texts and language texts related to model tasks are obtained from a first data set used for pre-training to form a second data set, so that the language texts used for pre-training can be learned again in the subsequent process of adjusting parameters of the large language model based on the model tasks, namely, the large language model can review and utilize experience of a pre-training stage; and the parameter constraint condition can be determined according to the importance degree of the parameters in the large language model to the learning language in the pre-training stage, so that the change amount of the parameters in the model can be constrained in the subsequent process of adjusting the parameters of the large language model based on the model task, the large language model is prevented from excessively modifying the knowledge learned in the pre-training stage, the two modes can effectively inhibit the catastrophic forgetting phenomenon of the large language model in the parameter adjustment process, and the model can acquire excellent performance on the basis of the task, and meanwhile, the rich knowledge learned in the pre-training stage is reserved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an implementation environment of a model training method according to an embodiment of the present application;

FIG. 2 is a flow chart of a model training method provided in accordance with an embodiment of the present application;

FIG. 3 is a flow chart of another model training method provided in accordance with an embodiment of the present application;

FIG. 4 is a block diagram of a model training apparatus provided in accordance with an embodiment of the present application;

FIG. 5 is a block diagram of another model training apparatus provided in accordance with an embodiment of the present application;

fig. 6 is a block diagram of a terminal according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used to distinguish between identical or similar items that have substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the "first," "second," and "nth" terms, nor is it limited to the number or order of execution.

The term "at least one" in this application means one or more, and the meaning of "a plurality of" means two or more.

It should be noted that, information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, language text referred to in this application is obtained with sufficient authorization.

The model training method provided by the embodiment of the application can be executed by computer equipment. In some embodiments, the computer device is a terminal or a server. In the following, taking a computer device as an example, an implementation environment of a model training method provided in an embodiment of the present application will be described, and fig. 1 is a schematic diagram of an implementation environment of a model training method provided in an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 can be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

In some embodiments, terminal 101 is, but is not limited to, a smart phone, tablet, notebook, desktop, smart speaker, smart watch, smart voice-interactive device, smart home appliance, vehicle-mounted terminal, etc. The terminal 101 runs an application program capable of acquiring language text. The application may be a communication-type application, a conference-type application, a question-answer-type application, or a document-reading-type application. Illustratively, the terminal 101 is a terminal used by a user. The terminal 101 may obtain language text entered by the user. The terminal 101 may then send the language text to the server 102, and the server 102 trains a large language model based on the obtained language text.

Those skilled in the art will recognize that the number of terminals may be greater or lesser. Such as the above-mentioned terminals may be only one, or the above-mentioned terminals may be several tens or hundreds, or more. The number of terminals and the device type are not limited in the embodiment of the present application.

In some embodiments, the server 102 is a stand-alone physical server, can be a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. The server 102 may construct a data set from language text acquired from the plurality of terminals 101. Server 102 then trains the large language model from the data set. In some embodiments, the server 102 takes on primary computing work and the terminal 101 takes on secondary computing work; alternatively, the server 102 takes on secondary computing work and the terminal 101 takes on primary computing work; alternatively, a distributed computing architecture is used for collaborative computing between the server 102 and the terminal 101.

Fig. 2 is a flowchart of a model training method provided according to an embodiment of the present application, and referring to fig. 2, a server is used as an example in the embodiment of the present application. The model training method comprises the following steps:

201. the server pre-trains the large language model based on a first data set comprising language text in a plurality of languages.

In this embodiment of the present application, for any language text in the first data set, the language to which the language text belongs may be chinese, english japanese or italian, which is not limited in this embodiment of the present application. The large language model (Large Language Model, LLM) may be GPT-3 (Generative Pretrained Transformer-3, generative pre-training transducer-3), paLM (Pathways Language Model, path language model), or LLaMA (Large Language Model Meta Artificial Intelligence, large language model meta-artificial intelligence), to which embodiments of the present application are not limited. The server pre-trains the large language model according to the first data set. In the pre-training stage, the large language model can learn language knowledge of language texts of multiple languages. The language knowledge may be vocabulary, grammar, or context information of language text, etc., which is not limited by the embodiments of the present application.

202. The server determines a second data set based on the first data set and the model task, the second data set including a portion of the language text in the first data set and the language text associated with the model task.

In the embodiment of the present application, the server obtains the partial language text from the first data set, and the embodiment of the present application does not limit the manner of obtaining the partial language text. The server obtains language text related to the model task based on the model task. The server may then aggregate the partial language text obtained from the first dataset and the language text obtained in connection with the model task into a second dataset. Accordingly, the second data set includes data used by the pre-training stage and data related to the model task. The embodiment of the application does not limit the model task.

203. The server determines parameter constraint conditions based on the weights of the parameters in the large language model, wherein the weights are used for representing the importance degree of the corresponding parameters to the large language model learning language in the pre-training stage, and the parameter constraint conditions are used for constraining the variation of the parameters.

In the embodiment of the application, for any parameter in the large language model, the server can determine the weight of the parameter according to the importance degree of the parameter to the large language model learning language in the pre-training stage. And then, the server determines parameter constraint conditions corresponding to the parameters according to the weights of the parameters. The parameter constraint of the parameter can constrain the amount of change of the parameter in step 204 to avoid excessive amounts of change of the parameter.

204. The server adjusts parameters of the pre-trained large language model based on the second data set and the parameter constraint condition, and the large language model after parameter adjustment is used for executing model tasks through any one of a plurality of languages.

In the embodiment of the application, the server adjusts the parameters of the pre-trained large language model based on the second data set, so that the large language model can learn the knowledge of the model task and review the knowledge learned in the pre-training stage. On the basis, the server can also adjust the parameters of the pre-trained large language model based on parameter constraint conditions so as to limit the excessive variation of the parameters and prevent the large language model from excessively modifying the knowledge learned in the pre-training stage. For any of a plurality of languages, the server may perform model tasks through the language text of that language. That is, the server may perform model tasks through the large language model based on the large language model understanding language text in a certain language. The large language model with the parameters adjusted supports model tasks of multiple languages. For example, large language models support tax question-answering tasks in multiple languages.

Fig. 3 is a flowchart of another model training method provided according to an embodiment of the present application, and referring to fig. 3, an example of the model training method is described in the embodiment of the present application. The model training method comprises the following steps:

301. The server pre-trains the large language model based on a first data set comprising language text in a plurality of languages.

In the embodiment of the present application, the first data set may be a large-scale multilingual data set including language texts in multiple languages, and the embodiment of the present application does not limit the number of languages and language texts in each language. For any language text in the first data set, the language text may be a general language in life, or may be a technical term in a certain field, which is not limited in the embodiment of the present application. The server may obtain the first data set from a local storage, or may also obtain the first data set from another computer device, where the manner of obtaining the first data set in the embodiment of the present application is not limited. Then, the server pre-trains the large language model through the first data set so that the large language model learns language knowledge of multiple languages.

The pre-training mode may be self-supervised learning, semi-supervised learning, contrast learning or the like, which is not limited in the embodiment of the present application. The large language model may be a model based on a transducer structure, which is not limited by the embodiments of the present application. The language knowledge learned by the large language model in the pre-training stage comprises at least one of vocabulary, grammar or context information of language text.

302. The server determines a second data set based on the first data set and the model task, the second data set including a portion of the language text in the first data set and the language text associated with the model task.

In the embodiment of the application, the server may randomly acquire a part of language text from the first data set; alternatively, the server may obtain the portion of the language text from the first dataset based on some rule, which is not limited in this embodiment of the present application. The server determines language text associated with the model task based on the model task. The model tasks may be text classification, emotion analysis, question-answer tasks, intent recognition, or machine translation, among other similar tasks. The language text related to the model task may be language text of multiple languages, or may be language text of a single language, which is not limited in the embodiment of the present application. Where "single language" may be a certain language in the first data set, embodiments of the present application are not limited thereto. The server constructs a second dataset from the partial language text obtained from the first dataset and the language text associated with the model task.

Where the model task is a text classification, the language text of the model task may include language text of multiple topics to enable a subsequent large language model to understand the topics of the language text entered. The plurality of topics may include financial, sports, social, entertainment, and the like topics. Accordingly, the server obtains language text for a plurality of topics. The server then constructs a second data set based on the first data set and the language text of the plurality of topics. The server then adjusts parameters of the pre-trained large language model based on the second dataset. The large language model with the parameters adjusted can identify the subject of the language text. The embodiment of the application does not limit the number of language texts of each theme.

In the case where the model task is emotion analysis, the language text of the model task may include language text of multiple emotions to enable a subsequent large language model to understand the emotion of the input language text. The various emotions may include categories of positive, negative, neutral, and the like. Accordingly, the server obtains language texts of various emotions. The server then constructs a second dataset based on the first dataset and the language text of the plurality of emotions. The server then adjusts parameters of the pre-trained large language model based on the second dataset. The large language model with the parameters adjusted can identify emotion reflected by the language text. The embodiment of the application does not limit the number of language texts of each emotion.

In the case where the model task is a question-answer task, the language text of the model task may include a plurality of dialog texts to enable a subsequent large language model to understand the entered language text and answer. Each dialog text includes text belonging to a "question" and text belonging to an "answer". Accordingly, the server obtains a plurality of dialog texts. Each dialog text includes a question text and an answer text. The server then constructs a second data set based on the first data set and the plurality of dialog texts. The server then adjusts parameters of the pre-trained large language model based on the second dataset. The large language model with the parameters adjusted can answer the language text. The embodiment of the application does not limit the language to which each dialogue text belongs.

For example, the first data set includes language text in multiple languages, such as chinese, english, japanese, and italian. Accordingly, the large language model can learn language knowledge of a plurality of languages such as Chinese, english, japanese, italian and the like based on the first data set. The model task is tax questions and answers. The language text associated with tax questions and answers includes chinese language text. The server obtains a portion of language text from the first dataset. The partial language text may include language text in multiple languages or may be language text in a single language. The "single language" may be chinese, i.e. the language text associated with the model task belongs to the same language.

In the case where the model task is intent recognition, the language text of the model task may include language text of multiple intents to enable a subsequent large language model to understand the intent of the user from the entered language text. The various intents may be weather queries, song searches, and random boring, among others. Accordingly, the server obtains language text of multiple intents. The server then constructs a second dataset based on the first dataset and the language text of the plurality of intents. The server then adjusts parameters of the pre-trained large language model based on the second dataset. The large language model with the parameters adjusted can recognize the semantics of language text, so that the intention of a user is understood. The embodiment of the application does not limit the number of language texts of each intention. The language text of each intent may contain keywords for indicating the correspondence of the intent.

Where the model task is machine translation, the language text of the model task may include a plurality of text pairs to enable a subsequent large language model to translate the input language text into another language. Accordingly, the server obtains a plurality of text pairs. Each text pair contains text belonging to a source language and text belonging to a target language. The source language and the target language are different languages. The server then constructs a second data set based on the first data set and the plurality of text pairs. The server then adjusts parameters of the pre-trained large language model based on the second dataset. The large language model with the parameters adjusted can translate the language text belonging to the source language into the language text belonging to the target language; alternatively, language text belonging to the target language is translated into language text belonging to the source language.

In some embodiments, the server may obtain language text for each language from the first dataset to construct the second dataset. Accordingly, the process of the server determining the second data set based on the first data set and the model task includes: for any of the languages in the first dataset, the server obtains language text belonging to the language. The server takes language texts of multiple languages as a first text subset. The server then obtains a second subset of text based on the model task. The server then uses the collection of the first subset of text and the second subset of text as the second data set. Wherein the first text subset may be regarded as a playback buffer for reviewing experience of the pre-training phase. According to the scheme provided by the embodiment of the application, the second data set is constructed by acquiring the language text of each language in the first data set and the language text related to the model task, so that the subsequent parameter adjustment of the large-scale language model based on the second data set can be realized based on the language text related to the model task to learn the knowledge of the model task, the purpose of reviewing and utilizing the experience of each language in the first data set can be realized, the catastrophic forgetting phenomenon of the large-scale language model in the parameter adjustment process is effectively restrained, and the large-scale language model after parameter adjustment still has the capability of understanding the languages of a plurality of languages on the basis of obtaining excellent performance of the model on the task.

The number of language texts of each language acquired by the server from the first data set may be the same or different, which is not limited in the embodiment of the present application. Alternatively, the server may obtain language text for each language from the first dataset in accordance with the number proportion of language text for each language in the first dataset.

In some embodiments, the server may also obtain language text associated with the model task from the first dataset to construct a second dataset. Accordingly, the process of the server determining the second data set based on the first data set and the model task includes: for any language text in the first dataset, the server determines a topic of the language text based on the semantics of the language text. Then, the server acquires language texts of which the topics are related to the model tasks from the first data set, and a third text subset is obtained. The server obtains a second subset of text based on the model task. The server then uses the second subset of text and the third subset of text as the second data set. According to the scheme provided by the embodiment of the application, the second data set is constructed by acquiring the language text related to the model task in the first data set, so that when the parameters of the large-scale language model are adjusted based on the second data set, the purpose of reviewing and utilizing experience of the pre-training stage is achieved, the catastrophic forgetting phenomenon of the large-scale language model in the parameter adjustment process can be effectively restrained, the adjustment can be further carried out based on the language text related to the model task, knowledge of the model task is learned, and learning from two aspects of the learned language text and the un-learned language text is achieved.

Wherein the language text in the first dataset associated with the model task may comprise a plurality of languages. Correspondingly, the second data set is constructed through the language text related to the model task in the first data set, so that the large-scale language model can learn the model task from the angles of multiple languages, namely, the model can acquire excellent performance on the task, and meanwhile, the large-scale language model with the parameters adjusted still has the capability of understanding the languages of multiple languages.

In retrieving the third subset of text, the server may filter the language text from the first dataset based on a similarity between the subject matter of the language text and the model task. Correspondingly, the server acquires language texts of subjects related to the model tasks from the first data set, and the process of obtaining the third text subset comprises the following steps: for any language text in the first dataset, the server determines a topic similarity based on the topic and model task of the language text. The topic similarity is used to represent the degree of correlation between the topic of the language text and the model task. And the server acquires the language text with the topic similarity reaching the similarity threshold value from the first data set, and acquires a third text subset. The size of the similarity threshold is not limited in the embodiment of the application. According to the scheme provided by the embodiment of the application, the second data set is constructed by acquiring the language text of which the correlation degree between the subject in the first data set and the model task reaches the similarity threshold, so that when the parameters of the large-scale language model are adjusted based on the second data set, the purpose of reviewing and utilizing experience of a pre-training stage can be achieved by adjusting the language text related to the model task based on the first data set, the catastrophic forgetting phenomenon of the large-scale language model in the parameter adjustment process can be effectively restrained, the language text related to the model task can be additionally adjusted based on the knowledge of the model task, learning of the model task from the two aspects of the learned language text and the un-learned language text is achieved, the performance of the large-scale language model on the model task can be improved, and the training efficiency can be improved.

303. The server determines parameter constraint conditions based on the weights of the parameters in the large language model, wherein the weights are used for representing the importance degree of the corresponding parameters to the large language model learning language in the pre-training stage, and the parameter constraint conditions are used for constraining the variation of the parameters.

In the embodiment of the application, the server can determine the weight of the parameter in the large language model according to the importance degree of the parameter in the large language model to the learning language in the pre-training stage. Wherein the weight of the parameter is positively correlated with the importance of the parameter. Then, the server determines parameter constraint conditions according to the weights of the parameters in the large language model. The parameter constraint condition can constrain the variation of the parameters of the large language model in step 304, so as to avoid that the parameters of the large language model forget the language knowledge learned in the pre-training stage due to the overlarge variation. The parameter constraint condition can limit the variation of the parameter within a preset range; alternatively, the variation of parameters may be constrained as part of the penalty of a large language model; alternatively, the change of the parameter may be restrained by controlling the learning rate of the parameter, and the embodiment of the present application does not limit this.

In some embodiments, the server may constrain the variation of the parameters as part of the penalty of the large language model with parameter constraints. Accordingly, the process of determining parameter constraint conditions by the server based on the weights of the parameters in the large language model comprises the following steps: the server obtains the weights of the parameters in the large language model. Then, for any parameter of the large language model, the server multiplies the parameter by the weight of the parameter to obtain a constraint term of the parameter. The server then determines parameter constraints based on the constraints of the plurality of parameters in the large language model. Wherein the parameter constraints are part of the penalty of the large language model. That is, the server adds constraint terms of a plurality of parameters in the large language model to obtain parameter constraint conditions. The server then adds the parameter constraints to the loss function to determine the loss of the large language model. According to the scheme provided by the embodiment of the application, the parameter constraint condition is determined according to the importance degree of the parameter in the large language model to the learning language in the pre-training stage, the parameter constraint condition is used as a part of the loss of the large language model, and the loss of the large language model is smaller and smaller in the training process, and accordingly, the numerical value corresponding to the parameter constraint condition in the loss is smaller and smaller, so that the purpose of restraining the variable quantity of the parameter is achieved, the large language model is prevented from excessively modifying the knowledge learned in the pre-training stage, the catastrophic forgetting phenomenon of the large language model in the parameter adjustment process can be effectively restrained, and the model can obtain excellent performance on the basis of tasks while retaining rich knowledge learned in the pre-training stage.

Wherein the plurality of parameters may be all parameters of the large language model. Accordingly, the server determines parameter constraints based on constraints of all parameters in the large language model. Alternatively, the plurality of parameters may be some parameters whose weights satisfy the condition among all the parameters of the large language model. Accordingly, the server determines parameter constraints based on constraints of some parameters in the large language model. This is not limiting in this embodiment of the present application.

Optionally, the plurality of parameters are part of all parameters of the large language model for which the weights satisfy the condition. The embodiment of the application does not limit the conditions satisfied by the weights. Accordingly, the server determines parameter constraint conditions based on constraint terms of a plurality of parameters in the large language model, and the process of determining the parameter constraint conditions comprises the following steps: the server screens out a plurality of parameters with weights reaching a weight threshold from all parameters in the large language model. Then, the server sums the constraint terms of the plurality of parameters to obtain parameter constraint conditions. According to the scheme provided by the embodiment of the application, the parameter constraint conditions are determined according to the parameters with higher importance degree on the learning language in the pre-training stage, so that the change amount of the parameters in the model can be restrained in the subsequent process of adjusting the parameters of the large-scale language model based on the model task, the large-scale language model is prevented from excessively modifying the knowledge learned in the pre-training stage, the catastrophic forgetting phenomenon of the large-scale language model in the parameter adjustment process can be effectively restrained, and the model can acquire excellent performance on the basis of the task, and meanwhile, the rich knowledge learned in the pre-training stage is reserved; in addition, compared with the scheme of limiting the variation of all parameters, only the parameters with higher importance degree need to be limited, and the parameters which are not important for language learning do not need to be paid attention to excessively, so that the running consumption can be saved, and the training efficiency can be improved.

In some embodiments, the server may constrain the variation of the parameter by controlling the learning rate of the parameter. Accordingly, the process of determining parameter constraint conditions by the server based on the weights of the parameters in the large language model comprises the following steps: for any parameter of the large language model, the server determines the learning rate of the parameter based on the weight of the parameter. Then, the server determines parameter constraints based on the learning rates of the plurality of parameters in the large language model. Wherein the learning rate of the parameter is inversely related to the weight of the parameter. According to the scheme provided by the embodiment of the application, the learning rate of the parameters is determined according to the importance degree of the parameters in the large language model to the learning language in the pre-training stage, and the learning rate of the parameters with higher importance degree is lower, so that the variation of the parameters is limited by the learning rate, the large language model is prevented from excessively modifying the knowledge learned in the pre-training stage, the catastrophic forgetting phenomenon of the large language model in the parameter adjustment process can be effectively restrained, and the model can acquire excellent performance on the basis of tasks, and meanwhile, the rich knowledge learned in the pre-training stage is reserved.

Wherein the plurality of parameters may be all parameters of the large language model. Accordingly, the server determines parameter constraints based on the learning rate of all parameters in the large language model. Alternatively, the plurality of parameters may be some parameters whose weights satisfy the condition among all the parameters of the large language model. Accordingly, the server determines parameter constraints based on the learning rate of the partial parameters in the large language model. This is not limiting in this embodiment of the present application. The condition that the weight satisfies may refer to the weight reaching a weight threshold, which is not limited by the embodiment of the present application.

The embodiment of the application does not limit the determination mode of the weight of the parameter. Optionally, the process of determining the weights of the parameters in the large language model by the server includes: and the server fits the posterior probability of the large language model after the pre-training to obtain Gaussian distribution. The server then determines a fischer information matrix based on the gaussian distribution. The server then determines the variance of the posterior probability of the large language model based on the fischer information matrix. Then, the server uses the variance of the posterior probability of the large language model as the weight of the parameters in the large language model.

304. The server adjusts parameters of the pre-trained large language model based on the second data set and the parameter constraint condition, and the large language model after parameter adjustment is used for executing model tasks through any one of a plurality of languages.

In the embodiment of the application, the server adjusts the parameters of the pre-trained large language model based on the second data set, so that the large language model can learn the knowledge of the model task and review the knowledge learned in the pre-training stage. The manner in which the parameters are adjusted may be considered as empirical playback (Experience Replay). On the basis, the server can also adjust the parameters of the pre-trained large language model based on parameter constraint conditions so as to limit the excessive variation of the parameters and prevent the large language model from excessively modifying the knowledge learned in the pre-training stage. The manner in which the parameters are adjusted can be regarded as elastic weight sharing (Elastic Weight Consolidation, EWC). Wherein "tuning" may also be referred to as "fine tuning".

In some embodiments, the server may train the large language model by the following equation one.

Equation one:

wherein,loss for representing large language models; a is used for representing a pre-training task; b is used for representing model tasks; />Loss for representing model tasks; λ is used to represent the importance of the old pre-training task a relative to the new model task B; θ is used to represent parameters in a large language model; i is used to represent the number of the parameter, which is used to distinguish the parameter in the large language model.

In some embodiments, the server may also adjust parameters of the large language model using at least one of Weight Decay policies (Weight Decay) and Dropout. According to the method, by adopting a weight attenuation strategy, a regularization strategy such as Dropout and the like, the overfitting of the large-scale language model can be prevented, the catastrophic forgetting is further restrained, and the generalization capability of the large-scale language model is improved. Wherein weight decay policy refers to suppression of overfitting by applying a penalty to the weights of the model. Dropout refers to increasing the robustness of the model by randomly discarding the output of a portion of the neurons.

305. The server evaluates the large language model based on a third data set that includes a plurality of language texts associated with the model tasks.

In the present embodiment, the third data set is a test set that is independent of the first data set and the second data set. That is, the language text in the third data set is different from either of the language text in the first data set and the second data set. The third data set may include language text related to the model task, or may include language text in the same language as the first data set, which is not limited in this embodiment of the present application. The server evaluates the large language model via the third dataset. The embodiment of the application does not limit the specific evaluation mode. The evaluation result of the large language model may indicate the performance of the large language model in executing the model task, or may indicate the understanding ability of the large language model to language of multiple languages, which is not limited in the embodiment of the present application.

Fig. 4 is a block diagram of a model training apparatus provided according to an embodiment of the present application. The model training apparatus is used for executing the steps when the model training method is executed, referring to fig. 4, and the model training apparatus includes: a first training module 401, a first determination module 402, a second determination module 403, and a second training module 404.

A first training module 401, configured to pretrain the large language model based on a first data set, where the first data set includes language texts in multiple languages;

a first determining module 402 for determining a second data set based on the first data set and the model task, the second data set comprising a portion of language text in the first data set and language text associated with the model task;

a second determining module 403, configured to determine a parameter constraint condition based on a weight of a parameter in the large language model, where the weight is used to represent an importance degree of the corresponding parameter to the large language model learning language in the pre-training stage, and the parameter constraint condition is used to constrain a variation of the parameter;

the second training module 404 is configured to adjust parameters of the pre-trained large language model based on the second data set and the parameter constraint condition, where the large language model after parameter adjustment is used to execute the model task in any one of multiple languages.

In some embodiments, fig. 5 is a block diagram of another model training apparatus provided in accordance with an embodiment of the present application. Referring to fig. 5, a first determining module 402 is configured to obtain, for any language in the first data set, a language text belonging to the language; a language text of a plurality of languages is used as a first text subset; acquiring a second text subset based on the model task; and taking the collection of the first text subset and the second text subset as a second data set.

In some embodiments, with continued reference to fig. 5, the first determination module 402 includes:

a first determining unit 4021 configured to determine, for any language text in the first data set, a subject of the language text based on semantics of the language text;

a first obtaining unit 4022, configured to obtain, from the first data set, a language text related to the subject and the model task, to obtain a third text subset;

a second obtaining unit 4023 configured to obtain a second text subset based on the model task;

a merging unit 4024, configured to use the second text subset and the third text subset as the second data set.

In some embodiments, with continued reference to fig. 5, the first obtaining unit 4022 is configured to determine, for any language text in the first data set, a topic similarity based on a topic and a model task of the language text, the topic similarity being used to represent a degree of correlation between the topic and the model task of the language text; and obtaining the language text with the topic similarity reaching the similarity threshold value from the first data set to obtain a third text subset.

In some embodiments, with continued reference to fig. 5, the second determination module 403 includes:

a third obtaining unit 4031 for obtaining the weight of the parameter in the large language model;

the processing unit 4032 is configured to multiply, for any parameter of the large language model, the parameter with a weight of the parameter to obtain a constraint term of the parameter;

a second determining unit 4033, configured to determine a parameter constraint condition based on constraint terms of a plurality of parameters in the large language model, where the parameter constraint condition is a part of a loss of the large language model.

In some embodiments, with continued reference to fig. 5, the second determining unit 4033 is configured to filter out a plurality of parameters with weights reaching a weight threshold from all parameters in the large language model; and summing constraint terms of the plurality of parameters to obtain parameter constraint conditions.

In some embodiments, with continued reference to fig. 5, the second determining module 403 is configured to determine, for any parameter of the large language model, a learning rate of the parameter based on a weight of the parameter, the learning rate of the parameter being inversely related to the weight of the parameter; parameter constraints are determined based on learning rates of a plurality of parameters in the large language model.

In some embodiments, with continued reference to fig. 5, the second training module 404 is further configured to adjust parameters of the large language model using at least one of a weight decay policy and Dropout.

The embodiment of the application provides a model training device, after a large language model is pre-trained, language texts and language texts related to model tasks are obtained from a first data set used for pre-training to form a second data set, so that the language texts used for pre-training can be learned again in the subsequent process of adjusting parameters of the large language model based on the model tasks, namely, the large language model can review and utilize experience of a pre-training stage; and the parameter constraint condition can be determined according to the importance degree of the parameters in the large language model to the learning language in the pre-training stage, so that the change amount of the parameters in the model can be constrained in the subsequent process of adjusting the parameters of the large language model based on the model task, the large language model is prevented from excessively modifying the knowledge learned in the pre-training stage, the two modes can effectively inhibit the catastrophic forgetting phenomenon of the large language model in the parameter adjustment process, and the model can acquire excellent performance on the basis of the task, and meanwhile, the rich knowledge learned in the pre-training stage is reserved.

It should be noted that, when the model training device provided in the foregoing embodiment runs an application program, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the functions described above. In addition, the model training device and the model training method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

In the embodiment of the present application, the computer device may be configured as a terminal or a server, and when the computer device is configured as a terminal, the technical solution provided in the embodiment of the present application may be implemented by the terminal as an execution body, and when the computer device is configured as a server, the technical solution provided in the embodiment of the present application may be implemented by the server as an execution body, and also the technical solution provided in the present application may be implemented by interaction between the terminal and the server, which is not limited in this embodiment of the present application.

Fig. 6 is a block diagram of a terminal 600 according to an embodiment of the present application. The terminal 600 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 600 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 600 includes: a processor 601 and a memory 602.

Processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 601 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 601 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen is required to display. In some embodiments, the processor 601 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one computer program for execution by processor 601 to implement the model training method provided by the method embodiments herein.

In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603, and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 603 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 604, a display 605, a camera assembly 606, audio circuitry 607, and a power supply 608.

Peripheral interface 603 may be used to connect at least one Input/Output (I/O) related peripheral to processor 601 and memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 601, memory 602, and peripheral interface 603 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 604 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 604 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 604 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. In some embodiments, the radio frequency circuit 604 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 604 may also include NFC (Near Field Communication, short range wireless communication) related circuitry, which is not limited in this application.

The display screen 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 605 is a touch display, the display 605 also has the ability to collect touch signals at or above the surface of the display 605. The touch signal may be input as a control signal to the processor 601 for processing. At this point, the display 605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 605 may be one, disposed on the front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in other embodiments, the display 605 may be a flexible display, disposed on a curved surface or a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 605 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 606 is used to capture images or video. In some embodiments, the camera assembly 606 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing, or inputting the electric signals to the radio frequency circuit 604 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 607 may also include a headphone jack.

The power supply 608 is used to power the various components in the terminal 600. The power source 608 may be alternating current, direct current, disposable or rechargeable. When the power source 608 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 600 further includes one or more sensors 609. The one or more sensors 609 include, but are not limited to: acceleration sensor 610, gyroscope sensor 611, pressure sensor 612, optical sensor 613, and proximity sensor 614.

The acceleration sensor 610 may detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 610 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 601 may control the display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 610. The acceleration sensor 610 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 611 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 611 may collect a 3D motion of the user to the terminal 600 in cooperation with the acceleration sensor 610. The processor 601 may implement the following functions based on the data collected by the gyro sensor 611: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 612 may be disposed at a side frame of the terminal 600 and/or at a lower layer of the display 605. When the pressure sensor 612 is disposed at a side frame of the terminal 600, a grip signal of the user to the terminal 600 may be detected, and the processor 601 performs a left-right hand recognition or a shortcut operation according to the grip signal collected by the pressure sensor 612. When the pressure sensor 612 is disposed at the lower layer of the display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 605. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 613 is used to collect the intensity of ambient light. In one embodiment, processor 601 may control the display brightness of display 605 based on the intensity of ambient light collected by optical sensor 613. Specifically, when the intensity of the ambient light is high, the display brightness of the display screen 605 is turned up; when the ambient light intensity is low, the display brightness of the display screen 605 is turned down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 613.

A proximity sensor 614, also known as a distance sensor, is typically provided on the front panel of the terminal 600. The proximity sensor 614 is used to collect the distance between the user and the front of the terminal 600. In one embodiment, when the proximity sensor 614 detects that the distance between the user and the front of the terminal 600 gradually decreases, the processor 601 controls the display 605 to switch from the bright screen state to the off screen state; when the proximity sensor 614 detects that the distance between the user and the front surface of the terminal 600 gradually increases, the processor 601 controls the display screen 605 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 6 is not limiting of the terminal 600 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 701 and one or more memories 702, where at least one computer program is stored in the memories 702, and the at least one computer program is loaded and executed by the processor 701 to implement the model training method provided in the foregoing method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

The present application also provides a computer readable storage medium, in which at least one section of a computer program is stored, where the at least one section of the computer program is loaded and executed by a processor of a computer device to implement the operations performed by the computer device in the model training method of the above embodiment. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.

Embodiments of the present application also provide a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the model training methods provided in the various alternative implementations described above.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments is merely exemplary in nature and is in no way intended to limit the invention, since it is intended that all modifications, equivalents, improvements, etc. that fall within the spirit and scope of the invention.

Claims

1. A method of model training, the method comprising:

2. The method of claim 1, wherein the determining a second data set based on the first data set and a model task comprises:

for any language in the first data set, acquiring a language text belonging to the language;

the language texts of the plurality of languages are used as a first text subset;

acquiring a second text subset based on the model task;

and taking the collection of the first text subset and the second text subset as the second data set.

3. The method of claim 1, wherein the determining a second data set based on the first data set and a model task comprises:

for any language text in the first data set, determining a theme of the language text based on the semantics of the language text;

obtaining language texts of which the topics are related to the model tasks from the first data set to obtain a third text subset;

acquiring a second text subset based on the model task;

and taking the second text subset and the third text subset as the second data set.

4. A method according to claim 3, wherein said obtaining, from said first dataset, language text whose subject matter is related to said model task, resulting in a third subset of text, comprises:

For any language text in the first data set, determining a topic similarity based on the topic of the language text and the model task, wherein the topic similarity is used for representing the degree of correlation between the topic of the language text and the model task;

and obtaining language texts with the topic similarity reaching a similarity threshold value from the first data set, and obtaining the third text subset.

5. The method of claim 1, wherein determining parameter constraints based on weights of parameters in the large language model comprises:

acquiring the weight of parameters in the large language model;

multiplying any parameter of the large language model by the weight of the parameter to obtain a constraint term of the parameter;

determining the parameter constraint conditions based on constraint terms of a plurality of parameters in the large language model, wherein the parameter constraint conditions are part of loss of the large language model.

6. The method of claim 5, wherein the determining the parameter constraint based on constraints of a plurality of parameters in the large language model comprises:

Screening out a plurality of parameters with weights reaching a weight threshold value from all parameters in the large language model;

and summing the constraint terms of the parameters to obtain a parameter constraint condition.

7. The method of claim 1, wherein determining parameter constraints based on weights of parameters in the large language model comprises:

for any parameter of the large language model, determining the learning rate of the parameter based on the weight of the parameter, wherein the learning rate of the parameter is inversely related to the weight of the parameter;

and determining parameter constraint conditions based on the learning rates of a plurality of parameters in the large language model.

8. The method according to claim 1, wherein the method further comprises:

and adjusting the parameters of the large language model by adopting at least one mode of a weight attenuation strategy and Dropout.

9. A model training apparatus, the apparatus comprising:

10. A computer device, characterized in that it comprises a processor and a memory for storing at least one section of a computer program, which is loaded by the processor and which performs the model training method of any of claims 1 to 8.

11. A computer readable storage medium, characterized in that the computer readable storage medium is adapted to store at least one computer program for performing the model training method of any of the claims 1 to 8.

12. A computer program product comprising a computer program which, when executed by a processor, implements the model training method of any of claims 1 to 8.