CN117350354A - Training method and device for large model, electronic equipment and storage medium - Google Patents

Training method and device for large model, electronic equipment and storage medium Download PDF

Info

Publication number
CN117350354A
CN117350354A CN202311228444.6A CN202311228444A CN117350354A CN 117350354 A CN117350354 A CN 117350354A CN 202311228444 A CN202311228444 A CN 202311228444A CN 117350354 A CN117350354 A CN 117350354A
Authority
CN
China
Prior art keywords
parameter
target
matrix corresponding
parameter matrix
target parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311228444.6A
Other languages
Chinese (zh)
Other versions
CN117350354B (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Moore Threads Technology Co Ltd
Original Assignee
Moore Threads Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Moore Threads Technology Co Ltd filed Critical Moore Threads Technology Co Ltd
Priority to CN202311228444.6A priority Critical patent/CN117350354B/en
Publication of CN117350354A publication Critical patent/CN117350354A/en
Application granted granted Critical
Publication of CN117350354B publication Critical patent/CN117350354B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present disclosure relates to a training method, apparatus, electronic device, and storage medium for a large model. The method comprises the following steps: determining target parameters in a target data processing model; for any one of the target parameters, initializing a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter according to the number of rows and columns of an original parameter matrix corresponding to the target parameter; inputting a training sample into the target data processing model, and outputting a prediction result corresponding to the training sample through the target data processing model; determining a value of a loss function corresponding to the target data processing model according to a prediction result corresponding to the training sample and a label corresponding to the training sample; and updating a first parameter matrix corresponding to the target parameter, a second parameter matrix corresponding to the target parameter and a third parameter matrix corresponding to a non-target parameter in the target data processing model according to the value of the loss function.

Description

Training method and device for large model, electronic equipment and storage medium
Technical Field
The disclosure relates to the field of computer technology, and in particular relates to a training method and device for a large model, electronic equipment and a storage medium.
Background
Under the condition that the scale of the data processing model is large (namely, the data processing model is a large model), the requirement of training the data processing model on the display card is high, and the single display card can hardly complete the training of the large-scale data processing model, so that the training cost of the large-scale data processing model is high.
Disclosure of Invention
The present disclosure provides a training solution for large models.
According to an aspect of the present disclosure, there is provided a training method of a large model, including:
determining target parameters in a target data processing model;
for any target parameter, initializing a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter according to the number of rows and the number of columns of the original parameter matrix corresponding to the target parameter, wherein the number of rows of the first parameter matrix corresponding to the target parameter is equal to the number of rows of the original parameter matrix corresponding to the target parameter, the number of columns of the first parameter matrix corresponding to the target parameter is smaller than the number of columns of the original parameter matrix corresponding to the target parameter, the number of columns of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the original parameter matrix corresponding to the target parameter, and the number of rows of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the first parameter matrix corresponding to the target parameter;
Inputting a training sample into the target data processing model, and outputting a prediction result corresponding to the training sample through the target data processing model;
determining a value of a loss function corresponding to the target data processing model according to a prediction result corresponding to the training sample and a label corresponding to the training sample;
and updating a first parameter matrix corresponding to the target parameter, a second parameter matrix corresponding to the target parameter and a third parameter matrix corresponding to a non-target parameter in the target data processing model according to the value of the loss function.
In one possible implementation manner, the determining the target parameters in the target data processing model includes:
and for any parameter in the target data processing model, determining the parameter as a target parameter in response to the original parameter matrix corresponding to the parameter meeting a preset condition.
In one possible implementation, the preset condition includes at least one of:
the product of the number of rows and the number of columns is larger than or equal to a first preset value;
the number of lines is larger than or equal to a second preset value;
the number of columns is greater than or equal to a third preset value.
In one possible implementation, the method further includes:
Acquiring a capacity value of a video memory;
and determining at least one of the first preset value, the second preset value and the third preset value according to the capacity value.
In a possible implementation manner, the number of columns of the first parameter matrix corresponding to the target parameter is at least one order of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter.
In one possible implementation manner, the original parameter matrix corresponding to the target parameter is kept fixed during the training process of the target data processing model.
In one possible implementation manner, the inputting the training sample into the target data processing model, outputting, by the target data processing model, a prediction result corresponding to the training sample includes:
calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter to obtain a first product corresponding to the target parameter;
determining the sum of the first products of the original parameter matrix corresponding to the target parameter and the first products of the original parameter matrix corresponding to the target parameter as the latest total parameter matrix corresponding to the target parameter;
inputting the training sample into the target data processing model, and obtaining a prediction result corresponding to the training sample based on the latest total parameter matrix corresponding to the target parameter and the latest third parameter matrix corresponding to the non-target parameter.
In one possible implementation manner, the updating the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter, and the third parameter matrix corresponding to the non-target parameter in the target data processing model according to the value of the loss function includes:
determining a first gradient of a first parameter matrix corresponding to the target parameter, a second gradient of a second parameter matrix corresponding to the target parameter and a third gradient of a third parameter matrix corresponding to the non-target parameter according to the value of the loss function;
updating the first parameter matrix corresponding to the target parameter according to the first gradient of the first parameter matrix corresponding to the target parameter;
updating a second parameter matrix corresponding to the target parameter according to a second gradient of the second parameter matrix corresponding to the target parameter;
and updating the third parameter matrix corresponding to the non-target parameter according to the third gradient of the third parameter matrix corresponding to the non-target parameter.
In one possible implementation, the method further includes:
and storing the first gradient of the first parameter matrix corresponding to the target parameter, the second gradient of the second parameter matrix corresponding to the target parameter and the third gradient of the third parameter matrix corresponding to the non-target parameter in a video memory.
In one possible implementation, the method further includes:
and storing first optimizer state information corresponding to a first parameter matrix corresponding to the target parameter, second optimizer state information corresponding to a second parameter matrix corresponding to the target parameter and third optimizer state information corresponding to a third parameter matrix corresponding to the non-target parameter in a video memory.
In one possible implementation manner, after updating the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter, and the third parameter matrix corresponding to the non-target parameter in the target data processing model according to the value of the loss function, the method further includes:
responding to the end of training of the target data processing model, calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter to obtain a second product corresponding to the target parameter;
and determining the sum of the second products of the original parameter matrix corresponding to the target parameter and the second products corresponding to the target parameter as an updated parameter matrix corresponding to the target parameter.
In one possible implementation, the loss function includes a first loss function corresponding to a task that predicts a next word.
In one possible implementation, the loss function includes a second loss function corresponding to the reinforcement learning task.
In one possible implementation, the target data processing model is a text processing model, and the training text is a training sample.
According to an aspect of the present disclosure, there is provided a data processing method including:
acquiring a target data processing model obtained by training the training method of the large model;
inputting the data to be processed into the target data processing model, and outputting a data processing result corresponding to the data to be processed through the target data processing model.
In one possible implementation, the data to be processed is text to be processed.
According to an aspect of the present disclosure, there is provided a training apparatus of a large model, including:
the first determining module is used for determining target parameters in the target data processing model;
the initialization module is used for initializing a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter according to the number of rows and the number of columns of the original parameter matrix corresponding to the target parameter for any target parameter, wherein the number of rows of the first parameter matrix corresponding to the target parameter is equal to the number of rows of the original parameter matrix corresponding to the target parameter, the number of columns of the first parameter matrix corresponding to the target parameter is smaller than the number of columns of the original parameter matrix corresponding to the target parameter, the number of columns of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the original parameter matrix corresponding to the target parameter, and the number of rows of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the first parameter matrix corresponding to the target parameter;
The prediction module is used for inputting a training sample into the target data processing model and outputting a prediction result corresponding to the training sample through the target data processing model;
the second determining module is used for determining the value of the loss function corresponding to the target data processing model according to the prediction result corresponding to the training sample and the label corresponding to the training sample;
and the updating module is used for updating the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter and the third parameter matrix corresponding to the non-target parameter in the target data processing model according to the value of the loss function.
In one possible implementation manner, the first determining module is configured to:
and for any parameter in the target data processing model, determining the parameter as a target parameter in response to the original parameter matrix corresponding to the parameter meeting a preset condition.
In one possible implementation, the preset condition includes at least one of:
the product of the number of rows and the number of columns is larger than or equal to a first preset value;
the number of lines is larger than or equal to a second preset value;
the number of columns is greater than or equal to a third preset value.
In one possible implementation, the apparatus further includes:
the second acquisition module is used for acquiring the capacity value of the video memory;
and a third determining module, configured to determine at least one of the first preset value, the second preset value and the third preset value according to the capacity value.
In a possible implementation manner, the number of columns of the first parameter matrix corresponding to the target parameter is at least one order of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter.
In one possible implementation manner, the original parameter matrix corresponding to the target parameter is kept fixed during the training process of the target data processing model.
In one possible implementation, the prediction module is configured to:
calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter to obtain a first product corresponding to the target parameter;
determining the sum of the first products of the original parameter matrix corresponding to the target parameter and the first products of the original parameter matrix corresponding to the target parameter as the latest total parameter matrix corresponding to the target parameter;
inputting the training sample into the target data processing model, and obtaining a prediction result corresponding to the training sample based on the latest total parameter matrix corresponding to the target parameter and the latest third parameter matrix corresponding to the non-target parameter.
In one possible implementation, the updating module is configured to:
determining a first gradient of a first parameter matrix corresponding to the target parameter, a second gradient of a second parameter matrix corresponding to the target parameter and a third gradient of a third parameter matrix corresponding to the non-target parameter according to the value of the loss function;
updating the first parameter matrix corresponding to the target parameter according to the first gradient of the first parameter matrix corresponding to the target parameter;
updating a second parameter matrix corresponding to the target parameter according to a second gradient of the second parameter matrix corresponding to the target parameter;
and updating the third parameter matrix corresponding to the non-target parameter according to the third gradient of the third parameter matrix corresponding to the non-target parameter.
In one possible implementation, the apparatus further includes:
the first storage module is used for storing the first gradient of the first parameter matrix corresponding to the target parameter, the second gradient of the second parameter matrix corresponding to the target parameter and the third gradient of the third parameter matrix corresponding to the non-target parameter in the video memory.
In one possible implementation, the apparatus further includes:
The second storage module is used for storing, in a video memory, first optimizer state information corresponding to a first parameter matrix corresponding to the target parameter, second optimizer state information corresponding to a second parameter matrix corresponding to the target parameter and third optimizer state information corresponding to a third parameter matrix corresponding to the non-target parameter.
In one possible implementation, the apparatus further includes:
the calculation module is used for responding to the end of training of the target data processing model, calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter, and obtaining the second product corresponding to the target parameter;
and a fourth determining module, configured to determine a sum of the second products corresponding to the target parameters and the original parameter matrix corresponding to the target parameters as an updated parameter matrix corresponding to the target parameters.
In one possible implementation, the loss function includes a first loss function corresponding to a task that predicts a next word.
In one possible implementation, the loss function includes a second loss function corresponding to the reinforcement learning task.
In one possible implementation, the target data processing model is a text processing model, and the training text is a training sample.
According to an aspect of the present disclosure, there is provided a data processing apparatus including:
the first acquisition module is used for acquiring a target data processing model obtained by training of the training device of the large model;
the data processing module is used for inputting the data to be processed into the target data processing model, and outputting a data processing result corresponding to the data to be processed through the target data processing model.
In one possible implementation, the data to be processed is text to be processed.
According to an aspect of the present disclosure, there is provided an electronic apparatus including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the executable instructions stored by the memory to perform the above-described method.
According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.
According to an aspect of the present disclosure, there is provided a computer program product comprising a computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the above method.
In the embodiment of the disclosure, for any target parameter, initializing a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter according to the number of rows and the number of columns of the original parameter matrix corresponding to the target parameter by determining the target parameter in the target data processing model, wherein the number of rows of the first parameter matrix corresponding to the target parameter is equal to the number of rows of the original parameter matrix corresponding to the target parameter, the number of columns of the first parameter matrix corresponding to the target parameter is smaller than the number of columns of the original parameter matrix corresponding to the target parameter, the number of columns of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the original parameter matrix corresponding to the target parameter, and the number of rows of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the first parameter matrix corresponding to the target parameter, inputting a training sample into the target data processing model, outputting a prediction result corresponding to the training sample through the target data processing model, determining a value of a loss function corresponding to the target data processing model according to the prediction result corresponding to the training sample and a label corresponding to the training sample, updating a first parameter matrix corresponding to the target parameter, a second parameter matrix corresponding to the target parameter and a third parameter matrix corresponding to a non-target parameter in the target data processing model according to the value of the loss function, so that in training of the target data processing model, the original parameter matrix with larger parameter quantity is not required to be updated for the target parameter, only the third parameter matrix, the first parameter matrix and the second parameter matrix with smaller parameter quantity are required to be updated, thus, a data processing model with larger scale can be trained under the condition of limited hardware resources (such as the environment with only a single display card). Compared with the training scheme of the data processing model in the related technology, the method and the device can reduce the memory overhead, reduce the requirement on the memory capacity, save the hardware cost, shorten the training time and achieve better effect of the target data processing model obtained by training.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.
FIG. 1 shows a flow chart of a training method for a large model provided by an embodiment of the present disclosure.
FIG. 2 illustrates a block diagram of a large model training apparatus provided by an embodiment of the present disclosure.
Fig. 3 shows a block diagram of an electronic device 1900 provided by an embodiment of the disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
The embodiment of the disclosure provides a training method of a large model, by determining target parameters in a target data processing model, for any one of the target parameters, initializing a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter according to the number of rows and the number of columns of an original parameter matrix corresponding to the target parameter, wherein the number of rows of the first parameter matrix corresponding to the target parameter is equal to the number of rows of the original parameter matrix corresponding to the target parameter, the number of columns of the first parameter matrix corresponding to the target parameter is smaller than the number of columns of the original parameter matrix corresponding to the target parameter, the number of columns of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the original parameter matrix corresponding to the target parameter, and the number of rows of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the first parameter matrix corresponding to the target parameter, inputting a training sample into the target data processing model, outputting a prediction result corresponding to the training sample through the target data processing model, determining a value of a loss function corresponding to the target data processing model according to the prediction result corresponding to the training sample and a label corresponding to the training sample, updating a first parameter matrix corresponding to the target parameter, a second parameter matrix corresponding to the target parameter and a third parameter matrix corresponding to a non-target parameter in the target data processing model according to the value of the loss function, so that in training of the target data processing model, the original parameter matrix with larger parameter quantity is not required to be updated for the target parameter, only the third parameter matrix, the first parameter matrix and the second parameter matrix with smaller parameter quantity are required to be updated, thus, a data processing model with larger scale can be trained under the condition of limited hardware resources (such as the environment with only a single display card). Compared with the training scheme of the data processing model in the related technology, the method and the device can reduce the memory overhead, reduce the requirement on the memory capacity, save the hardware cost, shorten the training time and achieve better effect of the target data processing model obtained by training.
The following describes in detail the training method of the large model provided in the embodiments of the present disclosure with reference to the accompanying drawings.
FIG. 1 shows a flow chart of a training method for a large model provided by an embodiment of the present disclosure. In one possible implementation manner, the execution subject of the training method of the large model may be a training apparatus of the large model, for example, the training method of the large model may be executed by a terminal device or a server or other electronic devices. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or the like. In some possible implementations, the training method of the large model may be implemented by a processor invoking computer readable instructions stored in a memory. As shown in fig. 1, the training method of the large model includes steps S11 to S15.
In step S11, target parameters in the target data processing model are determined.
In step S12, for any one of the target parameters, initializing a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter according to the number of rows and the number of columns of the original parameter matrix corresponding to the target parameter, where the number of rows of the first parameter matrix corresponding to the target parameter is equal to the number of rows of the original parameter matrix corresponding to the target parameter, the number of columns of the first parameter matrix corresponding to the target parameter is smaller than the number of columns of the original parameter matrix corresponding to the target parameter, the number of columns of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the original parameter matrix corresponding to the target parameter, and the number of rows of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the first parameter matrix corresponding to the target parameter.
In step S13, a training sample is input into the target data processing model, and a prediction result corresponding to the training sample is output through the target data processing model.
In step S14, a value of a loss function corresponding to the target data processing model is determined according to the prediction result corresponding to the training sample and the label corresponding to the training sample.
In step S15, according to the value of the loss function, a first parameter matrix corresponding to the target parameter, a second parameter matrix corresponding to the target parameter, and a third parameter matrix corresponding to a non-target parameter in the target data processing model are updated.
The embodiment of the disclosure can be applied to the technical fields of texts, voice, images, videos, multi-modes and the like. Also, the target data processing model in the embodiments of the present disclosure may be used for AIGC (Artificial Intelligence Generative Content, artificial intelligence generation content) or the like, without limitation.
In one possible implementation, the target data processing model is a text processing model, and the training text is a training sample. In the case of applying the embodiments of the present disclosure to the field of text processing technology, the target data processing model may be a target text processing model, and the training sample may be training text. After training the target text processing model is completed, the input of the target text processing model may be the text to be processed.
In the case of applying the embodiments of the present disclosure to the field of speech processing technology, the target data processing model may be a target speech processing model, and the training sample may be training speech. After the training of the target speech processing model is completed, the input of the target speech processing model may be the speech to be processed.
In the case of applying the embodiments of the present disclosure to the field of image processing technology, the target data processing model may be a target image processing model, and the training sample may be a training image. After the training of the target image processing model is completed, the input of the target image processing model may be an image to be processed.
In the case of applying the embodiments of the present disclosure to the field of video processing technology, the target data processing model may be a target video processing model, and the training sample may be a training video. After the training of the target video processing model is completed, the input of the target video processing model may be the video to be processed.
In the case of applying the embodiments of the present disclosure to the technical field of multi-modal processing, the target data processing model may be a target multi-modal processing model, and the training sample may be a training multi-modal. After the training of the target multi-mode processing model is completed, the input of the target multi-mode processing model can be the multi-mode to be processed.
In the disclosed embodiments, the target data processing model may be a large model.
In the disclosed embodiments, the learnable parameters in the target data processing model may be divided into target parameters and non-target parameters. Wherein the number of target parameters may be at least one. For example, a plurality of target parameters in a target data processing model may be determined. The non-target parameters may represent parameters other than the target parameters among the learnable parameters of the target data processing model. The number of non-target parameters may be at least one. The greater the number of target parameters, the higher the training efficiency of the target data processing model; the greater the number of non-target parameters, the better the training effect of the target data processing model.
In one possible implementation manner, the determining the target parameters in the target data processing model includes: and for any parameter in the target data processing model, determining the parameter as a target parameter in response to the original parameter matrix corresponding to the parameter meeting a preset condition.
In this implementation, the preset condition may represent a preset condition for determining the target parameter. In the implementation manner, for any parameter in the target data processing model, if an original parameter matrix corresponding to the parameter meets a preset condition, the parameter can be determined as a target parameter; if the original parameter matrix corresponding to the parameter does not meet the preset condition, the parameter can be determined to be a non-target parameter.
In the implementation manner, for any parameter in the target data processing model, the parameter is determined to be the target parameter in response to the original parameter matrix corresponding to the parameter meeting the preset condition, so that the division between the target parameter and the non-target parameter can be realized efficiently based on the preset condition.
As an example of this implementation, the preset condition includes at least one of: the product of the number of rows and the number of columns is larger than or equal to a first preset value; the number of lines is larger than or equal to a second preset value; the number of columns is greater than or equal to a third preset value.
In one example, the preset condition may be: the product of the number of rows and the number of columns is greater than or equal to a first preset value. In this example, for any parameter in the target data processing model, if the product of the number of rows and the number of columns of the original parameter matrix corresponding to the parameter is greater than or equal to a first preset value, the parameter may be determined as the target parameter; if the product of the number of rows and the number of columns of the original parameter matrix corresponding to the parameter is smaller than a first preset value, the parameter can be determined to be a non-target parameter.
In another example, the preset condition may be: the number of lines is greater than or equal to a second preset value. In this example, for any parameter in the target data processing model, if the number of rows of the original parameter matrix corresponding to the parameter is greater than or equal to a second preset value, the parameter may be determined as the target parameter; if the number of lines of the original parameter matrix corresponding to the parameters is smaller than a second preset value, the parameters can be determined to be non-target parameters.
In another example, the preset condition may be: the number of columns is greater than or equal to a third preset value. In this example, for any parameter in the target data processing model, if the number of columns of the original parameter matrix corresponding to the parameter is greater than or equal to a third preset value, the parameter may be determined as the target parameter; if the number of columns of the original parameter matrix corresponding to the parameters is smaller than a third preset value, the parameters can be determined to be non-target parameters.
By adopting the example, the parameter with larger number of rows and/or columns of the original parameter matrix can be determined as the target parameter, thereby being beneficial to reducing the overhead of the video memory.
In one example, the method further comprises: acquiring a capacity value of a video memory; and determining at least one of the first preset value, the second preset value and the third preset value according to the capacity value.
In one example, the preset condition may be: the product of the number of rows and the number of columns is greater than or equal to a first preset value. In this example, a capacity value of the video memory may be obtained in advance, and the first preset value may be determined according to the capacity value. The first preset value may be positively correlated with a capacity value of the video memory. That is, the larger the capacity value of the video memory, the larger the first preset value may be; the smaller the capacity value of the video memory, the smaller the first preset value may be.
In another example, the preset condition may be: the number of lines is greater than or equal to a second preset value. In this example, a capacity value of the video memory may be obtained in advance, and the second preset value may be determined according to the capacity value. The second preset value may be positively correlated with the capacity value of the video memory. That is, the larger the capacity value of the video memory, the larger the second preset value may be; the smaller the capacity value of the video memory, the smaller the second preset value may be.
In another example, the preset condition may be: the number of columns is greater than or equal to a third preset value. In this example, a capacity value of the video memory may be obtained in advance, and the third preset value may be determined according to the capacity value. The third preset value may be positively correlated with the capacity value of the video memory. That is, the larger the capacity value of the video memory, the larger the third preset value may be; the smaller the capacity value of the video memory, the smaller the third preset value may be.
In this example, by acquiring the capacity value of the video memory and determining at least one of the first preset value, the second preset value and the third preset value according to the capacity value, the video memory resource can be fully utilized on the premise of meeting the video memory requirement of each parameter in the target data processing model, and the training effect of the target data processing model can be improved.
In one example, at least one of the first preset value, the second preset value, and the third preset value may be determined in conjunction with at least one of a length of input data of a target data processing model, a parameter number of the target data processing model, and the like.
In another possible implementation, the target parameters and non-target parameters in the target data processing model may be determined according to a user specification.
In the embodiment of the present disclosure, after determining the target parameters in the target data processing model, for any target parameter, the first parameter matrix corresponding to the target parameter and the second parameter matrix corresponding to the target parameter may be initialized according to the number of rows and columns of the original parameter matrix corresponding to the target parameter. The number of lines of the first parameter matrix corresponding to the target parameter is equal to the number of lines of the original parameter matrix corresponding to the target parameter, the number of lines of the first parameter matrix corresponding to the target parameter is smaller than the number of lines of the original parameter matrix corresponding to the target parameter, the number of lines of the second parameter matrix corresponding to the target parameter is equal to the number of lines of the original parameter matrix corresponding to the target parameter, and the number of lines of the second parameter matrix corresponding to the target parameter is equal to the number of lines of the first parameter matrix corresponding to the target parameter.
For example, the size of the original parameter matrix corresponding to a certain target parameter is m×n, that is, the number of rows of the original parameter matrix corresponding to the target parameter is m, and the number of columns is n. Then, the first parameter matrix corresponding to the target parameter may have a size of mxr, that is, the number of rows of the first parameter matrix corresponding to the target parameter may be m, and the number of columns may be r. The size of the second parameter matrix corresponding to the target parameter may be r×n, that is, the number of rows of the second parameter matrix corresponding to the target parameter may be r, and the number of columns may be n. Wherein r < m, and r < n.
In the embodiments of the present disclosure, the first parameter matrix and the second parameter matrix may be initialized for different target parameters, respectively.
In a possible implementation manner, the number of columns of the first parameter matrix corresponding to the target parameter is at least one order of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter.
For example, the number of columns of the first parameter matrix corresponding to the target parameter may be an order of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter; for another example, the number of columns of the first parameter matrix corresponding to the target parameter may be two orders of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter; for another example, the number of columns of the first parameter matrix corresponding to the target parameter may be three orders of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter; etc.
For example, the size of an original parameter matrix corresponding to a certain target parameter is mxn, the size of a first parameter matrix corresponding to the target parameter is mxr, the size of a second parameter matrix corresponding to the target parameter is rxn, r < < m, and r < < n. For example, r may be 8, 16, 4, 64, 2, etc., without limitation.
In this implementation manner, for any target parameter, by setting the number of columns of the first parameter matrix corresponding to the target parameter and the number of rows of the second parameter matrix corresponding to the target parameter to be at least one order of magnitude smaller than the number of rows and the number of columns of the original parameter matrix corresponding to the target parameter, it is helpful to save a large amount of video memory resources.
In one possible implementation manner, for any target parameter, a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter may be initialized with a gaussian distribution, so as to increase the generalization capability of the target data processing model.
In embodiments of the present disclosure, a target data processing model may be trained using a training sample set, where the training sample set may include a plurality of training samples. For any training sample, the training sample can be input into a target data processing model, the feature vector corresponding to the training sample is extracted through the target data processing model, and the prediction result corresponding to the training sample is obtained based on the feature vector corresponding to the training sample.
In the training process of the target data processing model, W0+BA can be used as a total parameter matrix corresponding to any target parameter to participate in calculation of forward propagation and backward propagation. Wherein W0 may represent an original parameter matrix corresponding to the target parameter, a may represent a first parameter matrix corresponding to the target parameter, and B may represent a second parameter matrix corresponding to the target parameter. When the parameters of the target data processing model are updated, W0 may be fixed, and only A, B and W1 are updated, where W1 may represent a third parameter matrix corresponding to the non-target parameters.
In one possible implementation manner, the original parameter matrix corresponding to the target parameter is kept fixed during the training process of the target data processing model. In this implementation, during the training of the target data processing model, the original parameter matrix corresponding to the target parameter remains fixed, i.e. the original parameter matrix corresponding to the target parameter remains frozen and is not further adjusted. In the process of training the target data processing model, the original parameter matrix corresponding to each target parameter is kept fixed, so that a large amount of display memory resources can be saved.
In one possible implementation manner, the inputting the training sample into the target data processing model, outputting, by the target data processing model, a prediction result corresponding to the training sample includes: calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter to obtain a first product corresponding to the target parameter; determining the sum of the first products of the original parameter matrix corresponding to the target parameter and the first products of the original parameter matrix corresponding to the target parameter as the latest total parameter matrix corresponding to the target parameter; inputting the training sample into the target data processing model, and obtaining a prediction result corresponding to the training sample based on the latest total parameter matrix corresponding to the target parameter and the latest third parameter matrix corresponding to the non-target parameter.
In this implementation, for any target parameter, the first product corresponding to the target parameter may represent the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter during training of the target data processing model.
For example, for any target parameter, the latest total parameter matrix corresponding to the target parameter may be determined according to w0+ba. Wherein W0 may represent an original parameter matrix corresponding to the target parameter, a may represent a first parameter matrix corresponding to the target parameter, B may represent a second parameter matrix corresponding to the target parameter, and BA may represent a first product corresponding to the target parameter.
In this implementation manner, the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter is calculated to obtain the first product corresponding to the target parameter, the sum of the first products of the original parameter matrix corresponding to the target parameter and the target parameter is determined to be the latest total parameter matrix corresponding to the target parameter, the training sample is input into the target data processing model, and the prediction result corresponding to the training sample is obtained based on the latest total parameter matrix corresponding to the target parameter and the latest third parameter matrix corresponding to the non-target parameter, so that forward propagation calculation can be realized based on the latest total parameter matrix corresponding to each target parameter in the training process of the target data processing model.
As an example of this implementation, the updating the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter, and the third parameter matrix corresponding to the non-target parameter in the target data processing model according to the value of the loss function includes: determining a first gradient of a first parameter matrix corresponding to the target parameter, a second gradient of a second parameter matrix corresponding to the target parameter and a third gradient of a third parameter matrix corresponding to the non-target parameter according to the value of the loss function; updating the first parameter matrix corresponding to the target parameter according to the first gradient of the first parameter matrix corresponding to the target parameter; updating a second parameter matrix corresponding to the target parameter according to a second gradient of the second parameter matrix corresponding to the target parameter; and updating the third parameter matrix corresponding to the non-target parameter according to the third gradient of the third parameter matrix corresponding to the non-target parameter.
In this implementation, the first gradient may represent a gradient of a first parameter matrix corresponding to the target parameter, the second gradient may represent a gradient of a second parameter matrix corresponding to the target parameter, and the third gradient may represent a gradient of a third parameter matrix corresponding to the non-target parameter.
In this implementation manner, for any target parameter, a gradient of a latest total parameter matrix corresponding to the target parameter may be determined according to the value of the loss function, and a first gradient of a first parameter matrix corresponding to the target parameter and a second gradient of a second parameter matrix corresponding to the target parameter may be determined according to the gradient of the latest total parameter matrix corresponding to the target parameter. In the implementation manner, in the process of training the target data processing model, the calculation of backward propagation can be realized based on the total parameter matrix corresponding to each target parameter.
In one possible implementation, the method further includes: and storing the first gradient of the first parameter matrix corresponding to the target parameter, the second gradient of the second parameter matrix corresponding to the target parameter and the third gradient of the third parameter matrix corresponding to the non-target parameter in a video memory.
In the implementation manner, in the process of training the target data processing model, only the first gradient of the first parameter matrix corresponding to each target parameter, the second gradient of the second parameter matrix corresponding to each target parameter and the third gradient of the third parameter matrix corresponding to each non-target parameter can be stored in the video memory, and the gradients of the original parameter matrix corresponding to each target parameter are not required to be stored in the video memory, so that the video memory can be saved.
Compared with the prior art that the gradient of the original parameter matrix corresponding to the target parameter is required to be stored in the display card, the implementation mode only needs to store the first gradient of the first parameter matrix corresponding to the target parameter and the second gradient of the second parameter matrix corresponding to the target parameter in the display card. For example, the number of parameters of the original parameter matrix corresponding to the target parameter is m×n, and the total number of parameters of the first parameter matrix and the second parameter matrix corresponding to the target parameter is (m+n) ×r. Since r can be set to be at least one order of magnitude smaller than m and n, this implementation can save a lot of memory compared to the gradient of the original parameter matrix that the related art needs to save the target parameter corresponds to in the graphics card.
As an example of this implementation, the method further comprises: and storing first optimizer state information corresponding to a first parameter matrix corresponding to the target parameter, second optimizer state information corresponding to a second parameter matrix corresponding to the target parameter and third optimizer state information corresponding to a third parameter matrix corresponding to the non-target parameter in a video memory.
In this example, the first optimizer state information may represent state information of an optimizer corresponding to a first parameter matrix corresponding to the target parameter, the second optimizer state information may represent state information of an optimizer corresponding to a second parameter matrix corresponding to the target parameter, and the third optimizer state information may represent state information of an optimizer corresponding to a third parameter matrix corresponding to the non-target parameter.
The optimizer state information may represent data that the optimizer needs to use when performing gradient updates, among other things. For example, in the case of employing an SGD (Stochastic Gradient Descent, random gradient descent) optimizer, the optimizer state information may include momentum (momentum); as another example, in the case of Adam optimizers, the optimizer state information may include a first order momentum and a second order momentum; etc.
In this example, in the process of training the target data processing model, only the first optimizer state information corresponding to the first parameter matrix corresponding to each target parameter, the second optimizer state information corresponding to the second parameter matrix corresponding to each target parameter, and the third optimizer state information corresponding to the third parameter matrix corresponding to each non-target parameter may be stored in the video memory, without storing the optimizer state information corresponding to the original parameter matrix corresponding to each target parameter in the video memory, so that the video memory can be saved.
Compared with the prior art that the state information of the optimizer corresponding to the original parameter matrix corresponding to the target parameter is required to be stored in the display card, the implementation mode only needs to store the state information of the first optimizer corresponding to the first parameter matrix corresponding to the target parameter and the state information of the second optimizer corresponding to the second parameter matrix corresponding to the target parameter in the display card. For example, the number of parameters of the original parameter matrix corresponding to the target parameter is m×n, and the total number of parameters of the first parameter matrix and the second parameter matrix is (m+n) ×r. Since r can be set to be at least one order of magnitude smaller than m and n, this implementation can save a lot of memory compared to the related art that needs to save the optimizer state information corresponding to the original parameter matrix corresponding to the target parameter in the graphics card.
In one possible implementation, the loss function includes a first loss function corresponding to a task that predicts a next word. In the implementation manner, the target data processing model is trained by adopting the first loss function corresponding to the task of predicting the next word, so that the accuracy of data processing of the target data processing model is improved.
In other possible implementations, a penalty function corresponding to a task that predicts a next word, a penalty function corresponding to a task that predicts a next token, etc., may also be employed, without limitation.
In one possible implementation, the loss function includes a second loss function corresponding to the reinforcement learning task. In the implementation mode, the target data processing model is trained by adopting the second loss function corresponding to the reinforcement learning task, so that the semantic safety is improved.
In one possible implementation manner, after updating the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter, and the third parameter matrix corresponding to the non-target parameter in the target data processing model according to the value of the loss function, the method further includes: responding to the end of training of the target data processing model, calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter to obtain a second product corresponding to the target parameter; and determining the sum of the second products of the original parameter matrix corresponding to the target parameter and the second products corresponding to the target parameter as an updated parameter matrix corresponding to the target parameter.
In this implementation, for any target parameter, the second product corresponding to the target parameter may represent a product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter after training of the target data processing model is completed.
In this implementation, for any target parameter, an updated parameter matrix corresponding to the target parameter may be determined according to w0+ba. Wherein W0 may represent an original parameter matrix corresponding to the target parameter, a may represent a first parameter matrix corresponding to the target parameter, B may represent a second parameter matrix corresponding to the target parameter, and BA may represent a second product corresponding to the target parameter.
In the implementation manner, in the training process of the target data processing model, the original parameter matrix W0 corresponding to each target parameter does not need to be updated, and the updated parameter matrix corresponding to each target parameter can be obtained only by adding the original parameter matrix W0 to the second product after the training of the target text processing model is finished.
The training method of the large model provided by the embodiment of the disclosure can be applied to the technical fields of artificial intelligence and the like, and is not limited herein. The training method of the large model provided by the embodiment of the disclosure can be used for training a data processing model with a larger scale (i.e. a large model), and is not limited herein.
The training method of the large model provided by the embodiment of the disclosure is described below through a specific application scenario. In the application scene, the video memory capacity of the video card can be 32G; the target data processing model can adopt a network structure of ChatGLM-6B, and the length of input data can be 256; the target data processing model may be used in the field of medical question and answer.
In this application scenario, query_key_ value, dense, dense _h_to_4h and dense_4h_to_h in ChatGLM-6B can be determined as target parameters, and other learnable parameters in ChatGLM-6B can be determined as non-target parameters. The size of the primary parameter matrix corresponding to the query_key_value is 12288×4096, the size of the primary parameter matrix corresponding to the dense is 4096×4096, the size of the primary parameter matrix corresponding to the dense_h_to_4h is 16384×4096, and the size of the primary parameter matrix corresponding to the dense_4h_to_h is 4096×16384. These 4 target parameters are all larger.
The size of the original parameter matrix W0 corresponding to the target parameter may be mxn. The first parameter matrix a corresponding to the target parameter and the second parameter matrix B corresponding to the target parameter may be initialized, where the size of the first parameter matrix a corresponding to the target parameter may be mxr, and the size of the second parameter matrix B corresponding to the target parameter may be rxn. Wherein r < < m, and r < < n. The first parameter matrix a corresponding to the target parameter and the second parameter matrix B corresponding to the target parameter may be initialized with gaussian distribution.
Taking the target parameter query_key_value as an example, the size of an original parameter matrix corresponding to the query_key_value is 12288×4096, a first parameter matrix a corresponding to the query_key_value and a second parameter matrix B corresponding to the query_key_value can be initialized, wherein the size of the first parameter matrix a corresponding to the query_key_value can be 12288×r, and the size of the second parameter matrix B corresponding to the query_key_value can be r×4096. Wherein r is a superparameter which can be set manually.
In the process of training the target data processing model, W0+BA can be used as a total parameter matrix corresponding to any target parameter to participate in calculation of forward propagation and backward propagation. When the parameters of the target data processing model are updated, W0 may be fixed, and only A, B and W1 are updated, where W1 may represent a third parameter matrix corresponding to the non-target parameters. In addition, in the training process of the target data processing model, only the gradient and the optimizer state information of the first parameter matrix A, the second parameter matrix B and the third parameter matrix W1 need to be stored in the video memory, and the gradient and the optimizer state information of the original parameter matrix W0 corresponding to the target parameters do not need to be stored in the video memory.
In addition, in the application scenario, the target data processing model may be trained using a first loss function corresponding to a task that predicts a next word.
ChatGLM-6B was trained in a single graphics environment, with r=4, the amount of parameters trained was reduced from 6 billions+ to 734 tens of thousands. Through testing, by adopting the training method of the large model provided by the embodiment of the disclosure, a 6B-scale data processing model can be trained on a display card with a display memory of 32G. Moreover, the training effect obtained by the training method of the large model provided by the embodiment of the disclosure is better than that of a 3 hundred million parameter BERT model.
The embodiment of the disclosure also provides a data processing method, which comprises the following steps: acquiring a target data processing model obtained by training the training method of the large model; inputting the data to be processed into the target data processing model, and outputting a data processing result corresponding to the data to be processed through the target data processing model.
In one possible implementation, the data to be processed is text to be processed.
The embodiment of the disclosure also provides a training method of the text processing model. In one possible implementation manner, the execution subject of the training method of the text processing model may be a training apparatus of the text processing model, for example, the training method of the text processing model may be executed by a terminal device or a server or other electronic devices. The terminal device may be a user device, a mobile device, a user terminal, a cellular phone, a cordless phone, a personal digital assistant, a handheld device, a computing device, an in-vehicle device, a wearable device, or the like. In some possible implementations, the training method of the text processing model may be implemented by a processor invoking computer readable instructions stored in a memory.
The training method of the text processing model comprises the following steps: determining target parameters in a target text processing model; for any target parameter, initializing a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter according to the number of rows and the number of columns of the original parameter matrix corresponding to the target parameter, wherein the number of rows of the first parameter matrix corresponding to the target parameter is equal to the number of rows of the original parameter matrix corresponding to the target parameter, the number of columns of the first parameter matrix corresponding to the target parameter is smaller than the number of columns of the original parameter matrix corresponding to the target parameter, the number of columns of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the original parameter matrix corresponding to the target parameter, and the number of rows of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the first parameter matrix corresponding to the target parameter; inputting training texts into the target text processing model, and outputting prediction results corresponding to the training texts through the target text processing model; determining a value of a loss function corresponding to the target text processing model according to a prediction result corresponding to the training text and a label corresponding to the training text; and updating a first parameter matrix corresponding to the target parameter, a second parameter matrix corresponding to the target parameter and a third parameter matrix corresponding to a non-target parameter in the target text processing model according to the value of the loss function.
In the embodiment of the disclosure, the target text processing model may adopt a network structure such as ChatGLM-6B, LLaMA-13B, BLOOM, and the like, which is not limited herein. The text processing model may also be referred to as a language model or the like, and is not limited herein.
The target text processing model may be used in applications such as dialog systems (e.g., medical questions and answers), text generation, language correction, text classification, information retrieval and search engines, speech recognition, speech synthesis, automatic summarization, machine translation, intelligent text analysis, etc., without limitation.
In the disclosed embodiments, the learnable parameters in the target text processing model may be divided into target parameters and non-target parameters. Wherein the number of target parameters may be at least one. For example, a plurality of target parameters in a target text processing model may be determined. The non-target parameters may represent parameters other than the target parameters among the learnable parameters of the target text processing model. The number of non-target parameters may be at least one. The greater the number of target parameters, the higher the training efficiency of the target text processing model; the greater the number of non-target parameters, the better the training effect of the target text processing model.
In one possible implementation manner, the determining the target parameters in the target text processing model includes: and for any parameter in the target text processing model, determining the parameter as a target parameter in response to the original parameter matrix corresponding to the parameter meeting a preset condition.
In this implementation, the preset condition may represent a preset condition for determining the target parameter. In the implementation manner, for any parameter in the target text processing model, if an original parameter matrix corresponding to the parameter meets a preset condition, the parameter can be determined as a target parameter; if the original parameter matrix corresponding to the parameter does not meet the preset condition, the parameter can be determined to be a non-target parameter.
In the implementation manner, for any parameter in the target text processing model, the parameter is determined to be the target parameter in response to the original parameter matrix corresponding to the parameter meeting the preset condition, so that the division between the target parameter and the non-target parameter can be realized efficiently based on the preset condition.
As an example of this implementation, the preset condition includes at least one of: the product of the number of rows and the number of columns is larger than or equal to a first preset value; the number of lines is larger than or equal to a second preset value; the number of columns is greater than or equal to a third preset value.
In one example, the preset condition may be: the product of the number of rows and the number of columns is greater than or equal to a first preset value. In this example, for any parameter in the target text processing model, if the product of the number of rows and the number of columns of the original parameter matrix corresponding to the parameter is greater than or equal to a first preset value, the parameter may be determined as the target parameter; if the product of the number of rows and the number of columns of the original parameter matrix corresponding to the parameter is smaller than a first preset value, the parameter can be determined to be a non-target parameter.
In another example, the preset condition may be: the number of lines is greater than or equal to a second preset value. In this example, for any parameter in the target text processing model, if the number of rows of the original parameter matrix corresponding to the parameter is greater than or equal to a second preset value, the parameter may be determined as the target parameter; if the number of lines of the original parameter matrix corresponding to the parameters is smaller than a second preset value, the parameters can be determined to be non-target parameters.
In another example, the preset condition may be: the number of columns is greater than or equal to a third preset value. In this example, for any parameter in the target text processing model, if the number of columns of the original parameter matrix corresponding to the parameter is greater than or equal to a third preset value, the parameter may be determined as the target parameter; if the number of columns of the original parameter matrix corresponding to the parameters is smaller than a third preset value, the parameters can be determined to be non-target parameters.
By adopting the example, the parameter with larger number of rows and/or columns of the original parameter matrix can be determined as the target parameter, thereby being beneficial to reducing the overhead of the video memory.
In one example, the method further comprises: acquiring a capacity value of a video memory; and determining at least one of the first preset value, the second preset value and the third preset value according to the capacity value.
In one example, the preset condition may be: the product of the number of rows and the number of columns is greater than or equal to a first preset value. In this example, a capacity value of the video memory may be obtained in advance, and the first preset value may be determined according to the capacity value. The first preset value may be positively correlated with a capacity value of the video memory. That is, the larger the capacity value of the video memory, the larger the first preset value may be; the smaller the capacity value of the video memory, the smaller the first preset value may be.
In another example, the preset condition may be: the number of lines is greater than or equal to a second preset value. In this example, a capacity value of the video memory may be obtained in advance, and the second preset value may be determined according to the capacity value. The second preset value may be positively correlated with the capacity value of the video memory. That is, the larger the capacity value of the video memory, the larger the second preset value may be; the smaller the capacity value of the video memory, the smaller the second preset value may be.
In another example, the preset condition may be: the number of columns is greater than or equal to a third preset value. In this example, a capacity value of the video memory may be obtained in advance, and the third preset value may be determined according to the capacity value. The third preset value may be positively correlated with the capacity value of the video memory. That is, the larger the capacity value of the video memory, the larger the third preset value may be; the smaller the capacity value of the video memory, the smaller the third preset value may be.
In this example, by acquiring the capacity value of the video memory and determining at least one of the first preset value, the second preset value and the third preset value according to the capacity value, the video memory resource can be fully utilized on the premise of meeting the video memory requirement of each parameter in the target text processing model, and the training effect of the target text processing model can be improved.
In one example, at least one of the first preset value, the second preset value, and the third preset value may be determined in conjunction with at least one of a length of input text of a target text processing model, a parameter number of the target text processing model, and the like.
In another possible implementation, the target parameters and non-target parameters in the target text processing model may be determined according to a user's specification.
In the embodiment of the present disclosure, after determining the target parameters in the target text processing model, for any target parameter, the first parameter matrix corresponding to the target parameter and the second parameter matrix corresponding to the target parameter may be initialized according to the number of rows and columns of the original parameter matrix corresponding to the target parameter. The number of lines of the first parameter matrix corresponding to the target parameter is equal to the number of lines of the original parameter matrix corresponding to the target parameter, the number of lines of the first parameter matrix corresponding to the target parameter is smaller than the number of lines of the original parameter matrix corresponding to the target parameter, the number of lines of the second parameter matrix corresponding to the target parameter is equal to the number of lines of the original parameter matrix corresponding to the target parameter, and the number of lines of the second parameter matrix corresponding to the target parameter is equal to the number of lines of the first parameter matrix corresponding to the target parameter.
For example, the size of the original parameter matrix corresponding to a certain target parameter is m×n, that is, the number of rows of the original parameter matrix corresponding to the target parameter is m, and the number of columns is n. Then, the first parameter matrix corresponding to the target parameter may have a size of mxr, that is, the number of rows of the first parameter matrix corresponding to the target parameter may be m, and the number of columns may be r. The size of the second parameter matrix corresponding to the target parameter may be r×n, that is, the number of rows of the second parameter matrix corresponding to the target parameter may be r, and the number of columns may be n. Wherein r < m, and r < n.
In the embodiments of the present disclosure, the first parameter matrix and the second parameter matrix may be initialized for different target parameters, respectively.
In a possible implementation manner, the number of columns of the first parameter matrix corresponding to the target parameter is at least one order of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter.
For example, the number of columns of the first parameter matrix corresponding to the target parameter may be an order of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter; for another example, the number of columns of the first parameter matrix corresponding to the target parameter may be two orders of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter; for another example, the number of columns of the first parameter matrix corresponding to the target parameter may be three orders of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter; etc.
For example, the size of an original parameter matrix corresponding to a certain target parameter is mxn, the size of a first parameter matrix corresponding to the target parameter is mxr, the size of a second parameter matrix corresponding to the target parameter is rxn, r < < m, and r < < n. For example, r may be 8, 16, 4, 64, 2, etc., without limitation.
In this implementation manner, for any target parameter, by setting the number of columns of the first parameter matrix corresponding to the target parameter and the number of rows of the second parameter matrix corresponding to the target parameter to be at least one order of magnitude smaller than the number of rows and the number of columns of the original parameter matrix corresponding to the target parameter, it is helpful to save a large amount of video memory resources.
In one possible implementation manner, for any target parameter, a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter may be initialized by using gaussian distribution, so as to increase the generalization capability of the target text processing model.
In embodiments of the present disclosure, a target text processing model may be trained using a training text set, where the training text set may include a plurality of training texts. For any training text, the training text can be input into a target text processing model, the feature vector corresponding to the training text is extracted through the target text processing model, and the prediction result corresponding to the training text is obtained based on the feature vector corresponding to the training text.
In the training process of the target text processing model, W0+BA can be used as a total parameter matrix corresponding to any target parameter to participate in calculation of forward propagation and backward propagation. Wherein W0 may represent an original parameter matrix corresponding to the target parameter, a may represent a first parameter matrix corresponding to the target parameter, and B may represent a second parameter matrix corresponding to the target parameter. When the parameters of the target text processing model are updated, W0 may be fixed, and only A, B and W1 are updated, where W1 may represent a third parameter matrix corresponding to the non-target parameters.
In one possible implementation manner, the original parameter matrix corresponding to the target parameter is kept fixed during the training process of the target text processing model. In this implementation, during the training of the target text processing model, the original parameter matrix corresponding to the target parameter remains fixed, i.e., the original parameter matrix corresponding to the target parameter remains frozen and is not further adjusted. In the training process of the target text processing model, the original parameter matrix corresponding to each target parameter is kept fixed, so that a large amount of display memory resources can be saved.
In one possible implementation manner, the inputting training text into the target text processing model, outputting, by the target text processing model, a prediction result corresponding to the training text includes: calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter to obtain a first product corresponding to the target parameter; determining the sum of the first products of the original parameter matrix corresponding to the target parameter and the first products of the original parameter matrix corresponding to the target parameter as the latest total parameter matrix corresponding to the target parameter; inputting the training text into the target text processing model, and obtaining a prediction result corresponding to the training text based on the latest total parameter matrix corresponding to the target parameter and the latest third parameter matrix corresponding to the non-target parameter.
In this implementation, for any target parameter, the first product corresponding to the target parameter may represent the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter during training of the target text processing model.
For example, for any target parameter, the latest total parameter matrix corresponding to the target parameter may be determined according to w0+ba. Wherein W0 may represent an original parameter matrix corresponding to the target parameter, a may represent a first parameter matrix corresponding to the target parameter, B may represent a second parameter matrix corresponding to the target parameter, and BA may represent a first product corresponding to the target parameter.
In this implementation manner, the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter is calculated to obtain the first product corresponding to the target parameter, the sum of the first products of the original parameter matrix corresponding to the target parameter and the target parameter is determined to be the latest total parameter matrix corresponding to the target parameter, the training text is input into the target text processing model, and the prediction result corresponding to the training text is obtained based on the latest total parameter matrix corresponding to the target parameter and the latest third parameter matrix corresponding to the non-target parameter, so that forward propagation calculation can be realized based on the latest total parameter matrix corresponding to each target parameter in the training process of the target text processing model.
As an example of this implementation manner, the updating, according to the value of the loss function, the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter, and the third parameter matrix corresponding to the non-target parameter in the target text processing model includes: determining a first gradient of a first parameter matrix corresponding to the target parameter, a second gradient of a second parameter matrix corresponding to the target parameter and a third gradient of a third parameter matrix corresponding to the non-target parameter according to the value of the loss function; updating the first parameter matrix corresponding to the target parameter according to the first gradient of the first parameter matrix corresponding to the target parameter; updating a second parameter matrix corresponding to the target parameter according to a second gradient of the second parameter matrix corresponding to the target parameter; and updating the third parameter matrix corresponding to the non-target parameter according to the third gradient of the third parameter matrix corresponding to the non-target parameter.
In this implementation, the first gradient may represent a gradient of a first parameter matrix corresponding to the target parameter, the second gradient may represent a gradient of a second parameter matrix corresponding to the target parameter, and the third gradient may represent a gradient of a third parameter matrix corresponding to the non-target parameter.
In this implementation manner, for any target parameter, a gradient of a latest total parameter matrix corresponding to the target parameter may be determined according to the value of the loss function, and a first gradient of a first parameter matrix corresponding to the target parameter and a second gradient of a second parameter matrix corresponding to the target parameter may be determined according to the gradient of the latest total parameter matrix corresponding to the target parameter. In the implementation manner, in the training process of the target text processing model, the backward propagation calculation can be realized based on the total parameter matrix corresponding to each target parameter.
In one possible implementation, the method further includes: and storing the first gradient of the first parameter matrix corresponding to the target parameter, the second gradient of the second parameter matrix corresponding to the target parameter and the third gradient of the third parameter matrix corresponding to the non-target parameter in a video memory.
In the implementation manner, in the training process of the target text processing model, only the first gradient of the first parameter matrix corresponding to each target parameter, the second gradient of the second parameter matrix corresponding to each target parameter and the third gradient of the third parameter matrix corresponding to each non-target parameter can be stored in the video memory, and the gradients of the original parameter matrix corresponding to each target parameter are not required to be stored in the video memory, so that the video memory can be saved.
Compared with the prior art that the gradient of the original parameter matrix corresponding to the target parameter is required to be stored in the display card, the implementation mode only needs to store the first gradient of the first parameter matrix corresponding to the target parameter and the second gradient of the second parameter matrix corresponding to the target parameter in the display card. For example, the number of parameters of the original parameter matrix corresponding to the target parameter is m×n, and the total number of parameters of the first parameter matrix and the second parameter matrix corresponding to the target parameter is (m+n) ×r. Since r can be set to be at least one order of magnitude smaller than m and n, this implementation can save a lot of memory compared to the gradient of the original parameter matrix that the related art needs to save the target parameter corresponds to in the graphics card.
As an example of this implementation, the method further comprises: and storing first optimizer state information corresponding to a first parameter matrix corresponding to the target parameter, second optimizer state information corresponding to a second parameter matrix corresponding to the target parameter and third optimizer state information corresponding to a third parameter matrix corresponding to the non-target parameter in a video memory.
In this example, the first optimizer state information may represent state information of an optimizer corresponding to a first parameter matrix corresponding to the target parameter, the second optimizer state information may represent state information of an optimizer corresponding to a second parameter matrix corresponding to the target parameter, and the third optimizer state information may represent state information of an optimizer corresponding to a third parameter matrix corresponding to the non-target parameter.
The optimizer state information may represent data that the optimizer needs to use when performing gradient updates, among other things. For example, in the case of employing an SGD (Stochastic Gradient Descent, random gradient descent) optimizer, the optimizer state information may include momentum (momentum); as another example, in the case of Adam optimizers, the optimizer state information may include a first order momentum and a second order momentum; etc.
In this example, in the process of training the target text processing model, only the first optimizer state information corresponding to the first parameter matrix corresponding to each target parameter, the second optimizer state information corresponding to the second parameter matrix corresponding to each target parameter, and the third optimizer state information corresponding to the third parameter matrix corresponding to each non-target parameter may be stored in the display memory, without storing the optimizer state information corresponding to the original parameter matrix corresponding to each target parameter in the display memory, so that the display memory can be saved.
Compared with the prior art that the state information of the optimizer corresponding to the original parameter matrix corresponding to the target parameter is required to be stored in the display card, the implementation mode only needs to store the state information of the first optimizer corresponding to the first parameter matrix corresponding to the target parameter and the state information of the second optimizer corresponding to the second parameter matrix corresponding to the target parameter in the display card. For example, the number of parameters of the original parameter matrix corresponding to the target parameter is m×n, and the total number of parameters of the first parameter matrix and the second parameter matrix is (m+n) ×r. Since r can be set to be at least one order of magnitude smaller than m and n, this implementation can save a lot of memory compared to the related art that needs to save the optimizer state information corresponding to the original parameter matrix corresponding to the target parameter in the graphics card.
In one possible implementation, the loss function includes a first loss function corresponding to a task that predicts a next word. In the implementation manner, the target text processing model is trained by adopting the first loss function corresponding to the task of predicting the next word, so that the accuracy of text processing of the target text processing model is improved.
In other possible implementations, a penalty function corresponding to a task that predicts a next word, a penalty function corresponding to a task that predicts a next token, etc., may also be employed, without limitation.
In one possible implementation, the loss function includes a second loss function corresponding to the reinforcement learning task. In the implementation mode, the target text processing model is trained by adopting the second loss function corresponding to the reinforcement learning task, so that the semantic safety is improved.
In one possible implementation manner, after updating the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter, and the third parameter matrix corresponding to the non-target parameter in the target text processing model according to the value of the loss function, the method further includes: responding to the end of training of the target text processing model, calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter to obtain a second product corresponding to the target parameter; and determining the sum of the second products of the original parameter matrix corresponding to the target parameter and the second products corresponding to the target parameter as an updated parameter matrix corresponding to the target parameter.
In this implementation, for any target parameter, the second product corresponding to the target parameter may represent a product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter after training of the target text processing model is completed.
In this implementation, for any target parameter, an updated parameter matrix corresponding to the target parameter may be determined according to w0+ba. Wherein W0 may represent an original parameter matrix corresponding to the target parameter, a may represent a first parameter matrix corresponding to the target parameter, B may represent a second parameter matrix corresponding to the target parameter, and BA may represent a second product corresponding to the target parameter.
In the implementation manner, in the training process of the target text processing model, the original parameter matrix W0 corresponding to each target parameter does not need to be updated, and the updated parameter matrix corresponding to each target parameter can be obtained only by adding the original parameter matrix W0 to the second product after the training of the target text processing model is finished.
The training method of the text processing model provided by the embodiment of the disclosure can be applied to the technical fields of artificial intelligence and the like, and is not limited herein. The training method of the text processing model provided by the embodiment of the disclosure can be used for training a text processing model with a larger scale (i.e. a large model), and is not limited herein.
The training method of the text processing model provided by the embodiment of the disclosure is described below through a specific application scenario. In the application scene, the video memory capacity of the video card can be 32G; the target text processing model can adopt a network structure of ChatGLM-6B, and the length of an input text can be 256; the target text processing model may be used in the field of medical question and answer.
In this application scenario, query_key_ value, dense, dense _h_to_4h and dense_4h_to_h in ChatGLM-6B can be determined as target parameters, and other learnable parameters in ChatGLM-6B can be determined as non-target parameters. The size of the primary parameter matrix corresponding to the query_key_value is 12288×4096, the size of the primary parameter matrix corresponding to the dense is 4096×4096, the size of the primary parameter matrix corresponding to the dense_h_to_4h is 16384×4096, and the size of the primary parameter matrix corresponding to the dense_4h_to_h is 4096×16384. These 4 target parameters are all larger.
The size of the original parameter matrix W0 corresponding to the target parameter may be mxn. The first parameter matrix a corresponding to the target parameter and the second parameter matrix B corresponding to the target parameter may be initialized, where the size of the first parameter matrix a corresponding to the target parameter may be mxr, and the size of the second parameter matrix B corresponding to the target parameter may be rxn. Wherein r < < m, and r < < n. The first parameter matrix a corresponding to the target parameter and the second parameter matrix B corresponding to the target parameter may be initialized with gaussian distribution.
Taking the target parameter query_key_value as an example, the size of an original parameter matrix corresponding to the query_key_value is 12288×4096, a first parameter matrix a corresponding to the query_key_value and a second parameter matrix B corresponding to the query_key_value can be initialized, wherein the size of the first parameter matrix a corresponding to the query_key_value can be 12288×r, and the size of the second parameter matrix B corresponding to the query_key_value can be r×4096. Wherein r is a superparameter which can be set manually.
In the training process of the target text processing model, W0+BA can be used as a total parameter matrix corresponding to any target parameter to participate in calculation of forward propagation and backward propagation. When the parameters of the target text processing model are updated, W0 may be fixed, and only A, B and W1 are updated, where W1 may represent a third parameter matrix corresponding to the non-target parameters. In addition, in the training process of the target text processing model, only the gradient and the optimizer state information of the first parameter matrix A, the second parameter matrix B and the third parameter matrix W1 need to be stored in the display memory, and the gradient and the optimizer state information of the original parameter matrix W0 corresponding to the target parameters do not need to be stored in the display memory.
In addition, in the application scenario, the target text processing model may be trained using a first loss function corresponding to a task that predicts a next word.
ChatGLM-6B was trained in a single graphics environment, with r=4, the amount of parameters trained was reduced from 6 billions+ to 734 tens of thousands. Through testing, by adopting the training method of the text processing model provided by the embodiment of the disclosure, a 6B-scale text processing model can be trained on a display card with 32G of video memory. Moreover, the training effect obtained by the training method of the text processing model provided by the embodiment of the disclosure is better than that of a 3 hundred million parameter BERT model.
The embodiment of the disclosure also provides a text processing method, which comprises the following steps: acquiring a target text processing model obtained by training the training method of the text processing model; inputting the text to be processed into the target text processing model, and outputting a text processing result corresponding to the text to be processed through the target text processing model.
It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.
In addition, the disclosure further provides a training device, a data processing device, an electronic device, a computer readable storage medium and a computer program product for implementing any one of the training method or the data processing method for the large model provided by the disclosure, and the corresponding technical scheme and the corresponding technical effect can be referred to the corresponding records of the method section and are not repeated.
FIG. 2 illustrates a block diagram of a large model training apparatus provided by an embodiment of the present disclosure. As shown in fig. 2, the training device of the large model includes:
a first determining module 21 for determining target parameters in a target data processing model;
the initialization module 22 is configured to initialize, for any of the target parameters, a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter according to a number of rows and a number of columns of an original parameter matrix corresponding to the target parameter, where the number of rows of the first parameter matrix corresponding to the target parameter is equal to the number of rows of the original parameter matrix corresponding to the target parameter, the number of columns of the first parameter matrix corresponding to the target parameter is smaller than the number of columns of the original parameter matrix corresponding to the target parameter, the number of columns of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the original parameter matrix corresponding to the target parameter, and the number of rows of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the first parameter matrix corresponding to the target parameter;
The prediction module 23 is configured to input a training sample into the target data processing model, and output a prediction result corresponding to the training sample through the target data processing model;
a second determining module 24, configured to determine a value of a loss function corresponding to the target data processing model according to a prediction result corresponding to the training sample and a label corresponding to the training sample;
and the updating module 25 is configured to update the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter, and the third parameter matrix corresponding to the non-target parameter in the target data processing model according to the value of the loss function.
In one possible implementation, the first determining module 21 is configured to:
and for any parameter in the target data processing model, determining the parameter as a target parameter in response to the original parameter matrix corresponding to the parameter meeting a preset condition.
In one possible implementation, the preset condition includes at least one of:
the product of the number of rows and the number of columns is larger than or equal to a first preset value;
the number of lines is larger than or equal to a second preset value;
the number of columns is greater than or equal to a third preset value.
In one possible implementation, the apparatus further includes:
the second acquisition module is used for acquiring the capacity value of the video memory;
and a third determining module, configured to determine at least one of the first preset value, the second preset value and the third preset value according to the capacity value.
In a possible implementation manner, the number of columns of the first parameter matrix corresponding to the target parameter is at least one order of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter.
In one possible implementation manner, the original parameter matrix corresponding to the target parameter is kept fixed during the training process of the target data processing model.
In one possible implementation, the prediction module is configured to:
calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter to obtain a first product corresponding to the target parameter;
determining the sum of the first products of the original parameter matrix corresponding to the target parameter and the first products of the original parameter matrix corresponding to the target parameter as the latest total parameter matrix corresponding to the target parameter;
inputting the training sample into the target data processing model, and obtaining a prediction result corresponding to the training sample based on the latest total parameter matrix corresponding to the target parameter and the latest third parameter matrix corresponding to the non-target parameter.
In one possible implementation, the updating module is configured to:
determining a first gradient of a first parameter matrix corresponding to the target parameter, a second gradient of a second parameter matrix corresponding to the target parameter and a third gradient of a third parameter matrix corresponding to the non-target parameter according to the value of the loss function;
updating the first parameter matrix corresponding to the target parameter according to the first gradient of the first parameter matrix corresponding to the target parameter;
updating a second parameter matrix corresponding to the target parameter according to a second gradient of the second parameter matrix corresponding to the target parameter;
and updating the third parameter matrix corresponding to the non-target parameter according to the third gradient of the third parameter matrix corresponding to the non-target parameter.
In one possible implementation, the apparatus further includes:
the first storage module is used for storing the first gradient of the first parameter matrix corresponding to the target parameter, the second gradient of the second parameter matrix corresponding to the target parameter and the third gradient of the third parameter matrix corresponding to the non-target parameter in the video memory.
In one possible implementation, the apparatus further includes:
The second storage module is used for storing, in a video memory, first optimizer state information corresponding to a first parameter matrix corresponding to the target parameter, second optimizer state information corresponding to a second parameter matrix corresponding to the target parameter and third optimizer state information corresponding to a third parameter matrix corresponding to the non-target parameter.
In one possible implementation, the apparatus further includes:
the calculation module is used for responding to the end of training of the target data processing model, calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter, and obtaining the second product corresponding to the target parameter;
and a fourth determining module, configured to determine a sum of the second products corresponding to the target parameters and the original parameter matrix corresponding to the target parameters as an updated parameter matrix corresponding to the target parameters.
In one possible implementation, the loss function includes a first loss function corresponding to a task that predicts a next word.
In one possible implementation, the loss function includes a second loss function corresponding to the reinforcement learning task.
According to an aspect of the present disclosure, there is provided a data processing apparatus including:
the first acquisition module is used for acquiring a target data processing model obtained by training of the training device of the large model;
the data processing module is used for inputting the data to be processed into the target data processing model, and outputting a data processing result corresponding to the data to be processed through the target data processing model.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementation and technical effects of the functions or modules may refer to the descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. Wherein the computer readable storage medium may be a non-volatile computer readable storage medium or may be a volatile computer readable storage medium.
The disclosed embodiments also propose a computer program comprising computer readable code which, when run in an electronic device, causes a processor in the electronic device to carry out the above method.
Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, causes a processor in the electronic device to perform the above method.
The embodiment of the disclosure also provides an electronic device, including: one or more processors; a memory for storing executable instructions; wherein the one or more processors are configured to invoke the executable instructions stored by the memory to perform the above-described method.
The electronic device may be provided as a terminal, server or other form of device.
Fig. 3 shows a block diagram of an electronic device 1900 provided by an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server or a terminal device. Referring to FIG. 3, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output interface 1958 (I/O interface). Electronic device 1900 may operate an operating system based on memory 1932, such as the Microsoft Server operating system (Windows Server) TM ) Apple Inc. developed graphical user interface based operating System (Mac OS X TM ) Multi-user multi-process computer operating system (Unix) TM ) Unix-like operating system (Linux) of free and open source code TM ) Unix-like operating system (FreeBSD) with open source code TM ) Or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.
If the technical scheme of the embodiment of the disclosure relates to personal information, the product applying the technical scheme of the embodiment of the disclosure clearly informs the personal information processing rule and obtains personal independent consent before processing the personal information. If the technical solution of the embodiment of the present disclosure relates to sensitive personal information, the product applying the technical solution of the embodiment of the present disclosure obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of "explicit consent". For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (20)

1. A method of training a large model, comprising:
determining target parameters in a target data processing model;
for any target parameter, initializing a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter according to the number of rows and the number of columns of the original parameter matrix corresponding to the target parameter, wherein the number of rows of the first parameter matrix corresponding to the target parameter is equal to the number of rows of the original parameter matrix corresponding to the target parameter, the number of columns of the first parameter matrix corresponding to the target parameter is smaller than the number of columns of the original parameter matrix corresponding to the target parameter, the number of columns of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the original parameter matrix corresponding to the target parameter, and the number of rows of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the first parameter matrix corresponding to the target parameter;
Inputting a training sample into the target data processing model, and outputting a prediction result corresponding to the training sample through the target data processing model;
determining a value of a loss function corresponding to the target data processing model according to a prediction result corresponding to the training sample and a label corresponding to the training sample;
and updating a first parameter matrix corresponding to the target parameter, a second parameter matrix corresponding to the target parameter and a third parameter matrix corresponding to a non-target parameter in the target data processing model according to the value of the loss function.
2. The method of claim 1, wherein determining target parameters in a target data processing model comprises:
and for any parameter in the target data processing model, determining the parameter as a target parameter in response to the original parameter matrix corresponding to the parameter meeting a preset condition.
3. The method of claim 2, wherein the preset conditions include at least one of:
the product of the number of rows and the number of columns is larger than or equal to a first preset value;
the number of lines is larger than or equal to a second preset value;
the number of columns is greater than or equal to a third preset value.
4. A method according to claim 3, characterized in that the method further comprises:
acquiring a capacity value of a video memory;
and determining at least one of the first preset value, the second preset value and the third preset value according to the capacity value.
5. A method according to any one of claims 1 to 3, wherein the number of columns of the first parameter matrix corresponding to the target parameter is at least one order of magnitude smaller than the number of columns of the original parameter matrix corresponding to the target parameter.
6. A method according to any one of claims 1 to 3, wherein the matrix of raw parameters corresponding to the target parameters remains fixed during training of the target data processing model.
7. A method according to any one of claims 1 to 3, wherein said inputting training samples into said target data processing model and outputting prediction results corresponding to said training samples via said target data processing model comprises:
calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter to obtain a first product corresponding to the target parameter;
Determining the sum of the first products of the original parameter matrix corresponding to the target parameter and the first products of the original parameter matrix corresponding to the target parameter as the latest total parameter matrix corresponding to the target parameter;
inputting the training sample into the target data processing model, and obtaining a prediction result corresponding to the training sample based on the latest total parameter matrix corresponding to the target parameter and the latest third parameter matrix corresponding to the non-target parameter.
8. The method of claim 7, wherein updating the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter, and the third parameter matrix corresponding to the non-target parameter in the target data processing model according to the value of the loss function comprises:
determining a first gradient of a first parameter matrix corresponding to the target parameter, a second gradient of a second parameter matrix corresponding to the target parameter and a third gradient of a third parameter matrix corresponding to the non-target parameter according to the value of the loss function;
updating the first parameter matrix corresponding to the target parameter according to the first gradient of the first parameter matrix corresponding to the target parameter;
Updating a second parameter matrix corresponding to the target parameter according to a second gradient of the second parameter matrix corresponding to the target parameter;
and updating the third parameter matrix corresponding to the non-target parameter according to the third gradient of the third parameter matrix corresponding to the non-target parameter.
9. The method of claim 8, wherein the method further comprises:
and storing the first gradient of the first parameter matrix corresponding to the target parameter, the second gradient of the second parameter matrix corresponding to the target parameter and the third gradient of the third parameter matrix corresponding to the non-target parameter in a video memory.
10. The method according to claim 9, wherein the method further comprises:
and storing first optimizer state information corresponding to a first parameter matrix corresponding to the target parameter, second optimizer state information corresponding to a second parameter matrix corresponding to the target parameter and third optimizer state information corresponding to a third parameter matrix corresponding to the non-target parameter in a video memory.
11. A method according to any one of claims 1 to 3, characterized in that after said updating of the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter and the third parameter matrix corresponding to the non-target parameter in the target data processing model according to the value of the loss function, the method further comprises:
Responding to the end of training of the target data processing model, calculating the product of the latest first parameter matrix corresponding to the target parameter and the latest second parameter matrix corresponding to the target parameter to obtain a second product corresponding to the target parameter;
and determining the sum of the second products of the original parameter matrix corresponding to the target parameter and the second products corresponding to the target parameter as an updated parameter matrix corresponding to the target parameter.
12. A method according to any one of claims 1 to 3, characterized in that the loss function comprises a first loss function corresponding to a task of predicting a next word.
13. A method according to any one of claims 1 to 3, wherein the penalty function comprises a second penalty function corresponding to a reinforcement learning task.
14. A method according to any one of claims 1 to 3, wherein the target data processing model is a text processing model and the training text is a training sample.
15. A method of data processing, comprising:
obtaining a target data processing model trained by the training method of the large model according to any one of claims 1 to 14;
Inputting the data to be processed into the target data processing model, and outputting a data processing result corresponding to the data to be processed through the target data processing model.
16. The method of claim 15, wherein the data to be processed is text to be processed.
17. A training device for a large model, comprising:
the first determining module is used for determining target parameters in the target data processing model;
the initialization module is used for initializing a first parameter matrix corresponding to the target parameter and a second parameter matrix corresponding to the target parameter according to the number of rows and the number of columns of the original parameter matrix corresponding to the target parameter for any target parameter, wherein the number of rows of the first parameter matrix corresponding to the target parameter is equal to the number of rows of the original parameter matrix corresponding to the target parameter, the number of columns of the first parameter matrix corresponding to the target parameter is smaller than the number of columns of the original parameter matrix corresponding to the target parameter, the number of columns of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the original parameter matrix corresponding to the target parameter, and the number of rows of the second parameter matrix corresponding to the target parameter is equal to the number of columns of the first parameter matrix corresponding to the target parameter;
The prediction module is used for inputting a training sample into the target data processing model and outputting a prediction result corresponding to the training sample through the target data processing model;
the second determining module is used for determining the value of the loss function corresponding to the target data processing model according to the prediction result corresponding to the training sample and the label corresponding to the training sample;
and the updating module is used for updating the first parameter matrix corresponding to the target parameter, the second parameter matrix corresponding to the target parameter and the third parameter matrix corresponding to the non-target parameter in the target data processing model according to the value of the loss function.
18. A data processing apparatus, comprising:
a first obtaining module, configured to obtain a target data processing model obtained by training the training device for a large model according to claim 17;
the data processing module is used for inputting the data to be processed into the target data processing model, and outputting a data processing result corresponding to the data to be processed through the target data processing model.
19. An electronic device, comprising:
one or more processors;
a memory for storing executable instructions;
Wherein the one or more processors are configured to invoke the memory-stored executable instructions to perform the method of any of claims 1 to 16.
20. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 16.
CN202311228444.6A 2023-09-21 2023-09-21 Training method and device for large model, electronic equipment and storage medium Active CN117350354B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311228444.6A CN117350354B (en) 2023-09-21 2023-09-21 Training method and device for large model, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311228444.6A CN117350354B (en) 2023-09-21 2023-09-21 Training method and device for large model, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117350354A true CN117350354A (en) 2024-01-05
CN117350354B CN117350354B (en) 2024-06-18

Family

ID=89370176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311228444.6A Active CN117350354B (en) 2023-09-21 2023-09-21 Training method and device for large model, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117350354B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348571A (en) * 2016-11-29 2019-10-18 华为技术有限公司 A kind of neural network model training method, device, chip and system
WO2020048354A1 (en) * 2018-09-04 2020-03-12 杭州海康威视数字技术股份有限公司 Neural network model compression method and apparatus, and computer device
CN111178524A (en) * 2019-12-24 2020-05-19 中国平安人寿保险股份有限公司 Data processing method, device, equipment and medium based on federal learning
CN112541159A (en) * 2020-09-30 2021-03-23 华为技术有限公司 Model training method and related equipment
US20210365820A1 (en) * 2020-05-22 2021-11-25 Playtika Ltd. Fast and accurate machine learning by applying efficient preconditioner to kernel ridge regression
CN114116684A (en) * 2022-01-27 2022-03-01 中国传媒大学 Docker containerization-based deep learning large model and large data set version management method
CN114741389A (en) * 2022-03-29 2022-07-12 网易(杭州)网络有限公司 Model parameter adjusting method and device, electronic equipment and storage medium
CN115063875A (en) * 2022-08-16 2022-09-16 北京百度网讯科技有限公司 Model training method, image processing method, device and electronic equipment
CN115114462A (en) * 2022-04-28 2022-09-27 腾讯科技(深圳)有限公司 Model training method and device, multimedia recommendation method and device and storage medium
CN116522132A (en) * 2023-03-13 2023-08-01 之江实验室 Traffic data complement method, device and storage medium
CN116542315A (en) * 2023-04-28 2023-08-04 中山大学 Large-scale neural network parameter compression method and system based on tensor decomposition

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348571A (en) * 2016-11-29 2019-10-18 华为技术有限公司 A kind of neural network model training method, device, chip and system
WO2020048354A1 (en) * 2018-09-04 2020-03-12 杭州海康威视数字技术股份有限公司 Neural network model compression method and apparatus, and computer device
CN111178524A (en) * 2019-12-24 2020-05-19 中国平安人寿保险股份有限公司 Data processing method, device, equipment and medium based on federal learning
US20210365820A1 (en) * 2020-05-22 2021-11-25 Playtika Ltd. Fast and accurate machine learning by applying efficient preconditioner to kernel ridge regression
CN112541159A (en) * 2020-09-30 2021-03-23 华为技术有限公司 Model training method and related equipment
CN114116684A (en) * 2022-01-27 2022-03-01 中国传媒大学 Docker containerization-based deep learning large model and large data set version management method
CN114741389A (en) * 2022-03-29 2022-07-12 网易(杭州)网络有限公司 Model parameter adjusting method and device, electronic equipment and storage medium
CN115114462A (en) * 2022-04-28 2022-09-27 腾讯科技(深圳)有限公司 Model training method and device, multimedia recommendation method and device and storage medium
CN115063875A (en) * 2022-08-16 2022-09-16 北京百度网讯科技有限公司 Model training method, image processing method, device and electronic equipment
CN116522132A (en) * 2023-03-13 2023-08-01 之江实验室 Traffic data complement method, device and storage medium
CN116542315A (en) * 2023-04-28 2023-08-04 中山大学 Large-scale neural network parameter compression method and system based on tensor decomposition

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LEE S等: "A training method for low rank convolutional neural networks based on alternating tensor compose-decompose method", 《APPLIED SCIENCES》, vol. 11, no. 2, 31 December 2021 (2021-12-31), pages 643 *
崔建峰;邓泽平;申飞;史文武;: "基于非负矩阵分解和长短时记忆网络的单通道语音分离", 科学技术与工程, no. 12, 30 April 2019 (2019-04-30), pages 211 - 215 *

Also Published As

Publication number Publication date
CN117350354B (en) 2024-06-18

Similar Documents

Publication Publication Date Title
US20210192288A1 (en) Method and apparatus for processing data
CN108629414B (en) Deep hash learning method and device
CN113470619B (en) Speech recognition method, device, medium and equipment
CN109858045B (en) Machine translation method and device
WO2020182123A1 (en) Method and device for pushing statement
CN112712795B (en) Labeling data determining method, labeling data determining device, labeling data determining medium and electronic equipment
CN111738010B (en) Method and device for generating semantic matching model
CN113449070A (en) Multimodal data retrieval method, device, medium and electronic equipment
CN110019849B (en) Attention mechanism-based video attention moment retrieval method and device
CN112883968A (en) Image character recognition method, device, medium and electronic equipment
CN111133458A (en) Enhancing neural networks
US20230367972A1 (en) Method and apparatus for processing model data, electronic device, and computer readable medium
CN113140012B (en) Image processing method, device, medium and electronic equipment
CN111026849B (en) Data processing method and device
CN117171573A (en) Training method, device, equipment and storage medium for multi-modal model
CN117350354B (en) Training method and device for large model, electronic equipment and storage medium
CN112685996B (en) Text punctuation prediction method and device, readable medium and electronic equipment
CN112669816B (en) Model training method, voice recognition method, device, medium and equipment
CN111460214B (en) Classification model training method, audio classification method, device, medium and equipment
CN117217288B (en) Fine tuning method and device for large model, electronic equipment and storage medium
CN116821327A (en) Text data processing method, apparatus, device, readable storage medium and product
CN117350360B (en) Fine tuning method and device for large model, electronic equipment and storage medium
CN117688386A (en) Parameter adjustment method and device for large model, electronic equipment and storage medium
CN117350360A (en) Fine tuning method and device for large model, electronic equipment and storage medium
CN116029291B (en) Keyword recognition method, keyword recognition device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant