CN116432770A

CN116432770A - Model training, reasoning and construction method, device, equipment and storage medium

Info

Publication number: CN116432770A
Application number: CN202310208687.7A
Authority: CN
Inventors: 江泽胤子; 毛超杰; 黄子渊
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-07-14
Anticipated expiration: 2043-02-28
Also published as: CN116432770B

Abstract

The present disclosure relates to a model training, reasoning, construction method, apparatus, device and storage medium. According to the method and the device, at least one structure to be trained is added on the basis of the pre-training model, so that at least one target structure in the pre-training model corresponds to the at least one structure to be trained one by one, the structures to be trained with corresponding relations and the target structures are connected in parallel, and the structures to be trained with corresponding relations and the target structures can process input data in parallel. Further, according to the first processing result of the structure to be trained on the input data, the second processing result of the target structure on the input data is adjusted, so that the output of the structure to be trained can influence the final output information of the machine learning model. Because the structure to be trained is connected to the target structure in the pre-training model in parallel, the coupling degree of the structure to be trained and the pre-training model is reduced, and therefore the flexibility of the tuning training method for the pre-training model is improved.

Description

Model training, reasoning and construction method, device, equipment and storage medium

Technical Field

The disclosure relates to the field of information technology, and in particular relates to a model training, reasoning and constructing method, device, equipment and storage medium.

Background

Currently, a Pre-trained Model (Pre-trained Model) can be used as an initialization Model for downstream tasks. And (3) carrying out light adjustment on part of the structure in the pre-training model by a transfer learning method, so that the adjusted model can process downstream tasks.

However, the structure for achieving the lightweight adjustment is tightly coupled with the pre-training model, resulting in an inflexible tuning training method for the pre-training model.

Disclosure of Invention

In order to solve the above technical problems or at least partially solve the above technical problems, the present disclosure provides a model training, reasoning, construction method, apparatus, device and storage medium to improve flexibility of tuning training method for pre-training model.

In a first aspect, an embodiment of the present disclosure provides a machine learning model training method, where the machine learning model includes a pre-training model and at least one structure to be trained, the pre-training model includes at least one target structure, the at least one structure to be trained and the at least one target structure are in one-to-one correspondence, and the structure to be trained and the target structure that have a correspondence are connected in parallel, and the method includes:

Acquiring sample data of a target task, wherein the sample data comprises data to be processed and a labeling result of the data to be processed;

inputting the data to be processed into the machine learning model, so that the structure to be trained and the target structure with corresponding relations process input data in parallel, wherein the input data is the data to be processed or data obtained according to the data to be processed;

adjusting a second processing result of the target structure on the input data according to a first processing result of the structure to be trained on the input data to obtain an adjusted processing result, wherein the output information of the machine learning model is related to the adjusted processing result;

and adjusting parameters in the at least one structure to be trained according to the output information and the labeling result.

In a second aspect, an embodiment of the present disclosure provides a machine learning model reasoning method, the machine learning model including a pre-trained model and at least one trained structure, the at least one trained structure being trained according to the method of the first aspect, the pre-trained model including at least one target structure, the at least one trained structure and the at least one target structure being in one-to-one correspondence, the trained structure and the target structure having a correspondence being connected in parallel, the method comprising:

Acquiring target data of a target task;

inputting the target data into the machine learning model, so that the trained structure and the target structure with corresponding relations process input data in parallel, wherein the input data is the target data or data obtained according to the target data;

adjusting a second processing result of the target structure on the input data according to a first processing result of the trained structure on the input data, and obtaining an adjusted processing result;

and determining an output result of the target data according to the adjusted processing result.

In a third aspect, an embodiment of the present disclosure provides a machine learning model building method, the method including:

determining at least one target structure to be adjusted in the pre-training model according to the target task;

acquiring at least one structure to be trained, so that the at least one structure to be trained corresponds to the at least one target structure one by one;

and connecting the structure to be trained with the target structure in a corresponding relation in parallel to obtain a machine learning model, wherein the machine learning model comprises the pre-training model and the at least one structure to be trained, and the training method for the at least one structure to be trained adopts the method according to the first aspect.

In a fourth aspect, an embodiment of the present disclosure provides an image processing method, including:

acquiring a target image to be processed;

inputting the target image into a machine learning model, the machine learning model comprising a pre-trained model and at least one trained structure, the at least one trained structure being trained according to the method as described in the first aspect, the pre-trained model comprising at least one target structure, the at least one trained structure and the at least one target structure being in one-to-one correspondence, the trained structure and the target structure having a correspondence being connected in parallel, and the trained structure and the target structure having a correspondence being processed input data in parallel, the input data being the target image or data derived from the target image;

and determining an image processing result of the target image according to the adjusted processing result.

In a fifth aspect, an embodiment of the present disclosure provides a machine learning model training apparatus, where the machine learning model includes a pre-training model and at least one structure to be trained, the pre-training model includes at least one target structure, the at least one structure to be trained and the at least one target structure are in one-to-one correspondence, and the structure to be trained and the target structure that have a correspondence are connected in parallel, and the apparatus includes:

The acquisition module is used for acquiring sample data of a target task, wherein the sample data comprises data to be processed and a labeling result of the data to be processed;

the input module is used for inputting the data to be processed into the machine learning model, so that the structure to be trained and the target structure with corresponding relations process the input data in parallel, wherein the input data is the data to be processed or the data obtained according to the data to be processed;

the first adjusting module is used for adjusting a second processing result of the target structure on the input data according to a first processing result of the structure to be trained on the input data to obtain an adjusted processing result, and the output information of the machine learning model is related to the adjusted processing result;

and the second adjusting module is used for adjusting parameters in the at least one structure to be trained according to the output information and the labeling result.

In a sixth aspect, embodiments of the present disclosure provide an electronic device, including:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the methods of the first to fourth aspects.

In a seventh aspect, embodiments of the present disclosure provide a computer readable storage medium having stored thereon a computer program for execution by a processor to implement the methods of the first to fourth aspects.

According to the model training, reasoning, construction method, device, equipment and storage medium, at least one structure to be trained is added on the basis of the pre-training model, so that at least one target structure in the pre-training model corresponds to the at least one structure to be trained one by one, the structures to be trained with corresponding relations and the target structures are connected in parallel, and the structures to be trained with corresponding relations and the target structures can process input data in parallel. Further, according to the first processing result of the structure to be trained on the input data, the second processing result of the target structure on the input data is adjusted, so that the output of the structure to be trained can influence the final output information of the machine learning model. Further, according to the output information and a preset labeling result, parameters in at least one structure to be trained are adjusted. Thereby realizing tuning training of the pre-training model. Because the structure to be trained is connected to the target structure in the pre-training model in parallel, the coupling degree of the structure to be trained and the pre-training model is reduced, and therefore the flexibility of the tuning training method for the pre-training model is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flow chart of a machine learning model training method provided by an embodiment of the present disclosure;

fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a prior art pre-training model;

FIG. 4 is a schematic diagram of a machine learning model provided by an embodiment of the present disclosure;

FIG. 5 is a unified framework provided by embodiments of the present disclosure;

FIG. 6 is a schematic diagram of a prior art Prefix;

FIG. 7 is a schematic diagram of a parallel structure of Prefix provided by an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a prior art Prompt;

fig. 9 is a schematic diagram of a parallel structure of promt provided by an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a prior art Adapter;

FIG. 11 is a schematic diagram of a parallel architecture of an Adapter provided by an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a plurality of parallel structures provided in another embodiment of the present disclosure;

FIG. 13 is a schematic view of a first preset structure according to another embodiment of the present disclosure;

FIG. 14 is a schematic view of a second preset structure provided in another embodiment of the present disclosure;

fig. 15 is a schematic view of a third preset structure according to another embodiment of the present disclosure;

FIG. 16 is a schematic diagram of a plurality of preset structures connected in parallel according to another embodiment of the present disclosure;

FIG. 17 is a schematic diagram illustrating respective instantiations of a plurality of structures to be trained provided in accordance with another embodiment of the present disclosure;

FIG. 18 is a flow chart of a machine learning model reasoning method provided by another embodiment of the present disclosure;

FIG. 19 is a flowchart of a machine learning model construction method provided by another embodiment of the present disclosure;

FIG. 20 is a flowchart of an image processing method according to another embodiment of the present disclosure;

fig. 21 is a schematic diagram of an application scenario provided in another embodiment of the present disclosure;

FIG. 22 is a schematic structural diagram of a machine learning model training apparatus provided by an embodiment of the present disclosure;

fig. 23 is a schematic structural diagram of an embodiment of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.

It should be noted that, in the training stage and the reasoning stage, the target tasks (including, but not limited to, the target tasks set by the user, the target tasks determined according to the actual application requirements, and the like), the sample data and the target data (including, but not limited to, the data used for analysis, the stored data, the displayed data, and the like) which are all information and data authorized by the user or fully authorized by each party, and the collection, the use and the processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and are provided with corresponding operation entries for the user to select authorization or rejection.

In addition, the machine learning model training method and the reasoning method provided by the application can relate to the following terms for explanation, and the details are as follows:

pre-trained Model (Pre-trained Model): and performing large-scale training and tuning on the model produced after optimization on a representative data set. Are commonly used as an initialization model for training of downstream tasks to speed up training or to achieve better results.

Base Model (Foundation Model): in different fields, models trained by large amounts of data, powerful computational effort, well-designed structures can be adapted to a range of downstream tasks.

Transfer learning (Transfer Learning): from the learned related tasks, learning of new tasks is improved by transferring knowledge. The transfer learning is a machine learning method, and uses a model developed for a task A as an initial point to be reused in a process of developing a model for a task B.

Parameter efficient transfer learning method (Parameter-efficient Transfer Learning, PETL): based on the pre-training model, the tuning training method is realized by modifying a few parameters or adding a few additional parameters.

A network architecture, typically consisting of several layers of encoders and decoders.

Vision Transformer A network architecture is a migration application of a transducer architecture in the field of vision.

Transformer Block: a transducer sub-block is typically composed of multiple heads of attention and feed forward networks.

Multi-head Attention (MHA): typically, a module performs a relevance vector calculation on a Query (Query), a Key (Key), and a Value (Value).

Feed-forward Network (FFN): typically consisting of a plurality of fully connected layers and an activation function.

Multi-Layer Perceptron (MLP): a network module is typically composed of one or more hidden layers and activation functions.

Adapter, a tuning training method, typically acting in a feed forward network, is a small module consisting of two fully connected layers and an activation function.

Prompt, a tuning training method, usually by means of a learnable parameter spliced with the input.

Prefix, a tuning training method, usually by means of two learnable parameters spliced with keys (Key) and values (Value) in a multi-head attention layer.

Typically, a Pre-trained Model (Pre-trained Model) can be used as an initialization Model for downstream tasks. And (3) carrying out light adjustment on part of the structure in the pre-training model by a transfer learning method, so that the adjusted model can process downstream tasks. However, the structure for achieving the lightweight adjustment is tightly coupled with the pre-training model, resulting in an inflexible tuning training method for the pre-training model. In view of this problem, embodiments of the present disclosure provide a model training, reasoning, and building method, which is described below in connection with specific embodiments.

Fig. 1 is a flowchart of a machine learning model training method provided in an embodiment of the present disclosure. The method can be performed by a machine learning model training device, which can be implemented in software and/or hardware, and the device can be configured in an electronic device, such as a server or a terminal, wherein the terminal specifically comprises a mobile phone, a computer, a tablet computer, or the like. The server may specifically be a cloud server, and the machine learning model training method may be executed by the cloud server, where a plurality of computing nodes (cloud servers) may be deployed in the cloud server, and each computing node has processing resources such as computation and storage. At the cloud, a service may be provided by multiple computing nodes, although one computing node may provide one or more services. The cloud may provide the service by providing a service interface to the outside, and the user invokes the service interface to use the corresponding service. The service interface includes a software development kit (Software Development Kit, abbreviated as SDK), an application program interface (Application Programming Interface, abbreviated as API), and the like. In addition, the machine learning model training method described in this embodiment may be applied to an application scenario as shown in fig. 2. As shown in fig. 2, the application scenario includes a terminal 21 and a server 22, where the server 22 may train the machine learning model and then keep the trained machine learning model locally or deploy it to the terminal 21 or other servers.

In this embodiment, the machine learning model includes a pre-training model and at least one structure to be trained, the pre-training model includes at least one target structure, the at least one structure to be trained and the at least one target structure are in one-to-one correspondence, and the structure to be trained and the target structure having a correspondence are connected in parallel.

Because the pre-training model can be used as an initialization model in the downstream task training, on the basis of the pre-training model, the existing partial parameters in the pre-training model are further modified by adopting a parameter efficient transfer learning method, or new parameters are added to the pre-training model under the condition that the existing parameters in the pre-training model are kept unchanged, which is equivalent to further tuning and training the pre-training model, so that the pre-training model can evolve into a machine learning model capable of processing the downstream task.

As shown in fig. 3, assuming that the pre-training model is a transducer, the transducer includes a plurality of sub-blocks (blocks), each of which has a structure 31 as shown in fig. 3. The pre-training model is assumed to have a certain number of parameters, but when the pre-training model is applied to a downstream task, the expected performance index cannot be achieved when the data to be processed in the downstream task is processed through the parameters of the pre-training model. Therefore, it is necessary to modify some of the existing parameters in the pre-training model or to add new parameters to the pre-training model while keeping the existing parameters in the pre-training model unchanged, so that the pre-training model may evolve into a machine learning model capable of handling downstream tasks. In this embodiment, for example, as shown in fig. 4, some structures to be trained may be added on the basis of the pre-training model to obtain a machine learning model to be trained as shown in fig. 4, where each structure to be trained includes one or more parameters to be trained. Further, training the machine learning model to be trained according to sample data in the downstream task, and specifically, training one or more structures to be trained in the machine learning model to be trained. That is, the existing parameters in the pre-training model can be kept unchanged during the training process, and the parameters in each structure to be trained are continuously adjusted. When training is completed, the parameters inside each structure to be trained can be regarded as newly added parameters. For example, the pre-training model has 50 existing parameters and 5 newly added parameters, so that the machine learning model shown in fig. 4 has 55 parameters when training is completed, which is equivalent to expanding the existing parameters of the pre-training model, and can reach expected performance indexes, such as expected accuracy, when more parameters are adopted to process the data to be processed in the downstream task.

That is, the embodiment of the disclosure derives a Unified framework (U-Tuning) for efficient parameter transfer learning by re-examining the design paradigm of the efficient parameter transfer learning method, and specifically, the framework is composed of an operation module (OP) for freezing parameters and a Unified Tuning module (U-Tuning). The operational module (OP) may in particular be a pre-trained model or a part of the structure inside a pre-trained model, such as MHA, FFN, sub-blocks etc. as shown in fig. 4. The unified tuning module (U-Tuner) may specifically be a structure to be trained as shown in fig. 4. As shown in fig. 4, the MHA, the FFN, and the sub-blocks are respectively connected in parallel with one structure to be trained, so that a part of the structure to be trained, which is connected in parallel with the structure to be trained, in the pre-training model is recorded as a target structure. The machine learning model shown in fig. 4, for example, includes 3 target structures and 3 structures to be trained, the 3 target structures and the 3 structures to be trained correspond to each other, and the structures to be trained and the target structures having correspondence are connected in parallel. It will be appreciated that the machine learning model shown in fig. 4 is only a schematic illustration, and the present embodiment is not limited to the number of target structures and the number of structures to be trained. In addition, the target structure is not specifically limited, and for example, the target structure may be a sub-layer inside the MHA or the FFN, such as a Feed Forward layer (Feed Forward) in the FFN, in addition to the MHA, the FFN, and the sub-block. Furthermore, in other embodiments, the target structure may also be a transducer.

The training method of the machine learning model is described in detail below with reference to fig. 2 and 4, and as shown in fig. 1, the specific steps of the method are as follows:

s101, acquiring sample data of a target task, wherein the sample data comprises data to be processed and labeling results of the data to be processed.

For example, with the downstream task noted as the target task, the server 22 may obtain sample data for the target task, the sample data including data to be processed and labeling results for the data to be processed. For example, the target task is an image classification task. The data to be processed may be a sample image, and the labeling result of the data to be processed may be a classification label of the sample image. For example, the sample image is a scene image, and the classification label of the sample image is a classification label of a scene, such as a natural landscape or an office environment. The present embodiment does not limit the manner in which the server 22 obtains the sample data. For example, the server 22 may store sample data for the target task locally. Alternatively, the server 22 may receive sample data of the target task from the terminal 21 or other server. Still alternatively, the server 22 may grasp sample data for the target task from the network.

S102, inputting the data to be processed into the machine learning model, so that the structure to be trained and the target structure with corresponding relations process the input data in parallel, wherein the input data is the data to be processed or the data obtained according to the data to be processed.

For example, the server 22 may Input the sample image into a machine learning model as shown in FIG. 4, i.e., the sample image may be Input 40 as shown in FIG. 4. Alternatively, the input 40 as shown in FIG. 4 may also be a representation vector of the sample image. Further alternatively, the input 40 shown in fig. 4 is an image after the sample image is subjected to an initialization process, which is not limited to image cropping, image scaling, image sharpening, brightness adjustment, and the like. That is, the input 40 shown in fig. 4 may be a sample image or data derived from a sample image.

In addition, the data obtained from the sample image may also be intermediate results of intermediate layer outputs in the machine learning model. For example, as shown in fig. 4, if sub-block 41 is the first sub-block in the order, as seen from bottom to top, then input 40 as shown in fig. 4 may be the sample image. If sub-block 41 is not the first sub-block in the sequence, e.g., sub-block 41 is the second sub-block in the sequence, then input 40 as shown in FIG. 4 may be an intermediate result of the processing of the sample image by the first sub-block. Similarly, for the interior of sub-block 41, the input data of sub-block 41 (i.e., the data input to sub-block 41) and the input data of structure to be trained 42 (i.e., the data input to structure to be trained 42) are respectively the inputs 40 as shown in fig. 4, and the input data of mha and the input data of structure to be trained 43 are respectively the inputs 40 as shown in fig. 4. The input data of the FFN and the input data of the structure to be trained 44 are intermediate results of the operator 45 output, respectively.

Since the structure to be trained and the target structure having the correspondence are connected in parallel, the structure to be trained and the target structure having the correspondence can process input data in parallel, but the input data may be different for different correspondence. For example, the input data processed in parallel by sub-block 41 and structure to be trained 42 is input 40, and the input data processed in parallel by ffn and structure to be trained 44 is data obtained from input 40.

S103, adjusting a second processing result of the target structure on the input data according to a first processing result of the structure to be trained on the input data to obtain an adjusted processing result, wherein the output information of the machine learning model is related to the adjusted processing result.

For example, taking the MHA and the structure to be trained 43 shown in fig. 4 as an example, since the MHA and the structure to be trained 43 are connected in parallel, the MHA and the structure to be trained 43 process corresponding input data in parallel, and assuming that a result obtained by processing the input data by the structure to be trained 43 is denoted as a first processing result and a result obtained by processing the input data by the MHA is denoted as a second processing result, the second processing result may be adjusted according to the first processing result. For example, the first processing result and the second processing result are vectors with the same dimension, and the first processing result is added on the basis of the second processing result, so that the second processing result can be adjusted, and becomes a new processing result, and the new processing result is recorded as an adjusted processing result. In particular, the adjustment process may be implemented by an operator 45 as shown in fig. 4.

Similarly, after the FFN and the structure to be trained 44 process the corresponding input data in parallel, the processing result of the FFN on the input data is adjusted according to the processing result of the structure to be trained 44 on the input data, so as to obtain an adjusted processing result, and the adjustment process is implemented through an operator 46.

Similarly, after the sub-block 41 and the structure to be trained 42 process the corresponding input data in parallel, the processing result of the sub-block 41 on the input data is adjusted according to the processing result of the structure to be trained 42 on the input data, so as to obtain an adjusted processing result, and the adjustment process is implemented through an operator 47.

In addition, the sub-block 41 shown in fig. 4 is only one sub-block in the machine learning model, and the output of the sub-block 41, that is, the output of the operator 47, may be further used as the input of the last sub-block of the sub-block 41, and the processing procedure of the last sub-block on the input is similar to the processing procedure of the sub-block 41 shown in fig. 4 on the input 40, which is not repeated here. And so on, after the uppermost sub-block in the machine learning model is processed, output information of the machine learning model can be obtained, and the output information is related to the adjusted processing result. That is, the result of the process after the adjustment as described above is involved in the calculation of the output information from the input of the machine learning model to the machine learning model, that is, the output of each structure to be trained as shown in fig. 4 affects the output information of the machine learning model.

S104, adjusting parameters in the at least one structure to be trained according to the output information and the labeling result.

For example, according to the similarity or the difference between the output information of the machine learning model and the labeling result of the data to be processed, the parameters in each structure to be trained as shown in fig. 4 are adjusted, so that the output information of the machine learning model and the labeling result of the data to be processed are gradually similar in the subsequent iterative training process. For example, parameter adjustment may be guided based on gradient data between the output information of the machine learning model and the data to be processed. Where the intention of a gradient is a vector (vector) that means that the directional derivative of a function at that point takes a maximum along that direction, i.e. the function changes the fastest along that direction (the direction of this gradient) at that point with the greatest rate of change. Based on the principle, the adjustment direction of the parameters can be guided, so that the output information of the machine learning model approaches to the labeling result of the data to be processed.

In addition, when each structure to be trained as shown in fig. 4 is trained, the machine learning model as shown in fig. 4 may also be referred to as a tuned pre-training model.

According to the embodiment of the disclosure, at least one structure to be trained is added on the basis of the pre-training model, so that at least one target structure in the pre-training model corresponds to the at least one structure to be trained one by one, the structure to be trained with the corresponding relation is connected with the target structure in parallel, and the structure to be trained with the corresponding relation and the target structure can process input data in parallel. Further, according to the first processing result of the structure to be trained on the input data, the second processing result of the target structure on the input data is adjusted, so that the output of the structure to be trained can influence the final output information of the machine learning model. Further, according to the output information and a preset labeling result, parameters in at least one structure to be trained are adjusted. Thereby realizing tuning training of the pre-training model. Because the structure to be trained is connected to the target structure in the pre-training model in parallel, the coupling degree of the structure to be trained and the pre-training model is reduced, and therefore the flexibility of the tuning training method for the pre-training model is improved.

The structure shown in fig. 5 is a unified framework for parameter efficient migration learning in this embodiment, and the framework is called U-Tuning. The U-Tuning consists of an operation module (OP) of the freezing parameters and a unified Tuning module (U-Tuner), and the operation module (OP) of the freezing parameters and the unified Tuning module (U-Tuner) are connected in parallel. U-Tuner may also be referred to as a unified lightweight trainable structure. The frame allows for flexible insertion of the U-Tuner. Experiments show that the framework has enough universality, and can realize better performance on various downstream tasks, such as higher generalization and high performance. In addition, the framework not only can realize the existing parameter efficient transfer learning method, but also can realize a new parameter efficient transfer learning method.

As shown in fig. 5, x represents an input, and OP may be a transducer structure or a partial structure in a transducer structure. OP (x) represents the result of processing the input by the parameters in OP, which are the parameters already in the pre-trained model. When U-Tuner is the structure to be trained, U-Tuner (x) represents the result of processing the input by parameters in U-Tuner. OP (x) +U-Tuner (x) represents the result of the adjustment of OP (x) according to U-Tuner (x).

On the basis of the above embodiment, the pre-training model includes a plurality of sub-blocks, the sub-blocks including a multi-head attention layer and a feed-forward network, the multi-head attention layer or the feed-forward network including a plurality of sub-layers; the at least one target structure is at least one of the sub-layer, the multi-headed attention layer, the feed forward network, the sub-block, the pre-trained model.

As shown in fig. 4, the pre-training model includes a plurality of sub-blocks, each sub-block including a multi-head attention layer and a feed-forward network, each of which includes a plurality of sub-layers. The OP as described above may be a sub-layer, a multi-head attention layer, a feed forward network, a sub-block, or a pre-trained model.

It can be seen that the structure to be trained in this embodiment may be effective on a part of the structure of each level inside the pre-training model, and may even be effective on the pre-training model. That is, the structure to be trained in this embodiment not only can adjust and optimize the output distribution of the partial structure of each level in the pre-training model, but also can adjust and optimize the output distribution of the pre-training model. In the existing tuning training method, the structure for realizing light weight adjustment only takes effect on part of the structure inside the pre-training model, for example, only takes effect on the MHA or FFN, so that the rest of the structure which is not adjusted cannot adapt to the data distribution of the downstream task. Therefore, compared with the prior art, the method and the device can improve the adaptability of the optimized pre-training model to downstream tasks.

The present embodiment has studied the existing parameter efficient transfer learning methods, and found that each of the existing parameter efficient transfer learning methods can be equivalently expressed as a parallel combination of an existing operation (e.g., MHA or FFN in a Transformer) and a newly introduced substructure (e.g., a structure similar to MHA or FFN).

For example, FIG. 6 is a schematic diagram of a conventional Prefix. The MHA includes 3 known parameters as shown in FIG. 4, W _q 、W _k 、W _v 。W _q 、W _k 、W _v The projection weights respectively. W (W) _q 、W _k 、W _v Belonging to the existing parameters in the pre-training model. Assuming the input is noted as x, when x and W _q When combined, e.g. x and W _q Multiplication results in Q, which represents a representation vector of the Query (Query). When x and W _k When combined, e.g. x and W _k Multiplication results in K, which represents a representation vector of keys (keys). When x and W _v When combined, e.g. x and W _v Multiplication results in a representation vector of the V representation Value (Value). K (K) _pre Representing the parameters to be trained spliced on K, V _pre Representing the parameters to be trained spliced on V. Existing Prefix pass K _pre And V _pre And performing tuning training on the pre-training model. Specifically, the output MHA of the MHA shown in fig. 6 _pre The following formula (1) can be expressed:

MHA _pre ＝Attn(xW _q ,[K _pre ；xW _k ],[V _pre ；xW _v ]) (1)

where x represents the input. Attn represents the internal pair xW of MHA _q 、[K _pre ；xW _k ]、[V _pre ；xW _v ]The three vectors are calculated. [ K ] _pre ；xW _k ]Represent K _pre And xW _k And (5) splicing the obtained vectors. [ V _pre ；xW _v ]Represents V _pre And xW _v And (5) splicing the obtained vectors.

The following equation (2) can be derived from equation (1) above:

MHA′ _pre ＝(1-λ)Attn(Q,K,V)+λAttn(Q,K _pre ,V _pre ) (2)

where λ represents a weight coefficient. Therefore, in the present embodiment, the parallel structure shown in fig. 7 can be derived from the structure shown in fig. 6. The parallel structure includes an existing operation 71 and a new incoming substructure 72. The output of the existing operation 71 is Attn (Q, K, V) and the output of the newly introduced substructure 72 is Attn (Q, K) _pre ,V _pre ). The output of the parallel structure ensemble is MHA 'in equation (2)' _pre . The existing operations 71 and the newly introduced substructures 72 may be calculated in parallel.

Fig. 8 is a schematic diagram of a conventional promt. For example, existing promts perform tuning training on a pre-trained model by means of a learnable parameter concatenated with the input. The dimension of the learnable parameter is the same as the dimension of the input. Specifically, the output MHA of the MHA shown in fig. 8 _pro The following formula (3) can be expressed:

MHA _pro ＝Attn([x；x _pro ]W _q ,[x；x _pro ]W _k ,[x；x _pro ]W _v ) (3)

where x represents the input. X is x _pro Representing the learnable parameters. [ x; x is x _pro ]Representing x and x _pro And (5) splicing the obtained vectors. W (W) _q 、W _k 、W _v Projection in MHA, respectivelyWeight, W _q 、W _k 、W _v Belonging to the existing parameters in the pre-training model.

The following equation (4) can be derived from equation (3) above:

MHA′ _pro ＝(1-λ)Attn(Q,K,V)+λAttn(Q,K _pro ,V _pro ) (4)

where λ represents a weight coefficient. K (K) _pro ＝x _pro W _k 。V _pro ＝x _pro W _v . Therefore, in the present embodiment, the parallel structure shown in fig. 9 can be derived from the structure shown in fig. 8. The parallel structure includes an existing operation 91 and a newly introduced substructure 92. The output of the existing operation 91 is Attn (Q, K, V) and the output of the newly introduced substructure 92 is Attn (Q, K) _pro ,V _pro ). The output of the parallel structure ensemble is MHA 'in equation (4)' _pro . The existing operation 91 and the newly introduced substructure 92 may be calculated in parallel.

Fig. 10 is a schematic diagram of a conventional Adapter. For example, an Adapter is typically a small module consisting of two fully connected layers and an activation function that acts on the FFN. The output of the Adapter shown in fig. 10 can be expressed as the following equation (5):

where x represents the input.

The activation function, i.e., the non-linear layer (non-linear) shown in fig. 10, is represented. W (W) _down And W is _up Respectively represent the learnable parameters. The following equation (6) can be derived from equation (5) above:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing an activation function. Therefore, in the present embodiment, the parallel structure shown in fig. 11 can be derived from the structure shown in fig. 10. The parallel structure includes an existing operation 111 and a newly introduced substructure 112. The output of the existing operation 111 is FFN (x) and the output of the newly introduced substructure 112 is +.>

The output of the parallel structure as a whole is FFN 'in equation (6)' _adapter . Existing operations 111 and newly introduced substructures 112 may be computed in parallel. That is, prior art FFNs with adapters can be equivalently parallel connected to the original FFN module and FFN model with MLP layers.

As can be seen from fig. 6-11, the existing tuning method can be summarized as a unified formula, as shown in the following formula (7):

x′＝OP(x)+U-Tuner(x) (7)

The OP represents an existing operation in the transducer, for example, an existing operation 71, an existing operation 91, and an existing operation 111.U-Tuner represents a unified Tuner, e.g., new incoming substructure 72, new incoming substructure 92, new incoming substructure 112. The output of the U-Tuner, U-Tuner (x), is used to adjust the output profile of the OP, e.g., OP (x). The adjusted result is denoted as x'.

In this embodiment, since the OP and the U-Tuner are connected in parallel, the instantiation of the OP and the instantiation of the U-Tuner may be independent of each other. For example, the OP is instantiated as a sublayer in the MHA or FFN, or as a transducer sublayer, or as a transducer. The U-Tuner is instantiated as a new incoming substructure 72, a new incoming substructure 92, or a new incoming substructure 112. Or instantiates a U-Tuner as a parallel combination of at least two of the new incoming substructure 72, the new incoming substructure 92, and the new incoming substructure 112. That is, the instantiation of the OP does not affect the instantiation of the U-Tuner nor the instantiation of the U-Tuner. Therefore, the exemplary combination of the respective exemplary OP and U-Tuner is not limited to the parallel structure shown in fig. 7, 9, and 11. In the parallel structure shown in fig. 7, 9 and 11, the instantiation of the OP is similar to the instantiation of the U-Tuner, for example in fig. 7, the OP is instantiated as MHA and the U-Tuner is instantiated as the newly introduced substructure 72, and the computation process inside the newly introduced substructure 72 is similar to the computation process inside the MHA. While in other embodiments, the instantiation of the OP and the instantiation of the U-Tuner may be dissimilar, e.g., the OP is instantiated as an MHA and the U-Tuner is instantiated as a new incoming substructure 112. In addition, the U-Tuner may be instantiated not only as a new incoming substructure, but also as a parallel combination of multiple new incoming substructures. For example, as shown in fig. 12, OP is instantiated as MHA and U-Tuner is instantiated as a parallel combination of the newly introduced substructure 72 and the newly introduced substructure 112. That is, various new tuning methods can be generated by combining various instantiation forms of the OP and various instantiation forms of the U-Tuner.

In addition, it is understood that the tuning training method is not limited to Adapter, prompt, prefix. When a new tuning training method occurs, new parallel structures may also occur, such that other new lead-in substructures different from the new lead-in

substructures

72, 92, and 112 occur. In this case, the instantiation of the U-Tuner can be further extended, so that the instantiation of the U-Tuner is more flexible and versatile. Therefore, the instantiation combination forms obtained after the OP and the U-Tuner are respectively instantiated are further flexible and diversified, and a new parameter efficient migration method is realized. That is, by enriching the instantiation of the U-Tuner, it can be easily extended to new tuning methods.

Furthermore, the present embodiment may attach U-Tuner to part of the structure of each level inside the pre-training model, for example not only to MHA and FFN, but also to sub-layers within MHA or FFN, or to sub-blocks in the transducer, or even to the transducer. The application range of the U-Tuner is improved.

Since the OP may be a sub-layer, multi-head attention layer, feed forward network, sub-block, or pre-training model, the OP term may remain aligned with the design of the pre-training model in a manner that allows the OP to preserve the generalization capability of the pre-training model. In addition, since the OP may be a sub-layer, a multi-head attention layer, a feed-forward network, a sub-block, or a pre-training model, that is, the OP may be instantiated as any one of the structures in the pre-training model, the U-Tuner may tune the output distribution of any one of the structures in the pre-training model to accommodate the data distribution in various downstream tasks. Thereby increasing the flexibility of the U-Tuning framework and equation (7).

In addition, since the instantiation of the OP and the instantiation of the U-Tuner are decoupled, the structure after the U-Tuner is instantiated does not necessarily need to be consistent with the structure after the OP is instantiated. For example, when the OP is instantiated as an MHA, the U-Tuner is not limited to being instantiated as a new incoming substructure 72 or a new incoming substructure 92, e.g., may be instantiated as a new incoming substructure 112. This is in clear contrast to existing tuning methods, for example, fig. 6, 8, and 10 are schematic diagrams of Prefix, prompt, adapter in sequence, and since the parameters in the dashed box are tightly coupled to the parameters existing in the MHA or FFN, the calculation method involving the parameters in the dashed box is similar to the calculation method in the MHA or FFN. In the structure shown in fig. 5, the method of calculating the input in the U-Tuner may be different from the method of calculating the input in the OP, because the instantiation of the OP and the instantiation of the U-Tuner are decoupled. For example, prefix, K shown in FIG. 6 _pre And V _pre Participate in the computation process inside MHA, and K _pre And V _pre Acting on MHA. K newly introduced into substructure 72 as shown in FIG. 7 _pre And V _pre Participating calculation methods, i.e. λattn (Q, K) _pre ,V _pre ) Similar to the calculation method (1-lambda) Attn (Q, K, V) inside MHA, however, the newly introduced substructure 72 may act not only on MHA but also on FFN, i.e. K in the newly introduced substructure 72 _pre And V _pte Not only the MHA but also the FFN. In addition, because of the decoupling of the instantiation of the OP and the instantiation of the U-Tuner, the U-Tuner can be inserted into the pre-training model or removed from the pre-training model more flexibly. The U-Tuner has higher flexibility and is beneficial to the situation that after the U-Tuner is addedIs deployed to various downstream tasks.

The above embodiments describe that at least one structure to be trained may be added to the machine learning model to be trained, and optionally, the at least one structure to be trained includes at least one of a plurality of preset structures respectively; aiming at the structure to be trained and the target structure with corresponding relations, at least one preset structure included in the structure to be trained is connected with the target structure in parallel.

For example, the plurality of preset structures are the new incoming substructure 72, the new incoming substructure 92, and the new incoming substructure 112, respectively, as described above. Taking the machine learning model shown in fig. 4 as an example, some of the structures to be trained 42, 43, 44 may be empty, and other structures may be instantiated. Alternatively, the structure to be trained 42, the structure to be trained 43, and the structure to be trained 44 are instantiated respectively. When the structures to be trained 42, 43, 44 are instantiated, respectively, any one of the structures to be trained may be instantiated as a parallel combination of one or more of the new

incoming substructure

72, 92, and 112. In addition, the instantiation process of each structure to be trained can be independent and not affected.

When the MHA in the machine learning model shown in fig. 4 is denoted as the target structure, the target structure may be 71 shown in fig. 12, and the structure to be trained corresponding to the target structure may be a parallel combination of the newly introduced substructure 72 and the newly introduced substructure 112 as shown in fig. 12. And the new incoming substructure 72 and the new incoming substructure 112 are connected in parallel with the target structure, i.e. the target structure (71 as shown in fig. 12), the new incoming substructure 72 and the new incoming substructure 112, respectively, are connected in parallel.

Optionally, a first preset structure of the plurality of preset structures includes a first parameter to be trained, a second parameter to be trained, and a first known parameter; when the structure to be trained comprises the first preset structure, the processing of the input data by the first preset structure comprises the following steps: calculating to obtain a first intermediate value according to the input data and the first known parameter; and calculating a processing result of the first preset structure on the input data according to the first parameter to be trained, the second parameter to be trained and the first intermediate value.

For example, when the existing operation 71 shown in fig. 7 is summarized and abstracted as OP, the newly introduced substructure 72 may be summarized and abstracted as a preset structure 131 as shown in fig. 13, and the preset structure 131 is denoted as a first preset structure. As shown in fig. 13, the first preset structure includes a first to-be-trained parameter K _pre Second to-be-trained parameter V _pre In addition, the first preset structure may further include a first known parameter, where the first known parameter is W _q . The processing procedure of the first preset structure on the input data x may specifically be: based on the input data x and the first known parameter W _q Calculating a first intermediate value Q, e.g. x and W _q Multiplication gives Q. Further, according to the first to-be-trained parameter K _pre Second to-be-trained parameter V _pre And a first intermediate value Q, calculating the processing result of the first preset structure on the input data x, wherein the processing result can be Attn (Q, K) _pre ,V _pre )。

Optionally, a second preset structure of the plurality of preset structures includes a third parameter to be trained, a first known parameter, a second known parameter, and a third known parameter, where a dimension of the parameter to be trained is the same as a dimension of the input data; when the structure to be trained comprises the second preset structure, the processing of the input data by the second preset structure comprises the following steps: calculating to obtain a first intermediate value according to the input data and the first known parameter; calculating a second intermediate value according to the third parameter to be trained and the second known parameter; calculating a third intermediate value according to the third parameter to be trained and the third known parameter; and calculating a processing result of the second preset structure on the input data according to the first intermediate value, the second intermediate value and the third intermediate value.

For example, when the existing operation 91 shown in FIG. 9 is summarized as an OP, the newly introduced substructure 92 may be summarized as a preset structure 141 as shown in FIG. 14, the preset structure 141 being noted as a second preset structure. As shown in fig. 14, the second preset structure includes a learnable parameter x _pro The learnable parameter x _pro And recording as a third to-be-trained parameter. In addition, the second preset structure may further include a first known parameter, a second known parameter and a third known parameter, where the first known parameter may be W _q The second known parameter may be W _k The third known parameter may be W _v . The learnable parameter x _pro The dimensions of (2) are the same as the dimensions of the input data x. The processing procedure of the second preset structure on the input data x may specifically be: based on the input data x and the first known parameter W _q Calculating a first intermediate value Q, e.g. x and W _q Multiplication gives Q. According to the learnable parameter x _pro And a second known parameter W _k Calculating to obtain a second intermediate value K _pro ，K _pro ＝x _pro W _k . According to the learnable parameter x _pro And a third known parameter W _v Calculating to obtain a third intermediate value V _pro ，V _pro ＝x _pro W _v . Further, according to the first intermediate value Q, the second intermediate value K _pro And the third intermediate value V _pro Calculating the processing result of the second preset structure on the input data, wherein the processing result is marked as Attn (Q, K) _pro ,V _pro )。

Optionally, a third preset structure in the plurality of preset structures includes a fourth parameter to be trained and a fifth parameter to be trained; when the structure to be trained comprises the third preset structure, the processing of the input data by the third preset structure comprises the following steps: calculating a fourth intermediate value according to the fourth parameter to be trained and the input data; performing nonlinear processing on the fourth intermediate value to obtain a nonlinear processing result; and calculating the processing result of the third preset structure on the input data according to the nonlinear processing result and the fifth parameter to be trained.

For example, when the existing operation 111 shown in fig. 11 is summarized and abstracted as OP, the newly introduced sub-structure 112 may be summarized and abstracted as a preset structure 151 as shown in fig. 15, and the preset structure 151 is denoted as a third preset structure. As shown in FIG. 15, the third preset knotThe structure comprises a fourth parameter W to be trained _down And a fifth parameter to be trained W _up The method comprises the steps of carrying out a first treatment on the surface of the The processing procedure of the third preset structure on the input data x may specifically be: according to the fourth to-be-trained parameter W _down And the input data x to obtain a fourth intermediate value, e.g. xW _down . Nonlinear processing is carried out on the fourth intermediate value to obtain nonlinear processing results, such as

Based on the nonlinear processing result

And a fifth parameter to be trained W _up Calculating the result of the processing of said input data by said third preset structure, e.g. the result of the processing is +.>

Optionally, the first processing result of the to-be-trained structure on the input data includes processing results of the at least one preset structure on the input data respectively; adjusting a second processing result of the target structure on the input data according to a first processing result of the structure to be trained on the input data, including: and adjusting a second processing result of the target structure on the input data according to the scaling factors respectively corresponding to the at least one preset structure and the processing results of the at least one preset structure on the input data respectively.

In addition, as shown in fig. 13, 14 and 15, one Scale factor Scale is included in each of the first preset structure 131, the second preset structure 141 and the third preset structure 151. The scaling factor may be used to represent an adjustment degree when the processing result of the input data by the OP is adjusted according to the processing result of the input data by the first preset structure 131, the second preset structure 141 or the third preset structure 151. For example, the scaling factor Scale in the first preset structure 131 represents the adjustment degree when the processing result of the input data by the OP is adjusted according to the processing result of the input data by the first preset structure 131. For example, the result output by the operator 132 is expressed as the following formula (8):

MHA _pre ＝(1-λ)Attn(Q,K,V)+Scale*λAttn(Q,K _pre ,V _pre ) (8)

For example, as shown in fig. 16, the structure to be trained corresponding to the OP includes a second preset structure 141 and a third preset structure 151, and the first processing result of the structure to be trained on the input data includes the processing results of the second preset structure 141 and the third preset structure 151 on the input data respectively. Assuming that the processing result of the OP on the input data is recorded as the second processing result, when the second processing result is adjusted according to the first processing result, the second processing result may be specifically adjusted according to the scaling factor scale_1 corresponding to the third preset structure 151, the scaling factor scale_2 corresponding to the second preset structure 141, and the processing results of the second preset structure 141 and the third preset structure 151 on the input data respectively. For example, the second processing result, which is the processing result of the OP on the input data, is Attn (Q, K, V), and the processing result of the second preset structure 141 on the input data is Attn (Q, K) _pro ,V _pro ) The result of the processing of the input data by the third preset structure 151 is that

The adjusted second processing result is the output of the operator 161, for example, the output of the operator 161 can be expressed as the following formula (9):

in addition, after each structure to be trained is instantiated on the basis of fig. 4, a schematic diagram shown in fig. 17 may be obtained, specifically, the structure to be trained 42 is instantiated as a third preset structure 151, the structure to be trained 43 is instantiated as a first preset structure 131, and the structure to be trained 44 is instantiated as a second preset structure 141. The scaling factor corresponding to the third preset structure 151 is denoted as scale_1, the scaling factor corresponding to the second preset structure 141 is denoted as scale_2, and the scaling factor corresponding to the first preset structure 131 is denoted as scale_3.

This embodiment may further improve flexibility by scaling factors and balance between different models. While introducing a channel-based scaling factor for the parallel U-Tuner. In particular, the scaling factor may be a fixed constant, a single learnable variable or parameter, or may be data derived from input data (e.g., data resulting from compression of the input data).

In addition, since the present embodiment adjusts the parameters of a small portion of the pre-training model by freezing the parameters of a large portion of the pre-training model, or adds the parameters of a new small portion. Better results can be achieved by introducing a small fraction of the parameters.

The present embodiment proposes a unified paradigm U-Tuning, in which two core operators OP and U-Tuning represent the sub-modules to be optimized and the parallel Tuning method, respectively, which may be independent of the main branch OP. In addition, the two parts OP and U-Tuner are not limited to the method described above, but can be extended to more other methods. In addition, the U-Tuner connects the U-Tuner to the original module, e.g., OP, through a residual (e.g., operator 161) to achieve plug and play. The optimized pre-training model can be quickly adapted to various downstream tasks. Experiments show that the paradigm obtains good results in the aspect of transfer learning and provides a new visual angle for a parameter efficient tuning method.

For example, after the structure to be trained in the machine learning model shown in fig. 4 is trained, a trained machine learning model may be obtained, where the trained structure to be trained may be referred to as a trained structure. The process of performing downstream tasks using the trained machine learning model is the use phase or reasoning phase of the trained machine learning model.

Fig. 18 is a flowchart of a machine learning model reasoning method provided in another embodiment of the present disclosure. The machine learning model comprises a pre-training model and at least one trained structure, wherein the at least one trained structure is trained according to the training method, the pre-training model comprises at least one target structure, the at least one trained structure corresponds to the at least one target structure one by one, and the trained structure and the target structure which have corresponding relations are connected in parallel. In this embodiment, the method may be performed by a machine learning model inference device, which may be implemented in software and/or hardware, where the device may be configured in an electronic apparatus, for example, a server or a terminal, where the terminal specifically includes a mobile phone, a computer, a tablet computer, or the like. The server may be a cloud server, and the machine learning model reasoning method may be executed by the cloud. Specifically, the method comprises the following specific steps:

S1701, acquiring target data of a target task.

For example, the target task is an image classification task, and the target data is a target image to be classified in the image classification task. The server 22 may acquire the target image from the terminal 21 or other servers.

S1702, inputting the target data into the machine learning model, so that the trained structure and the target structure with corresponding relations process input data in parallel, wherein the input data is the target data or data obtained according to the target data.

For example, the trained machine learning model may be maintained local to the server 22, and after the server 22 obtains the target image, the target image is input into the trained machine learning model, so that the trained structure and the target structure having the correspondence in the machine learning model process input data in parallel, the input data being the data input to the trained structure and the data input to the target structure, and since the trained structure and the target structure having the correspondence are connected in parallel, the data input to the trained structure and the data input to the target structure are the same. Specifically, the input data may be a target image or data obtained from a target image. The data obtained from the target image is, for example, a representation vector of the target image or a result of processing the target image by other intermediate layers in the machine learning model.

S1703, adjusting a second processing result of the target structure on the input data according to the first processing result of the trained structure on the input data, and obtaining an adjusted processing result.

For example, according to the first processing result of the trained structure on the input data, the second processing result of the target structure on the input data is adjusted, and the adjusted processing result is obtained. The adjustment process is similar to the process of adjusting the second processing result of the target structure on the input data according to the first processing result of the structure to be trained on the input data as described above, and will not be repeated here.

S1704, determining an output result of the target data according to the adjusted processing result.

Specifically, the adjusted processing result may participate in the calculation from the input of the machine learning model to the output information of the machine learning model, i.e. the output of each trained structure may affect the output information of the machine learning model. The output information of the machine learning model is an image processing result of the machine learning model on the target image, and the image processing result may be a classification result, a detection result, a segmentation result, or the like for the target image.

According to the embodiment, at least one trained structure is added on the basis of the pre-training model, so that at least one target structure in the pre-training model corresponds to the at least one trained structure one by one, the trained structures with corresponding relations and the target structures are connected in parallel, and the trained structures with corresponding relations and the target structures can process input data in parallel. Further, according to the first processing result of the trained structure on the input data, the second processing result of the target structure on the input data is adjusted, so that final output information of the machine learning model is more accurate.

Fig. 19 is a flowchart of a machine learning model construction method according to another embodiment of the present disclosure. The method can be executed by a machine learning model construction device, the device can be realized in a software and/or hardware mode, the device can be configured in an electronic device, such as a server or a terminal, wherein the terminal specifically comprises a mobile phone, a computer or a tablet computer, etc. The server may specifically be a cloud server, and the machine learning model building method may be executed by the cloud. In this embodiment, the method specifically includes the following steps:

S1801, determining at least one target structure to be adjusted in the pre-training model according to the target task.

For example, taking the pre-training model shown in fig. 3 as an example, when the pre-training model is applied to a target task, tuning training of the pre-training model is required. Specifically, at least one target structure to be adjusted in the pre-training model may be determined according to the target task, e.g., the at least one target structure to be adjusted is MHA, FFN and sub-block as shown in fig. 3.

S1802, at least one structure to be trained is obtained, so that the at least one structure to be trained corresponds to the at least one target structure one by one.

As shown in fig. 3, assuming that there are 3 target structures to be adjusted, in order to connect one structure to be trained to each target structure in parallel, the cloud end needs to acquire 3 structures to be trained, so that the 3 structures to be trained and the 3 target structures are in one-to-one correspondence.

S1803, connecting the structure to be trained with the target structure in a corresponding relation in parallel to obtain a machine learning model, wherein the machine learning model comprises the pre-training model and the at least one structure to be trained, and the training method for the at least one structure to be trained adopts the training method.

For example, each target structure is connected with a structure to be trained in parallel, so as to obtain a machine learning model shown in fig. 4. The machine learning model comprises a pre-training model and 3 structures to be trained, and the training method for the 3 structures to be trained can adopt the model training method as described above, and is not repeated here.

According to the embodiment, at least one target structure to be adjusted in the pre-training model is determined according to the target task. Further, at least one structure to be trained is obtained, so that the at least one structure to be trained corresponds to the at least one target structure one by one, the structure to be trained and the target structure with corresponding relations are connected in parallel, a machine learning model is obtained, and the structure of the machine learning model can be suitable for the requirements of the target task.

Since the process of performing downstream tasks using the trained machine learning model is the use or reasoning phase of the trained machine learning model. In the use phase or reasoning phase, the downstream task is not limited to an image processing task, but may also be, for example, a video processing task, a text processing task, a multimodal processing task, etc. That is, in the use phase or the inference phase of the trained machine learning model, the input of the trained machine learning model is not limited to the input of the modality of the image, but may be the input of other modalities, for example, video input, text input, multimodal input, and the like. Wherein the multi-modality may be at least two of a plurality of modalities such as image, video, text, audio, etc. In the following, an image processing task is taken as an example, and in the use stage or the reasoning stage, a method for processing an image by using a machine learning model completed through the training is described.

Fig. 20 is a flowchart of an image processing method according to another embodiment of the present disclosure. The method may be performed by a cloud server. The method comprises the following specific steps:

s1901, acquiring a target image to be processed.

For example, as shown in fig. 21, the cloud server 202 receives a target image to be processed from the terminal 201.

S1902, inputting the target image into a machine learning model, wherein the machine learning model comprises a pre-training model and at least one trained structure, the at least one trained structure is obtained by training according to the training method, the pre-training model comprises at least one target structure, the at least one trained structure and the at least one target structure are in one-to-one correspondence, the trained structure and the target structure with corresponding relations are connected in parallel, and the trained structure and the target structure with corresponding relations process input data in parallel, and the input data is the target image or data obtained according to the target image.

For example, cloud server 202 may input the target image into a machine learning model, which may be a trained machine learning model. The machine learning model includes a pre-trained model and at least one trained structure. The trained structure is obtained by training a structure to be trained similar to that in fig. 4 using the training method described above. The pre-training model comprises at least one target structure, the at least one trained structure and the at least one target structure are in one-to-one correspondence, the trained structure and the target structure which have a corresponding relationship are connected in parallel, and the trained structure and the target structure which have a corresponding relationship process input data in parallel, wherein the input data are data input to the trained structure and data input to the target structure, and the data input to the trained structure and the data input to the target structure are the same because the trained structure and the target structure which have the corresponding relationship are connected in parallel. Specifically, the input data may be a target image or data obtained from a target image. The data obtained from the target image is, for example, a representation vector of the target image or a result of processing the target image by other intermediate layers in the machine learning model.

S1903, adjusting a second processing result of the target structure on the input data according to the first processing result of the trained structure on the input data, and obtaining an adjusted processing result.

S1904, determining an image processing result of the target image according to the adjusted processing result.

For example, the adjusted processing results may be involved in the calculation of output information from the input of the machine learning model to the machine learning model, i.e., the output of each trained structure may affect the output information of the machine learning model. The output information of the machine learning model is an image processing result of the machine learning model on the target image. For example, the target image is a scene image, and the image processing result of the machine learning model on the target image is a classification label of the target image, such as a nature landscape.

According to the embodiment, at least one trained structure is added on the basis of the pre-training model, so that at least one target structure in the pre-training model corresponds to the at least one trained structure one by one, the trained structures with corresponding relations and the target structures are connected in parallel, and the trained structures with corresponding relations and the target structures can process input data in parallel. Further, according to the first processing result of the trained structure on the input data, the second processing result of the target structure on the input data is adjusted, so that the image processing result of the machine learning model on the target image is more accurate.

Fig. 22 is a schematic structural diagram of a machine learning model training device according to an embodiment of the present disclosure. The machine learning model comprises a pre-training model and at least one structure to be trained, the pre-training model comprises at least one target structure, the at least one structure to be trained corresponds to the at least one target structure one by one, and the structure to be trained and the target structure with corresponding relations are connected in parallel. The machine learning model training apparatus provided in the embodiment of the present disclosure may execute the processing flow provided in the embodiment of the machine learning model training method, as shown in fig. 22, the machine learning model training apparatus 210 includes:

The acquiring module 211 is configured to acquire sample data of a target task, where the sample data includes data to be processed and a labeling result of the data to be processed;

the input module 212 is configured to input the data to be processed into the machine learning model, so that the structure to be trained and the target structure having a correspondence relationship process input data in parallel, where the input data is the data to be processed or data obtained according to the data to be processed;

a first adjustment module 213, configured to adjust a second processing result of the input data by the target structure according to a first processing result of the input data by the structure to be trained, to obtain an adjusted processing result, where output information of the machine learning model is related to the adjusted processing result;

and the second adjustment module 214 is configured to adjust parameters in the at least one structure to be trained according to the output information and the labeling result.

Optionally, the pre-training model comprises a plurality of sub-blocks, the sub-blocks comprising a multi-head attention layer and a feed-forward network, the multi-head attention layer or the feed-forward network comprising a plurality of sub-layers;

the at least one target structure is at least one of the sub-layer, the multi-headed attention layer, the feed forward network, the sub-block, the pre-trained model.

Optionally, the at least one structure to be trained includes at least one of a plurality of preset structures respectively;

aiming at the structure to be trained and the target structure with corresponding relations, at least one preset structure included in the structure to be trained is connected with the target structure in parallel.

Optionally, the first processing result of the to-be-trained structure on the input data includes processing results of the at least one preset structure on the input data respectively; the first adjustment module 213 is specifically configured to, when adjusting the second processing result of the input data by the target structure according to the first processing result of the input data by the structure to be trained:

and adjusting a second processing result of the target structure on the input data according to the scaling factors respectively corresponding to the at least one preset structure and the processing results of the at least one preset structure on the input data respectively.

Optionally, a first preset structure of the plurality of preset structures includes a first parameter to be trained, a second parameter to be trained, and a first known parameter; when the structure to be trained comprises the first preset structure, the processing of the input data by the first preset structure comprises the following steps:

Calculating to obtain a first intermediate value according to the input data and the first known parameter;

and calculating a processing result of the first preset structure on the input data according to the first parameter to be trained, the second parameter to be trained and the first intermediate value.

Optionally, a second preset structure of the plurality of preset structures includes a third parameter to be trained, a first known parameter, a second known parameter, and a third known parameter, where a dimension of the parameter to be trained is the same as a dimension of the input data;

when the structure to be trained comprises the second preset structure, the processing of the input data by the second preset structure comprises the following steps:

calculating a second intermediate value according to the third parameter to be trained and the second known parameter;

calculating a third intermediate value according to the third parameter to be trained and the third known parameter;

and calculating a processing result of the second preset structure on the input data according to the first intermediate value, the second intermediate value and the third intermediate value.

Optionally, a third preset structure in the plurality of preset structures includes a fourth parameter to be trained and a fifth parameter to be trained;

When the structure to be trained comprises the third preset structure, the processing of the input data by the third preset structure comprises the following steps:

calculating a fourth intermediate value according to the fourth parameter to be trained and the input data;

performing nonlinear processing on the fourth intermediate value to obtain a nonlinear processing result;

and calculating the processing result of the third preset structure on the input data according to the nonlinear processing result and the fifth parameter to be trained.

The machine learning model training device of the embodiment shown in fig. 22 may be used to implement the technical solution of the above method embodiment, and its implementation principle and technical effects are similar, and will not be described herein again.

The foregoing describes the internal functions and structure of a machine learning model training apparatus that may be implemented as an electronic device. Fig. 23 is a schematic structural diagram of an embodiment of an electronic device according to an embodiment of the disclosure. As shown in fig. 23, the electronic device includes a memory 221 and a processor 222.

The memory 221 is used to store programs. In addition to the programs described above, the memory 221 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like.

The memory 221 may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The processor 222 is coupled to the memory 221, and executes programs stored in the memory 221 for:

Further, as shown in fig. 23, the electronic device may further include: communication component 223, power supply component 224, audio component 225, display 226, and the like. Only some of the components are schematically shown in fig. 23, which does not mean that the electronic device only comprises the components shown in fig. 23.

The communication component 223 is configured to facilitate communication between the electronic device and other devices, either wired or wireless. The electronic device may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 223 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 223 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

A power supply component 224 provides power to the various components of the electronic device. The power components 224 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices.

The audio component 225 is configured to output and/or input audio signals. For example, the audio component 225 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 221 or transmitted via the communication component 223. In some embodiments, audio component 225 further includes a speaker for outputting audio signals.

The display 226 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.

In addition, the embodiment of the present disclosure also provides a computer-readable storage medium having stored thereon a computer program that is executed by a processor to implement the machine learning model training method, the machine learning model reasoning method, the machine learning model construction method, or the image processing method described in the above embodiments.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A machine learning model training method, wherein the machine learning model comprises a pre-training model and at least one structure to be trained, the pre-training model comprises at least one target structure, the at least one structure to be trained and the at least one target structure are in one-to-one correspondence, and the structure to be trained and the target structure which have a corresponding relationship are connected in parallel, the method comprising:

2. The method of claim 1, wherein the pre-training model comprises a plurality of sub-blocks, the sub-blocks comprising a multi-head attention layer and a feed-forward network, the multi-head attention layer or the feed-forward network comprising a plurality of sub-layers;

3. The method of claim 1, wherein the at least one structure to be trained comprises at least one of a plurality of preset structures, respectively;

4. A method according to claim 3, wherein the first processing result of the input data by the structure to be trained comprises the processing result of the input data by the at least one preset structure respectively;

adjusting a second processing result of the target structure on the input data according to a first processing result of the structure to be trained on the input data, including:

5. A method according to claim 3, wherein a first preset structure of the plurality of preset structures comprises a first parameter to be trained, a second parameter to be trained, and a first known parameter;

when the structure to be trained comprises the first preset structure, the processing of the input data by the first preset structure comprises the following steps:

6. A method according to claim 3, wherein a second preset structure of the plurality of preset structures comprises a third parameter to be trained, a first known parameter, a second known parameter and a third known parameter, the dimensions of the parameter to be trained and the dimensions of the input data being the same;

7. A method according to claim 3, wherein a third preset structure of the plurality of preset structures comprises a fourth to-be-trained parameter and a fifth to-be-trained parameter;

8. A machine learning model reasoning method, wherein the machine learning model comprises a pre-trained model and at least one trained structure, the at least one trained structure being trained according to the method of any of claims 1-7, the pre-trained model comprising at least one target structure, the at least one trained structure and the at least one target structure being in one-to-one correspondence, the trained structure and target structure having a correspondence being connected in parallel, the method comprising:

Acquiring target data of a target task;

9. A machine learning model building method, wherein the method comprises:

connecting the structure to be trained with the target structure in a corresponding relation in parallel to obtain a machine learning model, wherein the machine learning model comprises the pre-training model and the at least one structure to be trained, and the training method for the at least one structure to be trained adopts the method as set forth in any one of claims 1-7.

10. An image processing method, wherein the method comprises:

acquiring a target image to be processed;

inputting the target image into a machine learning model, the machine learning model comprising a pre-training model and at least one trained structure, the at least one trained structure being trained according to the method of any one of claims 1-7, the pre-training model comprising at least one target structure, the at least one trained structure and the at least one target structure being in one-to-one correspondence, the trained structure and the target structure having a correspondence being connected in parallel, and the trained structure and the target structure having a correspondence being processing input data in parallel, the input data being the target image or data derived from the target image;

11. A machine learning model training apparatus, wherein the machine learning model includes a pre-training model and at least one structure to be trained, the pre-training model includes at least one target structure, the at least one structure to be trained and the at least one target structure are in one-to-one correspondence, and the structure to be trained and the target structure having a correspondence are connected in parallel, the apparatus comprising:

12. An electronic device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-10.

13. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method of any of claims 1-10.