CN116957006A

CN116957006A - Training method, device, equipment, medium and program product of prediction model

Info

Publication number: CN116957006A
Application number: CN202310113390.2A
Authority: CN
Inventors: 赵思杰; 宋奕兵
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-10-27

Abstract

The application discloses a training method, device, equipment, medium and program product of a prediction model, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring a first prediction model; inputting second sample data in a second field into the first prediction model, and carrying out data prediction on the second sample data through n sub-networks to obtain a prediction result corresponding to the second sample data; determining a prediction loss value corresponding to the first prediction model based on a difference between a prediction result corresponding to the second sample data and the prediction label; obtaining an output gradient corresponding to the ith sub-network based on the predicted loss value and the output characteristic representation of the ith sub-network; and updating parameters of the ith bias parameter based on the output gradient corresponding to the ith sub-network to obtain a second prediction model. The mode of carrying out parameter adjustment on the bias parameters by using the output gradient can avoid carrying out model parameter adjustment by using the input characteristic representation/output characteristic representation, and reduces the memory storage burden of the server.

Description

Training method, device, equipment, medium and program product of prediction model

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a training method, device, equipment, medium and program product of a prediction model.

Background

The transfer learning refers to that a model applied to a source data domain can be applied to a target data domain after model parameters are adjusted, wherein the source data domain and the target data domain respectively belong to two different types of domains, such as: when the target prediction model is suitable for image classification in the image field, the target prediction model is subjected to parameter adjustment through transfer learning, so that the adjusted target prediction model is suitable for text analysis in the natural language processing field.

In the related art, in the process of adjusting parameters of a target prediction model, all parameters of the target prediction model are adjusted by using a target data set corresponding to a target data domain, so that the target prediction model can be completely adapted to the target data domain.

However, in the related art, the manner of adjusting all parameters of the target prediction model occupies too much server memory, resulting in huge memory space overhead, so that the training process of the target prediction model cannot be normally implemented on a device with smaller memory resources, which not only increases the storage burden of a computer, but also reduces the popularity of model training.

Disclosure of Invention

The embodiment of the application provides a training method, a device, equipment, a medium and a program product of a prediction model, which can solve the problem that the training process of the prediction model occupies excessive server memory in a migration learning scene, and the technical scheme is as follows:

in one aspect, a method for training a predictive model is provided, the method comprising:

obtaining a first prediction model, wherein model parameters of the first prediction model are obtained through training of first sample data in a first field, the first prediction model comprises n sub-networks, the i sub-networks are used for carrying out feature processing on input feature representations through i bias parameters and i weight parameters to obtain output feature representations, the i bias parameters are used for carrying out offset adjustment on the input feature representations of the i sub-networks, the i weight parameters are used for carrying out weight adjustment on the input feature representations of the i sub-networks, i is not less than 1 and not more than n, and i and n are integers;

inputting second sample data in a second field into the first prediction model, and carrying out data prediction on the second sample data through the n sub-networks to obtain a prediction result corresponding to the second sample data, wherein the second sample data is marked with a prediction label;

Determining a prediction loss value corresponding to the first prediction model based on a prediction result corresponding to the second sample data and a difference between the prediction labels;

obtaining an output gradient corresponding to the ith sub-network based on the predicted loss value and the output characteristic representation of the ith sub-network;

and updating parameters of the ith bias parameters based on the output gradient corresponding to the ith sub-network to obtain a second prediction model, wherein the second prediction model is used for carrying out data prediction on data in the second field.

In another aspect, there is provided a training apparatus for a predictive model, the apparatus comprising:

the system comprises an acquisition module, a first prediction module and a second prediction module, wherein the model parameters of the first prediction model are obtained through training of first sample data in a first field, the first prediction model comprises n sub-networks, the i sub-networks are used for carrying out feature processing on input feature representations through i bias parameters and i weight parameters to obtain output feature representations, the i bias parameters are used for carrying out offset adjustment on the input feature representations of the i sub-networks, the i weight parameters are used for carrying out weight adjustment on the input feature representations of the i sub-networks, i is greater than or equal to 1 and less than or equal to n, and i and n are integers;

The input module is used for inputting second sample data in a second field into the first prediction model, carrying out data prediction on the second sample data through the n sub-networks to obtain a prediction result corresponding to the second sample data, and marking the second sample data with a prediction label;

the determining module is used for determining a prediction loss value corresponding to the first prediction model based on a prediction result corresponding to the second sample data and the difference between the prediction labels;

the determining module is further configured to obtain an output gradient corresponding to the ith sub-network based on the predicted loss value and the output characteristic representation of the ith sub-network;

and the updating module is used for carrying out parameter updating on the ith bias parameter based on the output gradient corresponding to the ith sub-network to obtain a second prediction model, and the second prediction model is used for carrying out data prediction on the data in the second field.

In another aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory storing at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement a method of training a predictive model as in any of the embodiments above.

In another aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement a method of training a predictive model as in any of the embodiments above is provided.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the training method of the predictive model according to any one of the above embodiments.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

defining a corresponding bias parameter and a weight parameter in each sub-network of the first prediction model applied to the first field, wherein the bias parameter and the weight adoption number are used for carrying out feature processing on the input feature representation of the sub-network to obtain the output feature representation of the sub-network, and obtaining an output gradient corresponding to the ith sub-network according to a prediction loss value corresponding to the first prediction model in the second field and the output feature representation of the ith sub-network in the process of training the first prediction model by using sample data of the second field, so that the parameter of the ith bias parameter is updated, and finally the second prediction model applied to the second field is obtained. That is, in the process of training the model in the migration learning scene, the training of the model is completed by using the output gradient in each sub-network to perform parameter adjustment on the bias parameters in each sub-network. The mode of carrying out parameter adjustment on the bias parameters by using the output gradient can avoid carrying out model parameter adjustment by using the input characteristic representation/output characteristic representation, so that the input characteristic representation/output characteristic representation is not required to be stored for carrying out parameter update, the memory storage burden of a server is reduced, the training efficiency of a prediction model can be improved by carrying out parameter update on the bias parameters only, and the memory storage cost can be reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a training process of a training method for a predictive model according to an exemplary embodiment of the application;

FIG. 2 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a method of training a predictive model provided by an exemplary embodiment of the application;

FIG. 4 is a flow chart of a method of training a predictive model provided in another exemplary embodiment of the application;

FIG. 5 is a training process schematic of a training method for a predictive model according to another exemplary embodiment of the application;

FIG. 6 is a schematic diagram of a method of training a predictive model provided in accordance with yet another exemplary embodiment of the application;

FIG. 7 is a block diagram of a training apparatus for predictive models provided in accordance with an exemplary embodiment of the present application;

FIG. 8 is a block diagram of a training apparatus for predictive models provided in accordance with another exemplary embodiment of the present application;

fig. 9 is a block diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of promoting an understanding of the principles and advantages of the application, reference will now be made in detail to the embodiments of the application, some but not all of which are illustrated in the accompanying drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and no limitation on the amount or order of execution.

The model-based migration learning method refers to that model parameters of a prediction model are finely tuned, so that the prediction model obtained through source data domain training can be applied to a target data domain, wherein data in the source data domain and data in the target data domain belong to data in different domains.

In deep learning, a large-scale sample data of a first field is firstly utilized to train to obtain a first prediction model, and then model parameters of the first prediction model are finely adjusted according to sample data of a second field so as to adapt to the change of a data field from the first field to the second field. One widely used method of transfer learning is to use sample data in the second field to fully adjust all parameters of the first prediction model. However, this way of updating all parameters of the first prediction model is not friendly to hardware resources in the server, and updating all parameters of the first prediction model increases storage burden of storage resources in the server, and as the number of sample data in the second field increases, the number of parameters of the first prediction model in the training process also increases, and for different downstream tasks in the second field, each downstream task needs to set up a storage space for the storage resources in the server separately, so that all parameters corresponding to the downstream task are stored, which results in huge storage space overhead.

In contrast, the training method of the prediction model provided by the embodiment of the application can greatly reduce the overhead burden of the storage resources of the server, and referring to fig. 1, schematically, a first prediction model 110 is currently acquired, and the model parameters of the first prediction model 110 are obtained through training of first sample data in the first field. The first prediction model 110 includes n sub-networks, where the i-th sub-network 101 corresponds to the i-th bias parameter 1011 and the i-th weight parameter 1012,1 i n and i and n are integers. In the ith sub-network 101, the ith bias parameter 1011 and the ith weight parameter 1012 are used to perform feature processing on the input feature representation of the ith sub-network 101 and output the feature representation of the ith sub-network 101.

And inputting the second sample data 120 in the second field into the first prediction model 110, and performing data prediction on the second sample data 120 through n sub-networks, so as to obtain a prediction result corresponding to the second sample data 120. And determining a prediction loss value 130 corresponding to the first prediction model 110 according to the difference between the prediction label marked by the second sample data 120 and the prediction result. And obtaining an output gradient 140 corresponding to the ith sub-network according to the predicted loss value 130 and the output characteristic representation of the ith sub-network 101, and carrying out parameter updating on the ith bias parameter 1011 according to the output gradient 140 to finally obtain a second prediction model 150. The second prediction model 150 includes n sub-networks, and the bias parameter in each sub-network is updated by the output gradient corresponding to the sub-network, and the second prediction model 150 is used for performing data prediction on the data in the second field.

According to the training method of the prediction model, provided by the embodiment of the application, the storage cost of storage resources of a server can be reduced and the model training efficiency can be improved by only updating the bias parameters through the output gradient for each sub-network.

Fig. 2 is a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application, as shown in fig. 2, where the implementation environment includes a terminal 210, a server 220, and a communication network 230, where the terminal 210 and the server 220 are connected through the communication network 230. Alternatively, the communication network 230 may be a wired network or a wireless network, which is not limited herein.

In the embodiment of the present application, the terminal 210 is configured to send data to the server 220. Alternatively, the terminal 210 has installed therein a target application program having a data prediction function (e.g., image segmentation, text collation, text classification, etc.), which is not limited in this embodiment. The target application may be a conventional application, a cloud application, an applet or an application module in a host application, or a web platform, which is not limited in this embodiment.

The server 220 is configured to train the first prediction model to obtain a second prediction model, so as to provide the data prediction function to the terminal 210 through the second prediction model. Illustratively, the terminal 210 sends a first prediction model and second sample data in a second field to the server 220, where the first prediction model is a model obtained by training the first sample data in the first field, and the first prediction model includes n sub-networks, where the i sub-network corresponds to an i bias parameter and an i weight parameter, and the i bias parameter and the i weight parameter perform feature processing on an input feature representation of the i sub-network to obtain an output feature representation of the i sub-network; the second sample data is labeled with a predictive label.

After receiving the first prediction model and the second sample data, the server 220 inputs the second sample data into the first prediction model, performs data prediction on the second sample data through n sub-networks to obtain a prediction result corresponding to the second sample data, and determines a prediction loss value of the first prediction model according to the difference between the prediction result corresponding to the second sample data and the prediction label. And obtaining an output gradient corresponding to the ith sub-network according to the predicted loss value and the output characteristic representation of the ith sub-network, so as to update parameters of the ith bias parameter according to the output gradient of the ith sub-network, and finally obtaining a second prediction model. Optionally, after the second prediction model is trained by the server 220, the second prediction model is sent to the terminal 210, or after the second prediction model is trained by the server 220, the second prediction model is stored for later use. Optionally, when the terminal 210 needs to predict the target data, the target data is sent to the server 220, the server 220 invokes the second prediction model to predict the target data, and finally obtains the predicted content related to the target data.

In some alternative embodiments, the first prediction model and the second sample data are data stored locally by the server, that is, the server may separately complete the training process of the second prediction model.

In some alternative embodiments, terminal 210 is, but is not limited to, a smart phone, tablet, notebook, desktop computer, smart home appliance, smart car terminal, smart speaker, smart voice interaction device, aircraft, etc.

It should be noted that the server 220 can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms.

Cloud Technology (Cloud Technology) refers to a hosting Technology that unifies serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business model, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing. Optionally, server 220 may also be implemented as a node in a blockchain system.

It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant countries and regions. For example, the first sample data, the first predictive model, and the second sample data referred to in the present application are all acquired with sufficient authorization.

With reference to the foregoing description and implementation environment, fig. 3 is a flowchart of a method for training a prediction model according to an embodiment of the present application, where the method may be applied to a terminal, a server, or both the terminal and the server, and the embodiment of the present application is described by taking the application of the method to the server as an example, where the method includes:

in step 310, a first predictive model is obtained.

The model parameters of the first prediction model are obtained through training of first sample data in the first field, the first prediction model comprises n sub-networks, the i sub-networks are used for carrying out feature processing on input feature representations through i bias parameters and i weight parameters to obtain output feature representations, the i bias parameters are used for carrying out offset adjustment on the input feature representations of the i sub-networks, the i weight parameters are used for carrying out weight adjustment on the input feature representations of the i sub-networks, i is greater than or equal to 1 and less than or equal to n, and i and n are integers.

Optionally, the model parameters are all parameters in the first predictive model; alternatively, the model parameters are part of the parameters in the first predictive model.

Optionally, the fields include at least one of a computer vision field, a text classification field, a behavior recognition field, a natural language processing field, an indoor positioning field, a public opinion analysis field, a man-machine interaction field, and the like.

In some embodiments, the first predictive model is a neural network model for performing the first domain task because the first predictive model is trained from the first sample data of the first domain. Wherein the first domain task may be implemented as any one or more tasks in the natural language processing domain, including: entity recognition, semantic annotation, part-of-speech annotation, word segmentation, text classification, emotion analysis, natural language reasoning, question-answering, text semantic similarity analysis, machine translation, text summarization, image description generation, and the like, to which embodiments of the present application are not limited.

Illustratively, if the first domain is implemented as a medical domain, the first predictive model is implemented as a paper classification model for the medical domain. Accordingly, the first domain task may be implemented as classification of the medical articles, that is, the input text of the first prediction model is a target article, and the prediction result of the first prediction model is a class of the target article, or a probability that the target article belongs to a certain class.

Alternatively, the first sample data may be implemented as first task data, where the first task data is used for training on the first domain task. Optionally, the first task data is tag data associated with the first domain task, the tag data is used to indicate a reference execution result of the first sample data on the first domain task, and the first task data is associated with the first domain.

That is, the first prediction model is a model trained for the first domain task among models from which the first domain data-related features are extracted. For example: training for medical paper classification tasks in a model with relevant vocabulary features in the medical field extracted, and obtaining a classification model for executing the medical paper classification tasks.

Optionally, the first prediction model includes n sub-networks connected in series in sequence, where "series" means that the input feature representation of the i-th sub-network is obtained from the output feature representation of the i-1 th sub-network, and the input feature representation of the i+1-th sub-network is obtained according to the output feature representation of the i-th sub-network; alternatively, the first prediction model includes n independent sub-networks, i.e., n sub-networks are not in communication with each other.

Illustratively, an ith sub-network in the first prediction model is taken as an example for explanation, and the ith sub-network comprises network layer structures such as a full connection layer, a convolution layer, a pooling layer, a normalization layer and the like.

Optionally, the ith bias parameter and the ith weight parameter corresponding to the ith sub-network are located in at least one network layer of the above multiple network layers, for example: a weight parameter and a bias parameter are defined in the full connection layer in the i-th sub-network, and a weight parameter and a bias parameter are also defined in the normalization layer in the i-th sub-network. When the weight parameter and the bias parameter are respectively defined in different network layers of the same sub-network, the weight parameter and the bias parameter in different network layers are the same or different.

Optionally, the ith bias parameter and the ith weight parameter corresponding to the ith sub-network are located in a designated network layer among the above multiple network layers, for example: only one weight parameter and one bias parameter are defined in the full connection layer in the ith sub-network.

Optionally, the bias parameter and the weight parameter are parameters additionally defined in the process of training the first prediction training model on the task of the second field, that is, the weight parameter and the bias parameter do not exist in the first prediction model obtained by training the first sample data of the first field; alternatively, the bias parameters and the weight parameters are parameters that are inherently present in the first predictive model.

Illustratively, the weight parameters and bias parameters in different sub-networks are the same or different.

In an alternative case, the weight parameters and the bias parameters are defined in a network layer structure at the same time in the same sub-network; alternatively, the weight parameter and the bias parameter are defined in different network layer structures, respectively.

In the process of defining the weight parameters and the bias parameters in the network layer of the sub-network, the initial parameter values corresponding to the weight parameters and the bias parameters in the network layer are preset.

Optionally, the weight parameter and the bias parameter are parameter results comprising specific parameter values; alternatively, the weight parameter and the bias parameter are parameter matrices, that is, matrix values in the parameter matrices are parameter values, and thus, the weight parameter may also be referred to as a weight parameter matrix, and the bias parameter may also be referred to as a bias parameter matrix.

Illustratively, when the input characteristic of the ith sub-network represents the input of the ith sub-network, the output characteristic of the ith sub-network is output as the output result of the ith sub-network. In the process of inputting the input characteristic representation into the ith sub-network, the input characteristic representation is subjected to weight adjustment through the ith weight parameter corresponding to the ith sub-network, and the input characteristic representation subjected to weight adjustment is subjected to bias adjustment through the ith bias parameter, so that the output characteristic representation of the ith sub-network is obtained through adjustment.

The weight parameter in the ith sub-network is used for carrying out weight adjustment on the input feature representation to represent the importance degree played by the input feature representation in the final output result of the first prediction model. And the ith bias parameter in the ith sub-network is used for carrying out bias adjustment on the input characteristic representation to improve the fitting speed of the neural network of the sub-network, so that the model precision of the sub-network is improved.

Step 320, inputting the second sample data in the second field into the first prediction model, and performing data prediction on the second sample data through n sub-networks to obtain a prediction result corresponding to the second sample data.

Wherein the second sample data is labeled with a predictive tag.

In some embodiments, the second domain and the first domain belong to two different domains, for example: the first domain is a natural language processing domain, and the second domain is a visual processing domain.

Illustratively, the prediction label of the second sample data label is a real result corresponding to the second sample data, for example: when the second sample data is an image containing "cat", the predictive label marked by the second sample is "cat".

Optionally, the second sample data labeled prediction labels are manually labeled in advance by a developer; or, the prediction label of the second sample data label is a prediction result obtained by outputting other prediction models applied to the second field.

In some embodiments, the second sample data is implemented as sample data corresponding to a second domain task, where when the second domain task is implemented as a task in the visual processing domain, at least one of the tasks of image classification, object detection, semantic segmentation, and the like may be included.

That is, the second sample data is input into the first prediction model, and the prediction result of the second sample data obtained by performing data prediction on the second sample data through n sub-networks is the reference execution result of the first prediction model on the task in the second field.

Illustratively, the process of inputting the second sample data into the first prediction model and outputting the result is a forward propagation process of the first prediction model.

Step 330, determining a prediction loss value corresponding to the first prediction model based on the difference between the prediction result corresponding to the second sample data and the prediction label.

Illustratively, the predicted loss value is an overall loss value of the second sample data corresponding to the first prediction model, that is, the predicted loss value is a loss value commonly corresponding to the n sub-networks. The magnitude of the prediction loss value is used for indicating the degree of difference between the prediction result corresponding to the second sample data and the prediction label, and the larger the prediction loss value is, the larger the difference between the prediction result corresponding to the second sample data and the prediction label is, at this time, the lower the fitness of the first prediction model corresponding to the second field task is, and the lower the accuracy of processing the second field task is.

In some embodiments, a difference value between the prediction result corresponding to the second sample data and the prediction tag is calculated using a specified loss function, wherein the loss function comprises at least one of a plurality of different loss functions such as a mean square error loss function, an L1 distance loss function, an L2 distance loss function, a smoothed L1 (smoothl 1) loss function, a relative entropy loss function, a cross entropy loss function, a Softmax loss function, and the like.

Illustratively, the predictive loss value is a scalar.

Step 340, obtaining an output gradient corresponding to the ith sub-network based on the predicted loss value and the output characteristic representation of the ith sub-network.

Illustratively, the output gradient corresponding to the ith sub-network refers to the result of the predicted loss value with respect to the gradient on the ith sub-network, and is used to represent the magnitude of the variation of the predicted loss value when the output characteristic of the ith sub-network represents the variation.

In some embodiments, the output gradient corresponding to the ith sub-network is obtained by predicting the derivative of the loss value on the output characteristic representation of the ith sub-network.

Illustratively, the process of obtaining the output gradient corresponding to the ith sub-network according to the predicted loss value and the output characteristic representation of the ith sub-network is called a back propagation process of the first prediction model, wherein after the forward propagation processes corresponding to the n sub-networks in the first prediction model are sequentially completed, the back propagation process is sequentially performed from the nth sub-network.

And 350, updating parameters of the ith bias parameter based on the output gradient corresponding to the ith sub-network to obtain a second prediction model.

The second prediction model is used for carrying out data prediction on the data in the second field.

Illustratively, after the output gradient corresponding to the ith sub-network is obtained, the parameter gradient corresponding to the ith bias parameter is obtained according to the output gradient corresponding to the ith sub-network, so that the ith bias parameter is subjected to gradient descent according to the parameter gradient corresponding to the ith bias parameter, and parameter updating is performed on the ith bias parameter.

In some embodiments, after parameter updating is performed on the bias parameters corresponding to the n sub-networks respectively, a second prediction model is obtained, where the parameter updating sequence of the bias parameters corresponding to the n sub-networks respectively is opposite to the acquisition sequence of the output feature representations corresponding to the n sub-networks respectively.

Schematically, after the second prediction model is obtained, inputting the target data in the second field into the second prediction model, and performing data prediction on the target data through n sub-networks in the second prediction model to obtain a prediction result corresponding to the target data in the second field, for example: and inputting the target image in the field of computer vision into a second prediction model for image classification prediction, and outputting a classification result of the obtained target image as landscape.

Optionally, the bias parameters corresponding to the n sub-networks in the second prediction model are different from the bias parameters corresponding to the n sub-networks in the first prediction line.

Optionally, in addition to the bias parameters, other parameters respectively included in the n sub-networks in the second prediction model may also be different from other parameters respectively included in the n sub-networks in the first prediction line, for example: the weight parameters respectively corresponding to the n sub-networks in the second prediction model are different from the weight parameters respectively corresponding to the n sub-networks in the first prediction model.

In summary, in the training method of the prediction model provided by the embodiment of the present application, a corresponding bias parameter and a weight parameter are defined in each sub-network applied to the first prediction model in the first domain, the bias parameter and the weight acceptance number are used to perform feature processing on the input feature representation of the sub-network to obtain the output feature representation of the sub-network, and in the process of training the first prediction model by using the sample data in the second domain, the output gradient corresponding to the i sub-network is obtained according to the prediction loss value corresponding to the first prediction model in the second domain and the output feature representation of the i sub-network, so that the parameter update is performed on the i bias parameter, and finally the second prediction model applied to the second domain is obtained. That is, in the process of training the model in the migration learning scene, the training of the model is completed by using the output gradient in each sub-network to perform parameter adjustment on the bias parameters in each sub-network. The mode of carrying out parameter adjustment on the bias parameters by using the output gradient can avoid carrying out model parameter adjustment by using the input characteristic representation/output characteristic representation, so that the input characteristic representation/output characteristic representation is not required to be stored for carrying out parameter update, the memory storage burden of a server is reduced, the training efficiency of a prediction model can be improved by carrying out parameter update on the bias parameters only, and the memory storage cost can be reduced.

In an alternative embodiment, the weight parameter is implemented as a weight parameter matrix, and the bias parameter is implemented as a bias parameter matrix, referring to fig. 4 schematically, which shows a flowchart of a training method of a prediction model provided by an exemplary embodiment of the present application, that is, step 340 further includes steps 341 to 343, schematically, as shown in fig. 4, the method includes the following steps.

And 341, acquiring the input characteristic representation corresponding to the ith sub-network based on the output characteristic representation corresponding to the ith-1 sub-network.

Optionally, the input feature representation corresponding to the ith sub-network and the output feature representation corresponding to the i-1 th sub-network are the same or different, which is not limited. That is, the input feature representation corresponding to the i-th sub-network has a feature correlation with the output feature representation corresponding to the i-1 th sub-network.

In an alternative case, after the output characteristic representation corresponding to the i-1 th sub-network is obtained, the output characteristic representation corresponding to the i-1 th sub-network is directly used as the input characteristic representation corresponding to the i-1 th sub-network, that is, in this case, the i-1 th sub-network and the i-th sub-network are in a direct series connection relationship.

In another alternative case, after the output characteristic representation corresponding to the ith-1 sub-network is obtained, the input characteristic representation corresponding to the ith sub-network can be obtained after the characteristic adjustment is performed on the ith-1 output characteristic representation, that is, in this case, other modules are further included between the ith sub-network and the ith-1 sub-network, and are used for performing characteristic adjustment on the ith-1 output characteristic representation, and the ith sub-network and the ith-1 sub-network belong to an indirect series connection relationship.

The following describes the module for feature adjustment in detail。

In the embodiment of the present application, a module for performing feature adjustment is referred to as a feature promotion module (LRP).

In some embodiments, at least two parameter matrices preset in the ith sub-network are obtained; taking the matrix product result of the at least two parameter matrices as an intermediate variable corresponding to the ith sub-network, wherein the intermediate variable is used for carrying out characteristic adjustment on the output characteristic representation corresponding to the ith-1 sub-network; and obtaining the input characteristic representation corresponding to the ith sub-network based on the matrix relation between the output characteristic representation corresponding to the ith-1 sub-network and the intermediate variable.

Illustratively, each sub-network corresponds to one LRP, and the module structure of each LRP is the same or different, which is not limited thereto.

Optionally, the LRP is an additional module for assisting the training of the first prediction model in training the first prediction model using the second sample data of the second domain; alternatively, LRP is a module that exists fixed in a first predictive model obtained by training of first sample data in a first domain.

In this embodiment, at least two parameter matrices that can be learned (i.e., parameter adjustment is possible) are preset in the LRP for the ith sub-network, and the first matrix and the second matrix are taken as an example for explanation. Wherein the first matrix is A epsilon R ^c×r The second matrix is B epsilon R ^r×d Wherein the first matrix A is a c x r matrix (c representing columns, r representing rows) and the second matrix B is a r x d matrix (r representing rows, d representing columns), r < c and r < d.

In this embodiment, the first matrix and the second matrix belong to two low-rank parameter matrices, where a low-rank parameter matrix refers to a matrix with a high parameter correlation between rows and columns of the matrix, that is, a part of rows/columns in the matrix may be used to represent rows/columns of another part of the matrix, so that the number of learnable parameters in the low-rank parameter matrix is small.

In this embodiment, the result of matrix multiplication obtained by multiplying the first matrix and the second matrix in the ith sub-network is used as the intermediate variable AB corresponding to the ith sub-network, that is, the LRP includes c×r+r×d parameters. Wherein the intermediate variable has the same characteristic structure as the output characteristic representation of the i-1 th subnetwork. Therefore, the intermediate variable can perform characteristic adjustment on the output characteristic representation of the ith-1 sub-network, so that the corresponding characteristic representation capacity of the output characteristic representation of the ith-1 sub-network is enhanced, and the characteristic representation with the enhanced characteristic representation capacity is used as the input characteristic representation of the ith sub-network. It should be noted that, the "output feature representation corresponding to the i-1 th sub-network" is the input feature representation of the LRP, and the output feature representation of the LRP is the input feature representation corresponding to the i-th sub-network. In the present application, the input feature representation and the output feature representation for any sub-network/module are referred to as intermediate feature layers.

Illustratively, the input feature representation of the ith sub-network is derived from a matrix relationship between the ith-1 th output feature representation and the intermediate variables, where the matrix relationship may refer to equation one below.

Equation one: y=x+ab

Wherein x represents the output characteristic representation of the i-1 th sub-network, y represents the input characteristic representation of the i-th sub-network, and AB represents an intermediate variable.

And 342, inputting the input characteristic representation corresponding to the ith sub-network into the ith sub-network, and performing characteristic processing on the input characteristic representation corresponding to the ith sub-network through the ith bias parameter and the ith weight parameter to output and obtain the output characteristic representation corresponding to the ith sub-network.

Illustratively, the ith bias parameter and the ith weight parameter corresponding to the ith sub-network are learnable parameters additionally defined in the process of training the first prediction model through the second sample data.

In this embodiment, the full connection layer in the ith sub-network is defined to include the ith weight parameter W ε R ^c×d And the ith bias parameter b εR ^c Where c represents a column and r represents a row.

Matrix multiplication is carried out on the ith weight parameter and the input characteristic representation corresponding to the ith sub-network, so that a matrix product result is obtained; and carrying out matrix addition on the matrix multiplication result and the ith offset parameter to obtain an output characteristic representation corresponding to the ith sub-network.

Therefore, for the ith sub-network, firstly, the ith weight parameter and the input feature representation corresponding to the ith sub-network are subjected to matrix multiplication to obtain a matrix product result corresponding to the input feature representation, and then the matrix product result and the ith offset parameter are subjected to matrix addition, so that the obtained matrix addition result is used as an output feature representation corresponding to the ith sub-network, and the specific process can refer to a formula II.

Formula II: y=wx+b

Where y represents the output characteristic representation of the ith sub-network, x represents the input characteristic representation of the ith sub-network, W represents the ith weight parameter, and b represents the ith bias parameter.

And 343, taking the derivative calculation result of the output characteristic representation of the ith sub-network corresponding to the predicted loss value as the output gradient corresponding to the ith sub-network.

In this embodiment, after the output characteristic representation of the ith sub-network is obtained through the above formula two in the forward propagation process of the first prediction model, the predicted loss value is biased on the output characteristic representation corresponding to the ith sub-network in the backward propagation process of the first prediction model to obtain the predicted loss valueThe derivative calculation result of the output characteristic representation corresponding to the ith sub-network is taken as the output gradient of the ith sub-network +.>

In some embodiments, the input gradient corresponding to the ith sub-network is obtained based on the output gradient corresponding to the ith sub-network and the ith weight parameter, where the input gradient corresponding to the ith sub-network is used for gradient backhaul in the process of counter-propagating of the ith sub-network.

In this embodiment, in the process of back propagation of the first prediction model, the input gradient of each sub-network needs to be calculated for gradient feedback, where gradient feedback refers to that the result generated in the forward propagation process is returned to the back propagation process. Thus, the input gradient corresponding to the ith sub-network is obtained through the matrix relation between the output gradient of the ith sub-network and the ith weight parameter, wherein the matrix relation can refer to the formula III.

And (3) a formula III:

wherein W is ^T The i-th weight parameter after the rank is turned,representing the input gradient of the i-th sub-network.

In some embodiments, the input feature representation corresponding to the ith sub-network is deleted from the server memory.

In this embodiment, since the input gradient is determined by the output gradient and the weight parameter, and the output gradient is determined by the predicted loss value and the output characteristic representation, in the forward propagation process, after the output characteristic representation of the ith sub-network is obtained through the formula II, the input characteristic representation of the ith sub-network can be deleted from the server memory, and only the output characteristic representation of the ith sub-network is reserved, thereby reducing the storage overhead of the server memory resources.

In this embodiment, the output gradient corresponding to the ith sub-network is used as the parameter gradient corresponding to the ith bias parameter, and the fourth formula may be referred to specifically.

Equation four:

wherein, the liquid crystal display device comprises a liquid crystal display device,refers to the parameter gradient corresponding to the ith bias parameter.

In this embodiment, after the parameter gradient corresponding to the ith bias parameter is obtained, the parameter gradient is subjected to gradient descent to update the parameter of the ith bias parameter, and finally, after the bias parameters respectively corresponding to the n sub-networks are updated, the second prediction model is obtained.

In this embodiment, in the back propagation process of the first prediction model, the gradient of the first matrix and the gradient of the second matrix in the LRP are also required to update parameters of the first matrix and the second matrix, and calculate the input gradient of the LRP for gradient feedback, where the input gradient of the LRP may refer to the formula five, the gradient of the first matrix may refer to the formula six, and the gradient of the second matrix may refer to the formula seven.

Formula five:

formula six:

formula seven:

wherein, for the LRP corresponding to the ith sub-network,representing the input gradient of LRP,/->The output features representing LRP represent the corresponding output gradients, B ^T Representing a second matrix after rank conversion, A ^T The first matrix after rank conversion is represented.

From the formulas five, six and seven, it can be seen that the output gradient is equal to the output gradient, the gradient of the first matrix is related to the output gradient and the second matrix, and the gradient of the second matrix is related to the output gradient and the first matrix, so that after the output characteristic representation of the LRP is obtained according to the input characteristic representation of the LRP in the current forward propagation process, the input characteristic representation or the output characteristic representation can be deleted from the memory of the server correspondingly.

It is noted that the LRP may be inserted into any position in the first prediction model during the application process.

It is further noted that x is only used to represent an input feature representation, and is not specific to a certain input feature representation, and y is only used to represent an output feature representation, and is not specific to a certain output feature representation.

In this embodiment, after the input feature representation corresponding to the ith sub-network is obtained, the input feature representation is subjected to feature processing through the ith bias parameter and the ith weight parameter to obtain an output feature representation, and then the output gradient of the ith sub-network is obtained according to the derivative calculation result of the predicted loss value corresponding to the output feature representation, so that the input feature representation is avoided being used in the process of calculating the output gradient, the parameter calculation amount is reduced, and the model training efficiency is improved.

In this embodiment, the output characteristic representation corresponding to the sub-network is obtained through the weight parameter, the input characteristic representation corresponding to the sub-network, and the matrix relation between the bias parameter, so that the characteristic accuracy of the output characteristic representation can be improved.

In this embodiment, in addition to the output gradient obtained by calculation, an input gradient corresponding to the sub-network is obtained according to the output gradient and the weight parameter, so as to ensure that the output gradient can be returned in the counter-propagation process, and improve the model precision of the second prediction model.

In this embodiment, in the process of back propagation, no input feature representation of the subnetwork is needed no longer for both the output gradient and the input gradient, so that after the output feature representation is obtained through the input feature representation, the input feature representation can be deleted from the server memory, thereby reducing the storage overhead of the server memory.

In this embodiment, at least two parameter matrices are preset and are multiplied by each other, so that an intermediate variable corresponding to the ith sub-network is obtained, and the output characteristic representation is subjected to characteristic adjustment by using the intermediate variable, so that the characteristic representation of the input characteristic representation of the ith sub-network can be improved.

In an alternative embodiment, each sub-network further corresponds to a branching module, where the ith sub-network corresponds to the ith branching module, and referring to fig. 5, a flowchart of a training method of a prediction model provided by an exemplary embodiment of the present application is shown schematically, and the method includes the following steps as shown in fig. 5.

And 510, performing feature downsampling on the output feature representation corresponding to the i-1 th sub-network to obtain a downsampling result corresponding to the output feature representation of the i-1 th sub-network.

In some embodiments, the i-1 th subnetwork corresponds to the i-1 th feature extraction module.

Illustratively, after the output characteristic representation corresponding to the i-1 th sub-network is obtained, the output characteristic representation corresponding to the i-1 th sub-network is used as the input characteristic representation corresponding to the i-th characteristic extraction module, the characteristic extraction module is input for characteristic downsampling, a downsampling result of the i-1 th sub-network is output, and the downsampling result of the i-1 th sub-network is used as the input characteristic representation of the i-th branching module.

That is, in inputting the input feature representation corresponding to the i-1 th feature extraction module into the i-1 th feature extraction module, the input feature representation is subjected to a channel averaging pooling operation (Channel Average Pooling, CAP) to input the input feature representation xεR ^n×c Compressed intoThe downsampling multiple is a preset (e.g., 8, 16, 32, etc.) output characteristic representation used as the i-1 th characteristic extraction module, that is, the output characteristic representation of the i-1 th subnetwork corresponds to the downsampling result, and the specific reference may be made to the formula eight.

Formula eight:

wherein y is _k The output features representing the feature extraction module represent the kth component along the channel dimension c, and the other letter indices in equation eight represent indices along the channel dimension.

In some embodiments, the output characteristic representations corresponding to the ith-1 sub-network are divided into k groups of characteristic representation groups according to the preset channel dimension, wherein the j group of characteristic representation groups comprise r characteristic variables, j is less than or equal to k, and k and j are positive integers; and taking the feature average result corresponding to the r feature variables in the j-th group of feature representation groups as the output feature representation corresponding downsampling result of the i-1 sub-network based on the preset channel dimension.

In this embodiment, in the forward propagation process of the first prediction model, input feature representations corresponding to the feature extraction module are divided into k groups in a non-overlapping manner along a preset channel dimension, so as to obtain k groups of feature representation groups, and r feature variables in each group are calculated to obtain an average value along the preset channel dimension, so that a feature average result is used as a downsampling result of the ith sub-network.

In addition, in the back propagation process of the first prediction model, only the input gradient corresponding to the input feature representation needs to be acquired in the feature extraction module for gradient feedback, and the specific reference can be made to formula nine.

Formula nine:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the input gradient corresponding to the input feature representation, it can be seen from equation nine that the input gradient corresponding to the input feature representation and the output gradient corresponding to the output feature representation +.>And the input characteristic representation corresponding to the characteristic extraction module is deleted from the memory of the server after a downsampling result is obtained through a formula eight in the forward propagation process of the first prediction model.

Step 520, inputting the downsampling result corresponding to the output characteristic representation corresponding to the i-1 sub-network into the i branch module, and outputting to obtain the convolution result corresponding to the i branch module.

Illustratively, the obtained down-sampling result is output by the ith feature extraction module and is used as the input feature representation of the ith branch module to be input into the ith branch module, the input down-sampling result is subjected to feature convolution by the ith branch module, and the corresponding convolution result of the ith branch module is output.

In some embodiments, convolution operation is performed on the downsampling result corresponding to the output feature representation of the ith sub-network to obtain a convolution result corresponding to the ith branch module, wherein the convolution operation comprises at least one of point-by-point convolution and depth convolution.

In this embodiment, for the ith branching module, a point-by-point convolution layer and a depth convolution layer are adopted as main structures of the branching module, and each convolution sub-module in the branching module further includes a normalization layer, a nonlinear activation layer, and the like. Thus, the convolution result obtained by the i-th branching module output may refer to formula ten.

Formula ten: y is _i ＝Conv _PW (Conv _DW (Conv _PW (x _d +y _i-1 )))

Wherein Conv _PW Representing a first convolution sub-module consisting of a point-by-point convolution layer, a normalization layer and a nonlinear activation layer, conv _DW Representing a second convolution sub-module consisting of a depth convolution layer, a normalization layer, a nonlinear activation layer, x _d Representing the output feature representation corresponding to the ith feature extraction module, y _i-1 Representing the output characteristic representation of the i-1 th branching module.

In this embodiment, the branching module is referred to as a lightweight branching module (LSB).

In another alternative, the point-by-point convolution layer and the depth convolution layer in the LSBs may use common convolution layers, and the number of convolution layers in each LSB may be adjusted accordingly according to the training requirements.

And 530, performing feature stitching on the convolution result corresponding to the ith branch module and the downsampling result corresponding to the output feature representation of the ith sub-network, wherein the obtained stitching feature is used as the input feature representation of the (i+1) th branch module.

And performing feature splicing on the convolution result corresponding to the ith branch module and the output feature representation corresponding to the (i+1) th feature extraction module, wherein the obtained spliced feature is used as the input feature representation of the (i+1) th branch module. It is noted that, in the forward propagation process of the first prediction model, after the nth branch module outputs the convolution result, the convolution result of the nth branch module and the input feature representation corresponding to the nth sub-network are subjected to feature stitching, and the obtained stitching feature is used as the prediction result of the first prediction model.

Notably, at least one of the LRP and LSB described above is applied to train the first predictive model. That is, steps 510 through 530 may be implemented after step 320.

In this embodiment, by setting the branching module, feature stitching can be performed on the convolution result output by the branching module and the downsampling result represented by the output feature, and the stitched feature obtained by stitching is used as the input feature representation of the next branching module, so as to improve the accuracy of the first prediction model in the training process.

In this embodiment, by performing feature downsampling on the output feature representation of the sub-network, the number of channels of the output feature representation can be reduced, and the model training efficiency can be improved.

In this embodiment, by performing convolution operation on the downsampling result corresponding to the output feature representation of the sub-network, the number of parameters in the output feature representation can be reduced, and the data overhead of the computer in the model training process can be reduced.

Illustratively, in the process of training the first prediction model, a weight parameter and a bias parameter are added to each sub-network, and each sub-network corresponds to one LRP and one LSB, so as to implement a training process, please refer to fig. 6, which shows a flowchart of a training method of the prediction model provided by an exemplary embodiment of the present application, as shown in fig. 6, where the training method includes the following processes.

First, a first prediction model 601 is acquired, wherein the first prediction model 601 is a classification prediction model applied to the field of image classification, that is, after an image containing a cat is input into the first prediction model 601, a probability result of the cat contained in the image is output. At this time, the first prediction model 601 includes only n sub-networks, and fig. 6 shows only the i-1 th sub-network 602 and the i-th sub-network 604 for illustration.

In the process of inputting a second sample text in a second field (the second field is a text classification field, and the second sample text is a prose) into the first prediction model 601, firstly, text prediction is performed on the second sample text through n sub-networks in the first prediction model 601, and a text classification result corresponding to the second sample text is obtained, namely, the prediction probability that the sample text belongs to the prose is obtained. Wherein the second sample text is labeled with a text classification tag.

And obtaining the integral loss value corresponding to the first prediction model 601 according to the difference between the text classification labels and the prediction probabilities.

The above procedure is a forward propagation procedure of the first predictive model 601.

Therefore, in the forward propagation process of the first prediction model 601, taking the ith sub-network as an example, when the output feature representation of the ith-1 sub-network 602 is used as the ith-1 intermediate feature 603, the intermediate feature 603 is used as the input feature representation of the feature extraction module corresponding to the ith sub-network to perform feature downsampling, the downsampled result of the ith-1 output feature representation is output, the downsampled result is input into the ith branching module 605 to perform convolution operation, the convolved result corresponding to the ith branching module is obtained, and the downsampled result corresponding to the output feature representation of the ith sub-network is used as the input feature representation of the ith+1 branching module after feature stitching.

In addition, the i-1 intermediate feature 603 is further input into the i-th feature promotion module, so that an input feature representation of the i-th sub-network is output, and is input into the i-th sub-network 604, and the output feature representation of the i-th sub-network is output after feature processing is performed on the i-th sub-network 604 through the weight parameter and the bias parameter 606.

In the forward propagation process, after the output characteristic representation is obtained through the input characteristic representation, the input characteristic representation can be deleted correspondingly from the memory of the server.

In the back propagation process of the first prediction model 601, for different modules, gradients corresponding to the parameters are calculated by using the output gradients obtained by the corresponding output characteristic representations and the prediction loss values, so that the parameters are updated, and a second prediction model is finally obtained.

The migration learning method provided by the application aims at reforming an original pre-trained model, and in the downstream task learning process, most of parameters of the original model are kept unchanged, and only the modules and the parameters (bias parameters, LRP, LSB and CAP) provided by the application are updated. Compared with a common method, such as fine tuning all parameters of a model in downstream tasks, the method greatly reduces the storage space of parameter amounts, and a pre-training model with parameters of 100M can save 96% of storage space in 20 downstream tasks (only fine tuning 1% of parameters). Meanwhile, in the module updating provided by the application, the gradient of the learnable parameters does not depend on intermediate features or only depends on intermediate features with smaller sizes, so that the occupation of memory space can be greatly reduced, and compared with the fine adjustment of all parameters, the memory can be reduced by times.

FIG. 7 is a block diagram of a training apparatus for a predictive model according to an exemplary embodiment of the application, as shown in FIG. 7, the apparatus including the following:

the obtaining module 710 is configured to obtain a first prediction model, where model parameters of the first prediction model are obtained through training of first sample data in a first field, the first prediction model includes n sub-networks, where an i-th sub-network is configured to perform feature processing on an input feature representation through an i-th bias parameter and an i-th weight parameter to obtain an output feature representation, the i-th bias parameter is configured to perform offset adjustment on the input feature representation of the i-th sub-network, the i-th weight parameter is configured to perform weight adjustment on the input feature representation of the i-th sub-network, i is greater than or equal to 1 and less than or equal to n, and i and n are integers;

The input module 720 is configured to input second sample data in a second field into the first prediction model, perform data prediction on the second sample data through the n sub-networks, and obtain a prediction result corresponding to the second sample data, where the second sample data is labeled with a prediction tag;

a determining module 730, configured to determine a prediction loss value corresponding to the first prediction model based on a prediction result corresponding to the second sample data and a difference between the prediction labels;

the determining module 730 is further configured to obtain an output gradient corresponding to the ith sub-network based on the predicted loss value and the output characteristic representation of the ith sub-network;

and the updating module 740 is configured to update parameters of the ith bias parameter based on an output gradient corresponding to the ith sub-network, so as to obtain a second prediction model, where the second prediction model is used for performing data prediction on data in the second field.

Referring to fig. 8, in some alternative embodiments, the determining module 730 includes:

an obtaining unit 731, configured to obtain an input feature representation corresponding to the ith sub-network based on the output feature representation corresponding to the ith-1 th sub-network;

An output unit 732, configured to input an input feature representation corresponding to an ith sub-network into the ith sub-network, perform feature processing on the input feature representation corresponding to the ith sub-network through the ith bias parameter and the ith weight parameter, and output an output feature representation corresponding to the ith sub-network;

the obtaining unit 731 is further configured to use a derivative calculation result of the output feature representation corresponding to the ith sub-network and the predicted loss value as an output gradient corresponding to the ith sub-network.

In some embodiments, the output unit 732 is further configured to output and obtain an output feature representation corresponding to the ith sub-network based on the i weight parameter, the input feature representation corresponding to the ith sub-network, and a matrix relationship between the i bias parameter.

In some embodiments, the obtaining unit 731 is further configured to obtain an input gradient corresponding to the ith sub-network based on the output gradient corresponding to the ith sub-network and the ith weight parameter, where the input gradient corresponding to the ith sub-network is used for gradient backhaul in a process of back propagation of the ith sub-network.

In some embodiments, the apparatus further comprises:

and the deleting module 750 is configured to delete the input feature representation corresponding to the ith sub-network from the server memory.

In some embodiments, the updating module 740 is further configured to determine a parameter gradient corresponding to the ith bias parameter based on the output gradient corresponding to the ith sub-network; and carrying out parameter updating on the ith bias parameter through a parameter gradient corresponding to the ith bias parameter to obtain the second prediction model.

In some embodiments, the obtaining unit 731 is further configured to obtain at least two parameter matrices preset in the ith sub-network; taking the matrix product result of the at least two parameter matrices as an intermediate variable corresponding to the ith sub-network, wherein the intermediate variable is used for carrying out characteristic adjustment on the output characteristic representation corresponding to the ith-1 sub-network; and obtaining the input characteristic representation corresponding to the ith sub-network based on the matrix relation between the output characteristic representation corresponding to the ith-1 sub-network and the intermediate variable.

In some embodiments, the ith sub-network corresponds to an ith branching module;

the apparatus further comprises:

The sampling module 760 is configured to perform feature downsampling on the output feature representation corresponding to the i-1 th sub-network to obtain a downsampling result corresponding to the output feature representation of the i-1 th sub-network;

the input module 720 is further configured to input the downsampling result corresponding to the output characteristic representation corresponding to the i-1 th sub-network into the i-th branch module, and output a convolution result corresponding to the i-th branch module.

In some embodiments, the sampling module 760 is further configured to divide the output feature representations corresponding to the i-1 th sub-network into k feature representation groups according to a preset channel dimension, where the j-th feature representation group includes r feature variables, j is less than or equal to k, and k and j are positive integers; and taking the feature average result corresponding to the r feature variables in the j-th group of feature representation groups as the output feature representation corresponding downsampling result of the i-1 sub-network based on the preset channel position.

In some embodiments, the input module 720 is further configured to convolve the downsampled result corresponding to the output feature representation of the i-1 th sub-network to obtain the convolved result corresponding to the i-th branching module.

In summary, in the training device for a prediction model provided in the embodiment of the present application, a corresponding bias parameter and a weight parameter are defined in each sub-network of a first prediction model applied to a first domain, the bias parameter and the weight acceptance number are used to perform feature processing on an input feature representation of the sub-network to obtain an output feature representation of the sub-network, and in a process of training the first prediction model using sample data of a second domain, an output gradient corresponding to an i sub-network is obtained according to a prediction loss value corresponding to the first prediction model in the second domain and the output feature representation of the i sub-network, so that parameter update is performed on the i bias parameter, and finally, a second prediction model applied to the second domain is obtained. That is, in the process of training the model in the migration learning scene, the training of the model is completed by using the output gradient in each sub-network to perform parameter adjustment on the bias parameters in each sub-network. The mode of carrying out parameter adjustment on the bias parameters by using the output gradient can avoid carrying out model parameter adjustment by using the input characteristic representation/output characteristic representation, so that the input characteristic representation/output characteristic representation is not required to be stored for carrying out parameter update, the memory storage burden of a server is reduced, the training efficiency of a prediction model can be improved by carrying out parameter update on the bias parameters only, and the memory storage cost can be reduced.

It should be noted that: in the training device for a prediction model provided in the above embodiment, only the division of the above functional modules is used as an example, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the training device of the prediction model provided in the above embodiment and the training method embodiment of the prediction model belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not repeated here.

Fig. 9 shows a block diagram of a computer device 900 provided by an exemplary embodiment of the application. The computer device 900 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Computer device 900 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, and the like.

In general, the computer device 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 901 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 901 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 901 may integrate a GPU (Graphics Processing Unit, image processor) for taking care of rendering and drawing of content that the display screen needs to display. In some embodiments, the processor 901 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

The memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement the training method and data classification method of the classification model provided by the method embodiments of the present application.

In some embodiments, computer device 900 may optionally include other components, and those skilled in the art will appreciate that the structure illustrated in FIG. 9 is not limiting of computer device 900, and may include more or fewer components than shown, or may combine certain components, or employ a different arrangement of components.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing related hardware, and the program may be stored in a computer readable storage medium, which may be a computer readable storage medium included in the memory of the above embodiments; or may be a computer-readable storage medium, alone, that is not incorporated into the terminal. The computer readable storage medium stores at least one instruction, at least one program, a code set, or an instruction set, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the training method and the data classification method of the classification model according to any of the above embodiments.

Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid State Drives), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A method of training a predictive model, the method comprising:

2. The method according to claim 1, wherein the deriving the output gradient corresponding to the i-th sub-network based on the predicted loss value and the output characteristic representation of the i-th sub-network comprises:

acquiring an input characteristic representation corresponding to the ith sub-network based on the output characteristic representation corresponding to the ith-1 sub-network;

inputting an input characteristic representation corresponding to an ith sub-network into the ith sub-network, performing characteristic processing on the input characteristic representation corresponding to the ith sub-network through the ith bias parameter and the ith weight parameter, and outputting to obtain an output characteristic representation corresponding to the ith sub-network;

And taking a derivative calculation result of the output characteristic representation of the ith sub-network corresponding to the predicted loss value as an output gradient corresponding to the ith sub-network.

3. The method according to claim 2, wherein the performing feature processing on the input feature representation corresponding to the ith sub-network through the ith bias parameter and the ith weight parameter, and outputting to obtain the output feature representation corresponding to the ith sub-network includes:

performing matrix multiplication on the ith weight parameter and the input characteristic representation corresponding to the ith sub-network to obtain a matrix product result;

and carrying out matrix addition on the matrix multiplication result and the ith offset parameter to obtain an output characteristic representation corresponding to the ith sub-network.

4. A method according to claim 3, wherein after the outputting obtains the output characteristic representation corresponding to the ith sub-network, the method further comprises:

and obtaining an input gradient corresponding to the ith sub-network based on the output gradient corresponding to the ith sub-network and the ith weight parameter, wherein the input gradient corresponding to the ith sub-network is used for carrying out gradient feedback in the process of carrying out counter propagation on the ith sub-network.

5. The method according to any one of claims 2 to 4, wherein after the outputting obtains the output characteristic representation corresponding to the ith sub-network, the method further includes:

and deleting the input characteristic representation corresponding to the ith sub-network from the memory of the server.

6. The method according to any one of claims 2 to 4, wherein the obtaining the input feature representation corresponding to the i-th sub-network based on the output feature representation corresponding to the i-1-th sub-network includes:

acquiring at least two parameter matrixes preset in an ith sub-network;

taking the matrix product result of the at least two parameter matrices as an intermediate variable corresponding to the ith sub-network, wherein the intermediate variable is used for carrying out characteristic adjustment on the output characteristic representation corresponding to the ith-1 sub-network;

and obtaining the input characteristic representation corresponding to the ith sub-network based on the matrix relation between the output characteristic representation corresponding to the ith-1 sub-network and the intermediate variable.

7. The method according to any one of claims 2 to 4, wherein the ith sub-network corresponds to an ith branching module;

the method further comprises the steps of:

performing feature downsampling on the output feature representation corresponding to the i-1 th sub-network to obtain a downsampling result corresponding to the output feature representation of the i-1 th sub-network;

Inputting the downsampling result corresponding to the output characteristic representation corresponding to the i-1 sub-network into an i branch module, and outputting to obtain a convolution result corresponding to the i branch module;

and performing feature splicing on the convolution result corresponding to the ith branch module and the downsampling result corresponding to the output feature representation of the ith sub-network, wherein the obtained spliced feature is used as the input feature representation of the (i+1) th branch module.

8. The method according to claim 7, wherein the performing feature downsampling on the output feature representation corresponding to the i-1 th sub-network to obtain the downsampling result corresponding to the output feature representation of the i-1 th sub-network includes:

dividing the output characteristic representations corresponding to the ith-1 sub-network into k groups of characteristic representation groups according to a preset channel dimension, wherein the jth group of characteristic representation groups comprise r characteristic variables, j is less than or equal to k, and k and j are positive integers;

and taking the feature average result corresponding to the r feature variables in the j-th group of feature representation groups as the output feature representation corresponding downsampling result of the i-1 sub-network based on the preset channel position.

9. The method of claim 7, wherein inputting the output characteristic of the i-1 th subnetwork to the i-th branching module and outputting the convolution result corresponding to the i-th branching module comprises:

And carrying out convolution operation on the downsampling result corresponding to the output characteristic representation of the ith-1 sub-network to obtain a convolution result corresponding to the ith branch module, wherein the convolution operation comprises at least one of deep convolution and point-by-point convolution.

10. A training device for a predictive model, the device comprising:

11. A computer device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement a method of training a predictive model as claimed in any one of claims 1 to 9.

12. A computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a method of training a predictive model as claimed in any one of claims 1 to 9.

13. A computer program product comprising computer instructions which, when executed by a processor, implement a method of training a predictive model as claimed in any one of claims 1 to 9.