WO2023197520A1

WO2023197520A1 - Data processing method and system, and device and readable storage medium

Info

Publication number: WO2023197520A1
Application number: PCT/CN2022/118104
Authority: WO
Inventors: 郭振华; 邱志勇; 赵雅倩; 李仁刚
Original assignee: 苏州浪潮智能科技有限公司
Priority date: 2022-04-14
Filing date: 2022-09-09
Publication date: 2023-10-19
Also published as: CN114461568B; CN114461568A

Abstract

Disclosed in the present application are a data processing method and system, and a device and a readable storage medium in the technical field of computers. In the present application, a host is connected to a hardware computing platform by means of a CXL protocol, so that the host and the hardware computing platform may share the memory, IO and cache of each other. In this way, training data does not need to be transmitted by means of storage mediums such as a host memory, a GPU cache and a GPU memory; instead, training data in the host memory is directly read by the hardware computing platform, thereby reducing the overhead of data transmission. Moreover, the hardware computing platform may adjust a learning rate on the basis of a moment moving average current value and then calculate new parameters of a model, so that model parameters can be stabilized, the model precision can be guaranteed, and the training efficiency can be improved.

Description

A data processing method, system, equipment and readable storage medium

Cross-references to related applications

This application requires the priority of the Chinese patent application submitted to the China Patent Office on April 14, 2022, with the application number 202210387060.8, and the application name is "A data processing method, system, equipment and readable storage medium", and its entire content incorporated herein by reference.

Technical field

The present application relates to the field of computer technology, and in particular, to a data processing method, system, equipment and non-volatile computer-readable storage medium.

Background technique

Currently, model training can be carried out with the help of hardware modules (such as GPU, Graphics Processing Unit). For example: the server as the host sends a large amount of training data to the hardware module, and the hardware module processes the training data for model training. After the model training is completed, the hardware module feeds back the trained model to the host. However, the inventor realized that due to the large amount of training data, and the data transmission between the host and the hardware module needs to go through storage media such as host memory, GPU cache, GPU memory, etc., the data transmission between the host and the hardware module is The overhead is large and will affect the model training efficiency.

Contents of the invention

On the one hand, this application provides a data processing method, which is applied to a hardware computing platform connected to a host through the CXL (Compute Express Link, high-speed interconnection communication protocol) protocol, including:

The training data used to train the target model is shared in the host based on the CXL protocol;

Call the target model to process the training data to obtain the training results, and calculate new parameters of the target model based on the training results; wherein, calculating the new parameters includes: determining the current value of the moment moving average, and adjusting the learning rate based on the current value of the moment moving average, based on the adjusted New parameters for learning rate calculation;

Update the target model with new parameters to obtain a new model; and

In response to the new model meeting the convergence conditions, the new model is retained and the host is enabled to share the new model based on the CXL protocol.

In one embodiment, determining the current value of the moment moving average and adjusting the learning rate based on the current value of the moment moving average includes:

Determine the current value of the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average; and

In response to the current value of the moment moving average being greater than the preset threshold, the warmup strategy is used to adjust the learning rate; in response to the current value of the moment moving average being less than or equal to the preset threshold, the stochastic gradient descent and momentum algorithms are used to adjust the learning rate.

In one embodiment, determining the current value of the moment moving average based on the preset target attenuation coefficient and the moment moving average maximum value includes:

Calculate the current value of the moment moving average according to the first formula; the first formula is:

Among them, ρt is the current value of the moment moving average, ρ∞ is the maximum value of the moment moving average, t represents the current training time, and β2 is the target attenuation coefficient.

In one embodiment, the warmup strategy is used to adjust the learning rate, including:

Calculate the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment;

Calculate a new first moving average based on the preset object attenuation coefficient, update gradient and the first moving average of the previous training moment;

Calculate a new second moving average based on the updated gradient, target attenuation coefficient, new first moving average, and second moving average at the previous training moment;

Calculate the learning rate at the current training time based on the new second moving average and the target attenuation coefficient; and

Accordingly, new parameters are calculated based on the adjusted learning rate, including:

Calculate new parameters based on the learning rate at the current training moment, the model parameters output at the previous training moment, the preset forward step length, the correction term of the new second moving average and the correction term of the new first moving average.

In one embodiment, stochastic gradient descent and momentum algorithms are used to adjust the learning rate, including:

Calculate the learning rate at the current training moment based on the preset iteration parameters and the preset forward step length, the target moving average of the previous training moment and the update gradient; and

Calculate new parameters based on the learning rate at the current training moment and the model parameters output at the previous training moment.

In one embodiment, the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol.

In one embodiment, the computing module includes: any one or combination of CPU, GPU, FPGA, and ASIC.

In one embodiment, calculating the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment includes:

The calculation formula of the updated gradient gt at the current training time is:

Among them, gt is the updated gradient at the current training time, θt-1 represents the model parameters output at the previous training time,

represents the derivation of θ, X is the training data, and ft(θt-1; X) represents the training result for the training data.

In one embodiment, calculating a new first moving average based on the preset object attenuation coefficient, update gradient and the first moving average of the previous training moment includes:

The calculation formula of the new first moving average mt is:

m _t =β ₁ m _t-1 +(1-β ₁ )g _t

Among them, mt is the new first moving average, β1 is the object attenuation coefficient, mt-1 is the first moving average of the previous training moment, and gt is the updated gradient of the current training moment.

In one embodiment, calculating a new second moving average based on the updated gradient, the target attenuation coefficient, the new first moving average, and the second moving average of the previous training moment includes:

The calculation formula of the new second moving average vt is:

v _t =β ₂ v _t-1 +(1-β ₂ )(g _t -m _t ) ²

Among them, vt is the new second moving average, β2 is the target attenuation coefficient, mt is the new first moving average, vt-1 is the second moving average of the previous training moment, and gt is the updated gradient of the current training moment.

In one embodiment, calculating the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient includes:

The calculation formula of the learning rate lt in response to the training time ρt>4 is:

ρt>4 means that the current value of the moment sliding average is greater than the preset threshold, the preset threshold value is 4, vt is the new second sliding average, and β2 is the target attenuation coefficient.

In one embodiment, calculating the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient also includes:

The calculation formula of the learning rate lt in response to the training time ρt≤4 is:

ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4; among them, lt-1 is the learning rate output at the previous training moment, αt is the forward step length, and gt is the current training moment. The update gradient of , ε is the preset iteration parameter.

In one embodiment, new parameters are calculated based on the adjusted learning rate, including:

The calculation formula of the new parameter θt is:

ρt>4 means that the current value of the moment moving average is greater than the preset threshold, the preset threshold value is 4, αt is the forward step length,

rt is the correction term of the new second moving average vt,

is the correction term of the new first moving average mt; ρt is the current value of the moment moving average, and ρ∞ is the maximum value of the moment moving average.

In one embodiment, calculating new parameters based on the adjusted learning rate also includes:

The calculation formula of the new parameter θt is:

ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, which is 4, and θt-1 means the model parameters output at the previous training moment.

In one embodiment, the new first moving average mt determines the descending direction of the gradient during the model training process, and vt and αt jointly determine the descending size of the gradient during the model training process.

In one of the embodiments, the new first moving average mt is calculated

Used to calculate new parameters to reduce calculation errors.

In one of the embodiments, in the early stage of model training,

Increase the original new first moving average mt;

In response to t becoming a larger value, β1t approaches 0, 1-β1t approaches 1, and finally

The value approaches the original new first moving average mt.

The second aspect of this application provides a data processing system, including: a host, and a hardware computing platform connected to the host through the CXL protocol;

Host, used to provide training data for training target models; new models trained based on the CXL protocol shared hardware computing platform; and

Hardware computing platform, used to share training data in the host based on the CXL protocol; call the target model to process the training data to obtain training results, and calculate new parameters of the target model based on the training results; use the new parameters to update the target model to obtain a new model; if new If the model meets the convergence conditions, the new model is retained; calculating new parameters includes: determining the current value of the moment moving average, adjusting the learning rate based on the current value of the moment moving average, and calculating new parameters based on the adjusted learning rate.

A third aspect of this application provides an electronic device, including:

One or more memories for storing computer-readable instructions; and

One or more processors, used to execute computer-readable instructions to implement the aforementioned disclosed data processing methods.

A fourth aspect of the application provides one or more non-volatile computer-readable storage media storing computer-readable instructions. When executed by one or more processors, the computer-readable instructions cause one or more processes to The processor executes the data processing method disclosed above.

Description of the drawings

In order to explain the embodiments of the present application or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only This is an embodiment of the present application. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.

Figure 1 is a flow chart of a data processing method provided in one or more embodiments of the present application;

Figure 2 is a schematic diagram of a system framework provided in one or more embodiments of the present application;

Figure 3 is a schematic diagram of a connection between devices provided in one or more embodiments of the present application;

Figure 4 is a schematic diagram of memory sharing based on the CXL protocol provided in one or more embodiments of the present application;

Figure 5 is a schematic diagram of an electronic device provided in one or more embodiments of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

At present, due to the large amount of training data, and the data transmission between the host and the hardware module needs to go through storage media such as host memory, GPU cache, and GPU memory, the data transmission overhead between the host and the hardware module is large and will Affects model training efficiency. To this end, this application provides a data processing solution that can reduce the data transmission overhead between the host and the hardware module and improve model training efficiency.

As shown in Figure 1, an embodiment of the present application discloses a data processing method, which is applied to a hardware computing platform connected to a host through the CXL protocol, including:

S101. Share the training data in the host for training the target model based on the CXL protocol.

In this embodiment, the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol. Computing modules include: any one or combination of CPU, GPU, FPGA, and ASIC. The target model can be any model, such as CNN, natural language processing model, image classification model, etc.

S102. Call the target model to process the training data to obtain training results, and calculate new parameters of the target model based on the training results. Calculating the new parameters includes: determining the current value of the moment moving average, and adjusting the learning rate based on the current value of the moment moving average. The new parameters are calculated using the subsequent learning rate.

It should be noted that the model training process is the process of updating model parameters. Current optimization algorithms used to update model parameters include AdaGrad, RMSProp, Adam, etc. Improved algorithms for Adam such as Radam, Adabelief, etc.

This embodiment uses Adabelief to update model parameters. Specifically, based on Adabelief, parameters such as the forward step length, two attenuation coefficients, iteration parameter ε, and the maximum value of the moment moving average can be set. After each training result is obtained, new model parameters can be calculated based on these parameters at the previous training moment. In this embodiment, in order to avoid the impact of the learning rate on parameter calculation, the current value of the moment sliding average is first calculated, and based on the moment sliding The average current value adjusts the learning rate before calculating new parameters, so that the appropriate learning rate can be determined and the model parameters can be steadily updated. Among them, the calculated new parameters include the weight parameters and bias parameters of the model, that is, the new parameters of the model calculated each time are a collection of many parameters.

S103. Update the target model using new parameters to obtain a new model.

S104. If the new model meets the convergence conditions, the new model is retained and the host is allowed to share the new model based on the CXL protocol.

Specifically, in response to the current new model not meeting the convergence conditions, continue training the current model until the new model obtained by training meets the convergence conditions. Among them, the convergence conditions can be set with reference to existing related technologies, such as reaching the maximum number of iterations, etc.

It can be seen that in this embodiment, the host and the hardware computing platform are connected through the CXL protocol. Therefore, the host and the hardware computing platform can share each other's memory, IO and cache. Then the training data is transmitted from the host to the hardware computing platform without going through the host memory, Instead of storage media such as GPU cache and GPU memory, the hardware computing platform directly reads the training data in the host memory, thereby reducing data transmission overhead. At the same time, during the process of training the model, the hardware computing platform can adjust the learning rate based on the current value of the moment moving average, and calculate new parameters of the model based on the adjusted learning rate, thereby stabilizing the model parameters, avoiding falling into local optimality, and ensuring model accuracy. , improve training efficiency. It can be seen that this solution can reduce the data transmission overhead between the host and the hardware module and improve the efficiency of model training.

Based on the above embodiments, in a specific implementation, determining the current value of the moment moving average and adjusting the learning rate based on the current value of the moment moving average includes: determining the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average. Current value; in response to the current value of the moment moving average being greater than the preset threshold, use the warmup strategy to adjust the learning rate; corresponding to the current value of the moment moving average being greater than the preset threshold, use stochastic gradient descent and momentum algorithms to adjust the learning rate.

In a specific implementation, determining the current value of the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average includes: calculating the current value of the moment moving average according to a first formula; the first formula is:

Among them, ρt is the current value of the moment moving average, ρ∞ is the maximum value of the moment moving average, t represents the current training time, and β2 is the target attenuation coefficient. Among them, ρ∞=[1/(1-β2)]-1=β2/(1-β2).

In a specific implementation, the warmup strategy is used to adjust the learning rate, including: calculating the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment; based on the preset object attenuation coefficient, update gradient Calculate a new first moving average with the first moving average of the previous training moment; calculate a new second moving average based on the updated gradient, target attenuation coefficient, new first moving average and the second moving average of the previous training moment; calculate a new second moving average based on the new The sliding average and target attenuation coefficient are used to calculate the learning rate at the current training moment; accordingly, new parameters are calculated based on the adjusted learning rate, including: the learning rate based on the current training moment, the model parameters output at the previous training moment, and the preset The forward step length, the correction term of the new second moving average, and the correction term of the new first moving average calculate new parameters.

When the current value of the moment moving average is greater than the preset threshold, after adjusting the learning rate using the warmup strategy, the process of calculating new parameters includes:

(1) Assuming that the current training time is t, then the calculation formula of the update gradient gt at the current training time is:

(2) The calculation formula of the new first moving average mt at the current training time is: m _t =β ₁ m _t-1 +(1-β ₁ )g _t . Among them, mt is the new first moving average, β1 is the object attenuation coefficient, mt-1 is the first moving average of the previous training moment, and gt is the updated gradient of the current training moment.

(3) The calculation formula of the new second moving average vt at the current training time is: v _t =β ₂ v _t-1 +(1-β ₂ )(g _t -m _t ) ² . Among them, vt is the new second moving average, β2 is the target attenuation coefficient, mt is the new first moving average, vt-1 is the second moving average of the previous training moment, and gt is the updated gradient of the current training moment.

(4) The calculation formula of the learning rate lt when the current training time ρt>4 is:

ρt>4 means that the current value of the moment moving average is greater than the preset threshold, that is, the preset threshold value is 4. Among them, vt is the new second moving average, and β2 is the target attenuation coefficient.

(5) The calculation formula of the new parameter θt at the current training time is:

ρt>4 means that the current value of the moment moving average is greater than the preset threshold, that is, the preset threshold value is 4. Among them, αt is the forward step length,

rt is the correction term of the new second moving average vt,

is the correction term of the new first moving average mt. ρt is the current value of the moment moving average, and ρ∞ is the maximum value of the moment moving average.

Among them, mt determines the direction of gradient descent during model training, and vt and αt jointly determine the magnitude of gradient descent during model training. And for mt, we get

Then calculating new parameters can make the calculation error always relatively small. That is: passed in the early stage of model training

Increase the original mt; when t becomes larger, β1t approaches 0, so 1-β1t approaches 1, so the later

Close to the original mt. According to this, when ρt>4, the learning rate gradually and steadily increases, which helps to slow down the early over-fitting phenomenon of the model in the initial training stage and maintain the stability of the distribution.

In a specific implementation, using stochastic gradient descent and momentum algorithms to adjust the learning rate includes: calculating the updated gradient at the current training time based on training data, training results, and model parameters output at the previous training time; based on preset iteration parameters Calculate the learning rate at the current training moment with the preset forward step length, the target moving average of the previous training moment, and the update gradient; accordingly, calculate new parameters based on the adjusted learning rate, including: based on the learning rate at the current training moment and Calculate new parameters based on the model parameters output at the previous training moment.

When the current value of the moment moving average is not greater than the preset threshold, after adjusting the learning rate using stochastic gradient descent and momentum algorithms, the process of calculating new parameters includes:

(2) The calculation formula of the learning rate lt when the current training time ρt≤4 is:

ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4. Among them, lt-1 is the learning rate output at the previous training moment, αt is the forward step length, gt is the update gradient at the current training moment, and ε is the preset iteration parameter.

(3) The calculation formula of the new parameter θt at the current training time is:

ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4. Among them, θt-1 represents the model parameters output at the previous training moment.

When ρt≤4, the stochastic gradient descent plus momentum (SGD+Momentum) algorithm can be used to effectively avoid the negative learning rate and keep the learning rate in a relatively stable fluctuation state in the early stage.

The following embodiment builds a hardware interconnection system based on the CXL protocol for model training, which can effectively solve data transmission delay and bandwidth problems, and can support various mainstream communication topologies such as Parameter server and Ring-Allreduce.

Specifically, the hardware interconnection system provided by this embodiment includes computing devices CPU, GPU, FPGA, and ASIC. It can realize memory sharing of multiple heterogeneous computing devices through the CXL protocol, open up the communication delay barrier between heterogeneous devices, and significantly To increase the speed of data interaction, please see Figure 2 for the overall architecture of the system.

As shown in Figure 2, Python is used to implement the top-level deep learning framework, and OneAPI programming is used to implement the target operator. The target operator can be called by the top-level deep learning framework and runs on different underlying computing devices. The different underlying computing devices CPU, GPU, FPGA, and ASIC are interconnected through the CXL protocol, and each computing device and the host device are also connected through the CXL protocol. Among them, the target operator implementation includes: the model that needs to be trained, the Rectified-Adabelief optimization algorithm and its related parameters.

Specifically, the topological connection diagram between each device can be seen in Figure 3. In Figure 3, each computing device (CPU, GPU, FPGA, ASIC, etc.) is connected to the host device through an adapter device. According to the connection structure shown in Figure 3, each computing device can be shared between different host devices, that is, different hosts share all computing devices. Each connection line shown in Figure 3 uses the CXL protocol to realize interconnection sharing of IO, cache and memory.

Taking the memory sharing of each computing device as an example, the schematic diagram of memory sharing of each computing device is shown in Figure 4. When each host and each computing device accesses the memory of a certain computing device, it is like accessing its own memory.

It can be seen that this embodiment uses the Adabelief optimization algorithm to solve the problem of excessive learning rate variance caused by insufficient data in the early stage of training of the optimization algorithm, achieve faster convergence speed when completing various deep learning tasks, and avoid prematurely falling into local problems. Optimal solution. At the same time, a heterogeneous computing system that implements the distributed Rectified-Adabelief optimization algorithm is built based on the CXL communication protocol, and the Rectified-Adabelief optimization algorithm is implemented based on the OneAPI programming model, so that it can run on a variety of heterogeneous computing devices. Achieve memory consistency between heterogeneous computing devices, greatly increase data transmission bandwidth, and reduce data interaction delays between computing devices.

A data processing system provided by an embodiment of the present application is introduced below. The data processing system described below and the data processing method described above can be referred to each other.

The embodiment of the present application discloses a data processing system, including: a host, and a hardware computing platform connected to the host through the CXL protocol;

In a specific implementation, the hardware computing platform is specifically used for:

If the current value of the moment moving average is greater than the preset threshold, the warmup strategy is used to adjust the learning rate; otherwise, the stochastic gradient descent and momentum algorithms are used to adjust the learning rate.

Accordingly, the hardware computing platform is specifically used for:

In a specific implementation manner, the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol.

In a specific implementation, the computing module includes: any one or combination of CPU, GPU, FPGA, and ASIC.

For more specific working processes of each module and unit in this embodiment, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be described again here.

It can be seen that this embodiment provides a data processing system that can reduce data transmission overhead between the host and the hardware module and improve model training efficiency.

An electronic device provided by an embodiment of the present application is introduced below. An electronic device described below and a data processing method and system described above may be referred to each other.

Referring to Figure 5, an embodiment of the present application discloses an electronic device, including:

One or more memories 501 for storing computer readable instructions;

One or more processors 502 are configured to execute computer-readable instructions to implement the methods disclosed in any of the above embodiments.

The following is an introduction to a non-volatile computer-readable storage medium provided by embodiments of the present application. The non-volatile computer-readable storage medium described below and the data processing method, system and device described above can be Cross-reference.

A non-volatile computer-readable storage medium used to store a computer program, wherein the computer-readable instructions implement the data processing method disclosed in the foregoing embodiment when executed by a processor. Regarding the specific steps of this method, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be described again here.

"First", "second", "third", "fourth", etc. (if present) mentioned in this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method or apparatus that encompasses a series of steps or units need not be limited to those steps or units expressly listed. , but may include other steps or elements not expressly listed or inherent to such processes, methods or apparatuses.

It should be noted that the descriptions involving “first”, “second”, etc. in this application are for descriptive purposes only and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of indicated technical features. . Therefore, features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In addition, the technical solutions in various embodiments can be combined with each other, but it must be based on the realization by those of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be realized, it should be considered that such a combination of technical solutions does not exist. , nor is it within the scope of protection required by this application.

Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein may be implemented directly in hardware, in software modules executed by a processor, or in a combination of both. Software modules may be located in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. Any other form of non-volatile computer-readable storage medium known to the public.

This article uses specific examples to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method and the core idea of the present application; at the same time, for those of ordinary skill in the art, based on this application There will be changes in the specific implementation and scope of application of the ideas. In summary, the contents of this description should not be understood as limiting the present application.

Claims

A data processing method, characterized in that it is applied to a hardware computing platform connected to a host through the CXL protocol, including:

Share the training data used to train the target model in the host based on the CXL protocol;

Calling the target model to process the training data to obtain training results, and calculating new parameters of the target model based on the training results; wherein calculating the new parameters includes: determining the current value of the moving average of moments, and calculating the current value of the moving average of moments based on the training results. The sliding average current value adjusts the learning rate, and the new parameters are calculated based on the adjusted learning rate;

Update the target model using the new parameters to obtain a new model; and

In response to the new model meeting the convergence condition, the new model is retained and the host is allowed to share the new model based on the CXL protocol.
The method of claim 1, wherein determining the current value of the moving average of moments and adjusting the learning rate based on the current value of the moving average of moments includes:

Determine the current value of the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average; and

In response to the current value of the moment moving average being greater than the preset threshold, a warmup strategy is used to adjust the learning rate; or in response to the current value of the moment moving average being less than or equal to the preset threshold, stochastic gradient descent and momentum algorithms are used to adjust the learning rate. Describe the learning rate.
The method of claim 2, wherein determining the current value of the moment moving average based on the preset target attenuation coefficient and the moment moving average maximum value includes:

Calculate the current value of the moment moving average according to the first formula; the first formula is:

Wherein, ρt is the current value of the moment moving average, ρ∞ is the maximum value of the moment moving average, t represents the current training time, and β2 is the target attenuation coefficient.
The method of claim 2, wherein adjusting the learning rate using a warmup strategy includes:

Calculate the update gradient at the current training moment based on the training data, the training results and the model parameters output at the previous training moment;

Calculate a new first moving average based on the preset object attenuation coefficient, the updated gradient and the first moving average of the previous training moment;

Calculate a new second moving average based on the updated gradient, the target attenuation coefficient, the new first moving average and the second moving average of the previous training moment; and

Calculate the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient;

Correspondingly, calculating the new parameters based on the adjusted learning rate includes:

The calculation is based on the learning rate at the current training moment, the model parameters output at the previous training moment, the preset forward step length, the correction term of the new second moving average and the correction term of the new first moving average. New parameters.
The method of claim 2, wherein adjusting the learning rate using stochastic gradient descent and momentum algorithms includes:

Calculate the update gradient at the current training time based on the training data, the training results, and the model parameters output at the previous training time; and

The learning rate at the current training moment is calculated based on the preset iteration parameters and the preset forward step length, the target moving average of the previous training moment and the update gradient; accordingly, the new calculation is based on the adjusted learning rate. Parameters, including:

The new parameters are calculated based on the learning rate at the current training time and the model parameters output at the previous training time.
The method according to any one of claims 1 to 5, characterized in that the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol.
The method of claim 4, wherein calculating the update gradient at the current training moment based on the training data, the training results and the model parameters output at the previous training moment includes:

The calculation formula of the updated gradient gt at the current training moment is:

Where, gt is the updated gradient at the current training moment, θt-1 represents the model parameters output at the previous training moment,
represents the derivation of θ, X is the training data, and ft(θt-1; X) represents the training result for the training data.
The method of claim 4, wherein calculating a new first moving average based on the preset object attenuation coefficient, the updated gradient and the first moving average of the previous training moment includes:

The calculation formula of the new first moving average mt is:

m t =β 1 m t-1 +(1-β 1 )g t

Where, mt is the new first moving average, β1 is the object attenuation coefficient, mt-1 is the first moving average of the previous training moment, and gt is the updated gradient of the current training moment.
The method according to claim 4, wherein the new second moving average is calculated based on the updated gradient, the target attenuation coefficient, the new first moving average and the second moving average of the previous training moment. ,include:

The calculation formula of the new second moving average vt is:

v t =β 2 v t-1 +(1-β 2 )(g t -m t ) 2

Where, vt is the new second moving average, β2 is the target attenuation coefficient, mt is the new first moving average, vt-1 is the second moving average of the previous training moment, and gt is the The updated gradient at the current training time.
The method of claim 4, wherein calculating the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient includes:

The calculation formula of the learning rate lt in response to the training time ρt>4 is:

ρt>4 indicates that the current value of the moment moving average is greater than the preset threshold, and the preset threshold value is 4, vt is the new second moving average, and β2 is the target attenuation coefficient.
The method of claim 4, wherein calculating the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient further includes:

The calculation formula of the learning rate lt in response to the training time ρt≤4 is:

ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4; where, lt-1 is the learning rate output at the previous training moment, αt is the forward step length, and gt is the current The updated gradient at the training time, ε is the preset iteration parameter.
The method of claim 4, wherein calculating the new parameters based on the adjusted learning rate includes:

The calculation formula of the new parameter θt is:

ρt>4 means that the current value of the moment moving average is greater than the preset threshold, the preset threshold value is 4, αt is the forward step length,
rt is the correction term of the new second moving average vt,
is the correction term of the new first moving average mt; ρt is the current value of the moment moving average, and ρ∞ is the maximum value of the moment moving average.
The method of claim 4, wherein calculating the new parameters based on the adjusted learning rate further includes:

The calculation formula of the new parameter θt is:

ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, and the preset threshold value is 4, and θt-1 means the model parameters output at the previous training moment.
The method according to claim 12, characterized in that the new first moving average mt determines the descending direction of the gradient during the model training process, and vt and αt jointly determine the descending size of the gradient during the model training process.
The method according to claim 14, characterized in that: the new first moving average mt is calculated
Used to calculate new parameters to reduce calculation errors.
The method according to claim 15, characterized in that in the early stage of model training, through the
Increase the original new first moving average mt;

In response to t becoming a larger value, β1t approaches 0, 1-β1t approaches 1, and finally
The value approaches the original new first moving average mt.
The method according to claim 6, characterized in that the computing module includes: any one or combination of CPU, GPU, FPGA, and ASIC.
A data processing system, characterized by comprising: a host, and a hardware computing platform connected to the host through the CXL protocol;

The host is used to provide training data for training a target model; a new model obtained by training on a shared hardware computing platform based on the CXL protocol; and

The hardware computing platform is used to share the training data in the host based on the CXL protocol; call the target model to process the training data to obtain training results, and calculate new parameters of the target model based on the training results ;Use the new parameters to update the target model to obtain a new model; if the new model meets the convergence conditions, retain the new model; wherein, calculating the new parameters includes: determining the current value of the moment moving average, and based on The current value of the moment moving average is adjusted to the learning rate, and the new parameters are calculated based on the adjusted learning rate.
An electronic device, characterized by including:

One or more memories for storing computer-readable instructions; and

One or more processors, configured to execute the computer readable instructions to implement the method according to any one of claims 1 to 17.
One or more non-volatile computer-readable storage media storing computer-readable instructions, characterized in that, when executed by one or more processors, the computer-readable instructions cause the one or more processors to The steps of the method as claimed in any one of claims 1 to 17 are carried out.