WO2023197520A1 - Data processing method and system, and device and readable storage medium - Google Patents

Data processing method and system, and device and readable storage medium Download PDF

Info

Publication number
WO2023197520A1
WO2023197520A1 PCT/CN2022/118104 CN2022118104W WO2023197520A1 WO 2023197520 A1 WO2023197520 A1 WO 2023197520A1 CN 2022118104 W CN2022118104 W CN 2022118104W WO 2023197520 A1 WO2023197520 A1 WO 2023197520A1
Authority
WO
WIPO (PCT)
Prior art keywords
moving average
training
new
moment
model
Prior art date
Application number
PCT/CN2022/118104
Other languages
French (fr)
Chinese (zh)
Inventor
郭振华
邱志勇
赵雅倩
李仁刚
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2023197520A1 publication Critical patent/WO2023197520A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of computer technology, and in particular, to a data processing method, system, equipment and non-volatile computer-readable storage medium.
  • model training can be carried out with the help of hardware modules (such as GPU, Graphics Processing Unit).
  • hardware modules such as GPU, Graphics Processing Unit
  • the server as the host sends a large amount of training data to the hardware module, and the hardware module processes the training data for model training. After the model training is completed, the hardware module feeds back the trained model to the host.
  • the inventor realized that due to the large amount of training data, and the data transmission between the host and the hardware module needs to go through storage media such as host memory, GPU cache, GPU memory, etc., the data transmission between the host and the hardware module is The overhead is large and will affect the model training efficiency.
  • this application provides a data processing method, which is applied to a hardware computing platform connected to a host through the CXL (Compute Express Link, high-speed interconnection communication protocol) protocol, including:
  • the training data used to train the target model is shared in the host based on the CXL protocol;
  • calculating the new parameters includes: determining the current value of the moment moving average, and adjusting the learning rate based on the current value of the moment moving average, based on the adjusted New parameters for learning rate calculation;
  • the new model In response to the new model meeting the convergence conditions, the new model is retained and the host is enabled to share the new model based on the CXL protocol.
  • determining the current value of the moment moving average and adjusting the learning rate based on the current value of the moment moving average includes:
  • the warmup strategy is used to adjust the learning rate; in response to the current value of the moment moving average being less than or equal to the preset threshold, the stochastic gradient descent and momentum algorithms are used to adjust the learning rate.
  • determining the current value of the moment moving average based on the preset target attenuation coefficient and the moment moving average maximum value includes:
  • the first formula is:
  • ⁇ t is the current value of the moment moving average
  • is the maximum value of the moment moving average
  • t represents the current training time
  • ⁇ 2 is the target attenuation coefficient
  • the warmup strategy is used to adjust the learning rate, including:
  • new parameters are calculated based on the adjusted learning rate, including:
  • stochastic gradient descent and momentum algorithms are used to adjust the learning rate, including:
  • new parameters are calculated based on the adjusted learning rate, including:
  • the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol.
  • the computing module includes: any one or combination of CPU, GPU, FPGA, and ASIC.
  • calculating the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment includes:
  • gt is the updated gradient at the current training time
  • ⁇ t-1 represents the model parameters output at the previous training time
  • X is the training data
  • ft( ⁇ t-1; X) represents the training result for the training data.
  • calculating a new first moving average based on the preset object attenuation coefficient, update gradient and the first moving average of the previous training moment includes:
  • mt is the new first moving average
  • ⁇ 1 is the object attenuation coefficient
  • mt-1 is the first moving average of the previous training moment
  • gt is the updated gradient of the current training moment.
  • calculating a new second moving average based on the updated gradient, the target attenuation coefficient, the new first moving average, and the second moving average of the previous training moment includes:
  • vt is the new second moving average
  • ⁇ 2 is the target attenuation coefficient
  • mt is the new first moving average
  • vt-1 is the second moving average of the previous training moment
  • gt is the updated gradient of the current training moment.
  • calculating the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient includes:
  • ⁇ t>4 means that the current value of the moment sliding average is greater than the preset threshold, the preset threshold value is 4, vt is the new second sliding average, and ⁇ 2 is the target attenuation coefficient.
  • calculating the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient also includes:
  • ⁇ t ⁇ 4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4; among them, lt-1 is the learning rate output at the previous training moment, ⁇ t is the forward step length, and gt is the current training moment.
  • the update gradient of , ⁇ is the preset iteration parameter.
  • new parameters are calculated based on the adjusted learning rate, including:
  • ⁇ t>4 means that the current value of the moment moving average is greater than the preset threshold, the preset threshold value is 4, ⁇ t is the forward step length, rt is the correction term of the new second moving average vt, is the correction term of the new first moving average mt; ⁇ t is the current value of the moment moving average, and ⁇ is the maximum value of the moment moving average.
  • calculating new parameters based on the adjusted learning rate also includes:
  • ⁇ t ⁇ 4 means that the current value of the moment moving average is not greater than the preset threshold, which is 4, and ⁇ t-1 means the model parameters output at the previous training moment.
  • the new first moving average mt determines the descending direction of the gradient during the model training process, and vt and ⁇ t jointly determine the descending size of the gradient during the model training process.
  • the new first moving average mt is calculated Used to calculate new parameters to reduce calculation errors.
  • the second aspect of this application provides a data processing system, including: a host, and a hardware computing platform connected to the host through the CXL protocol;
  • Host used to provide training data for training target models; new models trained based on the CXL protocol shared hardware computing platform; and
  • Hardware computing platform used to share training data in the host based on the CXL protocol; call the target model to process the training data to obtain training results, and calculate new parameters of the target model based on the training results; use the new parameters to update the target model to obtain a new model; if new If the model meets the convergence conditions, the new model is retained; calculating new parameters includes: determining the current value of the moment moving average, adjusting the learning rate based on the current value of the moment moving average, and calculating new parameters based on the adjusted learning rate.
  • a third aspect of this application provides an electronic device, including:
  • One or more memories for storing computer-readable instructions
  • One or more processors used to execute computer-readable instructions to implement the aforementioned disclosed data processing methods.
  • a fourth aspect of the application provides one or more non-volatile computer-readable storage media storing computer-readable instructions. When executed by one or more processors, the computer-readable instructions cause one or more processes to The processor executes the data processing method disclosed above.
  • Figure 1 is a flow chart of a data processing method provided in one or more embodiments of the present application.
  • Figure 2 is a schematic diagram of a system framework provided in one or more embodiments of the present application.
  • Figure 3 is a schematic diagram of a connection between devices provided in one or more embodiments of the present application.
  • Figure 4 is a schematic diagram of memory sharing based on the CXL protocol provided in one or more embodiments of the present application.
  • Figure 5 is a schematic diagram of an electronic device provided in one or more embodiments of the present application.
  • this application provides a data processing solution that can reduce the data transmission overhead between the host and the hardware module and improve model training efficiency.
  • an embodiment of the present application discloses a data processing method, which is applied to a hardware computing platform connected to a host through the CXL protocol, including:
  • the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol.
  • Computing modules include: any one or combination of CPU, GPU, FPGA, and ASIC.
  • the target model can be any model, such as CNN, natural language processing model, image classification model, etc.
  • S102 Call the target model to process the training data to obtain training results, and calculate new parameters of the target model based on the training results. Calculating the new parameters includes: determining the current value of the moment moving average, and adjusting the learning rate based on the current value of the moment moving average. The new parameters are calculated using the subsequent learning rate.
  • model training process is the process of updating model parameters.
  • Current optimization algorithms used to update model parameters include AdaGrad, RMSProp, Adam, etc. Improved algorithms for Adam such as Radam, Adabelief, etc.
  • This embodiment uses Adabelief to update model parameters. Specifically, based on Adabelief, parameters such as the forward step length, two attenuation coefficients, iteration parameter ⁇ , and the maximum value of the moment moving average can be set. After each training result is obtained, new model parameters can be calculated based on these parameters at the previous training moment. In this embodiment, in order to avoid the impact of the learning rate on parameter calculation, the current value of the moment sliding average is first calculated, and based on the moment sliding The average current value adjusts the learning rate before calculating new parameters, so that the appropriate learning rate can be determined and the model parameters can be steadily updated. Among them, the calculated new parameters include the weight parameters and bias parameters of the model, that is, the new parameters of the model calculated each time are a collection of many parameters.
  • the new model meets the convergence conditions, the new model is retained and the host is allowed to share the new model based on the CXL protocol.
  • the convergence conditions can be set with reference to existing related technologies, such as reaching the maximum number of iterations, etc.
  • the host and the hardware computing platform are connected through the CXL protocol. Therefore, the host and the hardware computing platform can share each other's memory, IO and cache. Then the training data is transmitted from the host to the hardware computing platform without going through the host memory, Instead of storage media such as GPU cache and GPU memory, the hardware computing platform directly reads the training data in the host memory, thereby reducing data transmission overhead.
  • the hardware computing platform can adjust the learning rate based on the current value of the moment moving average, and calculate new parameters of the model based on the adjusted learning rate, thereby stabilizing the model parameters, avoiding falling into local optimality, and ensuring model accuracy. , improve training efficiency. It can be seen that this solution can reduce the data transmission overhead between the host and the hardware module and improve the efficiency of model training.
  • determining the current value of the moment moving average and adjusting the learning rate based on the current value of the moment moving average includes: determining the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average. Current value; in response to the current value of the moment moving average being greater than the preset threshold, use the warmup strategy to adjust the learning rate; corresponding to the current value of the moment moving average being greater than the preset threshold, use stochastic gradient descent and momentum algorithms to adjust the learning rate.
  • determining the current value of the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average includes: calculating the current value of the moment moving average according to a first formula; the first formula is:
  • ⁇ t is the current value of the moment moving average
  • is the maximum value of the moment moving average
  • t represents the current training time
  • ⁇ 2 is the target attenuation coefficient.
  • the warmup strategy is used to adjust the learning rate, including: calculating the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment; based on the preset object attenuation coefficient, update gradient Calculate a new first moving average with the first moving average of the previous training moment; calculate a new second moving average based on the updated gradient, target attenuation coefficient, new first moving average and the second moving average of the previous training moment; calculate a new second moving average based on the new
  • the sliding average and target attenuation coefficient are used to calculate the learning rate at the current training moment; accordingly, new parameters are calculated based on the adjusted learning rate, including: the learning rate based on the current training moment, the model parameters output at the previous training moment, and the preset The forward step length, the correction term of the new second moving average, and the correction term of the new first moving average calculate new parameters.
  • the process of calculating new parameters includes:
  • mt is the new first moving average
  • ⁇ 1 is the object attenuation coefficient
  • mt-1 is the first moving average of the previous training moment
  • gt is the updated gradient of the current training moment.
  • vt ⁇ 2 v t-1 +(1- ⁇ 2 )(g t -m t ) 2 .
  • vt is the new second moving average
  • ⁇ 2 is the target attenuation coefficient
  • mt is the new first moving average
  • vt-1 is the second moving average of the previous training moment
  • gt is the updated gradient of the current training moment.
  • ⁇ t>4 means that the current value of the moment moving average is greater than the preset threshold, that is, the preset threshold value is 4.
  • vt is the new second moving average
  • ⁇ 2 is the target attenuation coefficient.
  • ⁇ t>4 means that the current value of the moment moving average is greater than the preset threshold, that is, the preset threshold value is 4.
  • ⁇ t is the forward step length
  • rt is the correction term of the new second moving average vt
  • mt is the correction term of the new first moving average mt.
  • ⁇ t is the current value of the moment moving average
  • is the maximum value of the moment moving average.
  • mt determines the direction of gradient descent during model training
  • vt and ⁇ t jointly determine the magnitude of gradient descent during model training.
  • calculating new parameters can make the calculation error always relatively small. That is: passed in the early stage of model training Increase the original mt; when t becomes larger, ⁇ 1t approaches 0, so 1- ⁇ 1t approaches 1, so the later Close to the original mt. According to this, when ⁇ t>4, the learning rate gradually and steadily increases, which helps to slow down the early over-fitting phenomenon of the model in the initial training stage and maintain the stability of the distribution.
  • using stochastic gradient descent and momentum algorithms to adjust the learning rate includes: calculating the updated gradient at the current training time based on training data, training results, and model parameters output at the previous training time; based on preset iteration parameters Calculate the learning rate at the current training moment with the preset forward step length, the target moving average of the previous training moment, and the update gradient; accordingly, calculate new parameters based on the adjusted learning rate, including: based on the learning rate at the current training moment and Calculate new parameters based on the model parameters output at the previous training moment.
  • the process of calculating new parameters includes:
  • ⁇ t ⁇ 4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4.
  • lt-1 is the learning rate output at the previous training moment
  • ⁇ t is the forward step length
  • gt is the update gradient at the current training moment
  • is the preset iteration parameter.
  • ⁇ t ⁇ 4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4.
  • ⁇ t-1 represents the model parameters output at the previous training moment.
  • the stochastic gradient descent plus momentum (SGD+Momentum) algorithm can be used to effectively avoid the negative learning rate and keep the learning rate in a relatively stable fluctuation state in the early stage.
  • the following embodiment builds a hardware interconnection system based on the CXL protocol for model training, which can effectively solve data transmission delay and bandwidth problems, and can support various mainstream communication topologies such as Parameter server and Ring-Allreduce.
  • the hardware interconnection system includes computing devices CPU, GPU, FPGA, and ASIC. It can realize memory sharing of multiple heterogeneous computing devices through the CXL protocol, open up the communication delay barrier between heterogeneous devices, and significantly To increase the speed of data interaction, please see Figure 2 for the overall architecture of the system.
  • Python is used to implement the top-level deep learning framework
  • OneAPI programming is used to implement the target operator.
  • the target operator can be called by the top-level deep learning framework and runs on different underlying computing devices.
  • the different underlying computing devices CPU, GPU, FPGA, and ASIC are interconnected through the CXL protocol, and each computing device and the host device are also connected through the CXL protocol.
  • the target operator implementation includes: the model that needs to be trained, the Rectified-Adabelief optimization algorithm and its related parameters.
  • each computing device (CPU, GPU, FPGA, ASIC, etc.) is connected to the host device through an adapter device.
  • each computing device can be shared between different host devices, that is, different hosts share all computing devices.
  • Each connection line shown in Figure 3 uses the CXL protocol to realize interconnection sharing of IO, cache and memory.
  • each computing device Taking the memory sharing of each computing device as an example, the schematic diagram of memory sharing of each computing device is shown in Figure 4.
  • each host and each computing device accesses the memory of a certain computing device, it is like accessing its own memory.
  • this embodiment uses the Adabelief optimization algorithm to solve the problem of excessive learning rate variance caused by insufficient data in the early stage of training of the optimization algorithm, achieve faster convergence speed when completing various deep learning tasks, and avoid prematurely falling into local problems.
  • Optimal solution a heterogeneous computing system that implements the distributed Rectified-Adabelief optimization algorithm is built based on the CXL communication protocol, and the Rectified-Adabelief optimization algorithm is implemented based on the OneAPI programming model, so that it can run on a variety of heterogeneous computing devices. Achieve memory consistency between heterogeneous computing devices, greatly increase data transmission bandwidth, and reduce data interaction delays between computing devices.
  • a data processing system provided by an embodiment of the present application is introduced below.
  • the data processing system described below and the data processing method described above can be referred to each other.
  • the embodiment of the present application discloses a data processing system, including: a host, and a hardware computing platform connected to the host through the CXL protocol;
  • Host used to provide training data for training target models; new models trained based on the CXL protocol shared hardware computing platform; and
  • Hardware computing platform used to share training data in the host based on the CXL protocol; call the target model to process the training data to obtain training results, and calculate new parameters of the target model based on the training results; use the new parameters to update the target model to obtain a new model; if new If the model meets the convergence conditions, the new model is retained; calculating new parameters includes: determining the current value of the moment moving average, adjusting the learning rate based on the current value of the moment moving average, and calculating new parameters based on the adjusted learning rate.
  • the hardware computing platform is specifically used for:
  • the warmup strategy is used to adjust the learning rate; otherwise, the stochastic gradient descent and momentum algorithms are used to adjust the learning rate.
  • the hardware computing platform is specifically used for:
  • the first formula is:
  • ⁇ t is the current value of the moment moving average
  • is the maximum value of the moment moving average
  • t represents the current training time
  • ⁇ 2 is the target attenuation coefficient
  • the hardware computing platform is specifically used for:
  • the hardware computing platform is specifically used for:
  • the hardware computing platform is specifically used for:
  • the hardware computing platform is specifically used for:
  • the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol.
  • the computing module includes: any one or combination of CPU, GPU, FPGA, and ASIC.
  • this embodiment provides a data processing system that can reduce data transmission overhead between the host and the hardware module and improve model training efficiency.
  • An electronic device provided by an embodiment of the present application is introduced below.
  • An electronic device described below and a data processing method and system described above may be referred to each other.
  • an electronic device including:
  • One or more memories 501 for storing computer readable instructions
  • One or more processors 502 are configured to execute computer-readable instructions to implement the methods disclosed in any of the above embodiments.
  • non-volatile computer-readable storage medium provided by embodiments of the present application.
  • the non-volatile computer-readable storage medium described below and the data processing method, system and device described above can be Cross-reference.
  • the specific steps of this method reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be described again here.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. Any other form of non-volatile computer-readable storage medium known to the public.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Stored Programmes (AREA)

Abstract

Disclosed in the present application are a data processing method and system, and a device and a readable storage medium in the technical field of computers. In the present application, a host is connected to a hardware computing platform by means of a CXL protocol, so that the host and the hardware computing platform may share the memory, IO and cache of each other. In this way, training data does not need to be transmitted by means of storage mediums such as a host memory, a GPU cache and a GPU memory; instead, training data in the host memory is directly read by the hardware computing platform, thereby reducing the overhead of data transmission. Moreover, the hardware computing platform may adjust a learning rate on the basis of a moment moving average current value and then calculate new parameters of a model, so that model parameters can be stabilized, the model precision can be guaranteed, and the training efficiency can be improved.

Description

一种数据处理方法、系统、设备及可读存储介质A data processing method, system, equipment and readable storage medium
相关申请的交叉引用Cross-references to related applications
本申请要求于2022年04月14日提交中国专利局,申请号为202210387060.8,申请名称为“一种数据处理方法、系统、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application submitted to the China Patent Office on April 14, 2022, with the application number 202210387060.8, and the application name is "A data processing method, system, equipment and readable storage medium", and its entire content incorporated herein by reference.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种数据处理方法、系统、设备及非易失性计算机可读存储介质。The present application relates to the field of computer technology, and in particular, to a data processing method, system, equipment and non-volatile computer-readable storage medium.
背景技术Background technique
目前,可以借助硬件模块(如GPU,Graphics Processing Unit)进行模型训练。例如:作为主机的服务器将大量训练数据发给硬件模块,硬件模块处理训练数据进行模型训练,模型训练完成后,硬件模块将训练完成的模型反馈给主机。然而,发明人意识到,由于训练数据的数据量较大,且主机和硬件模块之间的数据传输需要经过主机内存、GPU缓存、GPU内存等存储介质,因此主机和硬件模块之间的数据传输开销较大,会影响模型训练效率。Currently, model training can be carried out with the help of hardware modules (such as GPU, Graphics Processing Unit). For example: the server as the host sends a large amount of training data to the hardware module, and the hardware module processes the training data for model training. After the model training is completed, the hardware module feeds back the trained model to the host. However, the inventor realized that due to the large amount of training data, and the data transmission between the host and the hardware module needs to go through storage media such as host memory, GPU cache, GPU memory, etc., the data transmission between the host and the hardware module is The overhead is large and will affect the model training efficiency.
发明内容Contents of the invention
本申请的一方面,提供了一种数据处理方法,应用于与主机通过CXL(Compute Express Link,高速互联通信协议)协议连接的硬件计算平台,包括:On the one hand, this application provides a data processing method, which is applied to a hardware computing platform connected to a host through the CXL (Compute Express Link, high-speed interconnection communication protocol) protocol, including:
基于CXL协议共享主机中的、用于训练目标模型的训练数据;The training data used to train the target model is shared in the host based on the CXL protocol;
调用目标模型处理训练数据得到训练结果,并基于训练结果计算目标模型的新参数;其中,计算新参数包括:确定矩滑动平均当前值,并基于矩滑动平均当前值调整学习率,基于调整后的学习率计算新参数;Call the target model to process the training data to obtain the training results, and calculate new parameters of the target model based on the training results; wherein, calculating the new parameters includes: determining the current value of the moment moving average, and adjusting the learning rate based on the current value of the moment moving average, based on the adjusted New parameters for learning rate calculation;
利用新参数更新目标模型,得到新模型;及Update the target model with new parameters to obtain a new model; and
响应于新模型符合收敛条件,保留新模型,并使主机基于CXL协议共享新模型。In response to the new model meeting the convergence conditions, the new model is retained and the host is enabled to share the new model based on the CXL protocol.
在其中一个实施例中,确定矩滑动平均当前值,并基于矩滑动平均当前值调整学习率,包括:In one embodiment, determining the current value of the moment moving average and adjusting the learning rate based on the current value of the moment moving average includes:
基于预设的目标衰减系数和矩滑动平均最大值确定矩滑动平均当前值;及Determine the current value of the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average; and
响应于矩滑动平均当前值大于预设阈值,利用warmup策略调整学习率;响应于矩滑动平均当前值小于或等于预设阈值,利用随机梯度下降以及动量算法调整学习率。In response to the current value of the moment moving average being greater than the preset threshold, the warmup strategy is used to adjust the learning rate; in response to the current value of the moment moving average being less than or equal to the preset threshold, the stochastic gradient descent and momentum algorithms are used to adjust the learning rate.
在其中一个实施例中,基于预设的目标衰减系数和矩滑动平均最大值确定矩滑动平均当前值,包括:In one embodiment, determining the current value of the moment moving average based on the preset target attenuation coefficient and the moment moving average maximum value includes:
按照第一公式计算矩滑动平均当前值;第一公式为:Calculate the current value of the moment moving average according to the first formula; the first formula is:
Figure PCTCN2022118104-appb-000001
Figure PCTCN2022118104-appb-000001
其中,ρt为矩滑动平均当前值,ρ∞为矩滑动平均最大值,t表示当前训练时刻,β2为目标衰减系数。Among them, ρt is the current value of the moment moving average, ρ∞ is the maximum value of the moment moving average, t represents the current training time, and β2 is the target attenuation coefficient.
在其中一个实施例中,利用warmup策略调整学习率,包括:In one embodiment, the warmup strategy is used to adjust the learning rate, including:
基于训练数据、训练结果以及前一训练时刻输出的模型参数计算当前训练时刻的更新梯度;Calculate the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment;
基于预设的对象衰减系数、更新梯度和前一训练时刻的第一滑动平均计算新第一滑动平均;Calculate a new first moving average based on the preset object attenuation coefficient, update gradient and the first moving average of the previous training moment;
基于更新梯度、目标衰减系数、新第一滑动平均和前一训练时刻的第二滑动平均计算新第二滑动平均;Calculate a new second moving average based on the updated gradient, target attenuation coefficient, new first moving average, and second moving average at the previous training moment;
基于新第二滑动平均和目标衰减系数计算当前训练时刻的学习率;及Calculate the learning rate at the current training time based on the new second moving average and the target attenuation coefficient; and
相应地,基于调整后的学习率计算新参数,包括:Accordingly, new parameters are calculated based on the adjusted learning rate, including:
基于当前训练时刻的学习率、前一训练时刻输出的模型参数、预设的前进步长、新第二滑动平均的矫正项和新第一滑动平均的矫正项计算新参数。Calculate new parameters based on the learning rate at the current training moment, the model parameters output at the previous training moment, the preset forward step length, the correction term of the new second moving average and the correction term of the new first moving average.
在其中一个实施例中,利用随机梯度下降以及动量算法调整学习率,包括:In one embodiment, stochastic gradient descent and momentum algorithms are used to adjust the learning rate, including:
基于训练数据、训练结果以及前一训练时刻输出的模型参数计算当前训练时刻的更新梯度;Calculate the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment;
基于预设的迭代参数和预设的前进步长、前一训练时刻的目标滑动平均和更新梯度计算当前训练时刻的学习率;及Calculate the learning rate at the current training moment based on the preset iteration parameters and the preset forward step length, the target moving average of the previous training moment and the update gradient; and
相应地,基于调整后的学习率计算新参数,包括:Accordingly, new parameters are calculated based on the adjusted learning rate, including:
基于当前训练时刻的学习率和前一训练时刻输出的模型参数计算新参数。Calculate new parameters based on the learning rate at the current training moment and the model parameters output at the previous training moment.
在其中一个实施例中,硬件计算平台包括多个计算模块,各个计算模块之间基于CXL协议共享内存。In one embodiment, the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol.
在其中一个实施例中,计算模块包括:CPU、GPU、FPGA、ASIC中的任一项或组合。In one embodiment, the computing module includes: any one or combination of CPU, GPU, FPGA, and ASIC.
在其中一个实施例中,基于训练数据、训练结果以及前一训练时刻输出的模型参数计算当前训练时刻的更新梯度,包括:In one embodiment, calculating the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment includes:
当前训练时刻的更新梯度gt的计算公式为:The calculation formula of the updated gradient gt at the current training time is:
Figure PCTCN2022118104-appb-000002
Figure PCTCN2022118104-appb-000002
其中,gt为当前训练时刻的更新梯度,θt-1表示前一训练时刻输出的模型参数,
Figure PCTCN2022118104-appb-000003
表示对θ求导,X为训练数据,ft(θt-1;X)表示针对训练数据的训练结果。
Among them, gt is the updated gradient at the current training time, θt-1 represents the model parameters output at the previous training time,
Figure PCTCN2022118104-appb-000003
represents the derivation of θ, X is the training data, and ft(θt-1; X) represents the training result for the training data.
在其中一个实施例中,基于预设的对象衰减系数、更新梯度和前一训练时刻的第一滑动平均计算新第一滑动平均,包括:In one embodiment, calculating a new first moving average based on the preset object attenuation coefficient, update gradient and the first moving average of the previous training moment includes:
新第一滑动平均mt的计算公式为:The calculation formula of the new first moving average mt is:
m t=β 1m t-1+(1-β 1)g t m t1 m t-1 +(1-β 1 )g t
其中,mt为新第一滑动平均,β1为对象衰减系数,mt-1为前一训练时刻的第一滑动平均,gt为当前训练时刻的更新梯度。Among them, mt is the new first moving average, β1 is the object attenuation coefficient, mt-1 is the first moving average of the previous training moment, and gt is the updated gradient of the current training moment.
在其中一个实施例中,基于更新梯度、目标衰减系数、新第一滑动平均和前一训练时刻的第二滑动平均计算新第二滑动平均,包括:In one embodiment, calculating a new second moving average based on the updated gradient, the target attenuation coefficient, the new first moving average, and the second moving average of the previous training moment includes:
新第二滑动平均vt的计算公式为:The calculation formula of the new second moving average vt is:
v t=β 2v t-1+(1-β 2)(g t-m t) 2 v t2 v t-1 +(1-β 2 )(g t -m t ) 2
其中,vt为新第二滑动平均,β2为目标衰减系数,mt为新第一滑动平均,vt-1为前一训练时刻的第二滑动平均,gt为当前训练时刻的更新梯度。Among them, vt is the new second moving average, β2 is the target attenuation coefficient, mt is the new first moving average, vt-1 is the second moving average of the previous training moment, and gt is the updated gradient of the current training moment.
在其中一个实施例中,基于新第二滑动平均和目标衰减系数计算当前训练时刻的学习率,包括:In one embodiment, calculating the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient includes:
响应于训练时刻ρt>4的学习率lt的计算公式为:The calculation formula of the learning rate lt in response to the training time ρt>4 is:
Figure PCTCN2022118104-appb-000004
Figure PCTCN2022118104-appb-000004
ρt>4表示矩滑动平均当前值大于预设阈值,预设阈值取值为4,vt为新第二滑动平 均,β2为目标衰减系数。ρt>4 means that the current value of the moment sliding average is greater than the preset threshold, the preset threshold value is 4, vt is the new second sliding average, and β2 is the target attenuation coefficient.
在其中一个实施例中,基于新第二滑动平均和目标衰减系数计算当前训练时刻的学习率,还包括:In one embodiment, calculating the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient also includes:
响应于训练时刻ρt≤4的学习率lt的计算公式为:The calculation formula of the learning rate lt in response to the training time ρt≤4 is:
Figure PCTCN2022118104-appb-000005
Figure PCTCN2022118104-appb-000005
ρt≤4表示矩滑动平均当前值不大于预设阈值,即预设阈值取值为4;其中,lt-1为前一训练时刻输出的学习率,αt为前进步长,gt为当前训练时刻的更新梯度,ε为预设的迭代参数。ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4; among them, lt-1 is the learning rate output at the previous training moment, αt is the forward step length, and gt is the current training moment. The update gradient of , ε is the preset iteration parameter.
在其中一个实施例中,基于调整后的学习率计算新参数,包括:In one embodiment, new parameters are calculated based on the adjusted learning rate, including:
新参数θt的计算公式为:The calculation formula of the new parameter θt is:
Figure PCTCN2022118104-appb-000006
Figure PCTCN2022118104-appb-000006
ρt>4表示矩滑动平均当前值大于预设阈值,预设阈值取值为4,αt为前进步长,
Figure PCTCN2022118104-appb-000007
rt为新第二滑动平均vt的矫正项,
Figure PCTCN2022118104-appb-000008
为新第一滑动平均mt的矫正项;ρt为矩滑动平均当前值,ρ∞为矩滑动平均最大值。
ρt>4 means that the current value of the moment moving average is greater than the preset threshold, the preset threshold value is 4, αt is the forward step length,
Figure PCTCN2022118104-appb-000007
rt is the correction term of the new second moving average vt,
Figure PCTCN2022118104-appb-000008
is the correction term of the new first moving average mt; ρt is the current value of the moment moving average, and ρ∞ is the maximum value of the moment moving average.
在其中一个实施例中,基于调整后的学习率计算新参数,还包括:In one embodiment, calculating new parameters based on the adjusted learning rate also includes:
新参数θt的计算公式为:The calculation formula of the new parameter θt is:
Figure PCTCN2022118104-appb-000009
Figure PCTCN2022118104-appb-000009
ρt≤4表示矩滑动平均当前值不大于预设阈值,预设阈值取值为4,θt-1表示前一训练时刻输出的模型参数。ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, which is 4, and θt-1 means the model parameters output at the previous training moment.
在其中一个实施例中,新第一滑动平均mt决定模型训练过程中梯度的下降方向,vt与αt共同决定模型训练过程中梯度的下降大小。In one embodiment, the new first moving average mt determines the descending direction of the gradient during the model training process, and vt and αt jointly determine the descending size of the gradient during the model training process.
在其中一个实施例中,将新第一滑动平均mt求取的
Figure PCTCN2022118104-appb-000010
用于计算新参数,以减少计算误差。
In one of the embodiments, the new first moving average mt is calculated
Figure PCTCN2022118104-appb-000010
Used to calculate new parameters to reduce calculation errors.
在其中一个实施例中,在模型训练前期通过
Figure PCTCN2022118104-appb-000011
增大原新第一滑动平均mt;
In one of the embodiments, in the early stage of model training,
Figure PCTCN2022118104-appb-000011
Increase the original new first moving average mt;
响应于t变为较大值,使得β1t趋近于0,1-β1t趋近于1,最终
Figure PCTCN2022118104-appb-000012
的值趋近于原新第一滑动平均mt。
In response to t becoming a larger value, β1t approaches 0, 1-β1t approaches 1, and finally
Figure PCTCN2022118104-appb-000012
The value approaches the original new first moving average mt.
本申请的第二方面,提供了一种数据处理系统,包括:主机、与主机通过CXL协议连接的硬件计算平台;The second aspect of this application provides a data processing system, including: a host, and a hardware computing platform connected to the host through the CXL protocol;
主机,用于提供用于训练目标模型的训练数据;基于CXL协议共享硬件计算平台训练得到的新模型;及Host, used to provide training data for training target models; new models trained based on the CXL protocol shared hardware computing platform; and
硬件计算平台,用于基于CXL协议共享主机中的训练数据;调用目标模型处理训练数据得到训练结果,并基于训练结果计算目标模型的新参数;利用新参数更新目标模型,得到新模型;若新模型符合收敛条件,则保留新模型;其中,计算新参数包括:确定矩滑动平均当前值,并基于矩滑动平均当前值调整学习率,基于调整后的学习率计算新参数。Hardware computing platform, used to share training data in the host based on the CXL protocol; call the target model to process the training data to obtain training results, and calculate new parameters of the target model based on the training results; use the new parameters to update the target model to obtain a new model; if new If the model meets the convergence conditions, the new model is retained; calculating new parameters includes: determining the current value of the moment moving average, adjusting the learning rate based on the current value of the moment moving average, and calculating new parameters based on the adjusted learning rate.
本申请的第三方面,提供了一种电子设备,包括:A third aspect of this application provides an electronic device, including:
一个或多个存储器,用于存储计算机可读指令;及One or more memories for storing computer-readable instructions; and
一个或多个处理器,用于执行计算机可读指令,以实现前述公开的数据处理方法。One or more processors, used to execute computer-readable instructions to implement the aforementioned disclosed data processing methods.
本申请的第四方面,提供了一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如前述公开的数据处理方法。A fourth aspect of the application provides one or more non-volatile computer-readable storage media storing computer-readable instructions. When executed by one or more processors, the computer-readable instructions cause one or more processes to The processor executes the data processing method disclosed above.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present application or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only This is an embodiment of the present application. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without exerting creative efforts.
图1为本申请一个或多个实施例中提供的一种数据处理方法流程图;Figure 1 is a flow chart of a data processing method provided in one or more embodiments of the present application;
图2为本申请一个或多个实施例中提供的一种系统框架示意图;Figure 2 is a schematic diagram of a system framework provided in one or more embodiments of the present application;
图3为本申请一个或多个实施例中提供的一种设备间的连接示意图;Figure 3 is a schematic diagram of a connection between devices provided in one or more embodiments of the present application;
图4为本申请一个或多个实施例中提供的一种基于CXL协议的内存共享示意图;Figure 4 is a schematic diagram of memory sharing based on the CXL protocol provided in one or more embodiments of the present application;
图5为本申请一个或多个实施例中提供的一种电子设备示意图。Figure 5 is a schematic diagram of an electronic device provided in one or more embodiments of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
目前,由于训练数据的数据量较大,且主机和硬件模块之间的数据传输需要经过主机内存、GPU缓存、GPU内存等存储介质,因此主机和硬件模块之间的数据传输开销较大,会影响模型训练效率。为此,本申请提供了一种数据处理方案,能够降低主机和硬件模块之间的数据传输开销,提升模型训练效率。At present, due to the large amount of training data, and the data transmission between the host and the hardware module needs to go through storage media such as host memory, GPU cache, and GPU memory, the data transmission overhead between the host and the hardware module is large and will Affects model training efficiency. To this end, this application provides a data processing solution that can reduce the data transmission overhead between the host and the hardware module and improve model training efficiency.
参见图1所示,本申请实施例公开了一种数据处理方法,应用于与主机通过CXL协议连接的硬件计算平台,包括:As shown in Figure 1, an embodiment of the present application discloses a data processing method, which is applied to a hardware computing platform connected to a host through the CXL protocol, including:
S101、基于CXL协议共享主机中的、用于训练目标模型的训练数据。S101. Share the training data in the host for training the target model based on the CXL protocol.
在本实施例中,硬件计算平台包括多个计算模块,各个计算模块之间基于CXL协议共享内存。计算模块包括:CPU、GPU、FPGA、ASIC中的任一项或组合。目标模型可以是任意模型,如:CNN、自然语言处理模型、图像分类模型等。In this embodiment, the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol. Computing modules include: any one or combination of CPU, GPU, FPGA, and ASIC. The target model can be any model, such as CNN, natural language processing model, image classification model, etc.
S102、调用目标模型处理训练数据得到训练结果,并基于训练结果计算目标模型的新参数;其中,计算新参数包括:确定矩滑动平均当前值,并基于矩滑动平均当前值调整学习率,基于调整后的学习率计算新参数。S102. Call the target model to process the training data to obtain training results, and calculate new parameters of the target model based on the training results. Calculating the new parameters includes: determining the current value of the moment moving average, and adjusting the learning rate based on the current value of the moment moving average. The new parameters are calculated using the subsequent learning rate.
需要说明的是,模型训练过程即更新模型参数的过程。当前用于更新模型参数的优化算法包括AdaGrad、RMSProp、Adam等。针对Adam的改进算法如Radam,Adabelief等。It should be noted that the model training process is the process of updating model parameters. Current optimization algorithms used to update model parameters include AdaGrad, RMSProp, Adam, etc. Improved algorithms for Adam such as Radam, Adabelief, etc.
本实施例使用Adabelief更新模型参数。具体的,基于Adabelief,可以设定前进步长、两个衰减系数、迭代参数ε、矩滑动平均最大值等参数。在每一次获得一个训练结果后,可基于前一训练时刻的这些参数计算新的模型参数,本实施例为了避免学习率对参数计算的影响,先计算了矩滑动平均当前值,并基于矩滑动平均当前值调整学习率后再计算新参数,从而可确定合适的学习率,继而保障模型参数的稳步更新。其中,计算出的新参数包括模型的权重参数与偏置参数,也即:每次计算出的模型新参数是众多参数的集合。This embodiment uses Adabelief to update model parameters. Specifically, based on Adabelief, parameters such as the forward step length, two attenuation coefficients, iteration parameter ε, and the maximum value of the moment moving average can be set. After each training result is obtained, new model parameters can be calculated based on these parameters at the previous training moment. In this embodiment, in order to avoid the impact of the learning rate on parameter calculation, the current value of the moment sliding average is first calculated, and based on the moment sliding The average current value adjusts the learning rate before calculating new parameters, so that the appropriate learning rate can be determined and the model parameters can be steadily updated. Among them, the calculated new parameters include the weight parameters and bias parameters of the model, that is, the new parameters of the model calculated each time are a collection of many parameters.
S103、利用新参数更新目标模型,得到新模型。S103. Update the target model using new parameters to obtain a new model.
S104、若新模型符合收敛条件,则保留新模型,并使主机基于CXL协议共享新模型。S104. If the new model meets the convergence conditions, the new model is retained and the host is allowed to share the new model based on the CXL protocol.
具体的,响应于当前新模型不符合收敛条件,继续训练当前模型,直至训练得到的新 模型符合收敛条件。其中,收敛条件可参照现有相关技术进行设定,如:达到最大迭代次数等。Specifically, in response to the current new model not meeting the convergence conditions, continue training the current model until the new model obtained by training meets the convergence conditions. Among them, the convergence conditions can be set with reference to existing related technologies, such as reaching the maximum number of iterations, etc.
可见,本实施例将主机和硬件计算平台通过CXL协议连接,因此主机和硬件计算平台可以共享对方的内存、IO和缓存,那么训练数据从主机传输至硬件计算平台,也就无需经由主机内存、GPU缓存、GPU内存等存储介质,而是由硬件计算平台直接读取主机内存中的训练数据,从而降低数据传输开销。同时,硬件计算平台在训练模型的过程中,可以基于矩滑动平均当前值调整学习率,基于调整后的学习率计算模型的新参数,从而可稳定模型参数,避免陷入局部最优,保障模型精度,提升训练效率。可见该方案能够降低主机和硬件模块之间的数据传输开销,提升模型训练效率。It can be seen that in this embodiment, the host and the hardware computing platform are connected through the CXL protocol. Therefore, the host and the hardware computing platform can share each other's memory, IO and cache. Then the training data is transmitted from the host to the hardware computing platform without going through the host memory, Instead of storage media such as GPU cache and GPU memory, the hardware computing platform directly reads the training data in the host memory, thereby reducing data transmission overhead. At the same time, during the process of training the model, the hardware computing platform can adjust the learning rate based on the current value of the moment moving average, and calculate new parameters of the model based on the adjusted learning rate, thereby stabilizing the model parameters, avoiding falling into local optimality, and ensuring model accuracy. , improve training efficiency. It can be seen that this solution can reduce the data transmission overhead between the host and the hardware module and improve the efficiency of model training.
基于上述实施例,在一种具体实施方式中,确定矩滑动平均当前值,并基于矩滑动平均当前值调整学习率,包括:基于预设的目标衰减系数和矩滑动平均最大值确定矩滑动平均当前值;响应于矩滑动平均当前值大于预设阈值,利用warmup策略调整学习率;相应于矩滑动平均当前值大于预设阈值,利用随机梯度下降以及动量算法调整学习率。Based on the above embodiments, in a specific implementation, determining the current value of the moment moving average and adjusting the learning rate based on the current value of the moment moving average includes: determining the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average. Current value; in response to the current value of the moment moving average being greater than the preset threshold, use the warmup strategy to adjust the learning rate; corresponding to the current value of the moment moving average being greater than the preset threshold, use stochastic gradient descent and momentum algorithms to adjust the learning rate.
在一种具体实施方式中,基于预设的目标衰减系数和矩滑动平均最大值确定矩滑动平均当前值,包括:按照第一公式计算矩滑动平均当前值;第一公式为:In a specific implementation, determining the current value of the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average includes: calculating the current value of the moment moving average according to a first formula; the first formula is:
Figure PCTCN2022118104-appb-000013
Figure PCTCN2022118104-appb-000013
其中,ρt为矩滑动平均当前值,ρ∞为矩滑动平均最大值,t表示当前训练时刻,β2为目标衰减系数。其中,ρ∞=[1/(1-β2)]-1=β2/(1-β2)。Among them, ρt is the current value of the moment moving average, ρ∞ is the maximum value of the moment moving average, t represents the current training time, and β2 is the target attenuation coefficient. Among them, ρ∞=[1/(1-β2)]-1=β2/(1-β2).
在一种具体实施方式中,利用warmup策略调整学习率,包括:基于训练数据、训练结果以及前一训练时刻输出的模型参数计算当前训练时刻的更新梯度;基于预设的对象衰减系数、更新梯度和前一训练时刻的第一滑动平均计算新第一滑动平均;基于更新梯度、目标衰减系数、新第一滑动平均和前一训练时刻的第二滑动平均计算新第二滑动平均;基于新第二滑动平均和目标衰减系数计算当前训练时刻的学习率;相应地,基于调整后的学习率计算新参数,包括:基于当前训练时刻的学习率、前一训练时刻输出的模型参数、预设的前进步长、新第二滑动平均的矫正项和新第一滑动平均的矫正项计算新参数。In a specific implementation, the warmup strategy is used to adjust the learning rate, including: calculating the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment; based on the preset object attenuation coefficient, update gradient Calculate a new first moving average with the first moving average of the previous training moment; calculate a new second moving average based on the updated gradient, target attenuation coefficient, new first moving average and the second moving average of the previous training moment; calculate a new second moving average based on the new The sliding average and target attenuation coefficient are used to calculate the learning rate at the current training moment; accordingly, new parameters are calculated based on the adjusted learning rate, including: the learning rate based on the current training moment, the model parameters output at the previous training moment, and the preset The forward step length, the correction term of the new second moving average, and the correction term of the new first moving average calculate new parameters.
当矩滑动平均当前值大于预设阈值时,利用warmup策略调整学习率后,计算新参数的过程包括:When the current value of the moment moving average is greater than the preset threshold, after adjusting the learning rate using the warmup strategy, the process of calculating new parameters includes:
(1)假设当前训练时刻为t,那么当前训练时刻的更新梯度gt的计算公式为:
Figure PCTCN2022118104-appb-000014
其中,gt为当前训练时刻的更新梯度,θt-1表示前一训练时刻输出的模型参数,
Figure PCTCN2022118104-appb-000015
表示对θ求导,X为训练数据,ft(θt-1;X)表示针对训练数据的训练结果。
(1) Assuming that the current training time is t, then the calculation formula of the update gradient gt at the current training time is:
Figure PCTCN2022118104-appb-000014
Among them, gt is the updated gradient at the current training time, θt-1 represents the model parameters output at the previous training time,
Figure PCTCN2022118104-appb-000015
represents the derivation of θ, X is the training data, and ft(θt-1; X) represents the training result for the training data.
(2)当前训练时刻的新第一滑动平均mt的计算公式为:m t=β 1m t-1+(1-β 1)g t。其中,mt为新第一滑动平均,β1为对象衰减系数,mt-1为前一训练时刻的第一滑动平均,gt为当前训练时刻的更新梯度。 (2) The calculation formula of the new first moving average mt at the current training time is: m t1 m t-1 +(1-β 1 )g t . Among them, mt is the new first moving average, β1 is the object attenuation coefficient, mt-1 is the first moving average of the previous training moment, and gt is the updated gradient of the current training moment.
(3)当前训练时刻的新第二滑动平均vt的计算公式为:v t=β 2v t-1+(1-β 2)(g t-m t) 2。其中,vt为新第二滑动平均,β2为目标衰减系数,mt为新第一滑动平均,vt-1为前一训练时刻的第二滑动平均,gt为当前训练时刻的更新梯度。 (3) The calculation formula of the new second moving average vt at the current training time is: v t2 v t-1 +(1-β 2 )(g t -m t ) 2 . Among them, vt is the new second moving average, β2 is the target attenuation coefficient, mt is the new first moving average, vt-1 is the second moving average of the previous training moment, and gt is the updated gradient of the current training moment.
(4)当前训练时刻ρt>4时的学习率lt的计算公式为:
Figure PCTCN2022118104-appb-000016
ρt>4表示矩滑动平均当前值大于预设阈值,即预设阈值取值为4。其中,vt为新第二滑动平均,β2为目标衰减系数。
(4) The calculation formula of the learning rate lt when the current training time ρt>4 is:
Figure PCTCN2022118104-appb-000016
ρt>4 means that the current value of the moment moving average is greater than the preset threshold, that is, the preset threshold value is 4. Among them, vt is the new second moving average, and β2 is the target attenuation coefficient.
(5)当前训练时刻的新参数θt的计算公式为:
Figure PCTCN2022118104-appb-000017
ρt>4表示矩滑动平均当前值大于预设阈值,即预设阈值取值为4。其中,αt为前进步长,
Figure PCTCN2022118104-appb-000018
Figure PCTCN2022118104-appb-000019
rt为新第二滑动平均vt的矫正项,
Figure PCTCN2022118104-appb-000020
为新第一滑动平均mt的矫正项。ρt为矩滑动平均当前值,ρ∞为矩滑动平均最大值。
(5) The calculation formula of the new parameter θt at the current training time is:
Figure PCTCN2022118104-appb-000017
ρt>4 means that the current value of the moment moving average is greater than the preset threshold, that is, the preset threshold value is 4. Among them, αt is the forward step length,
Figure PCTCN2022118104-appb-000018
Figure PCTCN2022118104-appb-000019
rt is the correction term of the new second moving average vt,
Figure PCTCN2022118104-appb-000020
is the correction term of the new first moving average mt. ρt is the current value of the moment moving average, and ρ∞ is the maximum value of the moment moving average.
其中,mt决定模型训练过程中梯度的下降方向,vt与αt共同决定模型训练过程中梯度的下降大小。而针对mt求得
Figure PCTCN2022118104-appb-000021
再去计算新参数,可使计算误差始终相对较小。即:在模型训练前期通过
Figure PCTCN2022118104-appb-000022
增大原mt;当t变得较大时,β1t趋近于0,所以1-β1t趋近 于1,因此后期的
Figure PCTCN2022118104-appb-000023
趋近于原mt。据此,ρt>4时,学习率逐步稳定增大,有助于减缓模型在初始训练阶段的提前过拟合现象,保持分布的平稳。
Among them, mt determines the direction of gradient descent during model training, and vt and αt jointly determine the magnitude of gradient descent during model training. And for mt, we get
Figure PCTCN2022118104-appb-000021
Then calculating new parameters can make the calculation error always relatively small. That is: passed in the early stage of model training
Figure PCTCN2022118104-appb-000022
Increase the original mt; when t becomes larger, β1t approaches 0, so 1-β1t approaches 1, so the later
Figure PCTCN2022118104-appb-000023
Close to the original mt. According to this, when ρt>4, the learning rate gradually and steadily increases, which helps to slow down the early over-fitting phenomenon of the model in the initial training stage and maintain the stability of the distribution.
在一种具体实施方式中,利用随机梯度下降以及动量算法调整学习率,包括:基于训练数据、训练结果以及前一训练时刻输出的模型参数计算当前训练时刻的更新梯度;基于预设的迭代参数和预设的前进步长、前一训练时刻的目标滑动平均和更新梯度计算当前训练时刻的学习率;相应地,基于调整后的学习率计算新参数,包括:基于当前训练时刻的学习率和前一训练时刻输出的模型参数计算新参数。In a specific implementation, using stochastic gradient descent and momentum algorithms to adjust the learning rate includes: calculating the updated gradient at the current training time based on training data, training results, and model parameters output at the previous training time; based on preset iteration parameters Calculate the learning rate at the current training moment with the preset forward step length, the target moving average of the previous training moment, and the update gradient; accordingly, calculate new parameters based on the adjusted learning rate, including: based on the learning rate at the current training moment and Calculate new parameters based on the model parameters output at the previous training moment.
当矩滑动平均当前值不大于预设阈值时,利用随机梯度下降以及动量算法调整学习率后,计算新参数的过程包括:When the current value of the moment moving average is not greater than the preset threshold, after adjusting the learning rate using stochastic gradient descent and momentum algorithms, the process of calculating new parameters includes:
(1)假设当前训练时刻为t,那么当前训练时刻的更新梯度gt的计算公式为:
Figure PCTCN2022118104-appb-000024
其中,gt为当前训练时刻的更新梯度,θt-1表示前一训练时刻输出的模型参数,
Figure PCTCN2022118104-appb-000025
表示对θ求导,X为训练数据,ft(θt-1;X)表示针对训练数据的训练结果。
(1) Assuming that the current training time is t, then the calculation formula of the update gradient gt at the current training time is:
Figure PCTCN2022118104-appb-000024
Among them, gt is the updated gradient at the current training time, θt-1 represents the model parameters output at the previous training time,
Figure PCTCN2022118104-appb-000025
represents the derivation of θ, X is the training data, and ft(θt-1; X) represents the training result for the training data.
(2)当前训练时刻ρt≤4时的学习率lt的计算公式为:
Figure PCTCN2022118104-appb-000026
ρt≤4表示矩滑动平均当前值不大于预设阈值,即预设阈值取值为4。其中,lt-1为前一训练时刻输出的学习率,αt为前进步长,gt为当前训练时刻的更新梯度,ε为预设的迭代参数。
(2) The calculation formula of the learning rate lt when the current training time ρt≤4 is:
Figure PCTCN2022118104-appb-000026
ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4. Among them, lt-1 is the learning rate output at the previous training moment, αt is the forward step length, gt is the update gradient at the current training moment, and ε is the preset iteration parameter.
(3)当前训练时刻的新参数θt的计算公式为:
Figure PCTCN2022118104-appb-000027
ρt≤4表示矩滑动平均当前值不大于预设阈值,即预设阈值取值为4。其中,θt-1表示前一训练时刻输出的模型参数。
(3) The calculation formula of the new parameter θt at the current training time is:
Figure PCTCN2022118104-appb-000027
ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4. Among them, θt-1 represents the model parameters output at the previous training moment.
当ρt≤4时,选用随机梯度下降加动量(SGD+Momentum)算法,可以有效避免学习率为负数的情况,使学习率在前期保持在较为稳定的波动状态。When ρt≤4, the stochastic gradient descent plus momentum (SGD+Momentum) algorithm can be used to effectively avoid the negative learning rate and keep the learning rate in a relatively stable fluctuation state in the early stage.
下述实施例基于CXL协议构建了硬件互连系统来进行模型训练,可以有效解决数据传输延迟以及带宽问题,能支持Parameter server、Ring-Allreduce等多种主流的通信拓扑结构。The following embodiment builds a hardware interconnection system based on the CXL protocol for model training, which can effectively solve data transmission delay and bandwidth problems, and can support various mainstream communication topologies such as Parameter server and Ring-Allreduce.
具体的,本实施例提供的硬件互连系统包括计算设备CPU、GPU、FPGA、ASIC,可 通过CXL协议实现多种异构计算设备的内存共享,打通异构设备之间的通信延迟屏障,大幅增加数据交互速度,系统的整体架构图请参见图2。Specifically, the hardware interconnection system provided by this embodiment includes computing devices CPU, GPU, FPGA, and ASIC. It can realize memory sharing of multiple heterogeneous computing devices through the CXL protocol, open up the communication delay barrier between heterogeneous devices, and significantly To increase the speed of data interaction, please see Figure 2 for the overall architecture of the system.
如图2所示,利用python实现顶层深度学习框架,并使用OneAPI编程实现目标算子。目标算子可供顶层深度学习框架调用,并运行在底层不同的计算设备上。底层的不同计算设备CPU、GPU、FPGA、ASIC通过CXL协议互联,且各计算设备与主机设备也通过CXL协议连接。其中,目标算子实现有:需要训练的模型、Rectified-Adabelief优化算法及其相关参数。As shown in Figure 2, Python is used to implement the top-level deep learning framework, and OneAPI programming is used to implement the target operator. The target operator can be called by the top-level deep learning framework and runs on different underlying computing devices. The different underlying computing devices CPU, GPU, FPGA, and ASIC are interconnected through the CXL protocol, and each computing device and the host device are also connected through the CXL protocol. Among them, the target operator implementation includes: the model that needs to be trained, the Rectified-Adabelief optimization algorithm and its related parameters.
具体的,各设备间的拓扑连接示意可参见图3。图3中各计算设备(CPU、GPU、FPGA、ASIC等)通过转接设备与主机设备相连。按照图3所示的连接结构,各计算设备可以在不同主机设备之间共享,也即:不同主机共享所有计算设备。图3所示的各条连接线均使用CXL协议,以实现IO、缓存和内存的互连共享。Specifically, the topological connection diagram between each device can be seen in Figure 3. In Figure 3, each computing device (CPU, GPU, FPGA, ASIC, etc.) is connected to the host device through an adapter device. According to the connection structure shown in Figure 3, each computing device can be shared between different host devices, that is, different hosts share all computing devices. Each connection line shown in Figure 3 uses the CXL protocol to realize interconnection sharing of IO, cache and memory.
以各计算设备的内存共享为例,各计算设备的内存共享示意图如图4所示,各主机、各计算设备访问某一计算设备的内存时,像是在访问自己的内存一样。Taking the memory sharing of each computing device as an example, the schematic diagram of memory sharing of each computing device is shown in Figure 4. When each host and each computing device accesses the memory of a certain computing device, it is like accessing its own memory.
可见,本实施例使用Adabelief优化算法,解决了优化算法在训练早期由于数据不足引起的学习率方差过大问题,在完成各种深度学习任务时实现更快的收敛速度,避免过早的陷入局部最优解。同时,基于CXL通信协议构建了实现分布式Rectified-Adabelief优化算法的异构计算系统,并基于OneAPI编程模型实现Rectified-Adabelief优化算法,使其能在多种异构计算设备上运行。实现异构计算设备之间的内存一致性,大大增加数据传输带宽,并减少计算设备之间的数据交互延迟。It can be seen that this embodiment uses the Adabelief optimization algorithm to solve the problem of excessive learning rate variance caused by insufficient data in the early stage of training of the optimization algorithm, achieve faster convergence speed when completing various deep learning tasks, and avoid prematurely falling into local problems. Optimal solution. At the same time, a heterogeneous computing system that implements the distributed Rectified-Adabelief optimization algorithm is built based on the CXL communication protocol, and the Rectified-Adabelief optimization algorithm is implemented based on the OneAPI programming model, so that it can run on a variety of heterogeneous computing devices. Achieve memory consistency between heterogeneous computing devices, greatly increase data transmission bandwidth, and reduce data interaction delays between computing devices.
下面对本申请实施例提供的一种数据处理系统进行介绍,下文描述的一种数据处理系统与上文描述的一种数据处理方法可以相互参照。A data processing system provided by an embodiment of the present application is introduced below. The data processing system described below and the data processing method described above can be referred to each other.
本申请实施例公开了一种数据处理系统,包括:主机、与主机通过CXL协议连接的硬件计算平台;The embodiment of the present application discloses a data processing system, including: a host, and a hardware computing platform connected to the host through the CXL protocol;
主机,用于提供用于训练目标模型的训练数据;基于CXL协议共享硬件计算平台训练得到的新模型;及Host, used to provide training data for training target models; new models trained based on the CXL protocol shared hardware computing platform; and
硬件计算平台,用于基于CXL协议共享主机中的训练数据;调用目标模型处理训练数据得到训练结果,并基于训练结果计算目标模型的新参数;利用新参数更新目标模型,得到新模型;若新模型符合收敛条件,则保留新模型;其中,计算新参数包括:确定矩滑动平均当前值,并基于矩滑动平均当前值调整学习率,基于调整后的学习率计算新参数。Hardware computing platform, used to share training data in the host based on the CXL protocol; call the target model to process the training data to obtain training results, and calculate new parameters of the target model based on the training results; use the new parameters to update the target model to obtain a new model; if new If the model meets the convergence conditions, the new model is retained; calculating new parameters includes: determining the current value of the moment moving average, adjusting the learning rate based on the current value of the moment moving average, and calculating new parameters based on the adjusted learning rate.
在一种具体实施方式中,硬件计算平台具体用于:In a specific implementation, the hardware computing platform is specifically used for:
基于预设的目标衰减系数和矩滑动平均最大值确定矩滑动平均当前值;及Determine the current value of the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average; and
若矩滑动平均当前值大于预设阈值,则利用warmup策略调整学习率;否则,利用随机梯度下降以及动量算法调整学习率。If the current value of the moment moving average is greater than the preset threshold, the warmup strategy is used to adjust the learning rate; otherwise, the stochastic gradient descent and momentum algorithms are used to adjust the learning rate.
在一种具体实施方式中,硬件计算平台具体用于:In a specific implementation, the hardware computing platform is specifically used for:
按照第一公式计算矩滑动平均当前值;第一公式为:Calculate the current value of the moment moving average according to the first formula; the first formula is:
Figure PCTCN2022118104-appb-000028
Figure PCTCN2022118104-appb-000028
其中,ρt为矩滑动平均当前值,ρ∞为矩滑动平均最大值,t表示当前训练时刻,β2为目标衰减系数。Among them, ρt is the current value of the moment moving average, ρ∞ is the maximum value of the moment moving average, t represents the current training time, and β2 is the target attenuation coefficient.
在一种具体实施方式中,硬件计算平台具体用于:In a specific implementation, the hardware computing platform is specifically used for:
基于训练数据、训练结果以及前一训练时刻输出的模型参数计算当前训练时刻的更新梯度;Calculate the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment;
基于预设的对象衰减系数、更新梯度和前一训练时刻的第一滑动平均计算新第一滑动平均;Calculate a new first moving average based on the preset object attenuation coefficient, update gradient and the first moving average of the previous training moment;
基于更新梯度、目标衰减系数、新第一滑动平均和前一训练时刻的第二滑动平均计算新第二滑动平均;Calculate a new second moving average based on the updated gradient, target attenuation coefficient, new first moving average, and second moving average at the previous training moment;
基于新第二滑动平均和目标衰减系数计算当前训练时刻的学习率;及Calculate the learning rate at the current training time based on the new second moving average and the target attenuation coefficient; and
相应地,硬件计算平台具体用于:Accordingly, the hardware computing platform is specifically used for:
基于当前训练时刻的学习率、前一训练时刻输出的模型参数、预设的前进步长、新第二滑动平均的矫正项和新第一滑动平均的矫正项计算新参数。Calculate new parameters based on the learning rate at the current training moment, the model parameters output at the previous training moment, the preset forward step length, the correction term of the new second moving average and the correction term of the new first moving average.
在一种具体实施方式中,硬件计算平台具体用于:In a specific implementation, the hardware computing platform is specifically used for:
基于训练数据、训练结果以及前一训练时刻输出的模型参数计算当前训练时刻的更新梯度;Calculate the update gradient at the current training moment based on the training data, training results, and model parameters output at the previous training moment;
基于预设的迭代参数和预设的前进步长、前一训练时刻的目标滑动平均和更新梯度计算当前训练时刻的学习率;及Calculate the learning rate at the current training moment based on the preset iteration parameters and the preset forward step length, the target moving average of the previous training moment and the update gradient; and
相应地,硬件计算平台具体用于:Accordingly, the hardware computing platform is specifically used for:
基于当前训练时刻的学习率和前一训练时刻输出的模型参数计算新参数。Calculate new parameters based on the learning rate at the current training moment and the model parameters output at the previous training moment.
在一种具体实施方式中,硬件计算平台包括多个计算模块,各个计算模块之间基于CXL协议共享内存。In a specific implementation manner, the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol.
在一种具体实施方式中,计算模块包括:CPU、GPU、FPGA、ASIC中的任一项或组合。In a specific implementation, the computing module includes: any one or combination of CPU, GPU, FPGA, and ASIC.
其中,关于本实施例中各个模块、单元更加具体的工作过程可以参考前述实施例中公开的相应内容,在此不再进行赘述。For more specific working processes of each module and unit in this embodiment, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be described again here.
可见,本实施例提供了一种数据处理系统,能够降低主机和硬件模块之间的数据传输开销,提升模型训练效率。It can be seen that this embodiment provides a data processing system that can reduce data transmission overhead between the host and the hardware module and improve model training efficiency.
下面对本申请实施例提供的一种电子设备进行介绍,下文描述的一种电子设备与上文描述的一种数据处理方法及系统可以相互参照。An electronic device provided by an embodiment of the present application is introduced below. An electronic device described below and a data processing method and system described above may be referred to each other.
参见图5所示,本申请实施例公开了一种电子设备,包括:Referring to Figure 5, an embodiment of the present application discloses an electronic device, including:
一个或多个存储器501,用于保存计算机可读指令;One or more memories 501 for storing computer readable instructions;
一个或多个处理器502,用于执行计算机可读指令,以实现上述任意实施例公开的方法。One or more processors 502 are configured to execute computer-readable instructions to implement the methods disclosed in any of the above embodiments.
下面对本申请实施例提供的一种非易失性计算机可读存储介质进行介绍,下文描述的一种非易失性计算机可读存储介质与上文描述的一种数据处理方法、系统及设备可以相互参照。The following is an introduction to a non-volatile computer-readable storage medium provided by embodiments of the present application. The non-volatile computer-readable storage medium described below and the data processing method, system and device described above can be Cross-reference.
一种非易失性计算机可读存储介质,用于保存计算机程序,其中,计算机可读指令被处理器执行时实现前述实施例公开的数据处理方法。关于该方法的具体步骤可以参考前述实施例中公开的相应内容,在此不再进行赘述。A non-volatile computer-readable storage medium used to store a computer program, wherein the computer-readable instructions implement the data processing method disclosed in the foregoing embodiment when executed by a processor. Regarding the specific steps of this method, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be described again here.
本申请涉及的“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法或设备固有的其它步骤或单元。"First", "second", "third", "fourth", etc. (if present) mentioned in this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method or apparatus that encompasses a series of steps or units need not be limited to those steps or units expressly listed. , but may include other steps or elements not expressly listed or inherent to such processes, methods or apparatuses.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions involving “first”, “second”, etc. in this application are for descriptive purposes only and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of indicated technical features. . Therefore, features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In addition, the technical solutions in various embodiments can be combined with each other, but it must be based on the realization by those of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be realized, it should be considered that such a combination of technical solutions does not exist. , nor is it within the scope of protection required by this application.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。Each embodiment in this specification is described in a progressive manner. Each embodiment focuses on its differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的非易失性计算机可读存储介质中。The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein may be implemented directly in hardware, in software modules executed by a processor, or in a combination of both. Software modules may be located in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. Any other form of non-volatile computer-readable storage medium known to the public.
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。This article uses specific examples to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method and the core idea of the present application; at the same time, for those of ordinary skill in the art, based on this application There will be changes in the specific implementation and scope of application of the ideas. In summary, the contents of this description should not be understood as limiting the present application.

Claims (20)

  1. 一种数据处理方法,其特征在于,应用于与主机通过CXL协议连接的硬件计算平台,包括:A data processing method, characterized in that it is applied to a hardware computing platform connected to a host through the CXL protocol, including:
    基于CXL协议共享所述主机中的、用于训练目标模型的训练数据;Share the training data used to train the target model in the host based on the CXL protocol;
    调用所述目标模型处理所述训练数据得到训练结果,并基于所述训练结果计算所述目标模型的新参数;其中,计算所述新参数包括:确定矩滑动平均当前值,并基于所述矩滑动平均当前值调整学习率,基于调整后的学习率计算所述新参数;Calling the target model to process the training data to obtain training results, and calculating new parameters of the target model based on the training results; wherein calculating the new parameters includes: determining the current value of the moving average of moments, and calculating the current value of the moving average of moments based on the training results. The sliding average current value adjusts the learning rate, and the new parameters are calculated based on the adjusted learning rate;
    利用所述新参数更新所述目标模型,得到新模型;及Update the target model using the new parameters to obtain a new model; and
    响应于所述新模型符合收敛条件,保留所述新模型,并使所述主机基于CXL协议共享所述新模型。In response to the new model meeting the convergence condition, the new model is retained and the host is allowed to share the new model based on the CXL protocol.
  2. 根据权利要求1所述的方法,其特征在于,所述确定矩滑动平均当前值,并基于所述矩滑动平均当前值调整学习率,包括:The method of claim 1, wherein determining the current value of the moving average of moments and adjusting the learning rate based on the current value of the moving average of moments includes:
    基于预设的目标衰减系数和矩滑动平均最大值确定矩滑动平均当前值;及Determine the current value of the moment moving average based on the preset target attenuation coefficient and the maximum value of the moment moving average; and
    响应于所述矩滑动平均当前值大于预设阈值,利用warmup策略调整所述学习率;或,响应于所述矩滑动平均当前值小于或等于预设阈值,利用随机梯度下降以及动量算法调整所述学习率。In response to the current value of the moment moving average being greater than the preset threshold, a warmup strategy is used to adjust the learning rate; or in response to the current value of the moment moving average being less than or equal to the preset threshold, stochastic gradient descent and momentum algorithms are used to adjust the learning rate. Describe the learning rate.
  3. 根据权利要求2所述的方法,其特征在于,所述基于预设的目标衰减系数和矩滑动平均最大值确定矩滑动平均当前值,包括:The method of claim 2, wherein determining the current value of the moment moving average based on the preset target attenuation coefficient and the moment moving average maximum value includes:
    按照第一公式计算所述矩滑动平均当前值;所述第一公式为:Calculate the current value of the moment moving average according to the first formula; the first formula is:
    Figure PCTCN2022118104-appb-100001
    Figure PCTCN2022118104-appb-100001
    其中,ρt为所述矩滑动平均当前值,ρ∞为所述矩滑动平均最大值,t表示当前训练时刻,β2为所述目标衰减系数。Wherein, ρt is the current value of the moment moving average, ρ∞ is the maximum value of the moment moving average, t represents the current training time, and β2 is the target attenuation coefficient.
  4. 根据权利要求2所述的方法,其特征在于,所述利用warmup策略调整所述学习率,包括:The method of claim 2, wherein adjusting the learning rate using a warmup strategy includes:
    基于所述训练数据、所述训练结果以及前一训练时刻输出的模型参数计算当前训练时刻的更新梯度;Calculate the update gradient at the current training moment based on the training data, the training results and the model parameters output at the previous training moment;
    基于预设的对象衰减系数、所述更新梯度和前一训练时刻的第一滑动平均计算新第一滑动平均;Calculate a new first moving average based on the preset object attenuation coefficient, the updated gradient and the first moving average of the previous training moment;
    基于所述更新梯度、所述目标衰减系数、所述新第一滑动平均和前一训练时刻的第二滑动平均计算新第二滑动平均;及Calculate a new second moving average based on the updated gradient, the target attenuation coefficient, the new first moving average and the second moving average of the previous training moment; and
    基于所述新第二滑动平均和所述目标衰减系数计算当前训练时刻的学习率;Calculate the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient;
    相应地,所述基于调整后的学习率计算所述新参数,包括:Correspondingly, calculating the new parameters based on the adjusted learning rate includes:
    基于当前训练时刻的学习率、所述前一训练时刻输出的模型参数、预设的前进步长、所述新第二滑动平均的矫正项和所述新第一滑动平均的矫正项计算所述新参数。The calculation is based on the learning rate at the current training moment, the model parameters output at the previous training moment, the preset forward step length, the correction term of the new second moving average and the correction term of the new first moving average. New parameters.
  5. 根据权利要求2所述的方法,其特征在于,所述利用随机梯度下降以及动量算法调整所述学习率,包括:The method of claim 2, wherein adjusting the learning rate using stochastic gradient descent and momentum algorithms includes:
    基于所述训练数据、所述训练结果以及前一训练时刻输出的模型参数计算当前训练时刻的更新梯度;及Calculate the update gradient at the current training time based on the training data, the training results, and the model parameters output at the previous training time; and
    基于预设的迭代参数和预设的前进步长、前一训练时刻的目标滑动平均和所述更新梯度计算当前训练时刻的学习率;相应地,所述基于调整后的学习率计算所述新参数,包括:The learning rate at the current training moment is calculated based on the preset iteration parameters and the preset forward step length, the target moving average of the previous training moment and the update gradient; accordingly, the new calculation is based on the adjusted learning rate. Parameters, including:
    基于当前训练时刻的学习率和所述前一训练时刻输出的模型参数计算所述新参数。The new parameters are calculated based on the learning rate at the current training time and the model parameters output at the previous training time.
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述硬件计算平台包括多个计算模块,各个计算模块之间基于CXL协议共享内存。The method according to any one of claims 1 to 5, characterized in that the hardware computing platform includes multiple computing modules, and each computing module shares memory based on the CXL protocol.
  7. 根据权利要求4所述的方法,其特征在于,所述基于所述训练数据、所述训练结果以及前一训练时刻输出的模型参数计算当前训练时刻的更新梯度,包括:The method of claim 4, wherein calculating the update gradient at the current training moment based on the training data, the training results and the model parameters output at the previous training moment includes:
    所述当前训练时刻的更新梯度gt的计算公式为:The calculation formula of the updated gradient gt at the current training moment is:
    Figure PCTCN2022118104-appb-100002
    Figure PCTCN2022118104-appb-100002
    其中,gt为所述当前训练时刻的更新梯度,θt-1表示所述前一训练时刻输出的模型参数,
    Figure PCTCN2022118104-appb-100003
    表示对θ求导,X为训练数据,ft(θt-1;X)表示针对所述训练数据的训练结果。
    Where, gt is the updated gradient at the current training moment, θt-1 represents the model parameters output at the previous training moment,
    Figure PCTCN2022118104-appb-100003
    represents the derivation of θ, X is the training data, and ft(θt-1; X) represents the training result for the training data.
  8. 根据权利要求4所述的方法,其特征在于,所述基于预设的对象衰减系数、所述更新梯度和前一训练时刻的第一滑动平均计算新第一滑动平均,包括:The method of claim 4, wherein calculating a new first moving average based on the preset object attenuation coefficient, the updated gradient and the first moving average of the previous training moment includes:
    所述新第一滑动平均mt的计算公式为:The calculation formula of the new first moving average mt is:
    m t=β 1m t-1+(1-β 1)g t m t1 m t-1 +(1-β 1 )g t
    其中,mt为所述新第一滑动平均,β1为所述对象衰减系数,mt-1为所述前一训练时刻的第一滑动平均,gt为当前训练时刻的更新梯度。Where, mt is the new first moving average, β1 is the object attenuation coefficient, mt-1 is the first moving average of the previous training moment, and gt is the updated gradient of the current training moment.
  9. 根据权利要求4所述的方法,其特征在于,所述基于所述更新梯度、所述目标衰减系数、所述新第一滑动平均和前一训练时刻的第二滑动平均计算新第二滑动平均,包括:The method according to claim 4, wherein the new second moving average is calculated based on the updated gradient, the target attenuation coefficient, the new first moving average and the second moving average of the previous training moment. ,include:
    所述新第二滑动平均vt的计算公式为:The calculation formula of the new second moving average vt is:
    v t=β 2v t-1+(1-β 2)(g t-m t) 2 v t2 v t-1 +(1-β 2 )(g t -m t ) 2
    其中,vt为所述新第二滑动平均,β2为所述目标衰减系数,mt为所述新第一滑动平均,vt-1为所述前一训练时刻的第二滑动平均,gt为所述当前训练时刻的更新梯度。Where, vt is the new second moving average, β2 is the target attenuation coefficient, mt is the new first moving average, vt-1 is the second moving average of the previous training moment, and gt is the The updated gradient at the current training time.
  10. 根据权利要求4所述的方法,其特征在于,所述基于所述新第二滑动平均和所述目标衰减系数计算当前训练时刻的学习率,包括:The method of claim 4, wherein calculating the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient includes:
    响应于训练时刻ρt>4的学习率lt的计算公式为:The calculation formula of the learning rate lt in response to the training time ρt>4 is:
    Figure PCTCN2022118104-appb-100004
    Figure PCTCN2022118104-appb-100004
    ρt>4表示矩滑动平均当前值大于预设阈值,所述预设阈值取值为4,vt为所述新第二滑动平均,β2为所述目标衰减系数。ρt>4 indicates that the current value of the moment moving average is greater than the preset threshold, and the preset threshold value is 4, vt is the new second moving average, and β2 is the target attenuation coefficient.
  11. 根据权利要求4所述的方法,其特征在于,所述基于所述新第二滑动平均和所述目标衰减系数计算当前训练时刻的学习率,还包括:The method of claim 4, wherein calculating the learning rate at the current training moment based on the new second moving average and the target attenuation coefficient further includes:
    响应于训练时刻ρt≤4的学习率lt的计算公式为:The calculation formula of the learning rate lt in response to the training time ρt≤4 is:
    Figure PCTCN2022118104-appb-100005
    Figure PCTCN2022118104-appb-100005
    ρt≤4表示矩滑动平均当前值不大于预设阈值,即预设阈值取值为4;其中,lt-1为前一训练时刻输出的学习率,αt为前进步长,gt为所述当前训练时刻的更新梯度,ε为预设的迭代参数。ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, that is, the preset threshold value is 4; where, lt-1 is the learning rate output at the previous training moment, αt is the forward step length, and gt is the current The updated gradient at the training time, ε is the preset iteration parameter.
  12. 根据权利要求4所述的方法,其特征在于,所述基于调整后的学习率计算所述新参数,包括:The method of claim 4, wherein calculating the new parameters based on the adjusted learning rate includes:
    所述新参数θt的计算公式为:The calculation formula of the new parameter θt is:
    Figure PCTCN2022118104-appb-100006
    Figure PCTCN2022118104-appb-100006
    ρt>4表示矩滑动平均当前值大于预设阈值,所述预设阈值取值为4,αt为前进步长,
    Figure PCTCN2022118104-appb-100007
    rt为新第二滑动平均vt的矫正项,
    Figure PCTCN2022118104-appb-100008
    为新第一滑动平均mt的矫正项;ρt为矩滑动平均当前值,ρ∞为矩滑动平均最大值。
    ρt>4 means that the current value of the moment moving average is greater than the preset threshold, the preset threshold value is 4, αt is the forward step length,
    Figure PCTCN2022118104-appb-100007
    rt is the correction term of the new second moving average vt,
    Figure PCTCN2022118104-appb-100008
    is the correction term of the new first moving average mt; ρt is the current value of the moment moving average, and ρ∞ is the maximum value of the moment moving average.
  13. 根据权利要求4所述的方法,其特征在于,所述基于调整后的学习率计算所述新参数,还包括:The method of claim 4, wherein calculating the new parameters based on the adjusted learning rate further includes:
    所述新参数θt的计算公式为:The calculation formula of the new parameter θt is:
    Figure PCTCN2022118104-appb-100009
    Figure PCTCN2022118104-appb-100009
    ρt≤4表示矩滑动平均当前值不大于预设阈值,所述预设阈值取值为4,θt-1表示前一训练时刻输出的模型参数。ρt≤4 means that the current value of the moment moving average is not greater than the preset threshold, and the preset threshold value is 4, and θt-1 means the model parameters output at the previous training moment.
  14. 根据权利要求12所述的方法,其特征在于,所述新第一滑动平均mt决定模型训练过程中梯度的下降方向,vt与αt共同决定模型训练过程中梯度的下降大小。The method according to claim 12, characterized in that the new first moving average mt determines the descending direction of the gradient during the model training process, and vt and αt jointly determine the descending size of the gradient during the model training process.
  15. 根据权利要求14所述的方法,其特征在于,将所述新第一滑动平均mt求取的
    Figure PCTCN2022118104-appb-100010
    用于计算新参数,以减少计算误差。
    The method according to claim 14, characterized in that: the new first moving average mt is calculated
    Figure PCTCN2022118104-appb-100010
    Used to calculate new parameters to reduce calculation errors.
  16. 根据权利要求15所述的方法,其特征在于,在模型训练前期通过所述
    Figure PCTCN2022118104-appb-100011
    增大原所述新第一滑动平均mt;
    The method according to claim 15, characterized in that in the early stage of model training, through the
    Figure PCTCN2022118104-appb-100011
    Increase the original new first moving average mt;
    响应于t变为较大值,使得β1t趋近于0,1-β1t趋近于1,最终
    Figure PCTCN2022118104-appb-100012
    的值趋近于原所述新第一滑动平均mt。
    In response to t becoming a larger value, β1t approaches 0, 1-β1t approaches 1, and finally
    Figure PCTCN2022118104-appb-100012
    The value approaches the original new first moving average mt.
  17. 根据权利要求6所述的方法,其特征在于,所述计算模块包括:CPU、GPU、FPGA、ASIC中的任一项或组合。The method according to claim 6, characterized in that the computing module includes: any one or combination of CPU, GPU, FPGA, and ASIC.
  18. 一种数据处理系统,其特征在于,包括:主机、与所述主机通过CXL协议连接的硬件计算平台;A data processing system, characterized by comprising: a host, and a hardware computing platform connected to the host through the CXL protocol;
    所述主机,用于提供用于训练目标模型的训练数据;基于CXL协议共享硬件计算平台训练得到的新模型;及The host is used to provide training data for training a target model; a new model obtained by training on a shared hardware computing platform based on the CXL protocol; and
    所述硬件计算平台,用于基于CXL协议共享所述主机中的所述训练数据;调用所述目标模型处理所述训练数据得到训练结果,并基于所述训练结果计算所述目标模型的新参数;利用所述新参数更新所述目标模型,得到新模型;若所述新模型符合收敛条件,则保留所述新模型;其中,计算所述新参数包括:确定矩滑动平均当前值,并基于所述矩滑动 平均当前值调整学习率,基于调整后的学习率计算所述新参数。The hardware computing platform is used to share the training data in the host based on the CXL protocol; call the target model to process the training data to obtain training results, and calculate new parameters of the target model based on the training results ;Use the new parameters to update the target model to obtain a new model; if the new model meets the convergence conditions, retain the new model; wherein, calculating the new parameters includes: determining the current value of the moment moving average, and based on The current value of the moment moving average is adjusted to the learning rate, and the new parameters are calculated based on the adjusted learning rate.
  19. 一种电子设备,其特征在于,包括:An electronic device, characterized by including:
    一个或多个存储器,用于存储计算机可读指令;及One or more memories for storing computer-readable instructions; and
    一个或多个处理器,用于执行所述计算机可读指令,以实现如权利要求1至17任一项所述的方法。One or more processors, configured to execute the computer readable instructions to implement the method according to any one of claims 1 to 17.
  20. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1至17任一项所述的方法的步骤。One or more non-volatile computer-readable storage media storing computer-readable instructions, characterized in that, when executed by one or more processors, the computer-readable instructions cause the one or more processors to The steps of the method as claimed in any one of claims 1 to 17 are carried out.
PCT/CN2022/118104 2022-04-14 2022-09-09 Data processing method and system, and device and readable storage medium WO2023197520A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210387060.8A CN114461568B (en) 2022-04-14 2022-04-14 Data processing method, system, equipment and readable storage medium
CN202210387060.8 2022-04-14

Publications (1)

Publication Number Publication Date
WO2023197520A1 true WO2023197520A1 (en) 2023-10-19

Family

ID=81418423

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/118104 WO2023197520A1 (en) 2022-04-14 2022-09-09 Data processing method and system, and device and readable storage medium

Country Status (2)

Country Link
CN (1) CN114461568B (en)
WO (1) WO2023197520A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112466A (en) * 2023-10-25 2023-11-24 浪潮(北京)电子信息产业有限公司 Data processing method, device, equipment, storage medium and distributed cluster
CN117785489A (en) * 2024-02-27 2024-03-29 苏州元脑智能科技有限公司 Server, task execution method and device and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461568B (en) * 2022-04-14 2022-07-08 苏州浪潮智能科技有限公司 Data processing method, system, equipment and readable storage medium
CN114925829A (en) * 2022-07-18 2022-08-19 山东海量信息技术研究院 Neural network training method and device, electronic equipment and storage medium
CN115310566A (en) * 2022-10-12 2022-11-08 浪潮电子信息产业股份有限公司 Distributed training system, method, device, equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312415A (en) * 2020-02-27 2021-08-27 Sap欧洲公司 Near memory acceleration for database operations
US20210390414A1 (en) * 2020-06-10 2021-12-16 Nvidia Corporation Accelerated training for neural network models
CN114169534A (en) * 2021-12-09 2022-03-11 京东科技信息技术有限公司 Training method, device, equipment and medium for distributed machine learning model
CN114257386A (en) * 2020-09-10 2022-03-29 华为技术有限公司 Training method, system, equipment and storage medium for detection model
CN114461568A (en) * 2022-04-14 2022-05-10 苏州浪潮智能科技有限公司 Data processing method, system, equipment and readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991095B (en) * 2016-01-21 2021-09-28 阿里巴巴集团控股有限公司 Machine exception handling method, learning rate adjusting method and device
CN110033081A (en) * 2019-03-08 2019-07-19 华为技术有限公司 A kind of method and apparatus of determining learning rate
US20210142177A1 (en) * 2019-11-13 2021-05-13 Nvidia Corporation Synthesizing data for training one or more neural networks
CN113723692A (en) * 2021-09-02 2021-11-30 深圳前海微众银行股份有限公司 Data processing method, apparatus, device, medium, and program product

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312415A (en) * 2020-02-27 2021-08-27 Sap欧洲公司 Near memory acceleration for database operations
US20210390414A1 (en) * 2020-06-10 2021-12-16 Nvidia Corporation Accelerated training for neural network models
CN114257386A (en) * 2020-09-10 2022-03-29 华为技术有限公司 Training method, system, equipment and storage medium for detection model
CN114169534A (en) * 2021-12-09 2022-03-11 京东科技信息技术有限公司 Training method, device, equipment and medium for distributed machine learning model
CN114461568A (en) * 2022-04-14 2022-05-10 苏州浪潮智能科技有限公司 Data processing method, system, equipment and readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112466A (en) * 2023-10-25 2023-11-24 浪潮(北京)电子信息产业有限公司 Data processing method, device, equipment, storage medium and distributed cluster
CN117112466B (en) * 2023-10-25 2024-02-09 浪潮(北京)电子信息产业有限公司 Data processing method, device, equipment, storage medium and distributed cluster
CN117785489A (en) * 2024-02-27 2024-03-29 苏州元脑智能科技有限公司 Server, task execution method and device and storage medium
CN117785489B (en) * 2024-02-27 2024-05-10 苏州元脑智能科技有限公司 Server, task execution method and device and storage medium

Also Published As

Publication number Publication date
CN114461568B (en) 2022-07-08
CN114461568A (en) 2022-05-10

Similar Documents

Publication Publication Date Title
WO2023197520A1 (en) Data processing method and system, and device and readable storage medium
CN105827537B (en) A kind of congestion improved method based on QUIC agreement
US10679145B2 (en) System and method for balancing computation with communication in parallel learning
US7353339B2 (en) Adaptive caching
US9237107B2 (en) Fair quantized congestion notification (FQCN) to mitigate transport control protocol (TCP) throughput collapse in data center networks
JP5010739B2 (en) Method and system for aggregate bandwidth control
JP4791322B2 (en) Method and apparatus for adaptive bandwidth control with bandwidth guarantee
US9397938B2 (en) Packet scheduling in a network processor
WO2015096692A1 (en) Method and system for controlling data reception traffic and computer storage medium
WO2015130404A1 (en) Packet shaping in a network processor
US20150249603A1 (en) Packet output processing
US11663129B2 (en) Using a machine learning module to dynamically determine tracks to prestage from storage to cache
JP2018110387A (en) Method and system for bandwidth measurement and adaptive data transmission based on buffer in real time live environment
CN109818863A (en) Link priority setting method and device
US8402180B2 (en) Autonomous multi-packet transfer for universal serial bus
WO2021238274A1 (en) Gradient information updating method for distributed deep learning, and related apparatus
CN106648456A (en) Dynamic save file access method based on use page view and prediction mechanism
JP4616391B2 (en) System and method for dynamic data prefetching
JP4782082B2 (en) Packet processing apparatus, method, and program
CN112383485A (en) Network congestion control method and device
WO2017000684A1 (en) Data reading method, peer device, controller, and storage medium
WO2022252546A1 (en) Information adjusting method and device, and storage medium
CN113064907B (en) Content updating method based on deep reinforcement learning
WO2024098953A1 (en) Lane line splicing method and apparatus, and electronic device and storage medium
US10061726B2 (en) Precision time management (PTM) for USB retimers that accurately adjusts timestamp fields of isochronous timestamp packets (ITPS)

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22937158

Country of ref document: EP

Kind code of ref document: A1