WO2017185411A1 - Apparatus and method for executing adagrad gradient descent training algorithm - Google Patents

Apparatus and method for executing adagrad gradient descent training algorithm Download PDF

Info

Publication number
WO2017185411A1
WO2017185411A1 PCT/CN2016/081836 CN2016081836W WO2017185411A1 WO 2017185411 A1 WO2017185411 A1 WO 2017185411A1 CN 2016081836 W CN2016081836 W CN 2016081836W WO 2017185411 A1 WO2017185411 A1 WO 2017185411A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
instruction
vector
updated
gradient
Prior art date
Application number
PCT/CN2016/081836
Other languages
French (fr)
Chinese (zh)
Inventor
郭崎
刘少礼
陈天石
陈云霁
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Publication of WO2017185411A1 publication Critical patent/WO2017185411A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations

Definitions

  • the present invention relates to the field of AdaGrad algorithm application technology, and more particularly to an apparatus and method for performing an AdaGrad gradient descent training algorithm.
  • the gradient descent optimization algorithm is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing.
  • the AdaGrad algorithm is easy to implement, has small computational complexity, requires small storage space, and can adaptively allocate learning rates for various parameters. Features are widely used. Implementing the AdaGrad algorithm with a dedicated device can significantly increase the speed of its execution.
  • a known method of performing the AdaGrad gradient descent algorithm is to use a general purpose processor.
  • the method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions.
  • One of the disadvantages of this approach is that the performance of a single general purpose processor is low.
  • communication between general-purpose processors becomes a performance bottleneck.
  • the general-purpose processor needs to decode the correlation operation corresponding to the AdaGrad gradient descent algorithm into a long column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
  • Another known method of performing the AdaGrad gradient descent algorithm is to use a graphics processing unit (GPU).
  • the method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit. Since the GPU is a device dedicated to performing graphics image operations and scientific calculations, without the special support for the AdaGrad gradient descent algorithm related operations, a large amount of front-end decoding work is still required to perform the correlation operation in the AdaGrad gradient descent algorithm. A lot of extra overhead.
  • the GPU has only a small on-chip buffer, and the data required for the operation (such as historical gradient values) needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings huge power consumption overhead.
  • the present invention provides an apparatus for performing an AdaGrad gradient descent algorithm, comprising:
  • controller unit configured to decode the read instruction into a micro instruction that controls the corresponding module, and send the instruction to the corresponding module;
  • a data buffer unit configured to store intermediate variables in the operation process, and perform initialization and update operations on the intermediate variables
  • a data processing module configured to perform an arithmetic operation under the control of the controller unit, including a vector addition operation, a vector multiplication operation, a vector division operation, a vector square root operation, and a basic operation, and storing the intermediate variable in the data buffer unit in.
  • the data processing module includes an operation control sub-module, a parallel vector addition operation unit, a parallel vector multiplication operation unit, a parallel vector division operation unit, a parallel vector square root operation unit, and a basic operation sub-module.
  • the data buffer unit initializes the sum of the squares of the historical gradient values when the device is initialized Set its value to 0, and open up two spatial storage constants ⁇ , ⁇ , which remain until the entire gradient descent algorithm is executed.
  • the data buffer unit compares the sum of the squares of the historical gradient values during each data update process. Read into the data processing module, update its value in the data processing module, that is, add the square sum of the current gradient values, and then write to the data buffer unit;
  • the data processing module reads a sum of squares of historical gradient values from the data buffer unit And constant ⁇ , ⁇ , update Value is sent back to the data cache unit, utilizing And constant ⁇ , ⁇ calculate adaptive learning rate Finally, the vector to be updated is updated by using the current gradient value and the adaptive learning rate.
  • the present invention also provides a method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
  • Step (2) reading the parameter vector to be updated and the corresponding gradient vector from the external space
  • Step (3) the data processing module reads and calculates the sum of squared historical gradients in the updated data buffer unit And pass the sum of the constants ⁇ , ⁇ and the historical gradient in the data buffer unit Calculated adaptive learning rate
  • step (4) the data processing module completes the update operation of the update vector by using the adaptive learning rate and the current gradient value, and the update operation calculation formula is as follows:
  • W t represents the current, that is, the t-th parameter to be updated
  • ⁇ L(W t ) represents the gradient value of the current parameter to be updated
  • W t+1 represents the updated parameter, which is also the next time, that is, t+1 iterations.
  • step (5) the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
  • the present invention also provides a method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
  • Step S1 pre-storing an IO instruction at the first address of the instruction cache unit
  • Step S2 the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all the instructions related to the AdaGrad gradient descent calculation from the external address space, and Cached into the instruction cache unit;
  • Step S3 the controller unit reads the assignment instruction from the instruction cache unit, and according to the translated microinstruction, the sum of the historical gradients in the data buffer unit Zero, and initialize the constant ⁇ , ⁇ ;
  • the constant ⁇ is the adaptive learning rate gain coefficient, used to adjust the range of the adaptive learning rate
  • the constant ⁇ is a constant, used to guarantee the denominator in the adaptive learning rate calculation Non-zero
  • t is the current number of iterations
  • W i is the parameter to be updated when the i-th iteration is operated
  • ⁇ L(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated
  • represents the summation operation
  • Step S4 the controller unit IO buffer unit reads an instruction from the instruction, according to the microinstruction translated, the access unit data read from the external space to be updated parameter vector W t and the corresponding gradient vector ⁇ L (W t) are read into the In the data cache unit;
  • Step S5 the controller unit reads a data transfer instruction from the instruction cache unit, and according to the translated microinstruction, the sum of the historical gradients in the data buffer unit And the constant ⁇ , ⁇ is transmitted to the data processing unit;
  • Step S6 the controller unit reads a vector instruction from the instruction cache unit, and performs a historical gradient square sum according to the translated microinstruction.
  • the update operation in which the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: using the vector multiplication parallel operation sub-module to obtain ( ⁇ L(W t )) 2 , using vector addition in parallel
  • the arithmetic submodule adds ( ⁇ L(W t )) 2 to the sum of the squares of the historical gradients in;
  • Step S7 the controller unit reads an instruction from the instruction cache unit, and according to the translated microinstruction, the updated historical gradient sum of squares Transferred from the data processing unit back to the data cache unit;
  • Step S8 the controller unit reads an adaptive learning rate operation instruction from the instruction buffer unit, and according to the translated micro instruction, the operation control sub-module controls the correlation operation module to perform the following operations: using the vector square root parallel operation sub-module to calculate Calculating adaptive learning rate by using vector division parallel operation sub-module
  • Step S9 the controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operation according to the translated micro-instruction: calculating by using the vector multiplication parallel operation unit sub-module Calculation using a vector addition parallel operation submodule Obtain the updated parameter vector W t+1 ;
  • Step S10 the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the updated parameter vector W t+1 is transmitted from the data processing unit to the external address space specified address through the data access unit;
  • Step S11 the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector converges, and if it converges, the operation ends; otherwise, the process proceeds to step S5. carried out.
  • the device and method of the present invention have the following beneficial effects: the AdaGrad gradient descent algorithm can be implemented by using the device, and the efficiency of data processing is greatly improved; by using a device specially used for executing the AdaGrad gradient descent algorithm,
  • the general-purpose processor that solves the data has insufficient computational performance, and the problem of large decoding cost in the previous stage accelerates the execution speed of related applications and greatly improves the efficiency of data processing.
  • the application of the data buffer unit avoids repeatedly reading data into the memory. , reducing the bandwidth of memory access.
  • FIG. 1 is a block diagram showing an example of an overall structure of an apparatus for implementing an AdaGrad gradient descent algorithm related application, in accordance with an embodiment of the present invention
  • FIG. 2 is a block diagram showing an example of a data processing module in an apparatus for implementing an AdaGrad gradient descent algorithm related application, in accordance with an embodiment of the present invention
  • FIG. 3 is a flow diagram of operations for implementing an AdaGrad gradient descent algorithm related application, in accordance with an embodiment of the present invention.
  • the invention discloses an apparatus for executing an AdaGrad gradient descent algorithm, comprising a data access unit, an instruction cache unit, a controller unit, a data buffer unit and a data processing module.
  • the data access unit can access the external address space, can read and write data to each cache unit in the device, and complete loading and storing of the data, and specifically includes reading the instruction to the instruction cache unit, and reading the parameter to be updated from the specified storage unit.
  • the updated parameter vector is directly written from the data processing module to the external designated space;
  • the instruction cache unit reads the instruction through the data access unit, and caches the read instruction;
  • the controller unit slave instruction Reading instructions in the cache unit, decoding the instructions into micro-instructions that control the behavior of other modules and transmitting them to other modules such as data access units, data buffer units, data processing modules, etc.; Intermediate variable, and These variables are initialized and updated;
  • the data processing module performs corresponding operations based on the instructions, including vector addition, vector multiplication, vector division, vector square root, and basic operations.
  • An apparatus for implementing an AdaGrad gradient descent algorithm can be used to support applications using the AdaGrad gradient descent algorithm.
  • a space is stored to store the sum of the squares of the historical gradient values.
  • a learning rate is calculated by using the square sum as the learning rate of the gradient descent, and then the update operation of the vector to be updated is performed. The gradient descent operation is repeated until the vector to be updated converges.
  • the invention also discloses a method for executing an AdaGrad gradient descent algorithm, and the specific implementation steps are as follows:
  • Step (1) completes the initialization operation of the data buffer unit by an instruction, including setting an initial value for the constant ⁇ , ⁇ , and summing the square of the historical gradient Zero operation.
  • the constant ⁇ is an adaptive learning rate gain coefficient, which is used to adjust the range of the adaptive learning rate.
  • the constant ⁇ is a small constant, which is used to ensure that the denominator in the adaptive learning rate calculation is non-zero, and t is the current iteration.
  • W i is the parameter to be updated when the i-th iteration is operated
  • ⁇ L(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated
  • represents the summation operation
  • Step (2) by the IO instruction, completes an operation of the data access unit reading the parameter vector to be updated and the corresponding gradient vector from the external space.
  • Step (3) The data processing module reads and calculates the sum of the squares of the historical gradients in the update data buffer unit according to the corresponding instructions. And pass the sum of the constants ⁇ , ⁇ and the historical gradient in the data buffer unit Calculated adaptive learning rate
  • Step (4) The data processing module completes the update operation of the update vector by using the adaptive learning rate and the current gradient value according to the corresponding instruction, and the update operation calculation formula is as follows:
  • W t represents the current (tth) parameter to be updated
  • ⁇ L(W t ) represents the gradient value of the current parameter to be updated
  • W t+1 represents the updated parameter, and is the next (t+1) iteration.
  • Step (5) The data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
  • the device includes a data access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware, including but not limited to FPGA, CGRA, and dedicated integration. Circuit ASICs, analog circuits, and memristors.
  • the data access unit 1 can access the external address space, can read and write data to each cache unit inside the device, and complete data loading and storage. Specifically, the method includes reading an instruction to the instruction cache unit 2, reading the parameter to be updated from the specified storage unit to the data processing unit 5, reading the gradient value from the external designated space to the data buffer unit 4, and updating the parameter vector from the parameter.
  • the data processing module 5 directly writes to the external designated space.
  • the instruction cache unit 2 reads the instruction through the data access unit 1 and caches the read instruction.
  • the controller unit 3 reads the instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules, and transmits them to other modules such as the data access unit 1, the data buffer unit 3, the data processing module 5, and the like.
  • the data buffer unit 4 initializes the sum of the squares of the historical gradient values at the time of device initialization. Set its value to 0, and open up two spatial storage constants ⁇ , ⁇ , which remain until the entire gradient descent iteration process ends.
  • the sum of the squares of the historical gradient values during each data update Read into the data processing module 5, the value is updated in the data processing module 5, i.e., the sum of the squares of the current gradient values is added, and then written to the data buffer unit 4.
  • the data processing module 5 reads the sum of the squares of the historical gradient values from the data buffer unit 4. And constant ⁇ , ⁇ , update The value is sent back to the data cache unit 4, utilizing And constant ⁇ , ⁇ calculate adaptive learning rate Finally, the vector to be updated is updated by using the current gradient value and the adaptive learning rate.
  • the data processing module includes an arithmetic control sub-module 51, a parallel vector addition unit 52, a parallel vector multiplication unit 53, a parallel vector multiplication unit 54, a parallel vector square root operation unit 55, and a basic operation sub-module 56. Since the vector operations in the AdaGrad gradient descent algorithm are element-wise operations, when the same vector performs an operation, the elements at different positions can perform operations in parallel.
  • Figure 3 shows a general flow diagram of an apparatus for performing correlation operations in accordance with the AdaGrad gradient descent algorithm.
  • step S1 an IO instruction is pre-stored at the first address of the instruction cache unit 2.
  • step S2 the operation starts, the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads all the AdaGrad gradient descent calculations from the external address space. All instructions are buffered into the instruction cache unit 2.
  • step S3 the controller unit 4 reads the assignment instruction from the instruction cache unit 2, and according to the translated microinstruction, the sum of the squares of the historical gradients in the data buffer unit.
  • the constant ⁇ is an adaptive learning rate gain coefficient, which is used to adjust the range of the adaptive learning rate.
  • the constant ⁇ is a small constant, which is used to ensure that the denominator in the adaptive learning rate calculation is non-zero, and t is the current iteration.
  • W i is the parameter to be updated when the i-th iteration is operated
  • ⁇ L(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated
  • represents the summation operation
  • step S4 the controller unit 3 into an IO instruction read from the instruction cache unit 2, in accordance with the translated microinstructions, read from a data access unit outside the space to be the updated parameter vector W t and the corresponding gradient vector ⁇ L (W t ) is read into the data buffer unit 4.
  • step S5 the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and sums the sum of the historical gradients in the data buffer unit 4 according to the translated microinstruction. And the constant ⁇ , ⁇ is transmitted to the data processing unit.
  • step S6 the controller unit reads a vector instruction from the instruction cache unit 2, and performs a historical gradient square sum according to the translated microinstruction.
  • the update operation in which the instruction is sent to the arithmetic control sub-module 51, and the arithmetic control sub-module 51 transmits the corresponding instruction to perform the following operation: using the vector multiplication parallel operation sub-module 53 to obtain ( ⁇ L(W t )) 2 , using The vector addition parallel operation sub-module adds ( ⁇ L(W t )) 2 to the sum of squared historical gradients in.
  • step S7 the controller unit reads an instruction from the instruction cache unit 2, and according to the translated microinstruction, the updated historical gradient sum of squares
  • the data processing unit 5 is transferred back to the data buffer unit 4.
  • step S8 the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit 2, and according to the translated micro-instruction, the operation control sub-module 51 controls the correlation operation module to perform the following operation: using the vector square root parallel operation sub-module 55 Calculation Using the vector division parallel operation sub-module 54 to calculate the adaptive learning rate
  • step S9 the controller unit reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform an operation based on the translated micro-instruction: calculating by using the vector multiplication parallel operation unit sub-module 52 Calculated by the vector addition parallel operation sub-module 52 The updated parameter vector W t+1 is obtained .
  • step S10 the controller unit reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the updated parameter vector Wt +1 is transmitted from the data processing unit 5 through the data access unit 1 to the external address space designation. address.
  • step S11 the controller unit reads a convergence determination instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector converges, and if it converges, the operation ends; otherwise, the process proceeds to step S5. Continue to execute.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

An apparatus and method for executing an AdaGrad gradient descent training algorithm, the apparatus comprising a control unit (3), a data cache unit (4), and a data processing module (5); the apparatus first reads a gradient vector and a vector to be updated, and simultaneously uses a current gradient value to update a historical gradient value in a cache region; at each iteration, using the current gradient value and the historical gradient value to calculate an update amount, and performing an update operation on the vector to be updated; and continually training until the parameter vectors to be updated converge. Employing a device specifically used for executing the AdaGrad gradient descent algorithm can solve the problems of the insufficient operating performance of general processors and high front-end decoding overheads, and accelerates the execution speed of related applications; in addition, the use of the data cache unit (4) prevents duplicate reading of data to a memory, reducing the memory access bandwidth.

Description

一种用于执行AdaGrad梯度下降训练算法的装置和方法Apparatus and method for performing AdaGrad gradient descent training algorithm 技术领域Technical field
本发明涉及AdaGrad算法应用技术领域,更具体地涉及一种用于执行AdaGrad梯度下降训练算法的装置和方法。The present invention relates to the field of AdaGrad algorithm application technology, and more particularly to an apparatus and method for performing an AdaGrad gradient descent training algorithm.
背景技术Background technique
梯度下降优化算法在函数逼近、优化计算、模式识别和图像处理等领域被广泛应用,AdaGrad算法由于其易于实现,计算量小,所需存储空间小以及能够自适应地为各个参数分配学习率等特征被广泛的使用。采用专用装置实现AdaGrad算法可以显著提高其执行的速度。The gradient descent optimization algorithm is widely used in the fields of function approximation, optimization calculation, pattern recognition and image processing. The AdaGrad algorithm is easy to implement, has small computational complexity, requires small storage space, and can adaptively allocate learning rates for various parameters. Features are widely used. Implementing the AdaGrad algorithm with a dedicated device can significantly increase the speed of its execution.
目前,一种执行AdaGrad梯度下降算法的已知方法是使用通用处理器。该方法通过使用通用寄存器堆和通用功能部件执行通用指令来支持上述算法。该方法的缺点之一是单个通用处理器的运算性能较低。而多个通用处理器并行执行时,通用处理器之间相互通信又成为了性能瓶颈。另外,通用处理器需要把AdaGrad梯度下降算法对应的相关运算译码成一长列运算及访存指令序列,处理器前端译码带来了较大的功耗开销。Currently, a known method of performing the AdaGrad gradient descent algorithm is to use a general purpose processor. The method supports the above algorithm by executing general purpose instructions using a general purpose register file and generic functions. One of the disadvantages of this approach is that the performance of a single general purpose processor is low. When multiple general-purpose processors are executed in parallel, communication between general-purpose processors becomes a performance bottleneck. In addition, the general-purpose processor needs to decode the correlation operation corresponding to the AdaGrad gradient descent algorithm into a long column operation and a fetch instruction sequence, and the processor front-end decoding brings a large power consumption overhead.
另一种执行AdaGrad梯度下降算法的已知方法是使用图形处理器(GPU)。该方法通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来支持上述算法。由于GPU是专门用来执行图形图像运算以及科学计算的设备,没有对AdaGrad梯度下降算法相关运算的专门支持,仍然需要大量的前端译码工作才能执行AdaGrad梯度下降算法中的相关运算,由此带来了大量的额外开销。另外,GPU只有较小的片上缓存,运算中所需数据(如历史梯度值)需要反复从片外搬运,片外带宽成为了主要性能瓶颈,同时带来了巨大的功耗开销。 Another known method of performing the AdaGrad gradient descent algorithm is to use a graphics processing unit (GPU). The method supports the above algorithm by executing a generic SIMD instruction using a general purpose register file and a generic stream processing unit. Since the GPU is a device dedicated to performing graphics image operations and scientific calculations, without the special support for the AdaGrad gradient descent algorithm related operations, a large amount of front-end decoding work is still required to perform the correlation operation in the AdaGrad gradient descent algorithm. A lot of extra overhead. In addition, the GPU has only a small on-chip buffer, and the data required for the operation (such as historical gradient values) needs to be repeatedly transferred from off-chip. The off-chip bandwidth becomes the main performance bottleneck, and brings huge power consumption overhead.
发明内容Summary of the invention
有鉴于此,本发明的目的在于提供一种用于执行AdaGrad梯度下降算法的装置和方法,以解决至少一个上述技术问题。In view of the above, it is an object of the present invention to provide an apparatus and method for performing an AdaGrad gradient descent algorithm to solve at least one of the above technical problems.
为了实现上述目的,作为本发明的一个方面,本发明提供了一种用于执行AdaGrad梯度下降算法的装置,包括:In order to achieve the above object, as an aspect of the present invention, the present invention provides an apparatus for performing an AdaGrad gradient descent algorithm, comprising:
控制器单元,用于将读取的指令译码为控制相应模块的微指令,并将其发送给相应模块;a controller unit, configured to decode the read instruction into a micro instruction that controls the corresponding module, and send the instruction to the corresponding module;
数据缓存单元,用于存储运算过程中的中间变量,并对所述中间变量执行初始化及更新操作;a data buffer unit, configured to store intermediate variables in the operation process, and perform initialization and update operations on the intermediate variables;
数据处理模块,用于在所述控制器单元的控制下执行运算操作,包括向量加法运算、向量乘法运算、向量除法运算、向量平方根运算及基本运算,并将中间变量存储于所述数据缓存单元中。a data processing module, configured to perform an arithmetic operation under the control of the controller unit, including a vector addition operation, a vector multiplication operation, a vector division operation, a vector square root operation, and a basic operation, and storing the intermediate variable in the data buffer unit in.
其中,所述数据处理模块包括运算控制子模块、并行向量加法运算单元、并行向量乘法运算单元、并行向量除法运算单元、并行向量平方根运算单元以及基本运算子模块。The data processing module includes an operation control sub-module, a parallel vector addition operation unit, a parallel vector multiplication operation unit, a parallel vector division operation unit, a parallel vector square root operation unit, and a basic operation sub-module.
其中,所述数据处理模块针对同一向量执行运算时,不同位置元素能够并行执行运算。Wherein, when the data processing module performs an operation on the same vector, different position elements can perform operations in parallel.
其中,所述数据缓存单元在装置初始化时,初始化历史梯度值的平方和
Figure PCTCN2016081836-appb-000001
将其值置为0,同时开辟两个空间存储常数α,ε,此二个常数空间一直保持,直至整个梯度下降算法执行完毕。
Wherein the data buffer unit initializes the sum of the squares of the historical gradient values when the device is initialized
Figure PCTCN2016081836-appb-000001
Set its value to 0, and open up two spatial storage constants α, ε, which remain until the entire gradient descent algorithm is executed.
其中,所述数据缓存单元在每次数据更新过程中,将历史梯度值的平方和
Figure PCTCN2016081836-appb-000002
读到数据处理模块中,在数据处理模块中更新其值,即加入当前梯度值的平方和,然后再写入到所述数据缓存单元中;
Wherein the data buffer unit compares the sum of the squares of the historical gradient values during each data update process.
Figure PCTCN2016081836-appb-000002
Read into the data processing module, update its value in the data processing module, that is, add the square sum of the current gradient values, and then write to the data buffer unit;
所述数据处理模块从所述数据缓存单元中读取历史梯度值的平方和
Figure PCTCN2016081836-appb-000003
和常数α,ε,更新
Figure PCTCN2016081836-appb-000004
的值并送回所述数据缓存单元,利用
Figure PCTCN2016081836-appb-000005
及常数α,ε计算自适应学习率
Figure PCTCN2016081836-appb-000006
最后利用当前梯度值及自适应学习率更新待更新向量。
The data processing module reads a sum of squares of historical gradient values from the data buffer unit
Figure PCTCN2016081836-appb-000003
And constant α, ε, update
Figure PCTCN2016081836-appb-000004
Value is sent back to the data cache unit, utilizing
Figure PCTCN2016081836-appb-000005
And constant α, ε calculate adaptive learning rate
Figure PCTCN2016081836-appb-000006
Finally, the vector to be updated is updated by using the current gradient value and the adaptive learning rate.
作为本发明的另一个方面,本发明还提供了一种执行AdaGrad梯度下降算法的方法,其特征在于,包括以下步骤:As another aspect of the present invention, the present invention also provides a method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
步骤(1),对数据缓存单元进行初始化,包括对常数α,ε设置初值以及对历史梯度平方和
Figure PCTCN2016081836-appb-000007
置零操作,其中,常数α为自适应学习率增益系数,用于调节控制自适应学习率的范围,常数ε为一个常数,用于保证自适应学习率计算中的分母非零,t为当前迭代次数,Wi为第i次迭代运算时的待更新参数,ΔL(Wi)为第i次迭代运算时待更新参数的梯度值,∑表示求和操作,其求和范围从i=1至i=t,即对初始至当前的梯度平方值(ΔL(W1))2,(ΔL(W2))2,...,(ΔL(Wt))2求和;
Step (1), initializing the data buffer unit, including setting the initial value for the constant α, ε, and summing the square of the historical gradient
Figure PCTCN2016081836-appb-000007
Zeroing operation, wherein the constant α is an adaptive learning rate gain coefficient for adjusting the range of the adaptive learning rate, and the constant ε is a constant for ensuring that the denominator in the adaptive learning rate calculation is non-zero, t is the current The number of iterations, W i is the parameter to be updated when the i-th iteration is operated, ΔL(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the summation range is from i=1 To i=t, that is, the initial to current gradient squared value (ΔL(W 1 )) 2 , (ΔL(W 2 )) 2 , . . . , (ΔL(W t )) 2 is summed;
步骤(2),从外部空间读取待更新参数向量和对应的梯度向量;Step (2), reading the parameter vector to be updated and the corresponding gradient vector from the external space;
步骤(3),数据处理模块读取并计算更新数据缓存单元中的历史梯度平方和
Figure PCTCN2016081836-appb-000008
并通过数据缓存单元中的常数α,ε和历史梯度平方和
Figure PCTCN2016081836-appb-000009
计算自适应学习率
Figure PCTCN2016081836-appb-000010
Step (3), the data processing module reads and calculates the sum of squared historical gradients in the updated data buffer unit
Figure PCTCN2016081836-appb-000008
And pass the sum of the constants α, ε and the historical gradient in the data buffer unit
Figure PCTCN2016081836-appb-000009
Calculated adaptive learning rate
Figure PCTCN2016081836-appb-000010
步骤(4),数据处理模块利用自适应学习率及当前梯度值完成对待更新向量的更新操作,更新操作计算公式如下:In step (4), the data processing module completes the update operation of the update vector by using the adaptive learning rate and the current gradient value, and the update operation calculation formula is as follows:
Figure PCTCN2016081836-appb-000011
Figure PCTCN2016081836-appb-000011
其中,Wt表示当前,即第t次的待更新参数,ΔL(Wt)表示当前待更新参数的梯度值,Wt+1表示更新后的参数,也是下一次,即t+1次迭代运算的待更新参数;Where W t represents the current, that is, the t-th parameter to be updated, ΔL(W t ) represents the gradient value of the current parameter to be updated, and W t+1 represents the updated parameter, which is also the next time, that is, t+1 iterations. The parameter to be updated of the operation;
步骤(5),数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤(2)处继续执行。In step (5), the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
以及一种执行AdaGrad梯度下降算法的装置,所述装置的控制器中固化有执行如上所述方法的程序。And a device for performing an AdaGrad gradient descent algorithm in which a program for performing the method as described above is cured.
作为本发明的再一个方面,本发明还提供了一种执行AdaGrad梯度下降算法的方法,其特征在于,包括以下步骤:As still another aspect of the present invention, the present invention also provides a method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
步骤S1,在指令缓存单元的首地址处预先存入一条IO指令; Step S1, pre-storing an IO instruction at the first address of the instruction cache unit;
步骤S2,控制器单元从指令缓存单元的首地址读取该条IO指令,根据译出的微指令,数据访问单元从外部地址空间读取所有与AdaGrad梯度下降计算有关的所有指令,并将其缓存入指令缓存单元中;Step S2, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all the instructions related to the AdaGrad gradient descent calculation from the external address space, and Cached into the instruction cache unit;
步骤S3,控制器单元从指令缓存单元读入赋值指令,根据译出的微指令,数据缓存单元中的历史梯度平方和
Figure PCTCN2016081836-appb-000012
置零,并初始化常数α,ε;其中,常数α为自适应学习率增益系数,用于调节控制自适应学习率的范围,常数ε为一个常数,用于保证自适应学习率计算中的分母非零,t为当前迭代次数,Wi为第i次迭代运算时的待更新参数,ΔL(Wi)为第i次迭代运算时待更新参数的梯度值,∑表示求和操作,其求和范围从i=1至i=t,即对初始至当前的梯度平方值(ΔL(W1))2,(ΔL(W2))2,...,(ΔL(Wt))2求和;
Step S3, the controller unit reads the assignment instruction from the instruction cache unit, and according to the translated microinstruction, the sum of the historical gradients in the data buffer unit
Figure PCTCN2016081836-appb-000012
Zero, and initialize the constant α, ε; where the constant α is the adaptive learning rate gain coefficient, used to adjust the range of the adaptive learning rate, the constant ε is a constant, used to guarantee the denominator in the adaptive learning rate calculation Non-zero, t is the current number of iterations, W i is the parameter to be updated when the i-th iteration is operated, ΔL(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the And the range from i=1 to i=t, that is, the initial to current gradient squared value (ΔL(W 1 )) 2 , (ΔL(W 2 )) 2 ,...,(ΔL(W t )) 2 Summing
步骤S4,控制器单元从指令缓存单元读入一条IO指令,根据译出的微指令,数据访问单元从外部空间读取待更新参数向量Wt和对应的梯度向量ΔL(Wt)读入到数据缓存单元中;Step S4, the controller unit IO buffer unit reads an instruction from the instruction, according to the microinstruction translated, the access unit data read from the external space to be updated parameter vector W t and the corresponding gradient vector ΔL (W t) are read into the In the data cache unit;
步骤S5,控制器单元从指令缓存单元读入一条数据传输指令,根据译出的微指令,数据缓存单元中的历史梯度平方和
Figure PCTCN2016081836-appb-000013
和常数α,ε被传输到数据处理单元;
Step S5, the controller unit reads a data transfer instruction from the instruction cache unit, and according to the translated microinstruction, the sum of the historical gradients in the data buffer unit
Figure PCTCN2016081836-appb-000013
And the constant α, ε is transmitted to the data processing unit;
步骤S6,控制器单元从指令缓存单元读入一条向量指令,根据译出的微指令,进行历史梯度平方和
Figure PCTCN2016081836-appb-000014
的更新操作,在该操作中,指令被送至运算控制子模块,运算控制子模块发送相应指令进行以下操作:利用向量乘法并行运算子模块得到(ΔL(Wt))2,利用向量加法并行运算子模块将(ΔL(Wt))2加到历史梯度平方和
Figure PCTCN2016081836-appb-000015
中;
Step S6, the controller unit reads a vector instruction from the instruction cache unit, and performs a historical gradient square sum according to the translated microinstruction.
Figure PCTCN2016081836-appb-000014
The update operation, in which the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: using the vector multiplication parallel operation sub-module to obtain (ΔL(W t )) 2 , using vector addition in parallel The arithmetic submodule adds (ΔL(W t )) 2 to the sum of the squares of the historical gradients
Figure PCTCN2016081836-appb-000015
in;
步骤S7,控制器单元从指令缓存单元读取一条指令,根据译出的微指令,更新后的历史梯度平方和
Figure PCTCN2016081836-appb-000016
从数据处理单元被传送回数据缓存单元;
Step S7, the controller unit reads an instruction from the instruction cache unit, and according to the translated microinstruction, the updated historical gradient sum of squares
Figure PCTCN2016081836-appb-000016
Transferred from the data processing unit back to the data cache unit;
步骤S8,控制器单元从指令缓存单元读取一条自适应学习率运算指令,根据译出的微指令,运算控制子模块控制相关运算模块进行如下操 作:利用向量平方根并行运算子模块计算
Figure PCTCN2016081836-appb-000017
利用向量除法并行运算子模块计算自适应学习率
Figure PCTCN2016081836-appb-000018
Step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction buffer unit, and according to the translated micro instruction, the operation control sub-module controls the correlation operation module to perform the following operations: using the vector square root parallel operation sub-module to calculate
Figure PCTCN2016081836-appb-000017
Calculating adaptive learning rate by using vector division parallel operation sub-module
Figure PCTCN2016081836-appb-000018
步骤S9,控制器单元从指令缓存单元读取一条参数向量更新指令,根据译出的微指令,驱动运算控制子模块进行如下的运算:利用向量乘法并行运算单元子模块计算出
Figure PCTCN2016081836-appb-000019
利用向量加法并行运算子模块计算
Figure PCTCN2016081836-appb-000020
得到更新后的参数向量Wt+1
Step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operation according to the translated micro-instruction: calculating by using the vector multiplication parallel operation unit sub-module
Figure PCTCN2016081836-appb-000019
Calculation using a vector addition parallel operation submodule
Figure PCTCN2016081836-appb-000020
Obtain the updated parameter vector W t+1 ;
步骤S10,控制器单元从指令缓存单元读取一条IO指令,根据译出的微指令,更新后的参数向量Wt+1从数据处理单元通过数据访问单元传送至外部地址空间指定地址;Step S10, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the updated parameter vector W t+1 is transmitted from the data processing unit to the external address space specified address through the data access unit;
步骤S11,控制器单元从指令缓存单元读取一条收敛判断指令,根据译出的微指令,数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤S5处继续执行。Step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector converges, and if it converges, the operation ends; otherwise, the process proceeds to step S5. carried out.
以及一种执行AdaGrad梯度下降算法的装置,所述装置的控制器中固化有执行如上所述方法的程序。And a device for performing an AdaGrad gradient descent algorithm in which a program for performing the method as described above is cured.
基于上述技术方案可知,本发明的装置和方法具有如下有益效果:使用该装置可以实现AdaGrad梯度下降算法,并大幅度提高数据处理的效率;通过采用专门用于执行AdaGrad梯度下降算法的设备,可以解决数据的通用处理器运算性能不足,前段译码开销大的问题,加速相关应用的执行速度,大幅度提高数据处理的效率;同时,对数据缓存单元的应用,避免了反复向内存读取数据,降低了内存访问的带宽。Based on the foregoing technical solutions, the device and method of the present invention have the following beneficial effects: the AdaGrad gradient descent algorithm can be implemented by using the device, and the efficiency of data processing is greatly improved; by using a device specially used for executing the AdaGrad gradient descent algorithm, The general-purpose processor that solves the data has insufficient computational performance, and the problem of large decoding cost in the previous stage accelerates the execution speed of related applications and greatly improves the efficiency of data processing. At the same time, the application of the data buffer unit avoids repeatedly reading data into the memory. , reducing the bandwidth of memory access.
附图说明DRAWINGS
图1为根据本发明一实施例的用于实现AdaGrad梯度下降算法相关应用的装置的整体结构的示例框图; 1 is a block diagram showing an example of an overall structure of an apparatus for implementing an AdaGrad gradient descent algorithm related application, in accordance with an embodiment of the present invention;
图2为根据本发明一实施例的用于实现AdaGrad梯度下降算法相关应用的装置中数据处理模块的示例框图;2 is a block diagram showing an example of a data processing module in an apparatus for implementing an AdaGrad gradient descent algorithm related application, in accordance with an embodiment of the present invention;
图3为根据本发明一实施例的用于实现AdaGrad梯度下降算法相关应用的运算的流程图。3 is a flow diagram of operations for implementing an AdaGrad gradient descent algorithm related application, in accordance with an embodiment of the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。通过以下详细描述,本发明的其它方面、优势和突出特征对于本领域技术人员将变得显而易见。The present invention will be further described in detail below with reference to the specific embodiments of the invention, Other aspects, advantages, and salient features of the present invention will become apparent to those skilled in the art.
在本说明书中,下述用于描述本发明原理的各种实施例只是说明,不应该以任何方式解释为限制本发明的范围。参照附图的下述描述用于帮助全面理解由权利要求及其等同物限定的本发明的示例性实施例。下述描述包括多种具体细节来帮助理解,但这些细节应认为仅仅是示例性的。因此,本领域普通技术人员应认识到,在不悖离本发明的范围和精神的情况下,可以对本文中描述的实施例进行多种改变和修改。此外,为了清楚和简洁起见,省略了公知功能和结构的描述。此外,贯穿附图,相同附图标记用于相似功能和操作。In the present specification, the following various embodiments for describing the principles of the present invention are merely illustrative and should not be construed as limiting the scope of the invention. The following description of the invention is intended to be understood as The description below includes numerous specific details to assist the understanding, but these details should be considered as merely exemplary. Accordingly, it will be appreciated by those skilled in the art that various changes and modifications can be made in the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness. In addition, the same reference numbers are used throughout the drawings for similar functions and operations.
本发明公开了一种用于执行AdaGrad梯度下降算法的装置,包括数据访问单元、指令缓存单元、控制器单元、数据缓存单元以及数据处理模块。其中,数据访问单元能够访问外部地址空间,可以向装置内部各个缓存单元读写数据,完成数据的加载和存储,具体包括向指令缓存单元读取指令,从指定存储单元之间读取待更新参数和对应的梯度值到数据处理单元,将更新后的参数向量从数据处理模块直接写入外部指定空间;指令缓存单元通过数据访问单元读取指令,并缓存读入的指令;控制器单元从指令缓存单元中读取指令,将指令译码为控制其他模块行为的微指令并将其发送给其他模块如数据访问单元、数据缓存单元、数据处理模块等;数据缓存单元存储装置运行中需要的一些中间变量,并对 这些变量做初始化及更新操作;数据处理模块根据指令做相应的运算操作,包括向量加法运算、向量乘法运算、向量除法运算、向量平方根运算及基本运算。The invention discloses an apparatus for executing an AdaGrad gradient descent algorithm, comprising a data access unit, an instruction cache unit, a controller unit, a data buffer unit and a data processing module. The data access unit can access the external address space, can read and write data to each cache unit in the device, and complete loading and storing of the data, and specifically includes reading the instruction to the instruction cache unit, and reading the parameter to be updated from the specified storage unit. And the corresponding gradient value to the data processing unit, the updated parameter vector is directly written from the data processing module to the external designated space; the instruction cache unit reads the instruction through the data access unit, and caches the read instruction; the controller unit slave instruction Reading instructions in the cache unit, decoding the instructions into micro-instructions that control the behavior of other modules and transmitting them to other modules such as data access units, data buffer units, data processing modules, etc.; Intermediate variable, and These variables are initialized and updated; the data processing module performs corresponding operations based on the instructions, including vector addition, vector multiplication, vector division, vector square root, and basic operations.
根据本发明一实施例实现的AdaGrad梯度下降算法的装置,可以用以支持使用AdaGrad梯度下降算法方面的应用。在数据缓存单元开辟一个空间存储历史梯度值的平方和,在每次进行梯度下降时,利用该平方和计算一个学习率,作为梯度下降的学习率,然后进行对待更新向量的更新操作。重复进行梯度下降操作,直至待更新向量收敛。An apparatus for implementing an AdaGrad gradient descent algorithm according to an embodiment of the present invention can be used to support applications using the AdaGrad gradient descent algorithm. In the data buffer unit, a space is stored to store the sum of the squares of the historical gradient values. Each time the gradient is dropped, a learning rate is calculated by using the square sum as the learning rate of the gradient descent, and then the update operation of the vector to be updated is performed. The gradient descent operation is repeated until the vector to be updated converges.
本发明还公开了一种执行AdaGrad梯度下降算法的方法,其具体实施步骤为:The invention also discloses a method for executing an AdaGrad gradient descent algorithm, and the specific implementation steps are as follows:
步骤(1)通过指令,完成数据缓存单元的初始化操作,包括对常数α,ε设置初值以及对历史梯度平方和
Figure PCTCN2016081836-appb-000021
置零操作。其中,常数α为自适应学习率增益系数,用于调节控制自适应学习率的范围,常数ε为一个较小的常数,用于保证自适应学习率计算中的分母非零,t为当前迭代次数,Wi为第i次迭代运算时的待更新参数,ΔL(Wi)为第i次迭代运算时待更新参数的梯度值,∑表示求和操作,其求和范围从i=1至i=t,即对初始至当前的梯度平方值(ΔL(W1))2,(ΔL(W2))2,...,(ΔL(Wt))2求和。
Step (1) completes the initialization operation of the data buffer unit by an instruction, including setting an initial value for the constant α, ε, and summing the square of the historical gradient
Figure PCTCN2016081836-appb-000021
Zero operation. The constant α is an adaptive learning rate gain coefficient, which is used to adjust the range of the adaptive learning rate. The constant ε is a small constant, which is used to ensure that the denominator in the adaptive learning rate calculation is non-zero, and t is the current iteration. The number of times, W i is the parameter to be updated when the i-th iteration is operated, ΔL(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the summation range is from i=1 to i=t, that is, the initial to current gradient squared value (ΔL(W 1 )) 2 , (ΔL(W 2 )) 2 , . . . , (ΔL(W t )) 2 is summed.
步骤(2)通过IO指令,完成数据访问单元从外部空间读取待更新参数向量和对应的梯度向量的操作。Step (2), by the IO instruction, completes an operation of the data access unit reading the parameter vector to be updated and the corresponding gradient vector from the external space.
步骤(3)数据处理模块根据相应指令,读取并计算更新数据缓存单元中的历史梯度平方和
Figure PCTCN2016081836-appb-000022
并通过数据缓存单元中的常数α,ε和历史梯度平方和
Figure PCTCN2016081836-appb-000023
计算的自适应学习率
Figure PCTCN2016081836-appb-000024
Step (3) The data processing module reads and calculates the sum of the squares of the historical gradients in the update data buffer unit according to the corresponding instructions.
Figure PCTCN2016081836-appb-000022
And pass the sum of the constants α, ε and the historical gradient in the data buffer unit
Figure PCTCN2016081836-appb-000023
Calculated adaptive learning rate
Figure PCTCN2016081836-appb-000024
步骤(4)数据处理模块根据相应的指令,利用自适应学习率及当前梯度值完成对待更新向量的更新操作,更新操作计算公式如下:Step (4) The data processing module completes the update operation of the update vector by using the adaptive learning rate and the current gradient value according to the corresponding instruction, and the update operation calculation formula is as follows:
Figure PCTCN2016081836-appb-000025
Figure PCTCN2016081836-appb-000025
其中,Wt表示当前(第t次)的待更新参数,ΔL(Wt)表示当前待更新参数的梯度值,Wt+1表示更新后的参数,也是下一次(t+1次)迭代运算的待更新参数。Where W t represents the current (tth) parameter to be updated, ΔL(W t ) represents the gradient value of the current parameter to be updated, W t+1 represents the updated parameter, and is the next (t+1) iteration. The parameter to be updated for the operation.
步骤(5)数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤(2)处继续执行。Step (5) The data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
下面将结合附图对本发明的具体方案进行进一步的阐述说明。The specific embodiments of the present invention will be further illustrated below in conjunction with the accompanying drawings.
图1示出了根据本发明一实施例的用于实现AdaGrad梯度下降算法的装置的整体结构的示例框图。如图1所示,该装置包括数据访问单元1、指令缓存单元2、控制器单元3、数据缓存单元4和数据处理模块5,均可以通过硬件实现,包括但不限于FPGA、CGRA、专用集成电路ASIC、模拟电路和忆阻器等。1 shows an example block diagram of the overall structure of an apparatus for implementing an AdaGrad gradient descent algorithm, in accordance with an embodiment of the present invention. As shown in FIG. 1, the device includes a data access unit 1, an instruction cache unit 2, a controller unit 3, a data buffer unit 4, and a data processing module 5, all of which can be implemented by hardware, including but not limited to FPGA, CGRA, and dedicated integration. Circuit ASICs, analog circuits, and memristors.
数据访问单元1能够访问外部地址空间,可以向装置内部各个缓存单元读写数据,完成数据的加载和存储。具体包括向指令缓存单元2读取指令,从指定存储单元之间来回读取待更新参数到数据处理单元5,从外部指定空间读入梯度值到数据缓存单元4,将更新后的参数向量从数据处理模块5直接写入外部指定空间。The data access unit 1 can access the external address space, can read and write data to each cache unit inside the device, and complete data loading and storage. Specifically, the method includes reading an instruction to the instruction cache unit 2, reading the parameter to be updated from the specified storage unit to the data processing unit 5, reading the gradient value from the external designated space to the data buffer unit 4, and updating the parameter vector from the parameter. The data processing module 5 directly writes to the external designated space.
指令缓存单元2通过数据访问单元1读取指令,并缓存读入的指令。The instruction cache unit 2 reads the instruction through the data access unit 1 and caches the read instruction.
控制器单元3从指令缓存单元2中读取指令,将指令译码成控制其他模块行为的微指令并将其发送给其他模块如数据访问单元1、数据缓存单元3、数据处理模块5等。The controller unit 3 reads the instructions from the instruction cache unit 2, decodes the instructions into micro-instructions that control the behavior of other modules, and transmits them to other modules such as the data access unit 1, the data buffer unit 3, the data processing module 5, and the like.
数据缓存单元4在装置初始化时,初始化历史梯度值的平方和
Figure PCTCN2016081836-appb-000026
将其值置为0,同时开辟两个空间存储常数α,ε,此二个常数空间一直保持着,直至整个梯度下降迭代过程结束。在每次数据更新过程中,会将历史梯度值的平方和
Figure PCTCN2016081836-appb-000027
读到数据处理模块5中,在数据处理模块5中更新其值,即加入当前梯度值的平方和,然后再写入到数据缓存单元4中。
The data buffer unit 4 initializes the sum of the squares of the historical gradient values at the time of device initialization.
Figure PCTCN2016081836-appb-000026
Set its value to 0, and open up two spatial storage constants α, ε, which remain until the entire gradient descent iteration process ends. The sum of the squares of the historical gradient values during each data update
Figure PCTCN2016081836-appb-000027
Read into the data processing module 5, the value is updated in the data processing module 5, i.e., the sum of the squares of the current gradient values is added, and then written to the data buffer unit 4.
数据处理模块5从数据缓存单元4中读取历史梯度值的平方和
Figure PCTCN2016081836-appb-000028
和常数α,ε,更新
Figure PCTCN2016081836-appb-000029
的值并送回数据缓存单元4,利用
Figure PCTCN2016081836-appb-000030
及常数α,ε计算自适应学习率
Figure PCTCN2016081836-appb-000031
最后利用当前梯度值及自适应学习率更新待更新向量。
The data processing module 5 reads the sum of the squares of the historical gradient values from the data buffer unit 4.
Figure PCTCN2016081836-appb-000028
And constant α, ε, update
Figure PCTCN2016081836-appb-000029
The value is sent back to the data cache unit 4, utilizing
Figure PCTCN2016081836-appb-000030
And constant α, ε calculate adaptive learning rate
Figure PCTCN2016081836-appb-000031
Finally, the vector to be updated is updated by using the current gradient value and the adaptive learning rate.
图2示出了根据本发明实施例的用于实现AdaGrad梯度下降算法相关应用的装置中数据处理模块的示例框图。如图2所示,数据处理模块包括运算控制子模块51、并行向量加法运算单元52、并行向量乘法运算单元53、并行向量乘法运算单元54、并行向量平方根运算单元55以及基本运算子模块56。由于AdaGrad梯度下降算法中向量运算均为element-wise运算,同一向量执行某种运算时,不同位置元素可以并行执行运算。2 illustrates an example block diagram of a data processing module in an apparatus for implementing an AdaGrad gradient descent algorithm related application in accordance with an embodiment of the present invention. As shown in FIG. 2, the data processing module includes an arithmetic control sub-module 51, a parallel vector addition unit 52, a parallel vector multiplication unit 53, a parallel vector multiplication unit 54, a parallel vector square root operation unit 55, and a basic operation sub-module 56. Since the vector operations in the AdaGrad gradient descent algorithm are element-wise operations, when the same vector performs an operation, the elements at different positions can perform operations in parallel.
图3示出了根据AdaGrad梯度下降算法进行相关运算的装置的总体流程图。Figure 3 shows a general flow diagram of an apparatus for performing correlation operations in accordance with the AdaGrad gradient descent algorithm.
在步骤S1,在指令缓存单元2的首地址处预先存入一条IO指令。In step S1, an IO instruction is pre-stored at the first address of the instruction cache unit 2.
在步骤S2,运算开始,控制单元3从指令缓存单元2的首地址读取该条IO指令,根据译出的微指令,数据访问单元1从外部地址空间读取所有与AdaGrad梯度下降计算有关的所有指令,并将其缓存入指令缓存单元2中。In step S2, the operation starts, the control unit 3 reads the IO instruction from the first address of the instruction cache unit 2, and according to the translated microinstruction, the data access unit 1 reads all the AdaGrad gradient descent calculations from the external address space. All instructions are buffered into the instruction cache unit 2.
在步骤S3,控制器单元4从指令缓存单元2读入赋值指令,根据译出的微指令,数据缓存单元中的历史梯度平方和
Figure PCTCN2016081836-appb-000032
置零,并初始化常数α,ε。其中,常数α为自适应学习率增益系数,用于调节控制自适应学习率的范围,常数ε为一个较小的常数,用于保证自适应学习率计算中的分母非零,t为当前迭代次数,Wi为第i次迭代运算时的待更新参数,ΔL(Wi)为第i次迭代运算时待更新参数的梯度值,∑表示求和操作,其求和范围从i=1至i=t,即对初始至当前的梯度平方值(ΔL(W1))2,(ΔL(W2))2,...,(ΔL(Wt))2求和。
In step S3, the controller unit 4 reads the assignment instruction from the instruction cache unit 2, and according to the translated microinstruction, the sum of the squares of the historical gradients in the data buffer unit.
Figure PCTCN2016081836-appb-000032
Set zero and initialize the constants α, ε. The constant α is an adaptive learning rate gain coefficient, which is used to adjust the range of the adaptive learning rate. The constant ε is a small constant, which is used to ensure that the denominator in the adaptive learning rate calculation is non-zero, and t is the current iteration. The number of times, W i is the parameter to be updated when the i-th iteration is operated, ΔL(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the summation range is from i=1 to i=t, that is, the initial to current gradient squared value (ΔL(W 1 )) 2 , (ΔL(W 2 )) 2 , . . . , (ΔL(W t )) 2 is summed.
在步骤S4,控制器单元3从指令缓存单元2读入一条IO指令,根据译出的微指令,数据访问单元1从外部空间读取待更新参数向量Wt和对应的梯度向量ΔL(Wt)读入到数据缓存单元4中。In step S4, the controller unit 3 into an IO instruction read from the instruction cache unit 2, in accordance with the translated microinstructions, read from a data access unit outside the space to be the updated parameter vector W t and the corresponding gradient vector ΔL (W t ) is read into the data buffer unit 4.
在步骤S5,控制器单元3从指令缓存单元2读入一条数据传输指令,根据译出的微指令,数据缓存单元4中的历史梯度平方和
Figure PCTCN2016081836-appb-000033
和常数α,ε被传输到数据处理单元。
In step S5, the controller unit 3 reads a data transfer instruction from the instruction cache unit 2, and sums the sum of the historical gradients in the data buffer unit 4 according to the translated microinstruction.
Figure PCTCN2016081836-appb-000033
And the constant α, ε is transmitted to the data processing unit.
在步骤S6,控制器单元从指令缓存单元2读入一条向量指令,根据译出的微指令,进行历史梯度平方和
Figure PCTCN2016081836-appb-000034
的更新操作,在该操作中,指令被送至运算控制子模块51,运算控制子模块51发送相应指令进行以下操作:利用向量乘法并行运算子模块53得到(ΔL(Wt))2,利用向量加法并行运算子模块将(ΔL(Wt))2加到历史梯度平方和
Figure PCTCN2016081836-appb-000035
中。
In step S6, the controller unit reads a vector instruction from the instruction cache unit 2, and performs a historical gradient square sum according to the translated microinstruction.
Figure PCTCN2016081836-appb-000034
The update operation, in which the instruction is sent to the arithmetic control sub-module 51, and the arithmetic control sub-module 51 transmits the corresponding instruction to perform the following operation: using the vector multiplication parallel operation sub-module 53 to obtain (ΔL(W t )) 2 , using The vector addition parallel operation sub-module adds (ΔL(W t )) 2 to the sum of squared historical gradients
Figure PCTCN2016081836-appb-000035
in.
在步骤S7,控制器单元从指令缓存单元2读取一条指令,根据译出的微指令,更新后的历史梯度平方和
Figure PCTCN2016081836-appb-000036
从数据处理单元5被传送回数据缓存单元4。
In step S7, the controller unit reads an instruction from the instruction cache unit 2, and according to the translated microinstruction, the updated historical gradient sum of squares
Figure PCTCN2016081836-appb-000036
The data processing unit 5 is transferred back to the data buffer unit 4.
在步骤S8,控制器单元从指令缓存单元2读取一条自适应学习率运算指令,根据译出的微指令,运算控制子模块51控制相关运算模块进行如下操作:利用向量平方根并行运算子模块55计算
Figure PCTCN2016081836-appb-000037
利用向量除法并行运算子模块54计算自适应学习率
Figure PCTCN2016081836-appb-000038
In step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction cache unit 2, and according to the translated micro-instruction, the operation control sub-module 51 controls the correlation operation module to perform the following operation: using the vector square root parallel operation sub-module 55 Calculation
Figure PCTCN2016081836-appb-000037
Using the vector division parallel operation sub-module 54 to calculate the adaptive learning rate
Figure PCTCN2016081836-appb-000038
在步骤S9,控制器单元从指令缓存单元2读取一条参数向量更新指令,根据译出的微指令,驱动运算控制子模块51进行如下的运算:利用向量乘法并行运算单元子模块52计算出
Figure PCTCN2016081836-appb-000039
利用向量加法并行运算子模块52计算
Figure PCTCN2016081836-appb-000040
得到更新后的参数向量Wt+1
In step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit 2, and drives the operation control sub-module 51 to perform an operation based on the translated micro-instruction: calculating by using the vector multiplication parallel operation unit sub-module 52
Figure PCTCN2016081836-appb-000039
Calculated by the vector addition parallel operation sub-module 52
Figure PCTCN2016081836-appb-000040
The updated parameter vector W t+1 is obtained .
在步骤S10,控制器单元从指令缓存单元2读取一条IO指令,根据译出的微指令,更新后的参数向量Wt+1从数据处理单元5通过数据访问单元1传送至外部地址空间指定地址。In step S10, the controller unit reads an IO instruction from the instruction cache unit 2, and according to the translated microinstruction, the updated parameter vector Wt +1 is transmitted from the data processing unit 5 through the data access unit 1 to the external address space designation. address.
在步骤S11,控制器单元从指令缓存单元2读取一条收敛判断指令,根据译出的微指令,数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤S5处继续执行。In step S11, the controller unit reads a convergence determination instruction from the instruction cache unit 2, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector converges, and if it converges, the operation ends; otherwise, the process proceeds to step S5. Continue to execute.
前面的附图中所描绘的进程或方法可通过包括硬件(例如,电路、专用逻辑等)、固件、软件(例如,被承载在非瞬态计算机可读介质上的软件),或两者的组合的处理逻辑来执行。虽然上文按照某些顺序操作描述了进程或方法,但是,应该理解,所描述的某些操作能以不同顺序来执行。此外,可并行地而非顺序地执行一些操作。The processes or methods depicted in the preceding figures may be by hardware (eg, circuitry, dedicated logic, etc.), firmware, software (eg, software carried on a non-transitory computer readable medium), or both. The combined processing logic is executed. Although the processes or methods have been described above in some order, it should be understood that certain operations described can be performed in a different order. Moreover, some operations may be performed in parallel rather than sequentially.
在前述的说明书中,参考其特定示例性实施例描述了本发明的各实施例。显然,可对各实施例做出各种修改,而不悖离所附权利要求所述的本发明的更广泛的精神和范围。相应地,说明书和附图应当被认为是说明性的,而不是限制性的。 In the foregoing specification, various embodiments of the invention have been described with reference It is apparent that various modifications may be made to the various embodiments without departing from the spirit and scope of the invention as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded as

Claims (9)

  1. 一种用于执行AdaGrad梯度下降算法的装置,其特征在于,包括:An apparatus for performing an AdaGrad gradient descent algorithm, comprising:
    控制器单元,用于将读取的指令译码为控制相应模块的微指令,并将其发送给相应模块;a controller unit, configured to decode the read instruction into a micro instruction that controls the corresponding module, and send the instruction to the corresponding module;
    数据缓存单元,用于存储运算过程中的中间变量,并对所述中间变量执行初始化及更新操作;a data buffer unit, configured to store intermediate variables in the operation process, and perform initialization and update operations on the intermediate variables;
    数据处理模块,用于在所述控制器单元的控制下执行运算操作,包括向量加法运算、向量乘法运算、向量除法运算、向量平方根运算及基本运算,并将中间变量存储于所述数据缓存单元中。a data processing module, configured to perform an arithmetic operation under the control of the controller unit, including a vector addition operation, a vector multiplication operation, a vector division operation, a vector square root operation, and a basic operation, and storing the intermediate variable in the data buffer unit in.
  2. 如权利要求1所述的用于执行AdaGrad梯度下降算法的装置,其特征在于,所述数据处理模块包括运算控制子模块、并行向量加法运算单元、并行向量乘法运算单元、并行向量除法运算单元、并行向量平方根运算单元以及基本运算子模块。The apparatus for performing an AdaGrad gradient descent algorithm according to claim 1, wherein the data processing module comprises an operation control sub-module, a parallel vector addition unit, a parallel vector multiplication unit, a parallel vector division unit, Parallel vector square root operation unit and basic operation submodule.
  3. 如权利要求2所述的用于执行AdaGrad梯度下降算法的装置,其特征在于,所述数据处理模块针对同一向量执行运算时,不同位置元素能够并行执行运算。The apparatus for performing an AdaGrad gradient descent algorithm according to claim 2, wherein when the data processing module performs an operation on the same vector, different position elements can perform operations in parallel.
  4. 如权利要求1所述的用于执行AdaGrad梯度下降算法的装置,其特征在于,所述数据缓存单元在装置初始化时,初始化历史梯度值的平方和
    Figure PCTCN2016081836-appb-100001
    将其值置为0,同时开辟两个空间存储常数α,ε,此二个常数空间一直保持,直至整个梯度下降算法执行完毕。
    The apparatus for performing an AdaGrad gradient descent algorithm according to claim 1, wherein said data buffer unit initializes a sum of squares of historical gradient values at device initialization.
    Figure PCTCN2016081836-appb-100001
    Set its value to 0, and open up two spatial storage constants α, ε, which remain until the entire gradient descent algorithm is executed.
  5. 如权利要求1所述的用于执行AdaGrad梯度下降算法的装置,其特征在于,所述数据缓存单元在每次数据更新过程中,将历史梯度值的平方和
    Figure PCTCN2016081836-appb-100002
    读到数据处理模块中,在数据处理模块中更新其值, 即加入当前梯度值的平方和,然后再写入到所述数据缓存单元中;
    The apparatus for performing an AdaGrad gradient descent algorithm according to claim 1, wherein said data buffer unit compares the sum of squares of historical gradient values during each data update process.
    Figure PCTCN2016081836-appb-100002
    Read into the data processing module, update its value in the data processing module, that is, add the square sum of the current gradient values, and then write to the data buffer unit;
    所述数据处理模块从所述数据缓存单元中读取历史梯度值的平方和
    Figure PCTCN2016081836-appb-100003
    和常数α,ε,更新
    Figure PCTCN2016081836-appb-100004
    的值并送回所述数据缓存单元,利用
    Figure PCTCN2016081836-appb-100005
    及常数α,ε计算自适应学习率
    Figure PCTCN2016081836-appb-100006
    最后利用当前梯度值及自适应学习率更新待更新向量。
    The data processing module reads a sum of squares of historical gradient values from the data buffer unit
    Figure PCTCN2016081836-appb-100003
    And constant α, ε, update
    Figure PCTCN2016081836-appb-100004
    Value is sent back to the data cache unit, utilizing
    Figure PCTCN2016081836-appb-100005
    And constant α, ε calculate adaptive learning rate
    Figure PCTCN2016081836-appb-100006
    Finally, the vector to be updated is updated by using the current gradient value and the adaptive learning rate.
  6. 一种执行AdaGrad梯度下降算法的方法,其特征在于,包括以下步骤:A method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
    步骤(1),对数据缓存单元进行初始化,包括对常数α,ε设置初值以及对历史梯度平方和
    Figure PCTCN2016081836-appb-100007
    置零操作,其中,常数α为自适应学习率增益系数,用于调节控制自适应学习率的范围,常数ε为一个常数,用于保证自适应学习率计算中的分母非零,t为当前迭代次数,Wi为第i次迭代运算时的待更新参数,ΔL(Wi)为第i次迭代运算时待更新参数的梯度值,∑表示求和操作,其求和范围从i=1至i=t,即对初始至当前的梯度平方值(ΔL(W1))2,(ΔL(W2))2,...,(ΔL(Wt))2求和;
    Step (1), initializing the data buffer unit, including setting the initial value for the constant α, ε, and summing the square of the historical gradient
    Figure PCTCN2016081836-appb-100007
    Zeroing operation, wherein the constant α is an adaptive learning rate gain coefficient for adjusting the range of the adaptive learning rate, and the constant ε is a constant for ensuring that the denominator in the adaptive learning rate calculation is non-zero, t is the current The number of iterations, W i is the parameter to be updated when the i-th iteration is operated, ΔL(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the summation range is from i=1 To i=t, that is, the initial to current gradient squared value (ΔL(W 1 )) 2 , (ΔL(W 2 )) 2 , . . . , (ΔL(W t )) 2 is summed;
    步骤(2),从外部空间读取待更新参数向量和对应的梯度向量;Step (2), reading the parameter vector to be updated and the corresponding gradient vector from the external space;
    步骤(3),数据处理模块读取并计算更新数据缓存单元中的历史梯度平方和
    Figure PCTCN2016081836-appb-100008
    并通过数据缓存单元中的常数α,ε和历史梯度平方和
    Figure PCTCN2016081836-appb-100009
    计算自适应学习率
    Figure PCTCN2016081836-appb-100010
    Step (3), the data processing module reads and calculates the sum of squared historical gradients in the updated data buffer unit
    Figure PCTCN2016081836-appb-100008
    And pass the sum of the constants α, ε and the historical gradient in the data buffer unit
    Figure PCTCN2016081836-appb-100009
    Calculated adaptive learning rate
    Figure PCTCN2016081836-appb-100010
    步骤(4),数据处理模块利用自适应学习率及当前梯度值完成对待更新向量的更新操作,更新操作计算公式如下:In step (4), the data processing module completes the update operation of the update vector by using the adaptive learning rate and the current gradient value, and the update operation calculation formula is as follows:
    Figure PCTCN2016081836-appb-100011
    Figure PCTCN2016081836-appb-100011
    其中,Wt表示当前,即第t次的待更新参数,ΔL(Wt)表示当前待更 新参数的梯度值,Wt+1表示更新后的参数,也是下一次,即t+1次迭代运算的待更新参数;Where W t represents the current, that is, the t-th parameter to be updated, ΔL(W t ) represents the gradient value of the current parameter to be updated, and W t+1 represents the updated parameter, which is also the next time, that is, t+1 iterations. The parameter to be updated of the operation;
    步骤(5),数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤(2)处继续执行。In step (5), the data processing unit determines whether the updated parameter vector converges. If it converges, the operation ends. Otherwise, the process proceeds to step (2) to continue execution.
  7. 一种执行AdaGrad梯度下降算法的装置,所述装置的控制器中固化有执行如权利要求6所述方法的程序。An apparatus for performing an AdaGrad gradient descent algorithm in which a program for performing the method of claim 6 is cured.
  8. 一种执行AdaGrad梯度下降算法的方法,其特征在于,包括以下步骤:A method for performing an AdaGrad gradient descent algorithm, comprising the steps of:
    步骤S1,在指令缓存单元的首地址处预先存入一条IO指令;Step S1, pre-storing an IO instruction at the first address of the instruction cache unit;
    步骤S2,控制器单元从指令缓存单元的首地址读取该条IO指令,根据译出的微指令,数据访问单元从外部地址空间读取所有与AdaGrad梯度下降计算有关的所有指令,并将其缓存入指令缓存单元中;Step S2, the controller unit reads the IO instruction from the first address of the instruction cache unit, and according to the translated microinstruction, the data access unit reads all the instructions related to the AdaGrad gradient descent calculation from the external address space, and Cached into the instruction cache unit;
    步骤S3,控制器单元从指令缓存单元读入赋值指令,根据译出的微指令,数据缓存单元中的历史梯度平方和
    Figure PCTCN2016081836-appb-100012
    置零,并初始化常数α,ε;其中,常数α为自适应学习率增益系数,用于调节控制自适应学习率的范围,常数ε为一个常数,用于保证自适应学习率计算中的分母非零,t为当前迭代次数,Wi为第i次迭代运算时的待更新参数,ΔL(Wi)为第i次迭代运算时待更新参数的梯度值,∑表示求和操作,其求和范围从i=1至i=t,即对初始至当前的梯度平方值(ΔL(W1))2,(ΔL(W2))2,...,(ΔL(Wt))2求和;
    Step S3, the controller unit reads the assignment instruction from the instruction cache unit, and according to the translated microinstruction, the sum of the historical gradients in the data buffer unit
    Figure PCTCN2016081836-appb-100012
    Zero, and initialize the constant α, ε; where the constant α is the adaptive learning rate gain coefficient, used to adjust the range of the adaptive learning rate, the constant ε is a constant, used to guarantee the denominator in the adaptive learning rate calculation Non-zero, t is the current number of iterations, W i is the parameter to be updated when the i-th iteration is operated, ΔL(W i ) is the gradient value of the parameter to be updated when the i-th iteration is operated, and ∑ represents the summation operation, and the And the range from i=1 to i=t, that is, the initial to current gradient squared value (ΔL(W 1 )) 2 , (ΔL(W 2 )) 2 ,...,(ΔL(W t )) 2 Summing
    步骤S4,控制器单元从指令缓存单元读入一条IO指令,根据译出的微指令,数据访问单元从外部空间读取待更新参数向量Wt和对应的梯度向量ΔL(Wt)读入到数据缓存单元中;Step S4, the controller unit IO buffer unit reads an instruction from the instruction, according to the microinstruction translated, the access unit data read from the external space to be updated parameter vector W t and the corresponding gradient vector ΔL (W t) are read into the In the data cache unit;
    步骤S5,控制器单元从指令缓存单元读入一条数据传输指令,根据 译出的微指令,数据缓存单元中的历史梯度平方和
    Figure PCTCN2016081836-appb-100013
    和常数α,ε被传输到数据处理单元;
    Step S5, the controller unit reads a data transfer instruction from the instruction cache unit, and according to the translated microinstruction, the sum of the historical gradients in the data buffer unit
    Figure PCTCN2016081836-appb-100013
    And the constant α, ε is transmitted to the data processing unit;
    步骤S6,控制器单元从指令缓存单元读入一条向量指令,根据译出的微指令,进行历史梯度平方和
    Figure PCTCN2016081836-appb-100014
    的更新操作,在该操作中,指令被送至运算控制子模块,运算控制子模块发送相应指令进行以下操作:利用向量乘法并行运算子模块得到(ΔL(Wt))2,利用向量加法并行运算子模块将(ΔL(Wt))2加到历史梯度平方和
    Figure PCTCN2016081836-appb-100015
    中;
    Step S6, the controller unit reads a vector instruction from the instruction cache unit, and performs a historical gradient square sum according to the translated microinstruction.
    Figure PCTCN2016081836-appb-100014
    The update operation, in which the instruction is sent to the operation control sub-module, and the operation control sub-module sends the corresponding instruction to perform the following operations: using the vector multiplication parallel operation sub-module to obtain (ΔL(W t )) 2 , using vector addition in parallel The arithmetic submodule adds (ΔL(W t )) 2 to the sum of the squares of the historical gradients
    Figure PCTCN2016081836-appb-100015
    in;
    步骤S7,控制器单元从指令缓存单元读取一条指令,根据译出的微指令,更新后的历史梯度平方和
    Figure PCTCN2016081836-appb-100016
    从数据处理单元被传送回数据缓存单元;
    Step S7, the controller unit reads an instruction from the instruction cache unit, and according to the translated microinstruction, the updated historical gradient sum of squares
    Figure PCTCN2016081836-appb-100016
    Transferred from the data processing unit back to the data cache unit;
    步骤S8,控制器单元从指令缓存单元读取一条自适应学习率运算指令,根据译出的微指令,运算控制子模块控制相关运算模块进行如下操作:利用向量平方根并行运算子模块计算
    Figure PCTCN2016081836-appb-100017
    利用向量除法并行运算子模块计算自适应学习率
    Figure PCTCN2016081836-appb-100018
    Step S8, the controller unit reads an adaptive learning rate operation instruction from the instruction buffer unit, and according to the translated micro instruction, the operation control sub-module controls the correlation operation module to perform the following operations: calculating by using the vector square root parallel operation sub-module
    Figure PCTCN2016081836-appb-100017
    Calculating adaptive learning rate by using vector division parallel operation sub-module
    Figure PCTCN2016081836-appb-100018
    步骤S9,控制器单元从指令缓存单元读取一条参数向量更新指令,根据译出的微指令,驱动运算控制子模块进行如下的运算:利用向量乘法并行运算单元子模块计算出
    Figure PCTCN2016081836-appb-100019
    利用向量加法并行运算子模块计算
    Figure PCTCN2016081836-appb-100020
    得到更新后的参数向量Wt+1
    Step S9, the controller unit reads a parameter vector update instruction from the instruction cache unit, and drives the operation control sub-module to perform the following operation according to the translated micro-instruction: calculating by using the vector multiplication parallel operation unit sub-module
    Figure PCTCN2016081836-appb-100019
    Calculation using a vector addition parallel operation submodule
    Figure PCTCN2016081836-appb-100020
    Obtain the updated parameter vector W t+1 ;
    步骤S10,控制器单元从指令缓存单元读取一条IO指令,根据译出的微指令,更新后的参数向量Wt+1从数据处理单元通过数据访问单元传送至外部地址空间指定地址; Step S10, the controller unit reads an IO instruction from the instruction cache unit, and according to the translated microinstruction, the updated parameter vector W t+1 is transmitted from the data processing unit to the external address space specified address through the data access unit;
    步骤S11,控制器单元从指令缓存单元读取一条收敛判断指令,根据译出的微指令,数据处理单元判断更新后的参数向量是否收敛,若收敛,运算结束,否则,转到步骤S5处继续执行。Step S11, the controller unit reads a convergence judgment instruction from the instruction cache unit, and according to the translated microinstruction, the data processing unit determines whether the updated parameter vector converges, and if it converges, the operation ends; otherwise, the process proceeds to step S5. carried out.
  9. 一种执行AdaGrad梯度下降算法的装置,所述装置的控制器中固化有执行如权利要求8所述方法的程序。 An apparatus for performing an AdaGrad gradient descent algorithm in which a program for performing the method of claim 8 is cured.
PCT/CN2016/081836 2016-04-29 2016-05-12 Apparatus and method for executing adagrad gradient descent training algorithm WO2017185411A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610280620.4A CN107341132B (en) 2016-04-29 2016-04-29 Device and method for executing AdaGrad gradient descent training algorithm
CN201610280620.4 2016-04-29

Publications (1)

Publication Number Publication Date
WO2017185411A1 true WO2017185411A1 (en) 2017-11-02

Family

ID=60161682

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/081836 WO2017185411A1 (en) 2016-04-29 2016-05-12 Apparatus and method for executing adagrad gradient descent training algorithm

Country Status (2)

Country Link
CN (1) CN107341132B (en)
WO (1) WO2017185411A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329941A (en) * 2020-11-04 2021-02-05 支付宝(杭州)信息技术有限公司 Deep learning model updating method and device
CN116128072A (en) * 2023-01-20 2023-05-16 支付宝(杭州)信息技术有限公司 Training method, device, equipment and storage medium of risk control model

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110378480B (en) * 2019-06-14 2022-09-27 平安科技(深圳)有限公司 Model training method and device and computer readable storage medium
CN111626434B (en) * 2020-05-15 2022-06-07 浪潮电子信息产业股份有限公司 Distributed training parameter updating method, device, equipment and storage medium
CN113238975A (en) * 2021-06-08 2021-08-10 中科寒武纪科技股份有限公司 Memory, integrated circuit and board card for optimizing parameters of deep neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826142A (en) * 2010-04-19 2010-09-08 中国人民解放军信息工程大学 Reconfigurable elliptic curve cipher processor
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
CN104035751A (en) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 Graphics processing unit based parallel data processing method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9477925B2 (en) * 2012-11-20 2016-10-25 Microsoft Technology Licensing, Llc Deep neural networks training for speech and pattern recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101826142A (en) * 2010-04-19 2010-09-08 中国人民解放军信息工程大学 Reconfigurable elliptic curve cipher processor
CN102156637A (en) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 Vector crossing multithread processing method and vector crossing multithread microprocessor
CN104035751A (en) * 2014-06-20 2014-09-10 深圳市腾讯计算机系统有限公司 Graphics processing unit based parallel data processing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DEAN, J. ET AL.: "Large Scale Distributed Deep Networks", NIPS' 12 PROCEEDINGS OF THE 25 TH INTERNATIONAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS, 6 December 2012 (2012-12-06), XP055113684 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329941A (en) * 2020-11-04 2021-02-05 支付宝(杭州)信息技术有限公司 Deep learning model updating method and device
CN116128072A (en) * 2023-01-20 2023-05-16 支付宝(杭州)信息技术有限公司 Training method, device, equipment and storage medium of risk control model
CN116128072B (en) * 2023-01-20 2023-08-25 支付宝(杭州)信息技术有限公司 Training method, device, equipment and storage medium of risk control model

Also Published As

Publication number Publication date
CN107341132A (en) 2017-11-10
CN107341132B (en) 2021-06-11

Similar Documents

Publication Publication Date Title
WO2017185411A1 (en) Apparatus and method for executing adagrad gradient descent training algorithm
US11574195B2 (en) Operation method
WO2017124644A1 (en) Artificial neural network compression encoding device and method
WO2017185391A1 (en) Device and method for performing training of convolutional neural network
WO2017124642A1 (en) Device and method for executing forward calculation of artificial neural network
WO2017124641A1 (en) Device and method for executing reversal training of artificial neural network
WO2017185389A1 (en) Device and method for use in executing matrix multiplication operations
CN111260025B (en) Apparatus and method for performing LSTM neural network operation
WO2017124646A1 (en) Artificial neural network calculating device and method for sparse connection
WO2017185347A1 (en) Apparatus and method for executing recurrent neural network and lstm computations
WO2017185394A1 (en) Device and method for performing reversetraining of fully connected layers of neural network
WO2018120016A1 (en) Apparatus for executing lstm neural network operation, and operational method
WO2017185257A1 (en) Device and method for performing adam gradient descent training algorithm
WO2017185336A1 (en) Apparatus and method for executing pooling operation
WO2018113790A1 (en) Operation apparatus and method for artificial neural network
CN109754062B (en) Execution method of convolution expansion instruction and related product
WO2017185413A1 (en) Device and method for executing hessian-free training algorithm
KR20230109791A (en) Packed data alignment plus compute instructions, processors, methods, and systems
WO2017185248A1 (en) Apparatus and method for performing auto-learning operation of artificial neural network
WO2018112892A1 (en) Device and method for supporting fast artificial neural network operation
WO2017185335A1 (en) Apparatus and method for executing batch normalization operation
WO2017177446A1 (en) Discrete data representation-supporting apparatus and method for back-training of artificial neural network
CN111860814B (en) Apparatus and method for performing batch normalization operations
WO2017185256A1 (en) Rmsprop gradient descent algorithm execution apparatus and method
CN107315569B (en) Device and method for executing RMSprop gradient descent algorithm

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899922

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16899922

Country of ref document: EP

Kind code of ref document: A1