WO2020249106A1 - 用于处理数据组的可编程器件及处理数据组的方法 - Google Patents

用于处理数据组的可编程器件及处理数据组的方法 Download PDF

Info

Publication number
WO2020249106A1
WO2020249106A1 PCT/CN2020/095907 CN2020095907W WO2020249106A1 WO 2020249106 A1 WO2020249106 A1 WO 2020249106A1 CN 2020095907 W CN2020095907 W CN 2020095907W WO 2020249106 A1 WO2020249106 A1 WO 2020249106A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
feature
bucket
programmable device
accumulation
Prior art date
Application number
PCT/CN2020/095907
Other languages
English (en)
French (fr)
Inventor
李嘉树
卢冕
季成
杨俊�
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Priority to EP20821819.8A priority Critical patent/EP3985498B1/en
Priority to US17/619,142 priority patent/US11791822B2/en
Publication of WO2020249106A1 publication Critical patent/WO2020249106A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03KPULSE TECHNIQUE
    • H03K19/00Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits
    • H03K19/02Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components
    • H03K19/173Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components using elementary logic circuits as components
    • H03K19/177Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits using specified components using elementary logic circuits as components arranged in matrix form
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • G06F7/505Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination
    • G06F7/509Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination for multiple operands, e.g. digital integrators

Definitions

  • the present disclosure relates to a programmable device for processing data sets and a method for processing data sets.
  • processing node splitting is a step that consumes a lot of running time, and the overall running time of the GBDT algorithm depends on this.
  • pipeline optimization is a common parallel optimization method in hardware acceleration. Pipeline optimization divides a complex processing operation into multiple steps. By overlapping operations on different steps, multiple operations can be executed in parallel, which greatly improves the running speed of the entire program and effectively improves the hardware Resource utilization efficiency.
  • an accumulator in order to optimize the pipeline of the accumulation operation in the GBDT histogram algorithm, an accumulator is usually used to solve the problem of data dependence (data conflict) caused by pipeline optimization.
  • resource for example, greater than 20,000 independent accumulation requirements
  • precision for example, 64-bit double-precision floating-point numbers
  • the purpose of the present disclosure is to provide a programmable device for processing data sets and a method for processing data sets.
  • An aspect of the present disclosure provides a programmable device for processing a data group.
  • the programmable device includes a plurality of accumulation circuits, wherein each accumulation circuit includes a pipeline adder and a pipeline adder for storing calculations.
  • a buffer unit of the result ; and a multiplexer for sequentially receiving the data in the data group, dynamically determining the correspondence between the multiple features contained in the data and the multiple accumulation circuits, and according to the correspondence The relationship sends the characteristic values of the multiple characteristics in the received data to the corresponding accumulation circuit respectively.
  • Another aspect of the present disclosure provides a method for processing a data group based on a programmable device.
  • the method includes: setting a plurality of accumulation circuits in the programmable device, wherein each accumulation circuit includes a pipeline adder and A buffer for storing the calculation result of the pipeline adder; and a multiplexer is set in the programmable device, the multiplexer receives each data in the data group, and dynamically determines the multiple features and Correspondence between the multiple accumulation circuits, and the feature value of each of the multiple features is sent to the corresponding accumulation circuit according to the correspondence during each period.
  • the multiplexer dynamically determines the correspondence between the multiple features contained in the received data and the multiple accumulation circuits, so as to avoid/reduce the characteristics of the accumulation circuit for specific features Values are assigned to the same feature again during the accumulation period, thereby avoiding/reducing data conflicts.
  • FIG. 1 shows a block diagram of an accumulation circuit generated by a pipeline adder and a buffer according to the present disclosure
  • FIG. 2 shows a schematic diagram of a timing diagram of an accumulation operation performed by an accumulation circuit according to the present disclosure
  • FIG. 3 shows a block diagram of a programmable device for processing a data group according to an embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of the correspondence between the accumulation circuit of the programmable device and the characteristics of the data according to an embodiment of the present disclosure
  • FIG. 5 shows a flowchart of a method for processing a data group according to an embodiment of the present disclosure
  • Fig. 6 shows a flowchart of a method for processing a data group according to another embodiment of the present disclosure.
  • the inventor of the present disclosure uses an adder (for example, a single-precision adder or a double-precision adder) and a buffer (for example, Block RAM) to generate an accumulation circuit.
  • an adder for example, a single-precision adder or a double-precision adder
  • a buffer for example, Block RAM
  • FIG. 1 shows a block diagram of using an accumulation circuit generated by an adder and a buffer in hardware acceleration according to the present disclosure.
  • GBDT's histogram optimization algorithm refers to: convert the feature value into bin before training, that is, make a piecewise function for the value of each feature, and divide the value of all samples on the feature into a certain segment ( bin). Finally, the feature value is converted from a continuous value to a discrete value. For example, the value of the age feature is divided into buckets, such as 5 buckets: 0-20 years old, 20-40 years old, 40-60 years old, 60-80 years old, and 80-100 years old.
  • any one of the buckets such as the 20-40-year-old bucket, add the age characteristics of all the data to fall into the 20-40-year-old characteristic value to get the cumulative value x (or the average age value x after accumulation), and then for The true value of the age feature falls into the data in the bucket of 20-40 years old, and the value of the age feature is replaced with x.
  • an accumulation circuit can be assigned to each feature to ensure that the histogram construction of all features can be processed in parallel.
  • each accumulation operation on certain hardware used for acceleration may cause a delay of, for example, multiple clock cycles.
  • FIG. 2 shows a schematic diagram of a timing chart of the accumulation operation performed by the accumulation circuit.
  • the adder reads data from the buffer during the load period (load signal high level), and performs an accumulation operation on the data during the following several clock cycles. After the accumulation operation is completed, the adder responds to high The level storage signal stores the accumulated new data in the buffer. Since the result of the previous accumulation has not been written into the buffer, the next accumulation operation cannot be performed. Therefore, a large amount of data dependence is inevitably introduced, which causes the pipeline to stall. In some cases, this delay can be as much as 14 clock cycles. In other words, the adder will be forced to pause for 13 clock cycles every time it runs for one clock cycle, which will cause the efficiency and throughput of the pipeline to drop. In this regard, the inventor of the present disclosure further proposed the solution shown in FIG. 3.
  • Fig. 3 shows a block diagram of a programmable device for processing a data group according to an embodiment of the present disclosure.
  • the programmable device for processing a data group includes a plurality of accumulation circuits AC and multiplexers MUX, wherein each accumulation circuit AC includes a pipeline adder SA and storage A buffer unit BUF of the SA calculation result of the pipeline adder.
  • the multiplexer MUX can be used to sequentially receive the data in the data group, dynamically determine the correspondence between the multiple features contained in the data and the multiple accumulating circuits AC, and combine the received data according to the correspondence The feature values of the multiple features in are respectively sent to the corresponding accumulating circuit AC.
  • the programmable device may be a field programmable gate array (FPGA).
  • the data group may be a sample data set for machine learning in a specific application scenario.
  • the machine learning algorithm may be a machine learning algorithm that needs to process a large amount of data and has specific requirements for accuracy.
  • the programmable device can be used to perform gradient regression decision tree GBDT histogram algorithm processing on the sample data set.
  • the basic idea of the histogram algorithm is to pre-pack the eigenvalues, so that only the histogram binning needs to be considered when calculating the split to select the division point. Compared with the pre-sorting algorithm, the histogram algorithm significantly reduces the memory consumption and helps improve the training speed.
  • the pipeline adder SA may operate as a pipeline circuit.
  • a pipeline circuit an instruction processing pipeline is composed of multiple circuit units with different functions, and then an instruction is divided into multiple steps (for example, 4-6 steps) and then executed by these circuit units respectively, so that the pipeline circuit can be implemented in Every clock cycle can get a new input. After the initial delay, the pipeline circuit can generate a new output every clock cycle.
  • the pipeline circuit does not reduce the time of a single data operation, but greatly increases the throughput, which makes the hardware utilization rate high, thereby reducing the demand for hardware resources.
  • the pipeline adder SA in each accumulation circuit reads the accumulated value corresponding to the bucket to which the received characteristic value belongs from the corresponding buffer unit BUF, and accumulates the received characteristic value to the read accumulated value to obtain a new And update the corresponding accumulated value in the corresponding buffer unit BUF with the new accumulated value (see Figure 2).
  • the number of accumulation circuits can be determined by available hardware resources, and the number of features included in the data in the data group can be set differently according to the situation (for example, determined by at least one of the type of data and the user).
  • the programmable device can process at least one of multiple types of data and multiple types of user data through the same hardware resource (for example, the same number of accumulation circuits).
  • the number of accumulating circuits AC may be smaller than the number of features contained in the data in the data group, in this case, some of the accumulating circuits AC will be multiplexed. In another embodiment, the number of accumulating circuits AC may be the same as the number of features included in the data in the data group to ensure that all features can be processed in parallel at the same time. In another embodiment, the number of accumulating circuits AC may be greater than the number of features contained in the data in the data group.
  • the data in the data group may include a feature label indicating the feature corresponding to each included feature value and a bucketing tag indicating the bucket corresponding to each included feature value.
  • the pipeline adder SA in each accumulating circuit AC can read the received feature value from the corresponding buffer unit BUF according to the feature tag and bucket tag corresponding to the received feature value. The accumulated value corresponding to the bucket.
  • the data in the data group may only include a bucket label indicating the bucket corresponding to each included feature value.
  • the pipeline adder SA in each accumulating circuit AC can dynamically determine the corresponding relationship between the control logic of the multiplexer and the bucket label corresponding to the received feature value from the corresponding buffer unit BUF Read the accumulated value corresponding to the bucket to which the received feature value belongs.
  • the pipeline adder SA may be a single precision adder or a double precision adder. It should be understood that various modifications can be made to the type of the pipeline adder SA according to resource and accuracy requirements without departing from the scope of the present disclosure.
  • the multiplexer dynamically determines the correspondence between multiple features and multiple accumulating circuits AC, which can avoid the continuous same feature falling on the same bucket to the greatest extent, thereby avoiding/reducing the occurrence of data conflicts.
  • the multiplexer MUX can dynamically determine the multiple features and multiple accumulations included in the received data according to the sequence number of the received data in the data group and the sequence number of each feature in the received data. Correspondence between circuits AC. This feature will be described in more detail with reference to FIG. 4 later.
  • the programmable device for processing the data group may further include an output unit (not shown).
  • the output unit may be used to sum accumulated values corresponding to the same bucket of the same feature in each buffer unit BUF in each accumulation circuit AC, and output each final accumulated value corresponding to each bucket of each feature.
  • FIG. 4 shows a schematic diagram of the correspondence between the accumulation circuit AC of the programmable device and the characteristics of the data according to an embodiment of the present disclosure.
  • the multiplexer MUX will dynamically determine the multiple features and multiple features included in the received data according to the sequence number of the received data in the data group and the sequence number of each feature in the received data with reference to Figure 4 below. A detailed description of the correspondence between the accumulating circuits AC.
  • the time delay of the addition operation of the accumulating circuit AC is 4 clock cycles (including the time period for the buffer unit BUF to read data, perform addition, and then update the buffer unit BUF with the result of the addition);
  • Each piece of data contains 4 features: feature a, feature b, feature c, feature d (for example, data 1 contains features f1a, f1b, f1c, f1d, and data 2 contains features f2a, f2b, f2c, f2d, and And so on).
  • the accumulation circuit 1 to the accumulation circuit 4 are all composed of a pipeline adder SA and a buffer unit BUF. Although it takes 4 clock cycles to complete an addition operation, because it is a pipeline circuit, the pipeline adder SA can start processing an addition operation every clock without data dependence.
  • the buffer unit BUF may be a dual-port memory that performs at most one store and one load operation in each clock cycle.
  • accumulation circuits (accumulation circuit 1 to accumulation circuit 4) are provided, that is, the number of accumulation circuits and the features contained in the data The number is the same.
  • the multiplexer sequentially receives data in chronological order, and each piece of data includes 4 features (for example, the features f1a, f1b, f1c, and f1d of data 1 are received in the first clock cycle in FIG. 4, and in the second clock Periodically receive the characteristics f2a, f2b, f2c, f2d...) of the data 2.
  • the corresponding relationship between the characteristics and the accumulation circuit as shown in the figure above is realized by setting the control logic of the data selection end of the multiplexer.
  • accumulating circuit 1 corresponds to f1a
  • accumulating circuit 2 corresponds to f1b
  • accumulating circuit 3 corresponds to f1c
  • accumulating circuit 4 corresponds to f1d
  • accumulating circuit 1 corresponds to f2b
  • accumulating Circuit 2 corresponds to f2c
  • accumulation circuit 3 corresponds to f2d
  • accumulation circuit 4 corresponds to f2a
  • the number of accumulation circuits, the number of features, and the corresponding relationship between accumulation circuits and features described here are only for convenience of description. It is easy to understand that, according to specific embodiments, the number of data features can be more than 4, for example, 200 to 300 or more.
  • the number of accumulation circuits can be equal to the number of features. In another embodiment, the number of accumulation circuits may be greater or less than the number of features.
  • the corresponding relationship between the accumulation circuit and the feature can be modified in various ways according to specific embodiments.
  • the data may include a feature tag and a bucket tag
  • the accumulation circuit reads the accumulated value corresponding to the bucket to which the received feature value belongs from the buffer unit BUF according to the feature tag and the bucket tag.
  • the data may only include the bucket tag, and the accumulation circuit reads the corresponding bucket corresponding to the received feature value from the buffer unit BUF according to the logic of the data selection end of the multiplexer and the bucket tag. Cumulative value.
  • each buffer unit BUF (the accumulation circuit corresponds to the buffer unit one to one) includes all the features and their buckets. Therefore, the output unit (not shown) can sum the accumulated values of the same bucket corresponding to the same feature in the buffer units corresponding to each accumulation circuit to obtain the final accumulated value.
  • Fig. 5 shows a flowchart of a method for processing a data group according to an embodiment of the present disclosure.
  • the method for processing a data group based on a programmable device includes:
  • step S100 multiple accumulation circuits are provided in the programmable device, where each accumulation circuit includes a pipeline adder and a buffer for storing calculation results of the pipeline adder;
  • a multiplexer is set in the programmable device, and the multiplexer receives each data in the data group, and dynamically determines the correspondence between multiple features contained in the data and multiple accumulation circuits , And send the feature value of each of the multiple features to the corresponding accumulation circuit according to the corresponding relationship during each time period.
  • the pipeline adder is a single precision adder or a double precision adder.
  • the programmable device may be a field programmable gate array (FPGA); the data group may be a sample data set used for machine learning in a specific application scenario; the programmable device may be used to compare sample data Set the implementation of gradient regression decision tree GBDT histogram algorithm processing.
  • the method can set the number of accumulation circuits to be the same as the number of features contained in the data in the data group, or can set the number of accumulation circuits to be greater or less than the number of features contained in the data in the data group.
  • the relationship between the number of data groups, programmable devices, and the number of accumulation circuits and the number of features included in the data described here is the same as or similar to those described with reference to FIG. 3, and therefore, redundant description is omitted here.
  • the data may include a feature label indicating the feature corresponding to each of the included feature values and a bucket tag indicating the bucket corresponding to each of the included feature values; in each accumulation circuit
  • the pipeline adder can read the accumulated value corresponding to the bucket to which the received feature value belongs from the corresponding cache unit according to the feature tag and the bucket tag corresponding to the received feature value.
  • the data may only include a bucket label indicating the bucket corresponding to each feature value contained; the pipeline adder in each accumulation circuit may be determined dynamically according to the multiplexer The control logic of the corresponding relationship and the bucket label corresponding to the received feature value reads the accumulated value corresponding to the bucket to which the received feature value belongs from the corresponding cache unit.
  • the pipeline adder in each accumulation circuit reads the accumulated value corresponding to the bucket to which the received feature value belongs from the corresponding buffer unit, and accumulates the received feature value to the read accumulated value The new accumulated value is obtained from the above, and the corresponding accumulated value in the corresponding buffer unit is updated with the new accumulated value.
  • the pipeline adder and buffer unit described here are the same as or similar to the pipeline adder SA and buffer unit BUF described with reference to FIG. 3, and therefore, redundant description is omitted here.
  • the multiplexer dynamically determines a plurality of features and a plurality of accumulation circuits included in the received data according to the serial number of the received data in the data group and the serial number of each feature in the received data Correspondence between.
  • the multiplexer described here is the same as or similar to the multiplexer MUX described with reference to FIGS. 3 and 4, and therefore, redundant description is omitted here.
  • Fig. 6 shows a flowchart of a method for processing a data group according to another embodiment of the present disclosure.
  • step S300 the method shown in FIG. 6 is basically the same or similar to the method shown in FIG. 5, and therefore, redundant description is omitted here.
  • step S300 an output unit is set in the programmable device, and the accumulated values corresponding to the same bucket of the same feature in each cache unit in each accumulation circuit are summed, and the output corresponding to each bucket of each feature The final cumulative value of each.
  • the pipeline adder and the cache unit can be accurately controlled and used.
  • the present disclosure designs a cache usage logic suitable for the machine learning algorithm, which reduces or eliminates the possibility of data conflicts, thereby greatly improving the execution efficiency of the pipeline.
  • the correspondence between a plurality of features included in the received data and a plurality of accumulation circuits is dynamically determined by a multiplexer It avoids/reduces the situation that the characteristic value of a specific characteristic is assigned to the same characteristic again during the accumulation circuit of the accumulation circuit, so as to avoid/reduce the occurrence of data conflict.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Mathematical Optimization (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Advance Control (AREA)
  • Image Processing (AREA)
  • Logic Circuits (AREA)

Abstract

提供了一种用于处理数据组的可编程器件及处理数据组的方法,所述可编程器件包括:多个累加电路,其中,每个累加电路包括一个流水线加法器和用于存储所述流水线加法器计算结果的一个缓存单元;以及多路复用器,用于依次接收所述数据组中的数据,动态地确定包含在所述数据中的多个特征与所述多个累加电路之间的对应关系,并根据所述对应关系将所接收数据中的多个特征的特征值分别发送到对应的累加电路。

Description

用于处理数据组的可编程器件及处理数据组的方法
本申请要求申请号为201910516213.2,申请日为2019年6月14日,名称为“用于处理数据组的可编程器件及处理数据组的方法”的中国专利申请的优先权,其中,上述申请公开的内容通过引用结合在本申请中。
技术领域
本公开涉及一种用于处理数据组的可编程器件和一种处理数据组的方法。
背景技术
随着机器学习算法的发展,在机器学习算法(例如,梯度回归决策树GBDT)的具体实现中,处理节点分裂是其中大量消耗运行时间的步骤,GBDT算法的总体运行时间取决于此。在例如直方图算法的处理节点分裂的诸多算法中,流水线优化是一种在硬件加速中常见的并行优化方法。流水线优化将一个复杂的处理操作切分为多个步骤,通过在不同步骤上重叠操作,使得多个操作可以并行地加以执行,从而大大的提升了整个程序的运行速度,也有效地提高了硬件资源的利用效率。
在现有技术中,为了将GBDT直方图算法中的累加操作进行流水线优化,通常采用累加器来解决因流水线优化而带来的数据依赖(数据冲突)问题。然而,由于资源(例如,大于2万个独立的累加需求)和精度(例如,64位double双精度浮点数)的限制,在GBDT直方图算法的硬件加速实现中,无法直接使用专门的累加器来执行累加操作。因此,采用专门的累加器来执行累加操作的优化方法是受限的。
发明内容
本公开的目的在于提供一种用于处理数据组的可编程器件和一种处理数据组的方法。
本公开的一方面提供了一种用于处理数据组的可编程器件,所述可编程器件包括:多个累加电路,其中,每个累加电路包括一个流水线加法器和用 于存储流水线加法器计算结果的一个缓存单元;以及多路复用器,用于依次接收数据组中的数据,动态地确定包含在所述数据中的多个特征与多个累加电路之间的对应关系,并根据对应关系将所接收数据中的多个特征的特征值分别发送到对应的累加电路。
本公开的另一方面提供了一种基于可编程器件处理数据组的方法,所述方法包括:在可编程器件中设置多个累加电路,其中,每个累加电路包括一个流水线加法器和用于存储流水线加法器计算结果的一个缓存器;以及在可编程器件中设置多路复用器,多路复用器接收数据组中的每个数据,动态地确定所述数据包含的多个特征与多个累加电路之间的对应关系,并且在每个时段期间根据对应关系将多个特征中的每个特征的特征值发送到对应的累加电路。
根据本公开的一个或多个方面,通过多路复用器动态地确定包含在所接收数据中的多个特征与多个累加电路之间的对应关系,避免/减少累加电路对特定特征的特征值进行累加期间再次被分配到同一特征的情况,从而避免/减少数据冲突的产生。
将在接下来的描述中部分阐述本公开另外的方面和优点,还有一部分通过描述将是清楚的,或者可以经过本公开的实施而得知。
附图说明
通过下面结合示例性地示出一例的附图进行的描述,本公开的上述和其他目的和特点将会变得更加清楚,其中:
图1示出了根据本公开的使用由流水线加法器和缓存器生成的累加电路的框图;
图2中示出了根据本公开通过累加电路来执行累加操作的时序图的示意图;
图3示出了根据本公开的实施例的用于处理数据组的可编程器件的框图;
图4示出了根据本公开的实施例的可编程器件的累加电路与数据的特征之间的对应关系的示意图;
图5示出了根据本公开的实施例的用于处理数据组的方法的流程图;以及
图6示出了根据本公开的另一实施例的用于处理数据组的方法的流程图。
具体实施方式
下面参照附图详细描述本公开的实施例。在此需要说明的是,在本公开中出现的“若干项之中的至少一项”均表示包含“该若干项中的任意一项”、“该若干项中的任意多项的组合”、“该若干项的全体”这三类并列的情况。例如“包括A和B之中的至少一个”即包括如下三种并列的情况:(1)包括A;(2)包括B;(3)包括A和B。又例如“执行步骤一和步骤二之中的至少一个”,即表示如下三种并列的情况:(1)执行步骤一;(2)执行步骤二;(3)执行步骤一和步骤二。
为了解决累加器在资源和精度上的限制,本公开的发明人采用加法器(例如,单精度加法器或双精度加法器)和缓存器(例如,Block RAM)配合来生成累加电路。
图1示出了根据本公开在硬件加速中使用由加法器和缓存器生成的累加电路的框图。GBDT的直方图优化算法是指:在训练前预先把特征值转化为bin,也就是对每个特征的取值做个分段函数,将所有样本在该特征上的取值划分到某一段(bin)中。最终把特征取值从连续值转化成了离散值。举例来说:对于年龄这个特征的取值进行分桶,如分成5个桶:0~20岁、20~40岁、40~60岁、60~80岁、80~100岁。对于其中的任一个桶,如对于20~40岁这个桶,累加所有数据的年龄特征落入20~40岁的特征值,得到累加值x(或累加后得到平均年龄值为x),然后对于年龄特征的真实取值落入20~40岁这个桶的数据,将其年龄特征的取值替换为x。由于每个数据都会包含多个特征,且多个特征需要分别进行直方图构建,因此可以给每个特征分配一个累加电路以保证能够同时并行地处理所有特征的直方图构建。然而,由于加法器和缓存器的特性,在某些用于加速的硬件(例如,FPGA器件)上每个累加操作可能会产生例如多个时钟周期的延时。
图2中示出了通过累加电路来执行累加操作的时序图的示意图。参照图2,加法器在加载时段(加载信号高电平)从缓存器中读取数据,并在此后的若干时钟周期期间对数据执行累加操作,在累加操作执行结束后,加法器响应于高电平的存储信号将累加后的新数据存储到缓存器中。由于上一次的累加结果还没有被写入缓存器而导致无法进行下一个累加操作,所以不可避免地引入了大量的数据依赖,因而导致流水线停顿。在一些情况下,这样的延时可以多达14时钟周期。换言之,加法器每运行一个时钟周期将会被迫停顿 13个时钟周期,并因此造成流水线的效率和吞吐量下降。对此,本公开的发明人进一步提出了图3所示的方案。
图3示出了根据本公开的实施例的用于处理数据组的可编程器件的框图。
如图3所示,根据本公开的用于处理数据组的可编程器件包括多个累加电路AC和多路复用器MUX,其中,每个累加电路AC包括一个流水线加法器SA和用于存储流水线加法器SA计算结果的一个缓存单元BUF。多路复用器MUX可以用于依次接收数据组中的数据,动态地确定包含在所述数据中的多个特征与多个累加电路AC之间的对应关系,并根据对应关系将所接收数据中的多个特征的特征值分别发送到对应的累加电路AC。在具体的实施例中,可编程器件可以是现场可编程门阵列(FPGA)。
在实施例中,所述数据组可以是特定应用场景下的用于进行机器学习的样本数据集。其中,机器学习算法可以是需要处理大量数据并对精度有特定要求的机器学习算法。例如,可编程器件可以用于对样本数据集执行梯度回归决策树GBDT直方图算法处理。直方图算法的基本理念是把特征值进行预装箱处理,从而在计算分裂时只需要考虑直方图分桶来选择划分点。相对于预排序算法,直方图算法显著的降低了内存的消耗,有助于提高训练的速度。
在根据本公开的可编程器件中,流水线加法器SA可以作为流水线电路来进行操作。在流水线电路中,由多个不同功能的电路单元组成一条指令处理流水线,然后将一条指令分成多步(例如,4-6步)后再由这些电路单元分别执行,这样流水线电路就能实现在每个时钟周期都可以获得一个新的输入。在经过初始的延时后,流水线电路每个时钟周期都可以生成一个新的输出。流水线电路并没有减少单个数据运算的时间,却大大增加了吞吐量,使得硬件利用率高,从而降低了硬件资源的需求。
每个累加电路中的流水线加法器SA从对应缓存单元BUF中读取所接收到的特征值所属的分桶对应的累加值,将所接收到的特征值累加到读取的累加值上得到新的累加值,并用新的累加值更新对应缓存单元BUF中的相应累加值(参见图2中)。累加电路的数量可以由可用硬件资源决定,数据组中的数据所包含的特征的数量可以根据情况不同地进行设置(例如,由数据的种类和用户之中的至少一个决定)。可编程器件可以通过同一硬件资源(例如,相同数量的累加电路)处理多种数据和多种用户数据之中的至少一个。在实施例中,累加电路AC的数量可以小于数据组中的数据所包含的特征的数量, 在这种情况下,累加电路AC中的一些将被复用。在另一实施例中,累加电路AC的数量可以与数据组中的数据所包含的特征的数量相同以确保可以同时并行地处理所有特征。在又一实施例中,累加电路AC的数量可以大于数据组中的数据所包含的特征的数量。
在本公开的实施例中,数据组中的数据可以包括指示所包含的各特征值所分别对应的特征的特征标签和指示所包含的各特征值所对应的分桶的分桶标签。在这种情况下,每个累加电路AC中的流水线加法器SA可以根据所接收到的特征值对应的特征标签和分桶标签,从对应缓存单元BUF中读取所接收到的特征值所属的分桶对应的累加值。
在本公开的另一实施例中,数据组中的数据可以仅包括指示所包含的各特征值所对应的分桶的分桶标签。在这种情况下,每个累加电路AC中的流水线加法器SA可以根据多路复用器的动态确定对应关系的控制逻辑和所接收到的特征值对应的分桶标签,从对应缓存单元BUF中读取所接收到的特征值所属的分桶对应的累加值。
在实施例中,流水线加法器SA可以为单精度加法器或双精度加法器。应该理解的是,根据资源和精度上的需求,可以对流水线加法器SA的类型进行各种修改而不脱离本公开的范围。
通过多路复用器动态地确定多个特征与多个累加电路AC之间的是对应关系能够最大程度的避免连续的同一特征落到同一个分桶上,从而避免/减少数据冲突的产生。
在实施例中,多路复用器MUX可以根据所接收的数据在数据组中的序号及所接收数据中的各特征的序号动态地确定包含在所接收数据中的多个特征与多个累加电路AC之间的对应关系。将随后参照图4对这一特征进行更详细的描述。
根据本公开的另一实施例,用于处理数据组的可编程器件还可以包括输出单元(未示出)。输出单元可以用于将各累加电路AC中的各缓存单元BUF中的与同一特征的同一分桶对应的累加值进行求和,并输出与各特征的各分桶对应的各最终累加值。
图4示出了根据本公开的实施例的可编程器件的累加电路AC与数据的特征之间的对应关系的示意图。
下面将参照图4给出多路复用器MUX根据所接收的数据在数据组中的 序号及所接收数据中的各特征的序号动态地确定包含在所接收数据中的多个特征与多个累加电路AC之间的对应关系的具体描述。
为描述方便,不妨假设:(1)累加电路AC做加法操作的时间延迟是4个时钟周期(包括缓存单元BUF读取数据、做加法、再用加法的结果更新缓存单元BUF的时间段);(2)每条数据包含4个特征:特征a、特征b、特征c、特征d(例如,数据1包含特征f1a、f1b、f1c、f1d,数据2包含特征f2a、f2b、f2c、f2d,以此类推)。
参照图3和图4,累加电路1至累加电路4均由一个流水线加法器SA和一个缓存单元BUF构成。虽然完成一个加法操作需要4个时钟周期,但是因其为流水线电路,故流水线加法器SA每个时钟都可以在没有数据依赖的情况下开始处理一个加法操作。缓存单元BUF可以是在每个时钟周期最多执行一个存储和一个加载操作的双端口存储器。
参照图4,在本实施例中,由于假设数据的特征的数量为4个,因此设置了4个累加电路(累加电路1至累加电路4),即,累加电路的数量与数据所包含的特征的数量相同。
多路复用器按时间顺序依次接收数据,而每条数据包括4个特征(例如,在图4中的第一个时钟周期接收数据1的特征f1a、f1b、f1c、f1d,在第二时钟周期接收数据2的特征f2a、f2b、f2c、f2d……),通过设置多路复用器的数据选择端的控制逻辑实现如上图所示的特征与累加电路的对应关系。更详细地说,在第一时钟周期期间,累加电路1对应f1a、累加电路2对应f1b、累加电路3对应f1c、累加电路4对应f1d;在第二时钟周期期间,累加电路1对应f2b、累加电路2对应f2c、累加电路3对应f2d、累加电路4对应f2a……。换言之,每当数据的序号增加1时,与累加电路对应的特征的序号向左循环移动1个位置。
需要说明的是,这里描述的累加电路的数量、特征的数量以及累加电路与特征的对应关系仅是为了方便说明。容易理解的是,根据具体的实施例,数据的特征的数量可以多于4,例如,200~300或更多。累加电路的数量可以与特征的数量相等。在另一实施例中,累加电路的数量可以大于或小于特征的数量。累加电路与特征的对应关系可以根据具体实施例进行各种修改。
通过多路复用器MUX的动态确定对应关系的控制逻辑,避免/减少累加电路对特定特征的特征值进行累加期间再次被分配到同一特征的情况,从而 避免/减少数据冲突的产生。
在实施例中,数据可以包括特征标签和分桶标签,累加电路根据特征标签和分桶标签从缓存单元BUF读取所接收到的特征值所属的分桶对应的累加值。
在另一实施例中,数据可以仅包括分桶标签,累加电路根据多路复用器的数据选择端的逻辑和分桶标签从缓存单元BUF读取所接收到的特征值所属的分桶对应的累加值。
参照图4,由于每个累加电路会参与计算全部特征的全部分桶的累加计算,因此每个缓存单元BUF(累加电路与缓存单元一一对应)均包括全部特征以及他们的分桶。因此,输出单元(未示出)可以将各累加电路对应的缓存单元中的对应同一特征的同一分桶的累加值进行求和,得到最终的累加值。
图5示出了根据本公开的实施例的用于处理数据组的方法的流程图。
在根据本公开的实施例中,基于可编程器件处理数据组的方法包括:
在步骤S100中,在可编程器件中设置多个累加电路,其中,每个累加电路包括一个流水线加法器和用于存储流水线加法器计算结果的一个缓存器;
在步骤S200中,在可编程器件中设置多路复用器,多路复用器接收数据组中的每个数据,动态地确定数据包含的多个特征与多个累加电路之间的对应关系,并且在每个时段期间根据对应关系将多个特征中的每个特征的特征值发送到对应的累加电路。
在根据本公开的方法中,流水线加法器为单精度加法器或双精度加法器。
在根据本公开的方法中,可编程器件可以是现场可编程门阵列(FPGA);数据组可以是特定应用场景下的用于进行机器学习的样本数据集;可编程器件可以用于对样本数据集执行梯度回归决策树GBDT直方图算法处理。所述方法可以设置累加电路的数量与数据组中的数据所包含的特征的数量相同,或者可以设置累加电路的数量大于或小于数据组中的数据所包含的特征的数量。这里描述的数据组和可编程器件以及累加电路的数量与数据所包含的特征的数量之间的关系与参照图3描述的那些相同或相似,因此,在此省略冗余的描述。
在根据本公开的方法中,数据可以包括指示所包含的各特征值所分别对应的特征的特征标签和指示所包含的各特征值所对应的分桶的分桶标签;每个累加电路中的流水线加法器可以根据所接收到的特征值对应的特征标签和 分桶标签,从对应缓存单元中读取所接收到的特征值所属的分桶对应的累加值。
在根据本公开的方法中,数据可以仅包括指示所包含的各特征值所对应的分桶的分桶标签;每个累加电路中的流水线加法器可以根据所述多路复用器的动态确定所述对应关系的控制逻辑和所接收到的特征值对应的分桶标签,从对应缓存单元中读取所接收到的特征值所属的分桶对应的累加值。
在实施例中,每个累加电路中的流水线加法器从对应缓存单元中读取所接收到的特征值所属的分桶对应的累加值,将所接收到的特征值累加到读取的累加值上得到新的累加值,并用新的累加值更新对应缓存单元中的相应累加值。这里描述的流水线加法器和缓存单元与参照图3描述的流水线加法器SA与缓存单元BUF相同或相似,因此,在此省略冗余的描述。
在实施例中,多路复用器根据所接收的数据在数据组中的序号及所接收数据中的各特征的序号,动态地确定包含在所接收数据中的多个特征与多个累加电路之间的对应关系。这里描述的多路复用器与参照图3和图4描述的多路复用器MUX相同或相似,因此,在此省略冗余的描述。
图6示出了根据本公开的另一实施例的用于处理数据组的方法的流程图。
除了步骤S300之外,图6示出的方法与图5示出的方法基本相同或相似,因此,在此省略冗余的描述。
在步骤S300中,在可编程器件中设置输出单元,将各累加电路中的各缓存单元中的与同一特征的同一分桶对应的累加值进行求和,并输出与各特征的各分桶对应的各最终累加值。
在根据本公开的实施例的用于处理数据组的可编程器件及处理数据组的方法中,在硬件加速开发中,可以精确地控制和使用流水线加法器和缓存单元。本公开根据机器学习算法的特性,设计出适合于机器学习算法的缓存使用逻辑,减少或消除了数据冲突的可能性,从而极大的提高了流水线的执行效率。
以上描述了本公开的各示例性实施例,应理解,上述描述仅是示例性的,并非穷尽性的,本公开不限于所披露的各示例性实施例。在不偏离本公开的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本公开的保护范围应该以权利要求的范围为准。
工业实用性
在本公开提供的用于处理数据组的可编程器件及处理数据组的方法中,通过多路复用器动态地确定包含在所接收数据中的多个特征与多个累加电路之间的对应关系,避免/减少累加电路对特定特征的特征值进行累加期间再次被分配到同一特征的情况,从而避免/减少数据冲突的产生。

Claims (22)

  1. 一种用于处理数据组的可编程器件,所述可编程器件包括:
    多个累加电路,其中,每个累加电路包括一个流水线加法器和用于存储所述流水线加法器计算结果的一个缓存单元;以及
    多路复用器,用于依次接收所述数据组中的数据,动态地确定包含在所述数据中的多个特征与所述多个累加电路之间的对应关系,并根据所述对应关系将所接收数据中的多个特征的特征值分别发送到对应的累加电路。
  2. 根据权利要求1所述的可编程器件,其中,
    所述每个累加电路中的所述流水线加法器从对应缓存单元中读取所接收到的特征值所属的分桶对应的累加值,将所接收到的特征值累加到读取的所述累加值上得到新的累加值,并用所述新的累加值更新对应缓存单元中的相应累加值。
  3. 根据权利要求2所述的可编程器件,所述可编程器件还包括:
    输出单元,用于将各累加电路中的各缓存单元中的与同一特征的同一分桶对应的累加值进行求和,并输出与各特征的各分桶对应的各最终累加值。
  4. 根据权利要求1所述的可编程器件,其中,所述累加电路的数量与所述数据组中的数据所包含的特征的数量相同。
  5. 根据权利要求1所述的可编程器件,其中,所述累加电路的数量小于所述数据组中的数据所包含的特征的数量,或者,所述累加电路的数量大于所述数据组中的数据所包含的特征的数量。
  6. 根据权利要求1所述的可编程器件,其中,
    所述多路复用器根据所接收的数据在所述数据组中的序号及所接收数据中的各特征的序号,动态地确定包含在所接收数据中的多个特征与所述多个累加电路之间的对应关系。
  7. 根据权利要求2所述的可编程器件,其中,所述数据中包括指示所包含的各特征值所分别对应的特征的特征标签和指示所包含的各特征值所对应的分桶的分桶标签;
    所述每个累加电路中的所述流水线加法器根据所接收到的特征值对应的特征标签和分桶标签,从对应缓存单元中读取所接收到的特征值所属的分桶对应的累加值。
  8. 根据权利要求2所述的可编程器件,其中,所述数据中包括指示所包含的各特征值所对应的分桶的分桶标签;
    所述每个累加电路中的所述流水线加法器根据所述多路复用器的动态确定所述对应关系的控制逻辑和所接收到的特征值对应的分桶标签,从对应缓存单元中读取所接收到的特征值所属的分桶对应的累加值。
  9. 根据权利要求1所述的可编程器件,其中,所述流水线加法器为单精度加法器或双精度加法器。
  10. 根据权利要求1-9中的任意一项所述的可编程器件,其中,
    所述数据组是特定应用场景下的用于进行机器学习的样本数据集;
    所述可编程器件用于对所述样本数据集执行梯度回归决策树GBDT直方图算法处理。
  11. 根据权利要求1-9中的任意一项所述的可编程器件,其中,所述可编程器件是现场可编程门阵列FPGA。
  12. 一种基于可编程器件处理数据组的方法,所述方法包括:
    在可编程器件中设置多个累加电路,其中,每个累加电路包括一个流水线加法器和用于存储所述流水线加法器计算结果的一个缓存器;以及
    在可编程器件中设置多路复用器,所述多路复用器接收所述数据组中的每个数据,动态地确定所述数据包含的多个特征与所述多个累加电路之间的对应关系,并且在每个时段期间根据所述对应关系将所述多个特征中的每个特征的特征值发送到对应的累加电路。
  13. 根据权利要求12所述的方法,其中,
    每个累加电路中的流水线加法器从对应缓存单元中读取所接收到的特征值所属的分桶对应的累加值,将所接收到的特征值累加到读取的所述累加值上得到新的累加值,并用所述新的累加值更新对应缓存单元中的相应累加值。
  14. 根据权利要求13所述的方法,所述方法还包括:
    在可编程器件中设置输出单元,将各累加电路中的各缓存单元中的与同一特征的同一分桶对应的累加值进行求和,并输出与各特征的各分桶对应的各最终累加值。
  15. 根据权利要求12所述的方法,其中,
    设置所述累加电路的数量与所述数据组中的数据所包含的特征的数量相同。
  16. 根据权利要求12所述的方法,其中,
    设置所述累加电路的数量小于所述数据组中的数据所包含的特征的数量,或者,设置所述累加电路的数量大于所述数据组中的数据所包含的特征的数量。
  17. 根据权利要求12所述的方法,其中,
    所述多路复用器根据所接收的数据在所述数据组中的序号及所接收数据中的各特征的序号,动态地确定包含在所接收数据中的多个特征与所述多个累加电路之间的对应关系。
  18. 根据权利要求13所述的方法,其中,所述数据中包括指示所包含的各特征值所分别对应的特征的特征标签和指示所包含的各特征值所对应的分桶的分桶标签;
    每个累加电路中的流水线加法器根据所接收到的特征值对应的特征标签和分桶标签,从对应缓存单元中读取所接收到的特征值所属的分桶对应的累加值。
  19. 根据权利要求13所述的方法,其中,所述数据中包括指示所包含的各特征值所对应的分桶的分桶标签;
    每个累加电路中的流水线加法器根据所述多路复用器的动态确定所述对应关系的控制逻辑和所接收到的特征值对应的分桶标签,从对应缓存单元中读取所接收到的特征值所属的分桶对应的累加值。
  20. 根据权利要求12所述的方法,其中,所述流水线加法器为单精度加法器或双精度加法器。
  21. 根据权利要求12-20中的任意一项所述的方法,其中,
    所述数据组是特定应用场景下的用于进行机器学习的样本数据集;
    所述可编程器件用于对所述样本数据集执行梯度回归决策树GBDT直方图算法处理。
  22. 根据权利要求12-20中的任意一项所述的方法,其中,所述可编程器件是现场可编程门阵列FPGA。
PCT/CN2020/095907 2019-06-14 2020-06-12 用于处理数据组的可编程器件及处理数据组的方法 WO2020249106A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20821819.8A EP3985498B1 (en) 2019-06-14 2020-06-12 Programmable device for processing data set, and method for processing data set
US17/619,142 US11791822B2 (en) 2019-06-14 2020-06-12 Programmable device for processing data set and method for processing data set

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910516213.2A CN110245756B (zh) 2019-06-14 2019-06-14 用于处理数据组的可编程器件及处理数据组的方法
CN201910516213.2 2019-06-14

Publications (1)

Publication Number Publication Date
WO2020249106A1 true WO2020249106A1 (zh) 2020-12-17

Family

ID=67887157

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/095907 WO2020249106A1 (zh) 2019-06-14 2020-06-12 用于处理数据组的可编程器件及处理数据组的方法

Country Status (4)

Country Link
US (1) US11791822B2 (zh)
EP (1) EP3985498B1 (zh)
CN (1) CN110245756B (zh)
WO (1) WO2020249106A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245756B (zh) 2019-06-14 2021-10-26 第四范式(北京)技术有限公司 用于处理数据组的可编程器件及处理数据组的方法
CN113222126B (zh) * 2020-01-21 2022-01-28 上海商汤智能科技有限公司 数据处理装置、人工智能芯片

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1399423A (zh) * 2001-07-24 2003-02-26 凌源通讯股份有限公司 相位解调器、符号时序回复电路及其方法
CN101950250A (zh) * 2010-09-29 2011-01-19 中国科学院光电技术研究所 基于累加器的哈特曼-夏克波前斜率获取方法及处理器
US20130297666A1 (en) * 2010-12-17 2013-11-07 Zheijiang University Fpga-based high-speed low-latency floating point accumulator and implementation method therefor
CN104102470A (zh) * 2014-07-23 2014-10-15 中国电子科技集团公司第五十八研究所 可配置可扩展的流水线乘累加器
CN105723333A (zh) * 2013-11-15 2016-06-29 高通股份有限公司 在执行单元与向量数据存储器之间具有合并电路系统的向量处理引擎以及相关的方法
CN110245756A (zh) * 2019-06-14 2019-09-17 第四范式(北京)技术有限公司 用于处理数据组的可编程器件及处理数据组的方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FI96256C (fi) * 1993-04-05 1996-05-27 Tapio Antero Saramaeki Menetelmä ja järjestely transponoidussa digitaalisessa FIR-suodattimessa binäärisen sisääntulosignaalin kertomiseksi tappikertoimilla sekä menetelmä transponoidun digitaalisen suodattimen suunnittelemiseksi
CN1067203C (zh) * 1994-04-20 2001-06-13 汤姆森消费电子有限公司 利用恒定位率编码器的多路复用系统
US7826581B1 (en) * 2004-10-05 2010-11-02 Cypress Semiconductor Corporation Linearized digital phase-locked loop method for maintaining end of packet time linearity
KR101949294B1 (ko) * 2012-07-24 2019-02-18 삼성전자주식회사 영상의 히스토그램 축적 계산 장치 및 방법
CN203966104U (zh) * 2014-07-23 2014-11-26 中国电子科技集团公司第五十八研究所 可配置可扩展的流水线乘累加器
US20160026607A1 (en) * 2014-07-25 2016-01-28 Qualcomm Incorporated Parallelization of scalar operations by vector processors using data-indexed accumulators in vector register files, and related circuits, methods, and computer-readable media
CN106250614B (zh) * 2016-07-29 2019-04-05 上海交通大学 适用于fpga平台电磁暂态实时仿真的数据处理方法
US10438115B2 (en) * 2016-12-01 2019-10-08 Via Alliance Semiconductor Co., Ltd. Neural network unit with memory layout to perform efficient 3-dimensional convolutions
US10480954B2 (en) * 2017-05-26 2019-11-19 Uber Technologies, Inc. Vehicle routing guidance to an authoritative location for a point of interest
RU2666303C1 (ru) * 2017-12-14 2018-09-06 Открытое Акционерное Общество "Информационные Технологии И Коммуникационные Системы" Способ и устройство для вычисления хэш-функции
CN108108150B (zh) * 2017-12-19 2021-11-16 云知声智能科技股份有限公司 乘累加运算方法及装置
CN109543830B (zh) * 2018-09-20 2023-02-03 中国科学院计算技术研究所 一种用于卷积神经网络加速器的拆分累加器

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1399423A (zh) * 2001-07-24 2003-02-26 凌源通讯股份有限公司 相位解调器、符号时序回复电路及其方法
CN101950250A (zh) * 2010-09-29 2011-01-19 中国科学院光电技术研究所 基于累加器的哈特曼-夏克波前斜率获取方法及处理器
US20130297666A1 (en) * 2010-12-17 2013-11-07 Zheijiang University Fpga-based high-speed low-latency floating point accumulator and implementation method therefor
CN105723333A (zh) * 2013-11-15 2016-06-29 高通股份有限公司 在执行单元与向量数据存储器之间具有合并电路系统的向量处理引擎以及相关的方法
CN104102470A (zh) * 2014-07-23 2014-10-15 中国电子科技集团公司第五十八研究所 可配置可扩展的流水线乘累加器
CN110245756A (zh) * 2019-06-14 2019-09-17 第四范式(北京)技术有限公司 用于处理数据组的可编程器件及处理数据组的方法

Also Published As

Publication number Publication date
US11791822B2 (en) 2023-10-17
EP3985498B1 (en) 2023-04-05
EP3985498A1 (en) 2022-04-20
US20220149843A1 (en) 2022-05-12
CN110245756A (zh) 2019-09-17
EP3985498A4 (en) 2022-08-03
CN110245756B (zh) 2021-10-26

Similar Documents

Publication Publication Date Title
US10748237B2 (en) Adaptive scheduling for task assignment among heterogeneous processor cores
US11625320B2 (en) Tensor-based optimization method for memory management of a deep-learning GPU and system thereof
WO2020249106A1 (zh) 用于处理数据组的可编程器件及处理数据组的方法
US20150178124A1 (en) Backfill scheduling for embarrassingly parallel jobs
US8898422B2 (en) Workload-aware distributed data processing apparatus and method for processing large data based on hardware acceleration
CN107908536B (zh) Cpu-gpu异构环境中对gpu应用的性能评估方法及系统
WO2014102563A1 (en) Select logic using delayed reconstructed program order
CN112487750A (zh) 一种基于存内计算的卷积加速计算系统及方法
US11720496B2 (en) Reconfigurable cache architecture and methods for cache coherency
WO2020248227A1 (zh) 一种基于负载预测的Hadoop计算任务推测执行方法
CN116501505B (zh) 负载任务的数据流生成方法、装置、设备及介质
WO2016041126A1 (zh) 基于gpu的数据流处理方法和装置
US20210390405A1 (en) Microservice-based training systems in heterogeneous graphic processor unit (gpu) cluster and operating method thereof
CN113452546A (zh) 深度学习训练通信的动态服务质量管理
Aghapour et al. Integrated ARM big. Little-Mali pipeline for high-throughput CNN inference
Vianna et al. Modeling the performance of the Hadoop online prototype
US10592517B2 (en) Ranking items
Tang et al. A network load perception based task scheduler for parallel distributed data processing systems
CN112463218B (zh) 指令发射控制方法及电路、数据处理方法及电路
CN110515729B (zh) 基于图形处理器的图计算节点向量负载平衡方法及装置
CN108108235B (zh) 任务处理方法及装置
JP3826848B2 (ja) 動的負荷均等化方法および動的負荷均等化装置
EP2348400A1 (en) Arithmetic processor, information processor, and pipeline control method of arithmetic processor
US10192014B2 (en) Circuit design support apparatus and computer readable medium
CN115827170B (zh) 基于离散事件的计算机体系结构的并行仿真方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20821819

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2020821819

Country of ref document: EP