CN114357371A

CN114357371A - Matrix data processing method and device, electronic equipment and storage medium

Info

Publication number: CN114357371A
Application number: CN202011083298.9A
Authority: CN
Inventors: 牛昕宇; 蔡权雄
Original assignee: Shenzhen Corerain Technologies Co Ltd
Current assignee: Shenzhen Corerain Technologies Co Ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2022-04-15

Abstract

The invention discloses a matrix data processing method, which comprises the following steps: acquiring matrix data to be processed and weight matrix data of a recurrent neural network; matching element parallel parameters and vector parallel parameters corresponding to the scale parameters according to the scale parameters of the recurrent neural network; partitioning the weight matrix data according to the element parallel parameters and the vector parallel parameters to obtain weight vector blocks; multiplying the weight vector block and the matrix element to be processed to obtain a first processing result; and configuring an addition tree tail according to the element parallel parameters, accumulating the first processing result through an addition tree containing the addition tree tail to obtain a second processing result, and outputting the second processing result as a processing result of the matrix data to be processed. The utilization rate of hardware resources is improved, the slicing strategy can be more flexible, and the method can be suitable for recurrent neural networks of various scales.

Description

Matrix data processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of machine learning, and in particular, to a method and an apparatus for processing matrix data, an electronic device, and a storage medium.

Background

With the rapid development of machine learning, Recurrent Neural Networks (RNNs) have proven to have useful characteristics for many important applications. Since they can record previous information to mentionRNN is applied to, for example, speech recognition, natural language processing, and video classification, and has been developed to study a variety of variants. Among the many RNN variants, the two most popular are the long short term memory network (LSTM) and the gated cycle unit (GRU). However, data dependencies in the RNN calculation stall the system until the required hidden vector returns from the full pipeline to start the next time step calculation, as shown in FIG. 1a, it is necessary to wait for the hidden vector h_tAnd returns to start the next time step calculation. Furthermore, a deeper pipeline is usually used to achieve a higher operating frequency, and since the system pipeline needs to be cleared, the stall loss is worsened, and hardware resources are idled during the stall. Therefore, the conventional RNN has a low utilization rate of hardware resources, and the RNN of a specific scale also has a requirement on hardware resources, which is not high in flexibility.

Content of application

The present invention aims to provide a matrix data processing method to improve the utilization rate of RNN to hardware resources, and further improve the utilization rate of RNN to hardware resources by flexibly configuring hardware resources, in view of the above-mentioned defects in the prior art.

The purpose of the invention is realized by the following technical scheme:

in a first aspect, a method for processing matrix data is provided, which is used for a recurrent neural network, and the method includes:

acquiring matrix data to be processed and weight matrix data of the recurrent neural network, wherein the matrix data to be processed and the weight matrix data are both formed by matrix elements, and the matrix data comprises column vectors constructed by the matrix elements;

according to the scale parameters of the recurrent neural network, matching element parallel parameters and vector parallel parameters corresponding to the scale parameters;

partitioning the weight matrix data according to the element parallel parameters and the vector parallel parameters to obtain weight vector blocks;

multiplying the weight vector block and the matrix element to be processed to obtain a first processing result;

and configuring an addition tree tail according to the element parallel parameters, accumulating the first processing result through an addition tree containing the addition tree tail to obtain a second processing result, and outputting the second processing result as the processing result of the matrix data to be processed.

Optionally, the configuring the tail of the addition tree according to the element parallel parameter includes:

configuring the addition tree tail behind the addition tree;

and configuring the parallelism of the tail of the addition tree.

Optionally, the scale parameter of the recurrent neural network includes: the matching of the element parallel parameters and the vector parallel parameters corresponding to the scale parameters according to the scale parameters of the recurrent neural network comprises the following steps:

and matching element parallel parameters and vector parallel parameters corresponding to the number of the processing units and the vector dimensions according to the number of the processing units and the vector dimensions.

Optionally, before the accumulating the first processing result through the adder tree including the tail of the adder tree to obtain a second processing result, the method further includes:

and carrying out balance calculation on the first processing result so as to balance the parallelism of the weight vector block.

In a second aspect, an embodiment of the present invention further provides an apparatus for processing matrix data, where the apparatus is used in a recurrent neural network, and the apparatus includes:

the acquiring module is used for acquiring matrix data to be processed and weight matrix data of the recurrent neural network, wherein the matrix data to be processed and the weight matrix data are both composed of matrix elements, and the matrix data comprises column vectors constructed by the matrix elements;

the matching module is used for matching element parallel parameters and vector parallel parameters corresponding to the scale parameters according to the scale parameters of the recurrent neural network;

the processing module is used for partitioning the weight matrix data according to the element parallel parameters and the vector parallel parameters to obtain weight vector blocks;

the calculation module is used for multiplying the weight vector block and the matrix element to be processed to obtain a first processing result;

and the output module is used for configuring an addition tree tail according to the element parallel parameters, accumulating the first processing result through an addition tree containing the addition tree tail to obtain a second processing result, and outputting the second processing result as the processing result of the matrix data to be processed.

Optionally, the output module includes:

a first configuration unit, configured to configure the addition tree tail after the addition tree;

and the second configuration unit is used for configuring the parallelism of the tail of the addition tree.

Optionally, the scale parameter of the recurrent neural network includes: the matching module is further used for matching element parallel parameters and vector parallel parameters corresponding to the processing unit number and the vector dimension according to the processing unit number and the vector dimension.

Optionally, the apparatus further comprises:

and the balancing module is used for carrying out balance calculation on the first processing result so as to balance the parallelism of the weight vector block.

In a third aspect, an electronic device is provided, including: the matrix data processing method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the steps in the matrix data processing method provided by the embodiment of the invention.

In a fourth aspect, a computer-readable storage medium is provided, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps in the matrix data processing method provided by the embodiment of the present invention.

The invention has the following beneficial effects: acquiring matrix data to be processed and weight matrix data of the recurrent neural network, wherein the matrix data to be processed and the weight matrix data are both formed by matrix elements, and the matrix data comprises column vectors constructed by the matrix elements; according to the scale parameters of the recurrent neural network, matching element parallel parameters and vector parallel parameters corresponding to the scale parameters; partitioning the weight matrix data according to the element parallel parameters and the vector parallel parameters to obtain weight vector blocks; multiplying the weight vector block and the matrix element to be processed to obtain a first processing result; and configuring an addition tree tail according to the element parallel parameters, accumulating the first processing result through an addition tree containing the addition tree tail to obtain a second processing result, and outputting the second processing result as the processing result of the matrix data to be processed. The column vectors of the weight matrix data and the matrix elements of the matrix to be processed are multiplied and then accumulated, the vectors of the matrix data to be processed are not required to be completely copied, so that the calculation of the next time step can be started without waiting for emptying a system pipeline, the calculation can be started only by partial input vectors, a data pipeline is formed, the condition of pause is avoided, the condition of idle hardware resources is reduced, the utilization rate of the hardware resources is improved, meanwhile, according to the scale parameters of the recurrent neural network, the element parallel parameters and the vector parallel parameters corresponding to the scale parameters are matched for slicing, and the corresponding addition tree tail is configured, so that the slicing strategy is more flexible, and the method can be suitable for recurrent neural networks of various scales.

Drawings

Fig. 1 is a schematic flowchart of a matrix data processing method according to an embodiment of the present invention;

fig. 1a is a schematic diagram of a conventional matrix data processing method according to an embodiment of the present invention;

FIG. 1b is a schematic diagram of a long term memory network according to an embodiment of the present invention;

FIG. 1c is a diagram illustrating a combined weight matrix according to an embodiment of the present invention;

fig. 1d is a schematic flowchart of a matrix data processing method according to an embodiment of the present invention;

FIG. 1e is a diagram illustrating a row-based vector multiplication according to an embodiment of the present invention;

FIG. 1f is a schematic diagram of a column-wise vector multiplication according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of another matrix data processing method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a weight vector block according to an embodiment of the present invention;

FIG. 3a is a schematic diagram of an addition tree tail according to an embodiment of the present invention;

FIG. 3b is a diagram illustrating another addition tree tail according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a matrix data processing apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an output module according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of another matrix data processing apparatus according to an embodiment of the present invention.

Detailed Description

The following describes preferred embodiments of the present invention and those skilled in the art will be able to realize the invention by using the related art in the following and will more clearly understand the innovative features and the advantages brought by the present invention.

The invention provides a matrix data processing method. The purpose of the invention is realized by the following technical scheme:

referring to fig. 1, fig. 1 is a schematic flow chart of a method for processing matrix data according to an embodiment of the present invention, as shown in fig. 1, the method is applied to a recurrent neural network, and the method includes the following steps:

101. and acquiring matrix data to be processed and weight matrix data of the recurrent neural network.

The matrix data to be processed and the weight matrix data are both composed of matrix elements, and the matrix data comprise column vectors constructed by the matrix elements.

The matrix data to be processed may be voice matrix data, text matrix data, image matrix data, and the like. The voice matrix data may be obtained by encoding voice information into a matrix space, the text matrix data may be obtained by encoding text information into a matrix space, and the image matrix data may be a pixel matrix of an image itself or may be obtained by encoding a pixel matrix of an image itself into a matrix space.

The weight matrix data is a weight matrix obtained by training the recurrent neural network. In the process of processing the matrix data to be processed, the implicit information of the matrix to be processed is extracted through the weight matrix, and corresponding classification information is obtained according to the implicit information.

The recurrent neural network may be deployed in a hardware environment such as a CPU (central processing unit), a GPU (image processor), an FPGA (field programmable gate array), or the like. In the embodiment of the invention, the recurrent neural network is preferably deployed in an FPGA-based hardware environment, and compared with hardware environments of a CPU and a GPU, the recurrent neural network running in the FPGA hardware environment has the advantages of low delay and low power consumption due to the hardware support of a logic gate.

The recurrent neural network can be a recurrent neural network such as a long-short term memory network and a gated cyclic unit (an input gate and a forgetting gate are combined into an updating gate). The recurrent neural network in the embodiment of the present invention is preferably a long-term and short-term memory network. Further, the recurrent neural network in the embodiment of the present invention is preferably a long-short term memory network deployed in a hardware environment of the FPGA. It should be noted that the embodiment of the present invention only uses the long and short term memory network as an embodiment to illustrate the inventive intent of the present invention, and the present invention is also applicable to other forms of recurrent neural networks, and the long and short term memory network should not be taken as a limitation to the scope of the present invention.

The weight momentThe array data is also weight matrix data of gates in the long-short term memory network. Specifically, the long-term and short-term memory network comprises four gates, namely an input gate, a forgetting gate, an input modulation gate and an output gate. An input modulation gate is understood to be a sub-part of the input gate for combining the input tensor with the implicit tensor such that the corresponding weight matrix of the input modulation gate represents the weight matrix of the input cell and the implicit cell. Wherein, four gates are respectively corresponding to each weight matrix, for example, the weight of the input gate corresponding to the input gate is W_i(n), the weight of the forgetting gate corresponding to the forgetting gate is W_f(n) the weight of the input modulation gate corresponding to the input modulation gate is W_g(n) the output gate has an output gate weight W_o(n) of (a). Wherein, W_i(n)、W_f(n)、W_g(n)、W_oAnd (n) are all matrixes with the same size.

In the long-short term memory network, the implicit state can be calculated by four gates, and the following formula can be specifically referred to:

i_t＝σ(W_i[x_t,h_t-1]+b_i) (1)

f_t＝σ(W_f[x_t,h_t-1]+b_f) (2)

g_t＝tanh(W_f[x_t,h_t-1]+b_u) (3)

o_t＝σ(Wo[x_t,h_t-1]+bo) (4)

c_t＝f_t⊙c_t-1+i_t⊙g_t (5)

h_t＝o_t⊙tanh(c_t) (6)

where σ is a normalization function, x_tFor the current input, h_t-1Is a last hidden state, i_tFor the calculation of the input gates, W is the weight matrix of the respective gates, b is the offset, f_tTo input the gate's calculation result, g_tTo input the gate calculation results, o_tTo input the gate's calculation results, c_tIs insideStore cell state, h_tIs the implicit state of the current input. The implicit state can be expressed by a tensor, that is, the implicit tensor can be used for the input of the next time step or the next computation layer, as shown in fig. 1b, i_t，f_t，g_t，o_tThe calculation of (c) may be referred to as gate calculation (LSTM-Gates), c_t，h_tThe calculation of (A) may be referred to as Tail calculation (LSTM-Tail).

Furthermore, in the embodiment of the present invention, the weight matrix data may be combined by the weight matrix data corresponding to the four gates, that is, W is the weight matrix data corresponding to the four gates_i(n)、W_f(n)、W_g(n)、W_oThe combination (n) is W (n). Let L be the tensor of the weight matrix data corresponding to the four gates_h*L_kThe tensor of the combined weight matrix data W (n) is H_w*L_wWherein L is_hNumber of rows of weight matrix data, L, for four gates_kThe number of columns of weight matrix data corresponding to four gates, in the same way, H_wIs the number of rows, L, of the weight matrix data W (n)_wThe column number of the weight matrix data W (n) is H_w＝4*L_h，L_w＝L_k. In one possible embodiment, the L is defined as the tensor size corresponding to the input matrix data_k＝L_h+L_x，L_xIs the number of rows of matrix data to be processed, in this case, L_w＝L_k＝L_h+L_x. As shown in FIG. 1c, W_i(0)、W_f(0)、W_g(0)、W_o(0) The row vectors of the first row of the gate weight are input, the row vector of the first row of the forgetting gate weight is input, the row vector of the first row of the modulating gate weight is input, the row vector of the first row of the gate weight is output, and the row vectors of the first four rows of the weight matrix data w (n) are also input.

Specifically, the combination of the weight matrix data corresponding to the four gates may be obtained by combining corresponding line vectors, for example, the line vectors in the first row in the weight matrix data corresponding to the four gates are combined, and in the weight matrix data w (n), the line vectors in the first four rows correspond to the line vectors in the first row in the weight matrix data corresponding to the four gates.

The weight matrix data corresponding to the four gates are combined to obtain the weight matrix data W (n) of a larger tensor, so that when the weight matrix is optimized in a time step, the weight matrix data W (n) is only required to be multiplied by a vector to optimize, the weight matrix corresponding to the four gates does not need to be optimized through the four vectors, and the calculation amount and the optimization time of optimization are saved.

102. And extracting weighted columnar vectors in the weight matrix data.

In this step, the weight matrix data is the weight matrix data w (n) in step 101, the weight column vector may also be referred to as a weight column vector, and each weight column vector may express one column of data in the weight matrix data.

In the case of the above, the multiplication of the weight matrix data and the matrix data to be processed is performed by a row-based vector, and in this case, it is necessary to extract all column vectors of the matrix data to be processed to complete the calculation of the vector multiplication. Such as: the tensor of the weight matrix data is n × m, and the tensor of the matrix data to be processed should be j × k, and it is necessary to satisfy j ═ m, so that the calculation of the vector multiplication can be performed. Therefore, it is necessary to extract a complete column of the matrix data to be processed to perform the calculation of the vector multiplication. As shown in FIG. 1d, Weights Matrix is W0, W1, W_Hw－2、W_Hw－1Are all weighted row vectors, the weight matrix has a total of H_wLine, total L_wColumn, 0, 1, …, L_x-1For the matrix data to be processed, there is L_xLine, 0, 1, …, L_h-1For the last implicit tensor, there is a total of L_hLine, at this time, L needs to be satisfied_w＝L_x+L_hThe calculation of the vector multiplication can be started, i.e. it is necessary to read L_x+L_hThe vector multiplication can be started by one matrix element, and hardware computing resources are in an idle state before the reading is finished.

103. And extracting matrix data to be processed and matrix elements to be processed corresponding to the weighted columnar vectors.

In this step, the above-mentioned matrix data to be processed is extracted in matrix element units, instead of vector units, and one vector includes a plurality of matrix elements, thereby shortening the time for starting the calculation. Specifically, if the vector is taken as a unit for extraction, calculation is started only after all matrix elements included in the vector are read, and calculation is started only after one matrix element is read by taking the matrix element as a unit without waiting for complete copying of the vector of the matrix data to be processed, so that calculation of the next time step can be started without waiting for emptying of a system pipeline, and calculation can be started only by partial input vectors.

It should be noted that the matrix elements to be processed are matrix elements in the matrix data to be processed.

104. And performing multiplication calculation on the weight column vector and the matrix element to be processed to obtain a first processing result.

In this step, assuming that the tensor of the weight matrix data is 3 × 3, the matrix to be processed is 3 × 1, assuming that the first column in the weight matrix data is a first weighted column vector, the first row in the matrix to be processed is a first matrix element, sequentially assuming that the second column in the weight matrix data is a second weighted column vector, the third column in the weight matrix data is a third weighted column vector, the second row in the matrix to be processed is a second matrix element, and the third row in the matrix to be processed is a third matrix element, according to step 104, the first weighted column may be multiplied by the first matrix element, the second weighted column may be multiplied by the second matrix element, and the third weighted column may be multiplied by the third matrix element. Tensors of the first, second and third weighted column vectors are all 3 × 1, tensors of the first, second and third matrix elements can also be regarded as 1 × 1, and then the corresponding 3 × 1 is multiplied by 1 × 1, and finally, a first processing result of three 3 × 1 tensors is obtained. Compared with the method that the weight matrix data is directly multiplied by the matrix to be processed, namely 3 x 3 is multiplied by 3 x 1, calculation is not needed to be carried out after the extraction of the whole tensor 3 x 1 is finished at one time, and calculation can be carried out after every 1 x 1 tensor (matrix element) is extracted, so that the data is closer to a streaming type, and the idle time of hardware resources is reduced. As shown in fig. 1e, the column vector of the weight matrix is directly multiplied by one matrix element in the matrix data to be processed, and since the column number of the column vector of the weight matrix is 1 and the row number of the matrix element is 1, the calculation of the vector multiplication is satisfied.

105. And accumulating the first processing result to obtain a second processing result, and outputting the second processing result as a processing result of the matrix data to be processed.

In this step, the tensor of the weight matrix data in step 104 is 3 × 3, and for the matrix to be processed is 3 × 1, the obtained first processing result is three tensors of 3 × 1, and the three tensors of 3 × 1 are accumulated to obtain a second processing result of 3 × 1 tensor. Compared with the method that weight matrix data is directly multiplied by a matrix to be processed, namely 3 × 3 is multiplied by 3 × 1, the result of multiplying 3 × 3 by 3 × 1 is a 3 × 1 tensor, which can also be called an implicit tensor or an implicit state, a weighted column vector is multiplied by matrix elements, namely 3 × 1 is multiplied by 1 × 1, and finally, the 3 × 1 tensor is obtained through accumulation, which can also be called an implicit tensor or an implicit state, and after every 1 × 1 tensor (matrix element) is extracted through multiplication of the weighted column vector and the matrix elements, calculation can be performed, so that the data is closer to a streaming type, and the idle time of hardware resources is reduced. As shown in FIG. 1e and FIG. 1f, in FIG. 1f, it can be seen that there is no need to wait for the implicit tensor h_tAll data is computed as a pipeline, with no stall latency.

In this embodiment, to-be-processed matrix data and weight matrix data of the recurrent neural network are obtained, where the to-be-processed matrix data and the weight matrix data are both composed of matrix elements, and the matrix data includes a column vector constructed from the matrix elements; extracting a weight column vector in the weight matrix data; extracting matrix elements to be processed corresponding to the matrix data to be processed and the weighted columnar vectors; multiplying the weight column vector and the matrix element to be processed to obtain a first processing result; and accumulating the first processing result to obtain a second processing result, and outputting the second processing result as the processing result of the matrix data to be processed. The column vectors of the weight matrix data and the matrix elements of the matrix to be processed are accumulated after multiplication, and the vectors of the matrix data to be processed are not required to be completely copied, so that the calculation of the next time step can be started without waiting for emptying a system pipeline, the calculation can be started only by partial input vectors, a data pipeline is formed, the condition of pause is avoided, the condition of hardware resource idling is reduced, and the utilization rate of hardware resources is improved.

Referring to fig. 2, fig. 2 is a flowchart of another matrix data processing method according to an embodiment of the present invention, as shown in fig. 2, the method is applied to a recurrent neural network, and based on the embodiment of fig. 1, the method includes the following steps:

201. and acquiring matrix data to be processed and weight matrix data of the recurrent neural network.

202. And matching element parallel parameters and vector parallel parameters corresponding to the scale parameters according to the scale parameters of the recurrent neural network.

In an embodiment of the present invention, the size parameter of the recurrent neural network may be determined by the number of processing units NPE.

In this step, the element parallel parameters (EP) and vector parallel parameters (VP) may be further exploited to exploit the available parallelism such that the number of computation cycles in the process is greater than the delay. It should be noted that the element parallel parameter indicates the number of parallel processes of matrix elements, and the vector parallel parameter indicates the number of rows of the column-wise vector.

Further, the vector parallel parameters are constrained by the weight matrix data and the element parallel parameters. The specific acquisition vector parallel parameter may be the number of processing units (NPE) acquired first; acquiring the vector row number of a weight matrix in the weight matrix data; and constraining vector parallel parameters according to the ratio of the number of the processing units to the element parallel parameters and the width of the weight matrix, and searching by a greedy algorithm to obtain the element parallel parameters and the vector parallel parameters. Specifically, the constraint may be:

VP≤H_w＝4*L_h (7)

VP≤NPE/EP (8)

with the above two equations, a greedy algorithm is performed starting with the element parallel parameter EP being 1, using the element parallel parameter as a variable, and the optimal vector parallel parameter and element parallel parameter are searched.

It can be seen that when EP is small, the number of processing cycles is high, since VP is constrained by equation (7), so that the number of active processing units PE (each processing unit having an EP element to process) is less than the number of processing units NPE, resulting in severe underutilization. As EP increases, the number of processing cycles decreases until EP reaches these optimum points. When the EP is greater than the optimum point, the processing period gradually increases. For example, according to the preset design space search result, when NPE is 16382, the EP value of the optimal configuration is between 4 and 16; when NPE is 65536, the EP value is between 16 and 64. At these optimal positions, a high degree of parallelism can be achieved, and thus the system throughput can be improved.

Optionally, the scale parameter of the recurrent neural network may further include: number of processing units and vector dimensions. In this optional embodiment, the element parallel parameters and the vector parallel parameters corresponding to the number of processing units and the vector dimensions may be matched according to the number of processing units and the vector dimensions. Thus, the element parallel parameters and the vector parallel parameters which are more suitable for the scale of the current recurrent neural network can be obtained to carry out slice partitioning.

For example, for a given vector size, better performance and utilization may be obtained by adjusting (EP, VP) the design parameters. For example, for a 512-dimensional vector, the performance at (EP, VP) of (32, 2048) is better than that at (EP, VP) of (16, 4096), while for a 1024-dimensional vector, the performance at (EP, VP) of (16, 4096) is better than that at (EP, VP) of (32, 2048). When running RNN models of different sizes, different choices of EP and VP can impact the hardware utilization and performance of the architecture.

It should be noted that each PE is equivalent to a full pipeline multiplier, and is used for multiplication of the column-wise multiplication, and after passing through the PE, the PE enters an accumulator for addition.

203. And partitioning the weight matrix data according to the element parallel parameters and the vector parallel parameters to obtain a weight vector block.

In this step, the weight vector block may be as shown in fig. 3, where it is to be noted that the element parallel parameter in the weight vector block and the element parallel parameter in the matrix data to be processed have the same matrix element number, that is, have the same EP. The weight vector block obtained by the element parallel parameters and the vector parallel parameters can improve the reasoning throughput of the recurrent neural network.

In the embodiment of the invention, because the element parallel parameters and the vector parallel parameters are obtained according to the scale parameters of the recurrent neural network, the element parallel parameters and the vector parallel parameters can be flexibly configured according to the scale of the recurrent neural network, so that the weight matrix data can be flexibly partitioned, the obtained weight vector blocks also conform to the scale of the current recurrent neural network, and the utilization rate of hardware is further improved.

204. And multiplying the weight vector block and the matrix element to be processed to obtain a first processing result.

And multiplying the element parallel parameters in the weight vector block and the matrix to be processed by the element parallel parameters obtained by the weight matrix data block by the same to obtain a first processing result. This may improve the inference throughput of the recurrent neural network.

205. And configuring an addition tree tail according to the element parallel parameters, accumulating the first processing result through an addition tree containing the addition tree tail to obtain a second processing result, and outputting the second processing result as the processing result of the matrix data to be processed.

In the embodiment of the present invention, in order to support various EPs and VPs, a tail of an addition tree (CAT) may be configured to reduce the configuration of an adder. The number of levels of the additive tree may change the configuration accordingly, corresponding to various numbers of EPs. For example, the fixed structure of the adder tree is a large-scale EP, and the result of the last stages of the adder tree can be directly updated to the accumulator by using a small-scale EP without inputting to the adder at the next stage, so that the configuration of the adder can be reduced.

Further, the tail of the addition tree is configured behind the addition tree; and configuring the parallelism of the tail of the addition tree.

In the additive tree tail CAT architecture, the adders of the additive tree tail can be reused as required accumulators without additional adder components. For example, a CAT (CAT-N) with N inputs may be configured to update 1 to N accumulators when the data reaches the last log2(N) level of the adder tree. As shown in FIG. 3a, the three addition trees each have 4 inputs, the tail of the addition tree CAT-4, and the results of the addition tree are used to update 1 to 4 accumulators. Alternatively, the structural details of another additive tree tail CAT-4 are shown in FIG. 3 b.

In a large-scale recursive neural network design, according to the design exploration of the optimal system throughput, since the number set of EP is {16, 32, 64}, the addition tree tail CAT-4 is sufficient. Of course, the tree tail CAT-N can also be added in the cascade of the added tree tail CAT-N, such as two CAT-2 cascades would be used to obtain a CAT-4.

In the embodiment of the present invention, it may further include performing dequantization (inverse quantization) after the adapter, and since the dequantization and quantization require a 32-bit multiplier unit and an adder, the total hardware cost of linear quantization of the tail of the addition tree is higher than that of 16-bit fixed-point quantization. Thus, the addition tree tail may be left to perform linear quantization. Instead, dequantization is performed after the adapter, and the quantized value is dequantized to be a fixed point value, so that an output vector which does not need dequantization is generated by the output of the tail of the addition tree, and the hardware which does not need dequantization is prevented from being added to the tail of the addition tree, thereby reducing the total hardware cost.

Further, a balance calculation is performed on the first processing result to balance the parallelism of the weight vector blocks.

Specifically, the architecture of the recurrent neural network includes a multiplier and an accumulator, and the accumulator is connected after the multiplier, wherein the multiplier is used for calculation of vector multiplication, specifically for calculation of vector multiplication between the above-mentioned column vector and matrix element, or for calculation of vector multiplication between a weight vector block and an element parallel parameter in the matrix to be processed, relative to the remaining matrix element. The accumulator is used for accumulating the first processing result.

Optionally, a balanced adder tree may be further disposed between the multiplier and the accumulator to perform balanced calculation on the first processing result, specifically, to balance parallelism of the element parallel parameter and the vector parallel parameter, so as to further increase inference throughput of the recurrent neural network.

In this embodiment, to-be-processed matrix data and weight matrix data of the recurrent neural network are obtained, where the to-be-processed matrix data and the weight matrix data are both composed of matrix elements, and the matrix data includes a column vector constructed from the matrix elements; according to the scale parameters of the recurrent neural network, matching element parallel parameters and vector parallel parameters corresponding to the scale parameters; partitioning the weight matrix data according to the element parallel parameters and the vector parallel parameters to obtain weight vector blocks; multiplying the weight vector block and the matrix element to be processed to obtain a first processing result; and configuring an addition tree tail according to the element parallel parameters, accumulating the first processing result through an addition tree containing the addition tree tail to obtain a second processing result, and outputting the second processing result as the processing result of the matrix data to be processed. The column vectors of the weight matrix data and the matrix elements of the matrix to be processed are multiplied and then accumulated, the vectors of the matrix data to be processed are not required to be completely copied, so that the calculation of the next time step can be started without waiting for emptying a system pipeline, the calculation can be started only by partial input vectors, a data pipeline is formed, the condition of pause is avoided, the condition of idle hardware resources is reduced, the utilization rate of the hardware resources is improved, meanwhile, according to the scale parameters of the recurrent neural network, the element parallel parameters and the vector parallel parameters corresponding to the scale parameters are matched for slicing, and the corresponding addition tree tail is configured, so that the slicing strategy is more flexible, and the method can be suitable for recurrent neural networks of various scales.

Referring to fig. 4, fig. 4 is a block diagram of a matrix data processing apparatus for a recurrent neural network according to an embodiment of the present invention, where the apparatus includes:

an obtaining module 401, configured to obtain to-be-processed matrix data and weight matrix data of the recurrent neural network, where the to-be-processed matrix data and the weight matrix data are both composed of matrix elements, and the matrix data includes a column vector constructed by the matrix elements;

a matching module 402, configured to match, according to a scale parameter of the recurrent neural network, an element parallel parameter and a vector parallel parameter corresponding to the scale parameter;

a processing module 403, configured to block the weight matrix data according to the element parallel parameter and the vector parallel parameter to obtain a weight vector block;

a calculating module 404, configured to perform multiplication calculation on the weight vector block and the matrix element to be processed to obtain a first processing result;

an output module 405, configured to configure an addition tree tail according to the element parallel parameter, accumulate the first processing result through an addition tree including the addition tree tail to obtain a second processing result, and output the second processing result as a processing result of the matrix data to be processed.

Optionally, as shown in fig. 5, the output module 405 includes:

a first configuration unit 4051, configured to configure the addition tree tail after the addition tree;

a second configuration unit 4052, configured to configure parallelism of the tail of the addition tree.

Optionally, the scale parameter of the recurrent neural network includes: the matching module 402 is further configured to match element parallel parameters and vector parallel parameters corresponding to the number of processing units and the vector dimensions according to the number of processing units and the vector dimensions.

Optionally, as shown in fig. 6, the apparatus further includes:

a balancing module 406, configured to perform a balancing calculation on the first processing result to balance parallelism of the weight vector block.

In the embodiment of the invention, the column vectors of the weight matrix data and the matrix elements of the matrix to be processed are multiplied and then accumulated, the vectors of the matrix data to be processed are not required to be completely copied, so that the calculation of the next time step can be started without waiting for the emptying of a system pipeline, only part of input vectors are required to start the calculation, a data pipeline is formed, the condition of pause is avoided, the idle condition of hardware resources is reduced, the utilization rate of the hardware resources is improved, meanwhile, according to the scale parameters of the recurrent neural network, the element parallel parameters and the vector parallel parameters corresponding to the scale parameters are matched to perform slicing, and the corresponding addition tree tail is configured, so that the slicing strategy is more flexible, and the method can be suitable for recurrent neural networks of various scales.

An embodiment of the present invention provides an electronic device, including: the matrix data processing method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the steps in the matrix data processing method provided by the embodiment of the invention.

The embodiment of the invention provides a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program realizes the steps in the matrix data processing method provided by the embodiment of the invention.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules illustrated are not necessarily required to practice the invention.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus can be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative

In addition, the processors and chips in the embodiments of the present invention may be integrated into one processing unit, may exist alone physically, or may be integrated into one unit by two or more hardware. The computer-readable storage medium or the computer-readable program may be stored in a computer-readable memory. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing is a more detailed description of the present invention in connection with specific preferred embodiments thereof, and it is not intended that the specific embodiments of the present invention be limited to these descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A method for processing matrix data for a recurrent neural network, the method comprising:

2. The method for processing matrix data according to claim 1, wherein the configuring the tail of the addition tree according to the element parallel parameters comprises:

configuring the addition tree tail behind the addition tree;

and configuring the parallelism of the tail of the addition tree.

3. The method of processing matrix data according to claim 1, wherein the scale parameter of the recurrent neural network comprises: the matching of the element parallel parameters and the vector parallel parameters corresponding to the scale parameters according to the scale parameters of the recurrent neural network comprises the following steps:

4. The method of processing matrix data according to claim 1, wherein before said accumulating said first processing result through an adder tree including a tail of said adder tree to obtain a second processing result, said method further comprises:

5. An apparatus for processing matrix data for a recurrent neural network, the apparatus comprising:

6. The apparatus for processing matrix data according to claim 5, wherein the output module comprises:

7. The apparatus for processing matrix data according to claim 5, wherein the scale parameter of the recurrent neural network comprises: the matching module is further used for matching element parallel parameters and vector parallel parameters corresponding to the processing unit number and the vector dimension according to the processing unit number and the vector dimension.

8. The apparatus for processing matrix data according to claim 5, wherein the apparatus further comprises:

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, which when executed by the processor implements the steps in the method for processing matrix data according to any one of claims 1 to 4.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps in the method of processing matrix data according to any one of claims 1 to 4.