CN116805155B

CN116805155B - LSTM network processing method, device, equipment and readable storage medium

Info

Publication number: CN116805155B
Application number: CN202311077196.XA
Authority: CN
Inventors: 张欣杨; 孙道恒; 闫夏超; 苏明明
Original assignee: Taichu Wuxi Electronic Technology Co ltd
Current assignee: Taichu Wuxi Electronic Technology Co ltd
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2024-01-19
Anticipated expiration: 2043-08-25
Also published as: CN116805155A

Abstract

The invention relates to the technical field of data processing, and discloses an LSTM network processing method, an LSTM network processing device, LSTM network processing equipment and a readable storage medium. The method comprises the following steps: for each cell in the LSTM network, obtaining matrix multiplication input data of the current cell to be operated, dividing according to the type of the result to be operated by the current cell to obtain a plurality of groups of matrixes to be operated, loading the matrixes to be operated into a cache area corresponding to the corresponding slave core group, carrying out matrix multiplication operation of the matrixes to be operated of the corresponding groups by the slave core group, and storing the calculation result of the matrix multiplication operation in the cache area; calculating to obtain the cell state of the current cell according to the forgotten gate result, the previous cell state, the input gate result and the temporary cell state; and calculating the hidden state of the current cell according to the cell state of the current cell and the output gate result. According to the scheme, when the LSTM network processing function is realized, the calculation efficiency is improved.

Description

LSTM network processing method, device, equipment and readable storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for LSTM network processing.

Background

LSTM (Long Short Term Memory, long and short term memory) networks are a special recurrent neural network that can analyze inputs using time series.

Since computations involving LSTM networks have a large amount of computation data, a more computationally efficient data processing framework is needed. In the prior art, a heterogeneous many-core AI (Artificial Intelligence ) acceleration processor is adopted to calculate an LSTM network, the heterogeneous many-core AI acceleration processor comprises a main core and a slave core array, each slave core has an independent cache space, calculation data can be processed in parallel, and each slave core is provided with an acceleration component specially used for processing matrix multiplication or other operations, so that the calculation speed is improved to the greatest extent. When calculating an LSTM network, the heterogeneous many-core AI acceleration processor usually starts an operator (kernel) to complete matrix multiplication operation, after calculation, data of matrix multiplication is stored from a cache area to a main memory, after a plurality of groups of matrix multiplication operation are sequentially executed, an operator is started to be specially used for next vector calculation.

However, the above scheme needs to complete the matrix multiplication operation and save the matrix multiplication operation to the main memory, and the subsequent calculation needs to access the main memory, so that the efficiency is low.

Disclosure of Invention

In view of this, the present invention provides a method, apparatus, device and readable storage medium for LSTM network processing, so as to solve the problem that in the related scheme, matrix multiplication operation needs to be completed first and stored in main memory, and subsequent computation needs to access the main memory, resulting in lower efficiency.

In a first aspect, the present invention provides an LSTM network processing method, applied to a heterogeneous many-core acceleration processor, where the heterogeneous many-core acceleration processor includes a plurality of groups of slave cores, each slave core includes a cache region, and a data cache channel is provided between each slave core, and the LSTM network includes an array formed by a plurality of cells, where the method includes: for each cell in the LSTM network, obtaining matrix multiplication input data which needs to be operated by the current cell, wherein the matrix multiplication input data comprises data corresponding to the hidden state of the previous cell; dividing the matrix multiplication input data according to the result category of the current cell to be operated to obtain a plurality of groups of matrices to be operated, wherein each group of matrices to be operated is used for data required by matrix multiplication operation corresponding to one operation result of the current cell, and the result category comprises an input gate result, a temporary cell state, a forget gate result and an output gate result; loading each group of matrix to be operated into a cache area corresponding to a corresponding slave core group, carrying out matrix multiplication operation of the matrix to be operated of the corresponding group by the slave core group, and storing the calculation result of the matrix multiplication operation in the cache area, wherein each group of matrix to be operated corresponds to one group of slave core groups, each slave core of each group respectively executes a part of the matrix multiplication operation, and stores an intermediate calculation result in each cache area, and each slave core invokes the intermediate calculation result through a data high-speed access channel; obtaining bias data vectors corresponding to each result category; superposing the calculated result of the matrix multiplication operation corresponding to the input gate result and the corresponding offset data vector to obtain an input gate result vector; superposing the calculation result of the matrix multiplication operation corresponding to the temporary cell state and the corresponding offset data vector to obtain a temporary cell state vector; superposing the calculation result of matrix multiplication operation corresponding to the forgetting gate result and the corresponding offset data vector to obtain a forgetting gate result vector; superposing the calculated result of the matrix multiplication operation corresponding to the output gate result and the corresponding offset data vector to obtain an output gate result vector; activating the input gate result vector through a first activation function to obtain an input gate result of the current cell; activating the temporary cell state vector through a second activation function to obtain the temporary cell state of the current cell; activating the forgotten gate result vector through a first activation function to obtain the forgotten gate result of the current cell; activating the output gate result vector through a first activation function to obtain an output gate result of the current cell; calculating the cell state of the current cell according to the forgotten gate result, the previous cell state, the input gate result and the temporary cell state, wherein the cell state of the current cell is used for calculating the next cell state; and calculating the hidden state of the current cell according to the cell state of the current cell and the output gate result, wherein the hidden state of the current cell is used for participating in the operation of the next cell as a part of matrix multiplication input data required by the next cell.

Therefore, according to the scheme, the LSTM network calculation is distributed to each slave core through the heterogeneous many-core acceleration processor, each slave core can conduct data interaction through the data high-speed access channel, each intermediate result of the calculation is accessed through the cache area in each slave core, the intermediate result is not required to be stored into the main memory, the steps of loading the intermediate result from the main memory are reduced, the memory access times are reduced, each slave core participates in the calculation, idle running is avoided, the calculation resources are fully utilized, and the calculation efficiency is improved.

In an alternative embodiment, the slave cores are arranged in a row B column, wherein each row of slave cores is a group; dividing the matrix multiplication input data according to the result category to be operated by the current cell to obtain a plurality of groups of matrices to be operated, wherein the method comprises the following steps: splitting the first left matrix data according to the column average, and loading the first left matrix data into each slave core respectively; wherein slave cores located in the same column load the same first left matrix partition data; splitting the second left matrix data according to the column average, and loading the second left matrix data into each slave core respectively; wherein slave cores located in the same column load the same second left matrix data; respectively loading first right matrix data, second right matrix data, third right matrix data, fourth right matrix data, fifth right matrix data, sixth right matrix data, seventh right matrix data and eighth right matrix data from the cores; the first right matrix data, the second right matrix data, the third right matrix data, the fourth right matrix data, the fifth right matrix data, the sixth right matrix data, the seventh right matrix data and the eighth right matrix data are split according to column average and are respectively loaded into each slave core of the slave core group; the first right matrix data and the second right matrix data are used for calculating an input gate result, the third right matrix data and the fourth right matrix data are used for calculating a temporary cell state, the fifth right matrix data and the sixth right matrix data are used for calculating a forgetting gate result, and the seventh right matrix data and the eighth right matrix data are used for calculating an output gate result; the first left matrix data is used for performing matrix multiplication operation with the first right matrix data, the third right matrix data, the fifth right matrix data and the seventh right matrix data respectively; the second left matrix data is used for performing matrix multiplication operation with the second right matrix data, the fourth right matrix data, the sixth right matrix data and the eighth right matrix data respectively.

Therefore, according to the scheme, according to the characteristics of the LSTM network, the slave cores are arranged into an array and are grouped according to rows, so that the slave cores are respectively used for parallel calculation of an input gate result, a temporary cell state, a forgetting gate result and an output gate result, the calculation process can be controlled according to rows or columns, and the parallelism degree is improved. Because the data interaction between the slave cores of adjacent rows is faster, the calculation tasks of each row of slave cores can be reasonably distributed, and the calculation efficiency is improved.

In an alternative embodiment, loading each group of matrices to be operated on to a cache area corresponding to a corresponding slave core group, performing matrix multiplication operation on the matrices to be operated on by the slave core group, and storing a calculation result of the matrix multiplication operation in the cache area, where the method includes: performing matrix multiplication operation on each first left matrix sub data and each first right matrix sub data and accumulating to obtain a first intermediate result, performing matrix multiplication operation on each second left matrix sub data and each second right matrix sub data and accumulating to obtain a second intermediate result, and adding the first intermediate result and the second intermediate result to obtain a matrix multiplication operation calculation result corresponding to an input gate result; performing matrix multiplication operation on each first left matrix sub data and each third right matrix sub data and accumulating to obtain a third intermediate result, performing matrix multiplication operation on each second left matrix sub data and each fourth right matrix sub data and accumulating to obtain a fourth intermediate result, and adding the third intermediate result and the fourth intermediate result to obtain a matrix multiplication operation calculation result corresponding to the temporary cell state; performing matrix multiplication operation on each first left matrix sub data and each fifth right matrix sub data and accumulating to obtain a fifth intermediate result, performing matrix multiplication operation on each second left matrix sub data and each sixth right matrix sub data and accumulating to obtain a sixth intermediate result, and adding the fifth intermediate result and the sixth intermediate result to obtain a matrix multiplication operation calculation result corresponding to the forgetting gate result; and performing matrix multiplication operation on each first left matrix sub data and each seventh right matrix sub data and accumulating to obtain a seventh intermediate result, performing matrix multiplication operation on each second left matrix sub data and each eighth right matrix sub data and accumulating to obtain an eighth intermediate result, and adding the seventh intermediate result and the eighth intermediate result to obtain a matrix multiplication operation calculation result corresponding to the output gate result.

Therefore, according to the scheme, the matrix to be operated is loaded into each slave core group, and each slave core in each slave core group loads a part of data respectively, so that each slave core participates in the calculation, and the calculation efficiency is improved.

In an alternative embodiment, the calculating the cell state of the current cell according to the forgotten gate result, the previous cell state, the input gate result and the temporary cell state includes: performing matrix multiplication operation on the forgotten gate result and the previous cell state to obtain an old memory result; performing matrix multiplication operation on the input gate result and the temporary cell state to obtain a new memory result; and adding the old memory result and the new memory result to obtain the cell state of the current cell.

Therefore, according to the scheme, the old memory result and the new memory result of the current cell are calculated, and the calculation of the cell state of the current cell is further perfected.

In an alternative embodiment, the method further comprises: each slave core respectively acquires the forgotten gate division result in the slave core of the column for calculating the forgotten gate result through the data high-speed access channel; each slave core acquires the previous cell division state respectively; performing matrix multiplication operation on the forgotten gate division result and the cell division state to obtain an old memory division result and storing the old memory division result in a cache area; each slave core respectively acquires input gate division results in the slave core of the column for calculating the input gate results through the data high-speed access channel; each slave core respectively acquires temporary cell fraction states in the slave core of the column for calculating the temporary cell states through the data high-speed access channel; performing matrix multiplication operation on the input gate division result and the temporary cell division state to obtain a new memory division result and storing the new memory division result in a cache area; each slave core loads the corresponding old memory score result and new memory score result from the respective cache region, adds the old memory score result and the new memory score result to obtain the cell score state of the current cell, and stores the cell score state into the cache region.

Therefore, according to the scheme, the data interaction among the slave cores is performed through the data high-speed access channel, so that each slave core performs a part of calculation, the calculation resources are reasonably utilized, and the calculation efficiency is improved.

In an alternative embodiment, the calculating the hidden state of the current cell according to the cell state of the current cell and the output gate result includes: activating the cell state of the current cell through the second activation function to obtain an intermediate activation result; and performing matrix multiplication operation on the output gate result and the intermediate activation result to obtain the hidden state of the current cell.

Therefore, the scheme calculates the intermediate activation result, and further improves the calculation of the hidden state of the current cell.

In an alternative embodiment, the method further comprises: each slave core loads the cell fraction status of the corresponding current cell from the respective cache region; activating the cell division state of the current cell through the second activation function to obtain an intermediate activation division result; each slave core respectively acquires output gate division results in the slave core of the column for calculating the output gate results through the data high-speed access channel; and performing matrix multiplication operation on the output gate division result and the intermediate activation division result to obtain the hidden division state of the current cell.

In a second aspect, the present invention provides an LSTM network processing apparatus for use in a heterogeneous many-core acceleration processor, the heterogeneous many-core acceleration processor including a plurality of groups of slave cores, each slave core including a cache region, each slave core having a data cache access channel therebetween, the LSTM network including an array of a plurality of cells, the apparatus comprising:

the first data acquisition module is used for acquiring matrix multiplication input data which need to be operated by the current cell for each cell in the LSTM network, wherein the matrix multiplication input data comprises data corresponding to the hidden state of the previous cell;

the data grouping module is used for dividing the matrix multiplication input data according to the result category of the operation required by the current cell to obtain a plurality of groups of matrices to be operated, wherein each group of matrices to be operated is used for the data required by matrix multiplication operation corresponding to one operation result of the current cell, and the result category comprises an input gate result, a forgetting gate result, an output gate result and a temporary cell state;

The matrix multiplication operation module is used for loading each group of matrix to be operated to a cache area corresponding to the corresponding slave core group, carrying out matrix multiplication operation of the matrix to be operated of the corresponding group by the slave core group, and storing the calculation result of the matrix multiplication operation in the cache area, wherein each group of matrix to be operated corresponds to one group of slave core group, each slave core of each group respectively executes a part of the matrix multiplication operation, stores the intermediate calculation result in the respective cache area, and calls the intermediate calculation result through a data high-speed access channel among the slave cores;

the second data acquisition module is used for acquiring offset data vectors corresponding to each result category;

the vector operation module is used for superposing the calculated result of the matrix multiplication operation corresponding to the input gate result and the corresponding offset data vector to obtain an input gate result vector; superposing the calculation result of matrix multiplication operation corresponding to the forgetting gate result and the corresponding offset data vector to obtain a forgetting gate result vector; superposing the calculated result of the matrix multiplication operation corresponding to the output gate result and the corresponding offset data vector to obtain an output gate result vector; superposing the calculation result of the matrix multiplication operation corresponding to the temporary cell state and the corresponding offset data vector to obtain a temporary cell state vector;

The activation module is used for activating the input gate result vector through a first activation function to obtain the input gate result of the current cell; activating the forgotten gate result vector through a first activation function to obtain the forgotten gate result of the current cell; activating the output gate result vector through a first activation function to obtain an output gate result of the current cell; activating the temporary cell state vector through a second activation function to obtain the temporary cell state of the current cell;

the cell state operation module is used for calculating the cell state of the current cell according to the forgotten gate result vector, the previous cell state, the input gate result vector and the temporary cell state, wherein the cell state of the current cell is used for calculating the next cell state;

the hidden state operation module is used for calculating the hidden state of the current cell according to the cell state of the current cell and the output gate result vector, and the hidden state of the current cell is used for taking part in the operation of the next cell as a part of matrix multiplication input data required by the next cell.

In a third aspect, the present invention provides a computer device comprising: the LSTM network processing system comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, so that the LSTM network processing method of the first aspect or any implementation mode corresponding to the first aspect is executed.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to execute the LSTM network processing method of the first aspect or any of the embodiments corresponding thereto.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of an LSTM network processing method according to an embodiment of the invention;

FIG. 2 is a flow chart of another LSTM network processing method according to an embodiment of the invention;

FIG. 3 shows a schematic diagram of a heterogeneous many-core acceleration processor;

FIG. 4 is a schematic diagram of the structure of a systolic array according to an embodiment of the present invention;

FIG. 5 shows a schematic flow diagram of an accumulation operation;

FIG. 6 shows a schematic calculation of step 207-step 208;

FIG. 7 is a schematic diagram of a calculation flow in the related art;

FIG. 8 is a schematic illustration of a calculation flow in accordance with an embodiment of the present invention;

FIG. 9 is a schematic diagram of the structure of a cell array according to an embodiment of the present invention;

FIG. 10 is a block diagram of an LSTM network processing device in accordance with an embodiment of the invention;

fig. 11 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Since computations involving LSTM networks have a large amount of computation data, a more computationally efficient data processing framework is needed. In the prior art, a heterogeneous many-core AI (Artificial Intelligence ) acceleration processor is adopted to calculate an LSTM network, the heterogeneous many-core AI acceleration processor comprises a main core and a slave core array, each slave core has an independent cache space, calculation data can be processed in parallel, and each slave core is provided with an acceleration component specially used for processing matrix multiplication or other operations, so that the calculation speed is improved to the greatest extent. When the LSTM network is calculated, the heterogeneous many-core AI acceleration processor usually starts an operator to complete matrix multiplication operation, after the calculation is completed, data of matrix multiplication is stored from a cache area to a main memory, after a plurality of groups of matrix multiplication operation are sequentially executed, an operator is started to be specially used for the next vector calculation.

Therefore, the embodiment of the invention provides an LSTM network processing method, which distributes the calculation of the LSTM network to each slave core in the heterogeneous many-core acceleration processor, thereby improving the calculation efficiency.

According to an embodiment of the present invention, there is provided an LSTM network processing method embodiment, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

In this embodiment, an LSTM network processing method is provided, which is applied to a heterogeneous many-core acceleration processor (i.e., a heterogeneous many-core AI acceleration processor), where the heterogeneous many-core acceleration processor includes a plurality of groups of slave cores, each slave core includes a cache area (LDM, local device memory), and each slave core has a data cache channel therebetween, and the LSTM network includes an array formed by a plurality of cells, and fig. 1 is a flowchart of an LSTM network method according to an embodiment of the present invention, and as shown in fig. 1, the flowchart includes the following steps:

step S101, for each cell in the LSTM network, obtaining matrix multiplication input data of the current cell needing to be operated.

The matrix multiplication input data comprises data corresponding to the hidden state of the previous cell.

It should be noted that, as one type of neural network, when solving a specific problem, the LSTM network is also divided into a training process and an application process, where a large number of data sets are required to be used for training, and when applying, target data is input into the trained LSTM network to obtain a target result. The processing process of the LSTM network can be in the training process or the application process.

Cells (cells) are the basic units of the LSTM network, and each cell forms the LSTM network in time series. The input of each cell includes the current input matrix, the cell state of the previous cell and the hidden state of the previous cell, and is output as the updated cell state and hidden state. It is derived from each cell which previous information and states need to be kept/remembered and which previous information and states are discarded, thus enabling LSTM networks to efficiently preserve the associated information long ago.

The LSTM network is processed through the heterogeneous many-core acceleration processor, and the heterogeneous meaning that each slave core is operated independently, and the calculation in the slave cores is not affected. According to the structure of the LSTM network, firstly, matrix multiplication input data of the LSTM network are required to be acquired during processing, the matrix multiplication input data are selected by related technicians according to specific application scenes, a part of matrix multiplication input data are the current input matrix in the LSTM network and the hidden state of the previous cell, and a part of matrix multiplication input data are weights corresponding to the current input matrix and the hidden state of the previous cell in the LSTM network.

Step S102, dividing the matrix multiplication input data according to the result category to be operated by the current cell to obtain a plurality of groups of matrices to be operated.

Each group of matrix to be operated is used for data required by matrix multiplication operation corresponding to one operation result of the current cell, and the result category comprises input gate results, temporary cell states, forget gate results and output gate results.

Gating is a method of selectively passing information, the forget gating result being used to determine which information will be discarded by the current cell from the cell state of the previous cell, the input gating result and the temporary cell state being used to determine which information will be updated by the current cell, and the output gating result being used to determine which information will be output by the current cell.

Step S103, each group of matrix to be operated is loaded to a cache area corresponding to the corresponding slave core group, the slave core group performs matrix multiplication operation of the matrix to be operated of the corresponding group, and the calculation result of the matrix multiplication operation is stored in the cache area.

Each group of matrix to be operated corresponds to a group of slave cores, each slave core of each group executes part of matrix multiplication operation, intermediate calculation results are stored in each cache area, and the intermediate calculation results are called among the slave cores through a data high-speed access channel.

Since the steps of calculating an input gate result, a temporary cell state, a forget gate result, and an output gate result are similar in LSTM networks, slave cores may be grouped for calculating an input gate result, a temporary cell state, a forget gate result, and an output gate result, respectively. Furthermore, each slave core in each slave core group can equally divide the corresponding matrix to be operated in the group so as to ensure that each slave core operates, avoid idle running of the slave core and improve the calculation efficiency.

Step S104, obtaining offset data vectors corresponding to each result category.

The bias data vector is a bias matrix obtained after an input matrix in the neural network is converted, for example, an input matrix z is converted into G (wz+b) through a neuron process, G is a conversion function, W is a weight matrix of z, and b is a bias matrix. The bias data vector includes a bias data vector corresponding to the current input matrix and a bias data vector corresponding to the hidden state of the previous cell.

The offset data vector is loaded into the cache region in the corresponding slave core group according to the result category.

Step S105, the calculated result of matrix multiplication operation corresponding to the input gate result and the corresponding offset data vector are overlapped to obtain an input gate result vector; superposing the calculated result of the matrix multiplication operation corresponding to the temporary cell state and the corresponding offset data vector to obtain a temporary cell state vector; superposing the calculation result of matrix multiplication operation corresponding to the forgetting gate result and the corresponding offset data vector to obtain a forgetting gate result vector; and superposing the calculated result of the matrix multiplication operation corresponding to the output gate result and the corresponding offset data vector to obtain an output gate result vector.

The calculation result of matrix multiplication corresponding to the input gate result, the calculation result of matrix multiplication corresponding to the temporary cell state, the calculation result of matrix multiplication corresponding to the forgetting gate result and the calculation result of matrix multiplication corresponding to the output gate result are the calculation result of matrix multiplication performed by the current input matrix and the corresponding weight or the calculation result of matrix multiplication performed by the hidden state of the previous cell and the corresponding weight in the LSTM network.

Step S106, activating the input gate result vector through a first activation function to obtain the input gate result of the current cell; activating the temporary cell state vector through a second activation function to obtain the temporary cell state of the current cell; activating the forgotten gate result vector through a first activation function to obtain the forgotten gate result of the current cell; and activating the output gate result vector through a first activation function to obtain the output gate result of the current cell.

The activation function is used for adding nonlinear factors, improving the expression capacity of the neural network and solving the problem which cannot be solved by the linear model.

Step S107, calculating the cell state of the current cell according to the forgotten gate result, the previous cell state, the input gate result and the temporary cell state, wherein the cell state of the current cell is used for calculating the next cell state.

The forget gate result and the previous cell state are used to determine which information the current cell will discard from the cell state of the previous cell, and the input gate result and the temporary cell state are used to determine which information the current cell will update, which together determine the cell state of the current cell.

It should be noted that the cell state of the LSTM network corresponds to the path of information transmission, and the related information in the sequence processing process can be transferred all the time, which corresponds to the "long-term memory" of the LSTM network, so that the information of even the earlier time step can be carried into the cells of the later time step, and the influence of the short-term memory is overcome.

Step S108, calculating the hidden state of the current cell according to the cell state of the current cell and the output gate result, wherein the hidden state of the current cell is used as a part of matrix multiplication input data required by the next cell to participate in the operation of the next cell.

The output gate results and the cell status of the current cell are used to determine which information the current cell will output.

The hidden state is characterized by extracting on the basis of the cell state, is relatively related to the current decision, and mainly stores 'short-time memory'.

Further, the cell state and hidden state of the current cell can be used as input of the next cell to participate in the operation of the next cell. The cell state and hidden state of the cells at the end of the time series may be stored in a host memory for subsequent calculation of the loss function of the LSTM, etc.

It should be noted that, steps S101 to S103 are matrix multiplication operations, and steps S104 to S108 are vector operations. In the related art, a matrix multiplication operation is performed by one operator, a calculation result of the matrix multiplication operation is stored in a main memory, then the calculation result of the matrix multiplication operation is loaded in the main memory by the other operator, the next vector calculation is performed, and then the result of the vector calculation is stored in the main memory. In the scheme, the calculation result of matrix multiplication operation is stored in the high-speed buffer area, so that the main memory is not needed, and the access times are reduced. It should be noted that the separation of the overall calculation into the matrix multiplication operation and the vector operation is for convenience of explanation of the technical effect, but in the specific step of the vector operation, there is also a matrix multiplication operation between two matrices, which means that the two matrices are multiplied.

According to the LSTM network processing method, the LSTM network calculation is distributed to each slave core through the heterogeneous many-core acceleration processor, data interaction can be carried out among the slave cores through the data high-speed access channel, each intermediate result of the calculation is accessed through the cache area in each slave core, the intermediate result is not required to be stored into the main memory, the step of loading the intermediate result from the main memory is reduced, the access times are reduced, each slave core participates in the calculation, idle running is avoided, the calculation resources are fully utilized, and the calculation efficiency is improved.

In this embodiment, an LSTM network processing method is provided, which is applied to a heterogeneous many-core acceleration processor, where the heterogeneous many-core acceleration processor includes multiple groups of slave cores, each slave core includes a cache region, and each slave core has a data cache channel therebetween, and the LSTM network includes an array formed by multiple cells, and fig. 2 is a flowchart of the LSTM network processing method according to an embodiment of the present invention, and as shown in fig. 2, the flowchart includes the following steps:

step S201, for each cell in the LSTM network, obtaining the matrix multiplication input data of the current cell to be operated, wherein the matrix multiplication input data comprises the data corresponding to the hidden state of the previous cell.

Specifically, first left matrix data, second left matrix data, first right matrix data, second right matrix data, third right matrix data, fourth right matrix data, fifth right matrix data, sixth right matrix data, seventh right matrix data, and eighth right matrix data are acquired.

Step S202, dividing the matrix multiplication input data according to the result category to be operated by the current cell to obtain a plurality of groups of matrix to be operated, wherein each group of matrix to be operated is used for the data required by matrix multiplication operation corresponding to one operation result of the current cell, and the result category comprises an input gate result, a temporary cell state, a forgetting gate result and an output gate result.

Specifically, the first left matrix data is a current input matrix of a current cell, the second left matrix data is data corresponding to a hidden state of a previous cell, the first right matrix data is a weight matrix corresponding to the first left matrix data in an input gate, the second right matrix data is a weight matrix corresponding to the second left matrix data in the input gate, the third right matrix data is a weight matrix corresponding to the first left matrix data in a temporary cell state, the fourth right matrix data is a weight matrix corresponding to the second left matrix data in a temporary cell state, the fifth right matrix data is a weight matrix corresponding to the first left matrix data in a forget gate, the sixth right matrix data is a weight matrix corresponding to the second left matrix data in the forget gate, the seventh right matrix data is a weight matrix corresponding to the first left matrix data in the output gate, and the eighth right matrix data is a weight matrix corresponding to the second left matrix data in the output gate.

If four sets of computations (input gate result, temporary cell state, forget gate result, and output gate result) are not allocated to different sets of slave cores, more buffers (buffers) are allocated for serially computing the input gate result, temporary cell state, forget gate result, and output gate result, respectively, when space allocation of the cache region is performed, so that the amount of data that can be processed by the cache region is reduced. In addition, serial logic can be changed into four groups of calculated parallel logic in the program code of the operator, so that the calculation speed is greatly improved.

Step S203, each group of matrix to be operated is loaded to a cache area corresponding to the corresponding slave core group, the slave core group carries out matrix multiplication operation of the matrix to be operated of the corresponding group, and the calculation result of the matrix multiplication operation is stored in the cache area, wherein each group of matrix to be operated corresponds to one group of slave core group, each slave core of each group respectively carries out part of the matrix multiplication operation, and stores the intermediate calculation result in the respective cache area, and each slave core calls the intermediate calculation result through a data high-speed access channel.

Specifically, each slave core is arranged in a row×b column, where each row of slave cores is a group, a is a positive integer, and B is a positive integer.

Optionally, the heterogeneous many-core acceleration processor includes a total of 32 slave cores of 4 rows by 8 columns. By way of example, fig. 3 shows a schematic diagram of a heterogeneous many-core acceleration processor. As shown in fig. 3, the heterogeneous many-core acceleration processor includes a master core, an array of slave cores, each of which includes a cache region, a matrix-multiplied acceleration component (systolic array), and a non-matrix-multiplied acceleration component, and a memory (main memory).

Specifically, the step S203 includes:

step S2031, splitting first left matrix data according to column average, and loading the split first left matrix data into each slave core respectively; wherein slave cores located in the same column load the same first left matrix partition data; splitting the second left matrix data according to the column average, and loading the second left matrix data into each slave core respectively; wherein slave cores located in the same column load the same second left matrix data.

In an actual application scene, a matrix multiplication acceleration component (i.e. a systolic array) used for performing matrix multiplication calculation in each slave core has a limitation on the row and column size of matrix multiplication input data, so that the matrix multiplication input data can be split into smaller matrices to be respectively loaded into different slave cores, and each slave core calculates a part of results and then accumulates to obtain a total result. It should be noted that, in this embodiment, taking an example that the number of columns of matrix multiplied by input data is large, each matrix multiplied by input data is split according to the average of the columns, and if the number of rows of matrix multiplied by input data is large in an actual scene, the matrix multiplied by input data may also be split according to the average of the rows. By splitting matrix multiplication input data into each slave core for calculation, larger data can be processed, the application range is widened, each slave core can process less data and cannot idle, and the overall calculation efficiency is improved.

In step S2032, the first right matrix data, the second right matrix data, the third right matrix data, the fourth right matrix data, the fifth right matrix data, the sixth right matrix data, the seventh right matrix data, and the eighth right matrix data are loaded from the cores, respectively.

The first right matrix data, the second right matrix data, the third right matrix data, the fourth right matrix data, the fifth right matrix data, the sixth right matrix data, the seventh right matrix data and the eighth right matrix data are split according to column average and are respectively loaded into each slave core of the slave core group. It should be noted that, since the row-column arrangement of each right matrix data in the main memory of the heterogeneous many-core acceleration processor is opposite to the row-column arrangement in actual computation (for example, C rows×d columns in the main memory, and turned over to D rows×c columns in actual computation), when loading the first right matrix data, the second right matrix data, the third right matrix data, the fourth right matrix data, the fifth right matrix data, the sixth right matrix data, the seventh right matrix data, and the eighth right matrix data, the data needs to be loaded into the cache area through the interface with the matrix transposition function.

The first right matrix data and the second right matrix data are used for calculating an input gate result, the third right matrix data and the fourth right matrix data are used for calculating a temporary cell state, the fifth right matrix data and the sixth right matrix data are used for calculating a forgetting gate result, and the seventh right matrix data and the eighth right matrix data are used for calculating an output gate result.

The first left matrix data is used for performing matrix multiplication operation with the first right matrix data, the third right matrix data, the fifth right matrix data and the seventh right matrix data respectively; the second left matrix data is used for performing matrix multiplication operation with the second right matrix data, the fourth right matrix data, the sixth right matrix data and the eighth right matrix data respectively.

That is, each slave core is loaded to a portion of the current input matrix, a portion of the hidden state of the previous cell, a portion of the weight corresponding to the current input matrix, and a portion of the weight corresponding to the hidden state of the previous cell.

Step S2033A, performing matrix multiplication operation on each first left matrix sub data and each first right matrix sub data and accumulating to obtain a first intermediate result, performing matrix multiplication operation on each second left matrix sub data and each second right matrix sub data and accumulating to obtain a second intermediate result, and adding the first intermediate result and the second intermediate result to obtain a calculation result of the matrix multiplication operation corresponding to the input gate result.

This step is performed by multiplying the matrix in the core by an acceleration component, i.e. a systolic array, which is an array structure consisting of a plurality of identical processing units according to a certain interconnection rule, in which data flow between the processing units of the array in a rhythmic order, during which all the processing units process their data simultaneously in parallel, and the intermediate results of the data processing can be stored in a temporary buffer of the processing units. The intermediate result is stored in the temporary cache area of the processing unit, so that the result of matrix multiplication operation can be output integrally, the access times to the cache area are reduced, and the calculation efficiency is improved.

First, each slave core performs matrix multiplication operation on the first left matrix sub data and the first right matrix sub data which are respectively divided to obtain a first intermediate sub result and stores the first intermediate sub result in a cache area, then performs matrix multiplication operation on the second left matrix sub data and the second right matrix sub data which are respectively divided to obtain a second intermediate sub result, and then adds the first intermediate sub result and the second intermediate sub result to obtain a calculated sub result of the matrix multiplication operation corresponding to an input gate result.

Specifically, according to the space limitation of the cache area (for example, the space limitation is 220 KB), a part of the first right matrix data is loaded to reside in the systolic array, and then a matrix multiplication operation is performed on a part of the first left matrix data and a part of the first right matrix data, so that the result is stored in the temporary cache area of the systolic array processing unit. The loop proceeds until all parts of the first left matrix sub-data are loaded, and the loop proceeds until all parts of the first right matrix are loaded.

Illustratively, FIG. 4 is a schematic diagram of the structure of a systolic array according to an embodiment of the present invention. As shown in fig. 4, in a certain slave core, since the systolic array can load 128 rows×32 columns at most, the cache area LDM loads the first left matrix partition data x (64 rows×32 columns) twice to the western (West) of the systolic array, and the western data becomes 128 rows×32 columns. The first x (64 rows×32 columns) and the first right matrix-divided data Wi (64 rows×is_sub columns) of North (North) are subjected to matrix multiplication operation, the first 32 columns of data (is_sub rows×64 columns) transposed, and the second x (64 rows×32 columns) and the first right matrix-divided data Wi (64 rows×is_sub columns) transposed are subjected to matrix multiplication operation, so that a first intermediate division result xxwi (128 rows×32 columns) can be output and stored in a temporary buffer area of a processing unit in the systolic array. The matrix multiplication operation process of the second left matrix data hx and the second right matrix data Ri is similar to that of the first left matrix data x and the first right matrix data Wi, and is not repeated here, the second intermediate result hx×ri is obtained after the matrix multiplication operation is performed on the second left matrix data hx and the second right matrix data Ri, and the calculated result x×wi+hx×ri (128 rows×32 columns) of the matrix multiplication operation corresponding to the input gate result is calculated and stored in the (writeback) cache area. Finally, the calculation division result x×wi+hx×ri (128 rows×32 columns) of the matrix multiplication operation is 64 rows×64 columns, so that the skip of data (stride) is changed to 64 (i.e. the lowest dimension of 64 rows×64 columns), and the limitation that the skip is 32 (the lowest dimension of 128 rows×32 columns, 32 dimensions is worse than 64 dimensions) when the cache area stores (put) data into the main memory HBM (High Bandwidth Memory ) can be solved.

Further, the calculated results of the matrix multiplication operation corresponding to the input gate results obtained from the cores are accumulated, and the calculated results of the matrix multiplication operation corresponding to the input gate results are obtained and stored in the cache area. The accumulation, i.e., reduce add operation, is illustrated by way of example with reference to FIG. 5, which shows a schematic flow diagram of the accumulation operation. As shown in FIG. 5, a row of 8 slave cores 0-7 is shown, when the accumulation operation is performed, core0 takes out the 0-7 data of core 1-core 7 and accumulates them on core0, core1 takes out the 8-15 data of core0, core 2-core 7 and accumulates them on core1, and so on.

It should be noted that the calculation steps of the matrix multiplication operations of the respective slave cores of the same group are synchronized.

It should be noted that, when the number of columns of the first left matrix data and the second left matrix data is larger, in step S2031, a part of the first left matrix data and the second left matrix data may be split according to columns, then step S2031 to step S2033A are performed, then other parts are loaded, then step S2031 to step S2033A are performed, and step S2031 to step S2033A are circulated until each split part participates in calculation.

Step S2033B, performing matrix multiplication operation on each first left matrix sub data and each third right matrix sub data and accumulating to obtain a third intermediate result, performing matrix multiplication operation on each second left matrix sub data and each fourth right matrix sub data and accumulating to obtain a fourth intermediate result, and adding the third intermediate result and the fourth intermediate result to obtain a calculation result of the matrix multiplication operation corresponding to the temporary cell state.

This step is similar to step S2033A and will not be described here.

Step S2033C, performing matrix multiplication operation on each first left matrix sub data and each fifth right matrix sub data and accumulating to obtain a fifth intermediate result, performing matrix multiplication operation on each second left matrix sub data and each sixth right matrix sub data and accumulating to obtain a sixth intermediate result, and adding the fifth intermediate result and the sixth intermediate result to obtain a calculation result of matrix multiplication operation corresponding to the forgotten gate result.

This step is similar to step S2033A and will not be described here.

Step S2033D, performing matrix multiplication operation on each first left matrix sub data and each seventh right matrix sub data and accumulating to obtain a seventh intermediate result, performing matrix multiplication operation on each second left matrix sub data and each eighth right matrix sub data and accumulating to obtain an eighth intermediate result, and adding the seventh intermediate result and the eighth intermediate result to obtain a calculation result of matrix multiplication operation corresponding to the output gate result.

This step is similar to step S2033A and will not be described here.

Step S204, obtaining offset data vectors corresponding to each result category.

This step is similar to step S104, and will not be described here.

Step S205, the calculated result of matrix multiplication operation corresponding to the input gate result and the corresponding offset data vector are overlapped to obtain an input gate result vector; superposing the calculated result of the matrix multiplication operation corresponding to the temporary cell state and the corresponding offset data vector to obtain a temporary cell state vector; superposing the calculation result of matrix multiplication operation corresponding to the forgetting gate result and the corresponding offset data vector to obtain a forgetting gate result vector; and superposing the calculated result of the matrix multiplication operation corresponding to the output gate result and the corresponding offset data vector to obtain an output gate result vector.

The input gate result vector, temporary cell state vector, forget gate result vector, and output gate result vector are all stored to the cache region of the corresponding slave core.

Step S206, activating the input gate result vector through a first activation function to obtain the input gate result of the current cell; activating the temporary cell state vector through a second activation function to obtain the temporary cell state of the current cell; activating the forgotten gate result vector through a first activation function to obtain the forgotten gate result of the current cell; and activating the output gate result vector through a first activation function to obtain the output gate result of the current cell.

Optionally, the first activation function is a sigmoid function, and may output a value within a range of (0, 1) intervals; the second activation function is a hyperbolic tangent function (tanh) that can map the output non-linearly into the (-1, 1) interval.

It should be noted that the active operations of the slave cores in the same column are synchronized.

The input gate result, temporary cell state, forget gate result, and output gate result are all stored to the cache region of the corresponding slave core. The single slave core may also output the computed input gate results or temporary cell states or forget gate results or output gate results asynchronously (i.e., the slave cores do not affect each other) as intermediate results to the master memory for subsequent computation of the loss function of the LSTM network, back propagation updating of the LSTM network, etc.

Step S207, calculating the cell state of the current cell according to the forgotten gate result, the previous cell state, the input gate result and the temporary cell state, wherein the cell state of the current cell is used for calculating the next cell state.

Specifically, the step S207 includes:

step S2071, performing matrix multiplication operation on the forgotten gate result and the previous cell state to obtain an old memory result.

Each slave core obtains the forgotten gate score result in the slave core for calculating the forgotten gate result of the column through the data high-speed access channel.

Further, each slave core acquires the previous cell fraction status, respectively. Each slave core loads the previous cell partition state into its respective cache region in a tandem manner.

Further, the forgotten gate division result and the cell division state are subjected to matrix multiplication operation to obtain an old memory division result and store the old memory division result in a cache region.

Step S2072, the input gate result and the temporary cell state are subjected to matrix multiplication operation to obtain a new memory result.

Each slave core obtains the input gate division result in the slave core of the column for calculating the input gate result through the data high-speed access channel.

Further, each slave core obtains the temporary cell fraction status in the slave core of the column for calculating the temporary cell status through the data cache channel.

Further, the input gate division result and the temporary cell division state are subjected to matrix multiplication operation to obtain a new memory division result, and the new memory division result is stored in a cache region.

Step S2073, adding the old memory result and the new memory result to obtain the cell state of the current cell.

Each slave core loads the corresponding old memory score result and new memory score result from the respective cache region, adds the old memory score result and the new memory score result to obtain the cell score state of the current cell, and stores the cell score state into the cache region.

Further, the cell fraction states of the current cells obtained from the nuclei are superimposed to obtain the cell state of the current cells.

Step S208, calculating the hidden state of the current cell according to the cell state of the current cell and the output gate result, wherein the hidden state of the current cell is used as a part of matrix multiplication input data required by the next cell to participate in the operation of the next cell.

Specifically, the step S208 includes:

step S2081, activating the cell state of the current cell by the second activation function to obtain an intermediate activation result.

Each slave core loads the cell partition status of the corresponding current cell from the respective cache region.

Further, the cell fraction state of the current cell is activated through the second activation function, and an intermediate activation fraction result is obtained.

Step S2082, performing matrix multiplication operation on the output gate result and the intermediate activation result to obtain the hidden state of the current cell.

Each slave core obtains the output gate partial result in the slave core of the column for calculating the output gate result through the data high-speed access channel.

Further, the output gate division result and the intermediate activation division result are subjected to matrix multiplication operation to obtain the hidden division state of the current cell.

Further, the hidden sub-states of the current cells obtained from the nuclei are superimposed to obtain the hidden states of the current cells.

Step 207 to step 208 may calculate through the free space of each slave core, and the slave cores in the same column may acquire the data stored in the cache region from each other through the data high-speed access channel, so that the calculation data required next may be averaged to each slave core, so that each slave core participates in the calculation, and will not idle, thereby fully utilizing the calculation resources and improving the calculation efficiency. Exemplary, fig. 6 shows a schematic calculation of steps 207 to 208. As shown in fig. 6, the left side of fig. 6 shows a list of four slave cores, each of which has been calculated an input gate result it, a temporary cell state c't, a forget gate result ft, and an output gate result ot, and stored in a cache area in the slave core. Taking core0 for calculating the input gate result it as an example, the data stored in core0 is shown on the right side of fig. 6, including the calculated input gate result it. In step 207-step 208, firstly, a forgetting gate result ft of 1/4 is obtained from a slave core which calculates the forgetting gate result ft through a data high-speed access channel, then a previous cell state ct-1 of 1/4 is loaded, an old memory result is obtained after matrix multiplication operation is carried out on the 1/4ft and the 1/4ct-1, the old memory result is stored in a cache area temp1, then an it calculated by 1/4 itself is loaded, a temporary cell state c't of 1/4 is obtained from the slave core which calculates the temporary cell state c't through the data high-speed access channel, a new memory result is obtained after matrix multiplication operation is carried out on the 1/4it and the 1/4c't, the old memory result and the new memory result are added to obtain a cell state ct of a current cell of 1/4, the old memory result and the new memory result are activated through a second activating function tanh, then the forgetting gate result of 1/4 is obtained from the slave core which calculates the output gate result ot, and the current state htot of 1/4 is obtained after matrix multiplication operation is carried out on the 1/4ot and the current cell state of 1/4 h (ct). The other three slave cores have similar calculation steps to core0 and will not be described here. After 1/4 of the results (cell state and hidden state) were calculated, the results were integrated to obtain the total result.

It should be noted that, since the number of rows of the systolic array is limited, when the matrix multiplication input data is larger, the matrix multiplication input data may be split according to the number of rows, and the operations S201 to S208 may be looped.

Because the data needs time to be loaded into the cache region, a double-buffer technology can be added to program the loading of the data, so that one part of data is loaded and the other part of data is operated at the same time, thus the calculation time can be covered, and the calculation efficiency is improved. Alternatively, each cache region is loaded with half of the input data first, and the other half of the input data is loaded while calculating the loaded half of the input data, using a double buffering technique.

It should be noted that in the heterogeneous many-core acceleration processor, the computation in each slave core is performed independently and is not affected by each other, so that the SYNC instruction (SYNC is a kind of instruction for data synchronization) may be applied in some steps, so that the slave cores may be synchronized when needed, for example, when a certain computation step is performed, by the SYNC instruction, the slave cores that have completed computation wait for the slave cores that have not completed computation, and the next computation step is performed after all the computation of the slave cores are completed. Optionally, the synchronization of the computation steps between the slave cores is performed by a SYNC instruction.

It should be noted that, steps S201 to S203 are matrix multiplication operations, and steps S204 to S208 are vector operations. In the related art, a matrix multiplication operation is performed by one operator, a calculation result of the matrix multiplication operation is stored in a main memory of the heterogeneous many-core acceleration processor, then the calculation result of the matrix multiplication operation is loaded from the main memory by the other operator, the next vector calculation is performed, and then the result of the vector calculation is stored in the main memory. In the scheme, the calculation result of matrix multiplication operation is stored in the cache region in the slave core, and the main memory is not needed to be stored, so that the time for accessing the main memory once (namely the starting time of the core, namely the starting time of the slave core, is the time for starting the slave core to participate in calculation) is reduced, and the calculation efficiency is improved.

Fig. 7 is a schematic diagram of a calculation flow in the related art. As shown in fig. 7, the HBM (High Bandwidth Memory ), i.e. the main memory of the heterogeneous many-core acceleration processor, loads (GET) data required for the matrix multiplication operation from the HBM to the cache area LDM through DMA (Direct Memory Access ), starts to perform calculation of an operator (kernel 1), obtains the result of the matrix multiplication operation through the systolic matrix ACE, stores the result in the cache area, and stores the result of the matrix multiplication operation in (PUT) main memory through DMA. Next, the computation steps of the respective slave cores are synchronized by the SYNC instruction. Then, data required for vector operation is loaded from the main memory, calculation of another operator (kernel 2) is started, next vector operation is performed through a non-matrix multiplication calculation module SIMD (Single Instruction Multiple Data, single instruction multiple data stream), and a result of the vector operation is stored in a (PUT) main memory.

Fig. 8 is a schematic diagram illustrating a calculation flow according to an embodiment of the present invention. As shown in fig. 8, the calculation of an operator (kernel) is started by first loading (GET) data (including data required for matrix multiplication operation and data required for vector operation) from HBM to the cache area LDM by DMA. Firstly, the result of matrix multiplication operation is obtained through a pulsation matrix ACE and stored in a high-speed buffer area LDM, then the calculation steps among cores are synchronized through SYNC instructions, then the next vector operation is carried out through a non-matrix multiplication calculation module SIMD (Single Instruction Multiple Data, single instruction multiple data stream), and the result of the vector operation is stored in a (PUT) main memory. Compared with fig. 7, one operator is reduced, the time of accessing the main memory once is reduced, and the operation efficiency is improved.

Alternatively, the calculation of one cell in the LSTM network is performed by the following formula:

i _t =σ(W _i ×x _t +R _i ×h _t-1 +b _Wi +b _Ri )

c’ _t =tanh(W _c ×x _t +R _c ×h _t-1 +b _Wc +b _Rc )

f _t =σ(W _f ×x _t +R _f ×h _t-1 +b _Wf +b _Rf )

o _t =σ(W _o ×x _t +R _o ×h _t-1 +b _Wo +b _Ro )

c _t =f _t ×c _t-1 +i _t ×c’ _t

h _t =o _t ×tanh(c _t )

wherein i is _t For inputting gate results, c' _t In a temporary cellular state, f _t To forget the gate result o _t To output gate results, x _t For the first left matrix data, h _t-1 For the second left matrix data, W _i R is the first right matrix data _i For the second right matrix data, W _c For the third right matrix data, R _c For the fourth right matrix data, W _f For the fifth right matrix data, R _f For the sixth right matrix data, W _o For the seventh right matrix data, R _o For the eighth right matrix data, b _Wi Is W and _i ×x _t corresponding offset data vector, b _Ri Is R and _i ×h _t-1 corresponding offset data vector, b _Wc Is W and _c ×x _t corresponding offset data vector, b _Rc Is R and _c ×h _t-1 corresponding offset data vector, b _Wf Is W and _f ×x _t corresponding offset data vector, b _Rf Is R and _f ×h _t-1 corresponding offset data vector, b _Wo Is W and _o ×x _t corresponding offset data vector, b _Ro Is R and _o ×h _t-1 corresponding offset data vector, c _t-1 For the previous cell state, c _t Is the cell state of the current cell, h _t Is the hidden state of the current cell.

All the steps are the operation process of one cell in the LSTM network. Further, the LSTM network includes M rows by N columns of cells. In each row of cells, the current input matrix, the cell state of the previous cell and the hidden state of the previous cell are taken as inputs of the next cell, the cell state at the leftmost side of the row is the preset initial cell state, the input hidden state is the preset initial hidden state, and the inputs (the initial cell state and the initial hidden state) of the leftmost cell of each row are not necessarily the same. In each column of cells, the hidden state of the next cell is used as the current input matrix of the last cell, the cell located at the lowest position of the column, the current input matrix is the preset initial current input matrix, and the input of the cell located at the lowest position of each column is not necessarily the same (the initial current input matrix). When the LSTM network is processed, the operation is performed on the cells in the lowest row first and then on the cells in the upper row. The output of the rightmost cells of each row can be stored to main memory for use in calculating the loss function of the LSTM network. The output of the uppermost cell of each column can be stored to main memory for calculation of the loss function of the LSTM network. The loss function of the LSTM network may be used to back-propagate updates to the LSTM network.

Exemplary, FIG. 9 is a schematic diagram of a cell array according to an embodiment of the present invention. As shown in FIG. 9, each rectangle represents a cell, cX ₀ ~cX ₂ Is the initial cell state of each row of cells, hX ₀ ~hX ₂ Is the initial cell state of each row of cells, X ⁰ ~X ² Is the primary cell of each rowThe initial current input matrix is used to determine,~/>is the cellular status of the individual cells,/->~/>Is the hidden state of each cell, < > and is the hidden state of each cell>~/>Is the current input matrix for each cell (for each cell, the hidden state is output to both the right cell and the upper cell; for cells not in the bottom row, the current input matrix is equal to the hidden state of the lower cell, e.g.>=/>），cY ₀ ~cY ₂ For the cell state of the last cell in each row (to be stored in main memory), hY ₀ ~hY ₂ For the cell state of the last cell in each row (which is stored in main memory), Y ⁰ ~Y ² The hidden state for the last cell in each column (which would be stored in main memory).

And according to the characteristics of the LSTM network, each slave core is arranged into an array and grouped according to rows, so that the slave cores are respectively used for parallel calculation of an input gate result, a temporary cell state, a forgetting gate result and an output gate result, and the calculation process can be controlled according to rows or columns, so that the parallelism degree is improved. Because the data interaction between the slave cores of adjacent rows is faster, the calculation tasks of each row of slave cores can be reasonably distributed, and the calculation efficiency is improved.

And the bandwidth is improved by the methods of reducing access times through fusion of matrix multiplication operation and vector operation, masking calculation time by a double-buffer technology, accelerating matrix multiplication operation by a pulse matrix, performing parallel calculation by SIMD instructions, taking data from a high-speed access channel of inter-core data and the like.

The embodiment also provides an LSTM network processing device, which is used to implement the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides an LSTM network processing apparatus, as shown in fig. 10, including:

the first data obtaining module 1001 is configured to obtain, for each cell in the LSTM network, matrix multiplication input data that needs to be operated by the current cell, where the matrix multiplication input data includes data corresponding to a hidden state of a previous cell.

The data grouping module 1002 is configured to divide the matrix multiplication input data according to a result category to be operated by the current cell, to obtain a plurality of groups of matrices to be operated, where each group of matrices to be operated is used for data required by matrix multiplication operation corresponding to an operation result of the current cell, and the result category includes an input gate result, a forget gate result, an output gate result, and a temporary cell state.

The matrix multiplication operation module 1003 is configured to load each group of matrices to be operated to a cache area corresponding to a corresponding slave core group, perform matrix multiplication operation of the matrices to be operated of the corresponding group by the slave core group, and store a calculation result of the matrix multiplication operation in the cache area, where each group of matrices to be operated corresponds to a group of slave core groups, each slave core of each group performs a part of the matrix multiplication operation, stores an intermediate calculation result in a respective cache area, and invokes the intermediate calculation result between the slave cores through a data high-speed access channel.

A second data obtaining module 1004, configured to obtain a bias data vector corresponding to each result category;

the vector operation module 1005 is configured to superimpose a calculation result of the matrix multiplication operation corresponding to the input gate result and a corresponding offset data vector, so as to obtain an input gate result vector; superposing the calculation result of matrix multiplication operation corresponding to the forgetting gate result and the corresponding offset data vector to obtain a forgetting gate result vector; superposing the calculated result of matrix multiplication operation corresponding to the output gate result and the corresponding offset data vector to obtain an output gate result vector; superposing the calculated result of the matrix multiplication operation corresponding to the temporary cell state and the corresponding offset data vector to obtain a temporary cell state vector;

an activation module 1006, configured to activate the input gate result vector by using a first activation function to obtain an input gate result of the current cell; activating the forgotten gate result vector through a first activation function to obtain the forgotten gate result of the current cell; activating the output gate result vector through a first activation function to obtain an output gate result of the current cell; activating the temporary cell state vector through a second activation function to obtain the temporary cell state of the current cell;

The cell state operation module 1007 is configured to calculate a cell state of the current cell according to the forgotten gate result vector, the previous cell state, the input gate result vector, and the temporary cell state, where the cell state of the current cell is used for calculating the next cell state;

the hidden state operation module 1008 is configured to calculate, according to the cell state of the current cell and the output gate result vector, a hidden state of the current cell, where the hidden state of the current cell is used as a part of matrix multiplication input data required by the next cell to participate in operation of the next cell.

In an alternative embodiment, the slave cores are arranged in a row B column, wherein each row of slave cores is a group; dividing the matrix multiplication input data according to the result category to be operated by the current cell to obtain a plurality of groups of matrices to be operated, including: splitting the first left matrix data according to the column average, and loading the first left matrix data into each slave core respectively; wherein slave cores located in the same column load the same first left matrix partition data; splitting the second left matrix data according to the column average, and loading the second left matrix data into each slave core respectively; wherein slave cores located in the same column load the same second left matrix data; respectively loading first right matrix data, second right matrix data, third right matrix data, fourth right matrix data, fifth right matrix data, sixth right matrix data, seventh right matrix data and eighth right matrix data from the cores; the first right matrix data, the second right matrix data, the third right matrix data, the fourth right matrix data, the fifth right matrix data, the sixth right matrix data, the seventh right matrix data and the eighth right matrix data are split according to column average and are respectively loaded into each slave core of the slave core group; the first right matrix data and the second right matrix data are used for calculating an input gate result, the third right matrix data and the fourth right matrix data are used for calculating a temporary cell state, the fifth right matrix data and the sixth right matrix data are used for calculating a forgetting gate result, and the seventh right matrix data and the eighth right matrix data are used for calculating an output gate result; the first left matrix data is used for performing matrix multiplication operation with the first right matrix data, the third right matrix data, the fifth right matrix data and the seventh right matrix data respectively; the second left matrix data is used for performing matrix multiplication operation with the second right matrix data, the fourth right matrix data, the sixth right matrix data and the eighth right matrix data respectively.

In an alternative embodiment, the calculating the cell state of the current cell according to the forgotten gate result, the previous cell state, the input gate result and the temporary cell state includes: performing matrix multiplication operation on the forgotten gate result and the previous cell state to obtain an old memory result; performing matrix multiplication operation on the input gate result and the temporary cell state to obtain a new memory result; adding the old memory result and the new memory result to obtain the cell state of the current cell.

In an alternative embodiment, each slave core respectively acquires the forgotten gate score result in the slave core of the column for calculating the forgotten gate result through the data high-speed access channel; each slave core acquires the previous cell division state respectively; performing matrix multiplication operation on the forgotten gate division result and the cell division state to obtain an old memory division result and storing the old memory division result in a cache area; each slave core respectively acquires input gate division results in the slave core of the column for calculating the input gate results through the data high-speed access channel; each slave core respectively acquires temporary cell fraction states in the slave core of the column for calculating the temporary cell states through the data high-speed access channel; performing matrix multiplication operation on the input gate division result and the temporary cell division state to obtain a new memory division result and storing the new memory division result in a cache area; each slave core loads the corresponding old memory score result and new memory score result from the respective cache region, adds the old memory score result and the new memory score result to obtain the cell score state of the current cell, and stores the cell score state into the cache region.

In an alternative embodiment, each slave core loads the cell fraction status of the corresponding current cell from the respective cache region; activating the cell division state of the current cell through the second activation function to obtain an intermediate activation division result; each slave core respectively acquires output gate division results in the slave core of the column for calculating the output gate results through the data high-speed access channel; and performing matrix multiplication operation on the output gate division result and the intermediate activation division result to obtain the hidden division state of the current cell.

Further functional descriptions of the above respective modules are the same as those of the above corresponding embodiments, and are not repeated here.

The LSTM network processing device in this embodiment is presented in the form of functional units, where the units refer to ASIC (Application Specific Integrated Circuit ) circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above described functions.

The embodiment of the invention also provides a computer device which is provided with the LSTM network processing device shown in the figure 10.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 11, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 11.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device further comprises input means 30 and output means 40. The processor 10, memory 20, input device 30, and output device 40 may be connected by a bus or other means, for example in fig. 11.

The input device 30 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, such as a touch screen, a keypad, a mouse, a trackpad, a touchpad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means 40 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A LSTM network processing method, applied to a heterogeneous many-core acceleration processor, where the heterogeneous many-core acceleration processor includes a plurality of groups of slave cores, each slave core includes a cache region, and a data cache channel is provided between the slave cores, and the LSTM network includes an array formed by a plurality of cells, where the method includes:

for each cell in the LSTM network, obtaining matrix multiplication input data which needs to be operated by the current cell, wherein the matrix multiplication input data comprises data corresponding to the hidden state of the previous cell;

dividing the matrix multiplication input data according to the result category of the current cell to be operated to obtain a plurality of groups of matrices to be operated, wherein each group of matrices to be operated is used for data required by matrix multiplication operation corresponding to one operation result of the current cell, and the result category comprises an input gate result, a temporary cell state, a forget gate result and an output gate result;

loading each group of matrix to be operated into a cache area corresponding to a corresponding slave core group, carrying out matrix multiplication operation of the matrix to be operated of the corresponding group by the slave core group, and storing the calculation result of the matrix multiplication operation in the cache area, wherein each group of matrix to be operated corresponds to one group of slave core groups, each slave core of each group respectively executes a part of the matrix multiplication operation, and stores an intermediate calculation result in each cache area, and each slave core invokes the intermediate calculation result through a data high-speed access channel;

Obtaining bias data vectors corresponding to each result category;

superposing the calculated result of the matrix multiplication operation corresponding to the input gate result and the corresponding offset data vector to obtain an input gate result vector; superposing the calculation result of the matrix multiplication operation corresponding to the temporary cell state and the corresponding offset data vector to obtain a temporary cell state vector; superposing the calculation result of matrix multiplication operation corresponding to the forgetting gate result and the corresponding offset data vector to obtain a forgetting gate result vector; superposing the calculated result of the matrix multiplication operation corresponding to the output gate result and the corresponding offset data vector to obtain an output gate result vector;

activating the input gate result vector through a first activation function to obtain an input gate result of the current cell; activating the temporary cell state vector through a second activation function to obtain the temporary cell state of the current cell; activating the forgotten gate result vector through a first activation function to obtain the forgotten gate result of the current cell; activating the output gate result vector through a first activation function to obtain an output gate result of the current cell;

Calculating the cell state of the current cell according to the forgotten gate result, the previous cell state, the input gate result and the temporary cell state, wherein the cell state of the current cell is used for calculating the next cell state;

calculating the hidden state of the current cell according to the cell state of the current cell and the output gate result, wherein the hidden state of the current cell is used as a part of matrix multiplication input data required by the next cell to participate in the operation of the next cell;

the slave cores are arranged in a row and a column, wherein each row of slave cores is a group;

dividing the matrix multiplication input data according to the result category to be operated by the current cell to obtain a plurality of groups of matrices to be operated, wherein the method comprises the following steps:

splitting the first left matrix data according to the column average, and loading the first left matrix data into each slave core respectively; wherein slave cores located in the same column load the same first left matrix partition data; splitting the second left matrix data according to the column average, and loading the second left matrix data into each slave core respectively; wherein slave cores located in the same column load the same second left matrix data;

respectively loading first right matrix data, second right matrix data, third right matrix data, fourth right matrix data, fifth right matrix data, sixth right matrix data, seventh right matrix data and eighth right matrix data from the cores; the first right matrix data, the second right matrix data, the third right matrix data, the fourth right matrix data, the fifth right matrix data, the sixth right matrix data, the seventh right matrix data and the eighth right matrix data are split according to column average and are respectively loaded into each slave core of the slave core group;

The first right matrix data and the second right matrix data are used for calculating an input gate result, the third right matrix data and the fourth right matrix data are used for calculating a temporary cell state, the fifth right matrix data and the sixth right matrix data are used for calculating a forgetting gate result, and the seventh right matrix data and the eighth right matrix data are used for calculating an output gate result;

2. The method of claim 1, wherein loading each group of matrices to be operated on to a corresponding cache region of a corresponding slave core group, performing a matrix multiplication operation of the matrices to be operated on by the slave core group of the corresponding group, and storing a calculation result of the matrix multiplication operation in the cache region, comprises:

performing matrix multiplication operation on each first left matrix sub data and each first right matrix sub data and accumulating to obtain a first intermediate result, performing matrix multiplication operation on each second left matrix sub data and each second right matrix sub data and accumulating to obtain a second intermediate result, and adding the first intermediate result and the second intermediate result to obtain a matrix multiplication operation calculation result corresponding to an input gate result;

Performing matrix multiplication operation on each first left matrix sub data and each third right matrix sub data and accumulating to obtain a third intermediate result, performing matrix multiplication operation on each second left matrix sub data and each fourth right matrix sub data and accumulating to obtain a fourth intermediate result, and adding the third intermediate result and the fourth intermediate result to obtain a matrix multiplication operation calculation result corresponding to the temporary cell state;

performing matrix multiplication operation on each first left matrix sub data and each fifth right matrix sub data and accumulating to obtain a fifth intermediate result, performing matrix multiplication operation on each second left matrix sub data and each sixth right matrix sub data and accumulating to obtain a sixth intermediate result, and adding the fifth intermediate result and the sixth intermediate result to obtain a matrix multiplication operation calculation result corresponding to the forgetting gate result;

and performing matrix multiplication operation on each first left matrix sub data and each seventh right matrix sub data and accumulating to obtain a seventh intermediate result, performing matrix multiplication operation on each second left matrix sub data and each eighth right matrix sub data and accumulating to obtain an eighth intermediate result, and adding the seventh intermediate result and the eighth intermediate result to obtain a matrix multiplication operation calculation result corresponding to the output gate result.

3. The method of claim 2, wherein said calculating the cell state of the current cell based on the forgotten gate result, a previous cell state, the input gate result, and a temporary cell state comprises:

performing matrix multiplication operation on the forgotten gate result and the previous cell state to obtain an old memory result; performing matrix multiplication operation on the input gate result and the temporary cell state to obtain a new memory result; and adding the old memory result and the new memory result to obtain the cell state of the current cell.

4. A method according to claim 3, characterized in that the method further comprises:

each slave core respectively acquires the forgotten gate division result in the slave core of the column for calculating the forgotten gate result through the data high-speed access channel; each slave core acquires the previous cell division state respectively; performing matrix multiplication operation on the forgotten gate division result and the cell division state to obtain an old memory division result and storing the old memory division result in a cache area;

each slave core respectively acquires input gate division results in the slave core of the column for calculating the input gate results through the data high-speed access channel; each slave core respectively acquires temporary cell fraction states in the slave core of the column for calculating the temporary cell states through the data high-speed access channel; performing matrix multiplication operation on the input gate division result and the temporary cell division state to obtain a new memory division result and storing the new memory division result in a cache area;

5. The method of claim 4, wherein calculating the hidden state of the current cell based on the cell state of the current cell and the output gate result comprises:

activating the cell state of the current cell through the second activation function to obtain an intermediate activation result; and performing matrix multiplication operation on the output gate result and the intermediate activation result to obtain the hidden state of the current cell.

6. The method of claim 5, wherein the method further comprises:

each slave core loads the cell fraction status of the corresponding current cell from the respective cache region;

activating the cell division state of the current cell through the second activation function to obtain an intermediate activation division result;

each slave core respectively acquires output gate division results in the slave core of the column for calculating the output gate results through the data high-speed access channel;

And performing matrix multiplication operation on the output gate division result and the intermediate activation division result to obtain the hidden division state of the current cell.

7. An LSTM network processing apparatus for use in a heterogeneous many-core acceleration processor, the heterogeneous many-core acceleration processor including a plurality of groups of slave cores, each slave core including a cache region, each slave core having a data cache channel therebetween, the LSTM network comprising an array of a plurality of cells, the apparatus comprising:

the cell state operation module is used for calculating the cell state of the current cell according to the forgotten gate result, the previous cell state, the input gate result and the temporary cell state, wherein the cell state of the current cell is used for calculating the next cell state;

the hidden state operation module is used for calculating the hidden state of the current cell according to the cell state of the current cell and the output gate result, wherein the hidden state of the current cell is used for taking part in the operation of the next cell as a part of matrix multiplication input data required by the next cell;

8. A computer device, comprising:

a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the LSTM network processing method of any of claims 1 to 6.

9. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the LSTM network processing method of any one of claims 1 to 6.