CN111723913A - Data processing method, device and equipment and readable storage medium - Google Patents

Data processing method, device and equipment and readable storage medium Download PDF

Info

Publication number
CN111723913A
CN111723913A CN202010567702.3A CN202010567702A CN111723913A CN 111723913 A CN111723913 A CN 111723913A CN 202010567702 A CN202010567702 A CN 202010567702A CN 111723913 A CN111723913 A CN 111723913A
Authority
CN
China
Prior art keywords
data
processing
matrix
fpga
media object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010567702.3A
Other languages
Chinese (zh)
Inventor
刘海威
董刚
赵雅倩
李仁刚
杨宏斌
蒋东东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN202010567702.3A priority Critical patent/CN111723913A/en
Publication of CN111723913A publication Critical patent/CN111723913A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a data processing method, a device, equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a media object to be processed, and inputting the media object to an LSTM network; acquiring target data which is generated in the process of processing the media object by the LSTM network and needs to be processed by using a gate structure; rearranging the target data by using the parallelism parameter of the FPGA to obtain parallel data; performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result; and feeding back the processing result to the LSTM network for continuous processing to obtain an output result of the media object. The method can accelerate the LSTM network by utilizing the FPGA, so that the LSTM network can be applied to the embedded equipment to process the media object, and the service function of the embedded equipment is enhanced.

Description

Data processing method, device and equipment and readable storage medium
Technical Field
The present invention relates to the field of computer application technologies, and in particular, to a data processing method, apparatus, device, and readable storage medium.
Background
LSTM (Long Short-Term Memory networks, a time-recursive neural network) is suitable for processing and predicting important events with very Long intervals and delays in time series due to a unique design structure. LSTM is widely used for processing media objects in the fields of speech recognition, machine translation, language modeling, emotion analysis, and text prediction. Since these fields are very close to daily life, the use of LSTM in mobile terminals is becoming more and more widespread, and how to efficiently utilize LSTM network to fulfill application needs becomes a recent hot research topic.
The most important structure in the LSTM network is a gate structure (including an input gate, an output gate, a forgetting gate and a cell gate), so that the LSTM network can be supported by a CPU, a GPU and an ASIC (Application integrated circuit) capable of supporting the gate structure for arithmetic processing. But due to the reasons of cost and power consumption, the CPU and GPU are difficult to use for LSTM computation in embedded devices, and the ASIC has the advantages of high performance and low power consumption when used for accelerating LSTM network computation, but has the disadvantage of insufficient flexibility.
In summary, how to effectively solve the problem of supporting the LSTM network in the embedded device to provide the media object processing service is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a data processing method, a data processing device, data processing equipment and a readable storage medium, which can accelerate an LSTM network by utilizing an FPGA (field programmable gate array), so that the LSTM network can be applied to embedded equipment to process a media object, and the service function of the embedded equipment is enhanced.
In order to solve the technical problems, the invention provides the following technical scheme:
a method of data processing, comprising:
acquiring a media object to be processed, and inputting the media object to an LSTM network;
acquiring target data which is generated in the process of processing the media object by the LSTM network and needs to be processed by using a gate structure;
rearranging the target data by using the parallelism parameter of the FPGA to obtain parallel data;
performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result;
and feeding back the processing result to the LSTM network for continuous processing to obtain an output result of the media object.
Preferably, the matrix vector multiplication processing is performed on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result, and the processing result includes:
and performing parallel matrix vector multiplication processing on the parallel data by using the matrix vector multiplication unit group matched with the parallelism parameter in the FPGA to obtain the processing result.
Preferably, the matrix vector multiplication unit group includes: the device comprises a multiplication unit, an addition unit, an accumulation unit and an offset addition unit, wherein the number of multipliers in the multiplication unit corresponds to the parallelism parameter.
Preferably, the parallel data comprises a weight data matrix, an input data matrix and an offset data matrix; performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result, wherein the processing result comprises the following steps:
respectively reading the array to be multiplied corresponding to the parallelism parameter from the input weight data matrix and the input data matrix; the multiplier group to be multiplied comprises weight data and input data;
inputting each multiplier group to be multiplied into each multiplier in the multiplication unit for multiplication processing to obtain a corresponding multiplication result;
inputting a plurality of multiplication calculation results into the addition unit for summation, and performing accumulation processing on the output of the addition unit by using the accumulation unit to obtain multiplication accumulation results;
and inputting the multiply-accumulate result and the corresponding offset data in the offset data matrix into the offset addition unit for superposition to obtain the processing result.
Preferably, the addition unit is a pipeline adder.
Preferably, rearranging the target data by using a parallelism parameter of the FPGA to obtain parallel data includes:
determining a data storage width by using the parallelism parameter;
and rearranging the target data according to the data storage width to obtain the parallel data.
Preferably, the method further comprises the following steps:
receiving and analyzing a delay control request;
and modifying the parallelism parameter by using the analysis result.
A data processing apparatus comprising:
the media object acquisition module is used for acquiring a media object to be processed and inputting the media object to the LSTM network;
a target data acquisition module, configured to acquire target data that needs to be processed by using a gate structure and is generated in the process of processing the media object by the LSTM network;
the data arrangement module is used for rearranging the target data by utilizing the parallelism parameter of the FPGA to obtain parallel data;
the parallel acceleration processing module is used for performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result;
and the result acquisition module is used for feeding back the processing result to the LSTM network for continuous processing to obtain the output result of the media object.
A data processing apparatus comprising:
a memory for storing a computer program;
a processor for implementing the steps of the data processing method when executing the computer program.
A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned data processing method.
By applying the method provided by the embodiment of the invention, the media object to be processed is obtained and is input into the LSTM network; acquiring target data which is generated in the process of processing the media object by the LSTM network and needs to be processed by using a gate structure; rearranging the target data by using the parallelism parameter of the FPGA to obtain parallel data; performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result; and feeding back the processing result to the LSTM network for continuous processing to obtain an output result of the media object.
It has been found that the calculation of matrix and vector multiplication in the gate structure of the LSTM network occupies a significant portion of the computational effort. Therefore, in the method, the matrix vector multiplication unit realized in the FPGA can be used for carrying out parallel accelerated processing on the target data which is generated and needs to be processed by using the gate structure in the process of processing the media object by the LSTM network. Specifically, in order to utilize the parallel characteristic of the FPGA, firstly, the target data is rearranged by using the parallelism parameter of the FPGA to obtain parallel data. Then, the matrix vector multiplication unit in the FPGA is used for processing the parallel data, and the gate structure processing can be completed quickly. The LSTM network may then obtain the output of the media object.
Compared with the traditional hardware (such as a CPU and a GPU) which is limited by the cost, the LSTM network acceleration of the embedded end is difficult to use, and the ASIC has the characteristics of high performance and low power consumption, but is lack of flexibility and difficult to adapt to various algorithm requirements. The method provides the FPGA-based acceleration LSTM network, and accelerates the calculation of the gate structure in the LSTM network by adopting parallel calculation, so as to realize the acceleration of the LSTM network, support the LSTM network in the embedded equipment, and effectively process the target media, namely, realize the functions of voice recognition, machine translation, language modeling, emotion analysis, text prediction and the like in the embedded equipment.
Accordingly, embodiments of the present invention further provide a data processing apparatus, a device and a readable storage medium corresponding to the data processing method, which have the above technical effects and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of an embodiment of a data processing method;
FIG. 2 is a schematic diagram of the structure of an LSTM;
FIG. 3 is a diagram illustrating a data arrangement according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an embodiment of a data processing method according to the present invention;
FIG. 5 is a schematic block diagram of an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of a data processing method in an embodiment of the present invention, where the method may be applied to an embedded device, and the method includes the following steps:
s101, acquiring a media object to be processed, and inputting the media object to an LSTM network.
The media object can be any one of a voice section corresponding to voice recognition, a text to be translated corresponding to machine translation, a word/paragraph corresponding to language modeling, a face image corresponding to emotion analysis and an input text corresponding to text prediction. Of course, the media object may also be input data corresponding to other specific fields of the LSTM network.
It should be noted that the LSTM network is an LSTM network capable of processing media objects. That is, in this embodiment, the LSTM network may be a network that is trained and can be used for any specific application of speech recognition, machine translation, language modeling, emotion analysis, and text prediction.
S102, target data which is generated in the process of processing the media object by the LSTM network and needs to be processed by the gate structure is obtained.
To facilitate understanding of what is the target data, a detailed description is given below in conjunction with the specific structure of the LSTM network.
Referring to fig. 2, fig. 2 is a schematic structural diagram of an LSTM. It can be seen that the LSTM network achieves preservation of and removal of the important content by removing or adding information of the "cell state" through the gate structure. A probability value between 0 and 1 is output through the sigmoid layer to describe how much of each part can pass through, wherein 0 represents that the task variable is not allowed to pass through, and 1 represents that all the variables are run through. Wherein the door structure comprises a forgetting door (f)t) Input gate (i)t) Output gate (o)t) And cell door
Figure BDA0002548418430000051
As can be seen from the structure of the LSTM network, the most important part of the calculation is the gate structure calculation, and the calculation of matrix and vector multiplication in the gate structure occupies most of the calculation amount. The 4-gate structure in the LSTM network is computationally similar, where most of the operations are the same, and the 4-gate calculation formula is as follows:
forget the door: f. oft=σ(WxfxtWhfht-1+bf) (1)
An input gate: i.e. it=σ(WxixtWhiht-1+bi) (2)
An output gate: ot=σ(WxoxtWhoht-1+bo) (3)
A cell gate:
Figure BDA0002548418430000061
the operation processes in parentheses in equations (1) to (4) are similar, and they have the exact same input xtAnd ht-1And outside the parenthesis, performing activation function calculation, wherein sigma represents a sigmoid function. Let input data xtDimension Nx× 1, input data ht-1(and h)t) Dimension Nh× 1, the weight data matrix W is inputx(means W)xfWxiWxoWxcSame below) dimension of Nh×NxHidden state weight data matrix Wh(means W)hfWhiWhoWhcSame below) dimension of Nh×NhOffset data matrix B (designated B)fBiBoBcSame below) has a dimension of N h1。
If according to the traditional calculation flow, respectively calculating Wx·xtAnd Wh·htThe operation of multiplying matrix and vector (abbreviated as matrix and vector multiplication in this text) needs to be completed step by step when N isxOr NhWhen the size is larger, not only the calculation speed is slow, but also a large amount of resources are consumed during hardware implementation.
In order to accelerate matrix vector multiplication processing and reduce consumption of a large number of resources so as to be applied to embedded equipment, the FPGA is adopted to accelerate calculation in the embodiment of the invention.
That is, data that needs to be processed using a gate structure (i.e., matrix-vector multiplication) is referred to as target data. Specifically, the target data may specifically include: wx (input weight data matrix), Wh (hidden state weight data matrix), X (input data), H (hidden state data), and B (offset data matrix).
S103, rearranging the target data by using the parallelism parameter of the FPGA to obtain parallel data.
In order to effectively utilize the parallel processing characteristics of the FPGA, in this embodiment, before the matrix vector multiplication is performed by using the FPGA, the target data needs to be rearranged so that the FPGA can perform the accelerated processing.
Considering that the requirements for delay and resource occupation are different due to different scales of FPGAs and different specific application scenarios, in this embodiment, different requirements can be met by setting a parallelism parameter. That is, if the delay requirement is high, the calculation process can be accelerated by setting a larger parallelism parameter; if the delay requirement is low, a smaller parallelism parameter can be set.
In order to match specific parallelism parameters, the target data needs to be rearranged to obtain parallel data suitable for the FPGA parallel processing mode. Specifically, the process of rearranging the target data includes:
determining data storage width by using a parallelism parameter;
and step two, rearranging the target data according to the data storage width to obtain parallel data.
That is, the target data originally suitable for single-line processing is rearranged, so that the target data can be applied to the parallel computation of the FPGA.
For example, the following steps are carried out: for convenience of description, set Nx=4,NhThe parallelism parameter P is 2 (representing that P multiplications are performed simultaneously per cycle) at 4. Referring to fig. 3, fig. 3 is a schematic diagram of a data arrangement according to an embodiment of the invention, wherein Wx and Wh are both 4X4, X and H are both 4X1, and B is also 4X 1. The left part of FIG. 3 is the process of reordering the input data X and H, which is re-ordered by 4X1The arrangement is 2 × 2, i.e., the data output width becomes P ═ 2. Column 1 is the original storage mode of Wx and Wh, which is exemplified by Wx, Wxf, Wxi, Wxc, and Wxo are arranged in a staggered order every 4 lines, each weight value accounts for Nx/2-2 lines, the width of each line is P (2), Wh data is arranged in the same way as Wx, and the redistributed weight data is stored as shown in column 2 in fig. 3.
Preferably, the parallelism parameter can be set and adjusted according to actual conditions. Specifically, a delay control request may be received and parsed; and modifying the parallelism parameter by using the analysis result. For example, if the resource occupation is requested to be reduced in the delay control request, the parallelism parameter can be reduced (e.g. from 3 to 2); if the delay control request requests a reduction in delay to increase the degree of acceleration, then the parallelism parameter may be increased (e.g., from 2 to 4).
And S104, performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result.
After the data arrangement conversion is completed, parallel matrix vector multiplication processing can be performed on the target data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result.
It should be noted that the parallel matrix vector multiplication processing performed herein refers to performing at least 2 multiplication calculations in parallel in the same clock cycle of the FPGA.
And the matrix vector multiplication unit group is matched with the parallelism parameter. For example, if the parallelism parameter is 2, then 2 multipliers are included in the multiplication unit in the matrix-vector multiplication unit group. That is to say, using the FPGA to perform parallel matrix vector multiplication processing on the parallel data to obtain a processing result, may specifically include: and performing parallel matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group matched with the parallelism parameter in the FPGA to obtain a processing result.
Wherein each matrix vector multiplication unit group comprises: the device comprises a multiplication unit, an addition unit, an accumulation unit and an offset addition unit, wherein the number of multipliers in the multiplication unit corresponds to a parallelism parameter. The multiplication unit includes at least 2 multipliers realized in the FPGA, the addition unit is an adder realized in the FPGA, the accumulation unit is an accumulation calculation, and the offset addition unit is an offset adder realized in the FPGA. The matrix vector multiplication unit group can comprise one multiplication unit, one or more than one addition unit and one offset addition unit. Specifically, the specific structure of the matrix-vector multiplication unit group may also be as shown in fig. 5, that is, the specific structure includes: the device comprises a multiplication unit, an addition unit, an accumulation unit and offset addition.
Specifically, the parallel data includes a weight data matrix, an input data matrix, and an offset data matrix; the following describes in detail a specific processing procedure of target data by taking an example of a matrix vector multiplication processing procedure performed by a matrix vector multiplication unit group. The specific treatment process comprises the following steps:
respectively reading a to-be-multiplied array corresponding to a parallelism parameter from an input weight data matrix and an input data matrix; the multiplier group to be multiplied comprises weight data and input data;
step two, inputting each group of the to-be-multiplied numbers into each multiplier in the multiplication unit respectively for multiplication processing to obtain corresponding multiplication results;
inputting a plurality of multiplication calculation results into an addition unit for summation, and performing accumulation processing on the output of the addition unit by using an accumulation unit to obtain multiplication accumulation results;
and fourthly, inputting the multiplication accumulation result and the corresponding offset data in the offset data matrix into an offset addition unit for superposition to obtain a processing result.
For convenience of description, the above four steps will be described in combination.
For the matrix vector multiplication unit group, the number of the to-be-multiplied groups matched with the parallelism parameter can be read from the input weight data matrix and the input data matrix, namely the number of the to-be-multiplied groups is matched with the parallelism parameter. For example, parallelism parameter 3, 3 sets of to-be-multiplied arrays are obtained. One weight data and one input data constitute a set of to-be-multiplied data.
And correspondingly inputting each group of to-be-multiplied number groups into the multiplier of the multiplication unit for multiplication to obtain a multiplication result. And then inputting a plurality of multiplication calculation results into an addition unit for summation, and performing accumulation processing on the output of the addition unit by using an accumulation unit to obtain a multiplication accumulation result. And finally, superposing the multiplication and accumulation result on the corresponding offset data in the offset data matrix through an offset addition unit to obtain a processing result.
For example, the following steps are carried out: as shown in fig. 4, fig. 4 is a schematic diagram of an embodiment of a data processing method according to the present invention, when performing calculation, a first cycle (FPGA clock cycle concept, i.e., time unit) reads Wxf1 and Wxf2 from a Wx matrix, and reads Xf1 and Xf2 from an X matrix to participate in calculation, and two multiplication units calculate Wxf1 × 1 and Wxf2 × 2 at the same time; the second cycle calculates Wxf3 x3 and Wxf 4x4 simultaneously, and so on. The multiplication unit completes P times of multiplication operations each time, outputs the result to the addition unit, and the addition unit completes the addition operation of the P multiplication results to obtain a calculation result. The next stage of the accumulation unit (i.e. the addition unit) is an offset addition unit (i.e. a simple adder), and the addition operation of the multiply-accumulate result and the corresponding offset data is completed within 1 clock cycle. The parenthesized parts in the above formulas (1) to (4), that is, the parts for acceleration by the FPGA are completed.
Preferably, to increase the accumulation speed, the addition unit is a pipeline adder. That is, each time the multiplication unit completes P times of multiplication operations, the result is output to the addition unit, the addition unit completes the addition operation of P multiplication results, the pipeline adder is adopted for realization, and log is needed in total2Class P, i.e. log2P cycles are completed. The accumulation unit can finally complete accumulation of the output result of the addition unit, and the accumulation unit completes accumulation of Nh/P data because the parallelism is P, so that a row of weight matrix and X (or H) multiplication accumulation result can be obtained, and the result is output to the next stage.
Preferably, in addition to parallel computing, pipeline design may be employed throughout the computation. As shown in fig. 4, fig. 4 is a schematic diagram of pipeline calculation in the embodiment of the present invention. Referring to the schematic block diagram of fig. 5, in the gate structure calculation process, the steps of the multiplication unit, the addition unit, etc. cannot be completed within one cycle. Besides the adder implementation of multi-stage pipelining adopted inside the adding unit, the calculation of different units is also pipelined. For example, the first step requires the calculation of X Wxf and the second step H Whf, and in fact, in calculating the additive calculation of X Wxf, the process of taking H and Whf has already started, and the calculations of the two steps overlap in time. And by popularizing to the previous layer, the calculation of different door structures is overlapped in time. With pipelining, although some computation delay is added, consumption of hardware resources is reduced while still achieving better computation performance.
And S105, feeding back the processing result to the LSTM network for continuous processing to obtain an output result of the media object.
After matrix vector multiplication processing is completed in the FPGA, a processing result (only one calculation result data is obtained in each calculation) is fed back to the LSTM network, and the processing result is stored in a corresponding line in an LSTM network appointed cache module after an activation function is carried out, so that output ht is obtained. And (5) after the ht is subjected to serial-parallel conversion, converting the data width into the parallelism degree P, and writing the parallelism degree P into the Wh cache module for calculating the next time step. Finally, the LSTM module outputs an input result corresponding to the media object. Particularly, for the LSTM network, except the process of corresponding processing of the gate structure, the calculation can be directly performed in the FPGA, and the rest of the calculation processing processes can also be performed in the FPGA.
If the media object is a voice segment, the output result may be specifically the voice content of the voice segment, or the voice feature (such as the identity information of the voice sender); if the media object is a text to be translated, outputting a result as a translated text; if the media object is emotion analysis input data, outputting a result which is an emotion analysis conclusion, such as happy and sad; and if the media object is a text input, outputting a predicted text as an output result.
By applying the method provided by the embodiment of the invention, the media object to be processed is obtained and is input into the LSTM network; acquiring target data which is generated in the process of processing the media object by the LSTM network and needs to be processed by using a gate structure; rearranging the target data by using the parallelism parameter of the FPGA to obtain parallel data; performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result; and feeding back the processing result to the LSTM network for continuous processing to obtain an output result of the media object.
It has been found that the calculation of matrix and vector multiplication in the gate structure of the LSTM network occupies a significant portion of the computational effort. Therefore, in the method, the matrix vector multiplication unit realized in the FPGA can be used for carrying out parallel accelerated processing on the target data which is generated and needs to be processed by using the gate structure in the process of processing the media object by the LSTM network. Specifically, in order to utilize the parallel characteristic of the FPGA, firstly, the target data is rearranged by using the parallelism parameter of the FPGA to obtain parallel data. Then, the matrix vector multiplication unit in the FPGA is used for processing the parallel data, and the gate structure processing can be completed quickly. The LSTM network may then obtain the output of the media object.
Compared with the traditional hardware (such as a CPU and a GPU) which is limited by the cost, the LSTM network acceleration of the embedded end is difficult to use, and the ASIC has the characteristics of high performance and low power consumption, but is lack of flexibility and difficult to adapt to various algorithm requirements. The method provides the FPGA-based acceleration LSTM network, and accelerates the calculation of the gate structure in the LSTM network by adopting parallel calculation, so as to realize the acceleration of the LSTM network, support the LSTM network in the embedded equipment, and effectively process the target media, namely, realize the functions of voice recognition, machine translation, language modeling, emotion analysis, text prediction and the like in the embedded equipment.
Corresponding to the above method embodiments, the embodiments of the present invention further provide a data processing apparatus, and the data processing apparatus described below and the data processing method described above may be referred to in correspondence with each other.
Referring to fig. 6, the apparatus includes the following modules:
a media object obtaining module 101, configured to obtain a media object to be processed, and input the media object to an LSTM network;
a target data obtaining module 102, configured to obtain target data that needs to be processed by using a gate structure and is generated in a process of processing a media object by an LSTM network;
the data arrangement module 103 is configured to rearrange the target data by using the parallelism parameter of the FPGA to obtain parallel data;
the parallel acceleration processing module 104 is configured to perform matrix-vector multiplication processing on parallel data by using a matrix-vector multiplication unit group in the FPGA to obtain a processing result;
and the result obtaining module 105 is configured to feed back the processing result to the LSTM network for further processing, so as to obtain an output result of the media object.
By applying the device provided by the embodiment of the invention, the media object to be processed is obtained and is input into the LSTM network; acquiring target data which is generated in the process of processing the media object by the LSTM network and needs to be processed by using a gate structure; rearranging the target data by using the parallelism parameter of the FPGA to obtain parallel data; performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result; and feeding back the processing result to the LSTM network for continuous processing to obtain an output result of the media object.
It has been found that the calculation of matrix and vector multiplication in the gate structure of the LSTM network occupies a significant portion of the computational effort. Therefore, in the device, the matrix vector multiplication unit realized in the FPGA can be used for carrying out parallel acceleration processing on the target data which is generated and needs to be processed by using the gate structure in the process of processing the media object by the LSTM network. Specifically, in order to utilize the parallel characteristic of the FPGA, firstly, the target data is rearranged by using the parallelism parameter of the FPGA to obtain parallel data. Then, the matrix vector multiplication unit in the FPGA is used for processing the parallel data, and the gate structure processing can be completed quickly. The LSTM network may then obtain the output of the media object.
Compared with the traditional hardware (such as a CPU and a GPU) which is limited by the cost, the LSTM network acceleration of the embedded end is difficult to use, and the ASIC has the characteristics of high performance and low power consumption, but is lack of flexibility and difficult to adapt to various algorithm requirements. The device provides an FPGA-based acceleration LSTM network, and the computation of a gate structure in the LSTM network is accelerated by adopting parallel computation, so that the acceleration of the LSTM network is realized, the LSTM network can be supported in embedded equipment, target media can be effectively processed, and functions of voice recognition, machine translation, language modeling, emotion analysis, text prediction and the like can be realized in the embedded equipment.
In a specific embodiment of the present invention, the parallel acceleration processing module 104 is specifically configured to perform parallel matrix vector multiplication processing on parallel data by using a matrix vector multiplication unit group matched with a parallelism parameter in an FPGA, so as to obtain a processing result.
In one embodiment of the present invention, the matrix vector multiplication unit group comprises: the device comprises a multiplication unit, an addition unit, an accumulation unit and an offset addition unit; the number of multipliers in the multiplication unit corresponds to the parallelism parameter.
In one embodiment of the present invention, the parallel data comprises a weight data matrix, an input data matrix, and an offset data matrix; the parallel acceleration processing module 104 is specifically configured to read the to-be-multiplied arrays corresponding to the parallelism parameter from the input weight data matrix and the input data matrix respectively; the multiplier group to be multiplied comprises weight data and input data; inputting each multiplier group to be multiplied into each multiplier in the multiplication unit for multiplication processing to obtain a corresponding multiplication result; and inputting the multiplication accumulation result and the corresponding offset data in the offset data matrix into an offset addition unit for superposition to obtain a processing result.
In one embodiment of the invention, the addition unit is a pipeline adder.
In an embodiment of the present invention, the data arrangement module 103 is specifically configured to determine a data storage width by using a parallelism parameter; and rearranging the target data according to the data storage width to obtain parallel data.
In one embodiment of the present invention, the method further comprises:
the parallelism parameter adjusting module is used for receiving and analyzing the delay control request; and modifying the parallelism parameter by using the analysis result.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a data processing device, and a data processing device described below and a data processing method described above may be referred to in correspondence with each other.
Referring to fig. 7, the data processing apparatus includes:
a memory 332 for storing a computer program;
a processor 322 for implementing the steps of the data processing method of the above-described method embodiments when executing the computer program.
Specifically, referring to fig. 8, fig. 8 is a schematic diagram of a specific structure of a data processing apparatus provided in this embodiment, which may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 322 (e.g., one or more processors) and a memory 332, where the memory 332 stores one or more computer applications 342 or data 344. Memory 332 may be, among other things, transient or persistent storage. The program stored in memory 332 may include one or more modules (not shown), each of which may include a sequence of instructions operating on a data processing device. Still further, the central processor 322 may be configured to communicate with the memory 332 to execute a series of instruction operations in the memory 332 on the data processing device 301.
The data processing apparatus 301 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input-output interfaces 358, and/or one or more operating systems 341.
The steps in the data processing method described above may be implemented by the structure of a data processing apparatus.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a readable storage medium, and a readable storage medium described below and a data processing method described above may be referred to in correspondence with each other.
A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the data processing method of the above-mentioned method embodiment.
The readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and various other readable storage media capable of storing program codes.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Claims (10)

1. A data processing method, comprising:
acquiring a media object to be processed, and inputting the media object to an LSTM network;
acquiring target data which is generated in the process of processing the media object by the LSTM network and needs to be processed by using a gate structure;
rearranging the target data by using the parallelism parameter of the FPGA to obtain parallel data;
performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result;
and feeding back the processing result to the LSTM network for continuous processing to obtain an output result of the media object.
2. The data processing method according to claim 1, wherein performing matrix-vector multiplication processing on the parallel data by using a matrix-vector multiplication unit group in the FPGA to obtain a processing result comprises:
and performing parallel matrix vector multiplication processing on the parallel data by using the matrix vector multiplication unit group matched with the parallelism parameter in the FPGA to obtain the processing result.
3. The data processing method of claim 1, wherein the set of matrix vector multiplication units comprises: the device comprises a multiplication unit, an addition unit, an accumulation unit and an offset addition unit, wherein the number of multipliers in the multiplication unit corresponds to the parallelism parameter.
4. The data processing method of claim 3, wherein the parallel data comprises a weight data matrix, an input data matrix, and an offset data matrix; performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result, wherein the processing result comprises the following steps:
respectively reading the array to be multiplied corresponding to the parallelism parameter from the input weight data matrix and the input data matrix; the multiplier group to be multiplied comprises weight data and input data;
inputting each multiplier group to be multiplied into each multiplier in the multiplication unit for multiplication processing to obtain a corresponding multiplication result;
inputting a plurality of multiplication calculation results into the addition unit for summation, and performing accumulation processing on the output of the addition unit by using the accumulation unit to obtain multiplication accumulation results;
and inputting the multiply-accumulate result and the corresponding offset data in the offset data matrix into the offset addition unit for superposition to obtain the processing result.
5. The data processing method of claim 4, wherein the addition unit is a pipeline adder.
6. The data processing method of claim 1, wherein rearranging the target data by using a parallelism parameter of the FPGA to obtain parallel data comprises:
determining a data storage width by using the parallelism parameter;
and rearranging the target data according to the data storage width to obtain the parallel data.
7. The data processing method according to any one of claims 1 to 6, further comprising:
receiving and analyzing a delay control request;
and modifying the parallelism parameter by using the analysis result.
8. A data processing apparatus, comprising:
the media object acquisition module is used for acquiring a media object to be processed and inputting the media object to the LSTM network;
a target data acquisition module, configured to acquire target data that needs to be processed by using a gate structure and is generated in the process of processing the media object by the LSTM network;
the data arrangement module is used for rearranging the target data by utilizing the parallelism parameter of the FPGA to obtain parallel data;
the parallel acceleration processing module is used for performing matrix vector multiplication processing on the parallel data by using a matrix vector multiplication unit group in the FPGA to obtain a processing result;
and the result acquisition module is used for feeding back the processing result to the LSTM network for continuous processing to obtain the output result of the media object.
9. A data processing apparatus, characterized by comprising:
a memory for storing a computer program;
a processor for implementing the steps of the data processing method according to any one of claims 1 to 7 when executing the computer program.
10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the data processing method according to any one of claims 1 to 7.
CN202010567702.3A 2020-06-19 2020-06-19 Data processing method, device and equipment and readable storage medium Withdrawn CN111723913A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010567702.3A CN111723913A (en) 2020-06-19 2020-06-19 Data processing method, device and equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010567702.3A CN111723913A (en) 2020-06-19 2020-06-19 Data processing method, device and equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN111723913A true CN111723913A (en) 2020-09-29

Family

ID=72568182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010567702.3A Withdrawn CN111723913A (en) 2020-06-19 2020-06-19 Data processing method, device and equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111723913A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115660035A (en) * 2022-12-28 2023-01-31 南京南瑞信息通信科技有限公司 Hardware accelerator for LSTM network and LSTM model
WO2023202352A1 (en) * 2022-04-21 2023-10-26 北京字跳网络技术有限公司 Speech recognition method and apparatus, electronic device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046901A1 (en) * 2016-08-12 2018-02-15 Beijing Deephi Intelligence Technology Co., Ltd. Hardware accelerator for compressed gru on fpga
CN108090560A (en) * 2018-01-05 2018-05-29 中国科学技术大学苏州研究院 The design method of LSTM recurrent neural network hardware accelerators based on FPGA
US20180174036A1 (en) * 2016-12-15 2018-06-21 DeePhi Technology Co., Ltd. Hardware Accelerator for Compressed LSTM
CN109542830A (en) * 2018-11-21 2019-03-29 北京灵汐科技有限公司 A kind of data processing system and data processing method
CN111045727A (en) * 2018-10-14 2020-04-21 天津大学青岛海洋技术研究院 Processing unit array based on nonvolatile memory calculation and calculation method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180046901A1 (en) * 2016-08-12 2018-02-15 Beijing Deephi Intelligence Technology Co., Ltd. Hardware accelerator for compressed gru on fpga
US20180174036A1 (en) * 2016-12-15 2018-06-21 DeePhi Technology Co., Ltd. Hardware Accelerator for Compressed LSTM
CN108090560A (en) * 2018-01-05 2018-05-29 中国科学技术大学苏州研究院 The design method of LSTM recurrent neural network hardware accelerators based on FPGA
CN111045727A (en) * 2018-10-14 2020-04-21 天津大学青岛海洋技术研究院 Processing unit array based on nonvolatile memory calculation and calculation method thereof
CN109542830A (en) * 2018-11-21 2019-03-29 北京灵汐科技有限公司 A kind of data processing system and data processing method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023202352A1 (en) * 2022-04-21 2023-10-26 北京字跳网络技术有限公司 Speech recognition method and apparatus, electronic device, and storage medium
CN115660035A (en) * 2022-12-28 2023-01-31 南京南瑞信息通信科技有限公司 Hardware accelerator for LSTM network and LSTM model
CN115660035B (en) * 2022-12-28 2023-08-11 南京南瑞信息通信科技有限公司 Hardware accelerator for LSTM network and LSTM model

Similar Documents

Publication Publication Date Title
CN107862374B (en) Neural network processing system and processing method based on assembly line
CN110326003B (en) Hardware node with location dependent memory for neural network processing
JP6507271B2 (en) CNN processing method and device
US10691996B2 (en) Hardware accelerator for compressed LSTM
Yepez et al. Stride 2 1-D, 2-D, and 3-D Winograd for convolutional neural networks
JP2022071015A (en) Batch processing in neural network processor
CN111310904A (en) Apparatus and method for performing convolutional neural network training
JP3228927B2 (en) Processor element, processing unit, processor, and arithmetic processing method thereof
CN109144469B (en) Pipeline structure neural network matrix operation architecture and method
CN109284824B (en) Reconfigurable technology-based device for accelerating convolution and pooling operation
CN111723913A (en) Data processing method, device and equipment and readable storage medium
CN110766128A (en) Convolution calculation unit, calculation method and neural network calculation platform
KR102396447B1 (en) Deep learning apparatus for ANN with pipeline architecture
Conte et al. GPU-acceleration of waveform relaxation methods for large differential systems
Arredondo-Velazquez et al. A streaming architecture for Convolutional Neural Networks based on layer operations chaining
Domingos et al. An efficient and scalable architecture for neural networks with backpropagation learning
CN113655986B9 (en) FFT convolution algorithm parallel implementation method and system based on NUMA affinity
CN111639701A (en) Method, system and equipment for extracting image features and readable storage medium
CN114519425A (en) Convolution neural network acceleration system with expandable scale
CN111985626B (en) System, method and storage medium for accelerating RNN (radio network node)
CN114127689A (en) Method for interfacing with a hardware accelerator
Ivutin et al. Design efficient schemes of applied algorithms parallelization based on semantic Petri-Markov net
Wu et al. Skeletongcn: a simple yet effective accelerator for gcn training
CN115222028A (en) One-dimensional CNN-LSTM acceleration platform based on FPGA and implementation method
CN113869517A (en) Inference method based on deep learning model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20200929

WW01 Invention patent application withdrawn after publication