CN111105023B - Data stream reconstruction method and reconfigurable data stream processor - Google Patents
Data stream reconstruction method and reconfigurable data stream processor Download PDFInfo
- Publication number
- CN111105023B CN111105023B CN201911087000.9A CN201911087000A CN111105023B CN 111105023 B CN111105023 B CN 111105023B CN 201911087000 A CN201911087000 A CN 201911087000A CN 111105023 B CN111105023 B CN 111105023B
- Authority
- CN
- China
- Prior art keywords
- neural network
- data stream
- target neural
- network layer
- chip
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000013528 artificial neural network Methods 0.000 claims abstract description 111
- 230000006870 function Effects 0.000 claims abstract description 44
- 238000012545 processing Methods 0.000 claims description 53
- 230000015654 memory Effects 0.000 claims description 34
- 230000004913 activation Effects 0.000 claims description 24
- 230000005540 biological transmission Effects 0.000 claims description 16
- 230000003068 static effect Effects 0.000 claims description 16
- 230000007787 long-term memory Effects 0.000 claims description 3
- 230000006403 short-term memory Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 6
- 238000010276 construction Methods 0.000 abstract description 4
- 238000011160 research Methods 0.000 abstract description 3
- 238000011176 pooling Methods 0.000 description 11
- 230000009471 action Effects 0.000 description 10
- 239000013598 vector Substances 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 238000013461 design Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 230000002787 reinforcement Effects 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 230000003139 buffering effect Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Neurology (AREA)
- Image Analysis (AREA)
- Memory System (AREA)
Abstract
The invention discloses a data stream reconstruction method and a reconfigurable data stream processor, in particular to data stream reconstruction oriented to a hybrid artificial neural network, which dynamically changes corresponding function configuration on resources such as a computing unit, a storage unit, a data flow unit and the like according to different neural network layers, multiplexes hardware in a large scale to realize the neural network layers with different functions, and obtains the effects of improving the hardware utilization rate, improving the operation speed, reducing the power consumption and the like aiming at a hybrid neural network structure formed by a plurality of neural network layers. Particularly, the reusable configuration is confirmed by acquiring the characteristic information of other novel neural network layers, so that the resource reuse foundation can be provided for the construction of other novel neural network layers and the realization of a hybrid neural network based on the novel neural network layers in the follow-up research, and the universality is extremely strong.
Description
Technical Field
The present invention relates to the field of data stream technology of neural networks, and in particular, to a data stream reconstruction method and a reconfigurable data stream processor.
Background
Neural networks are widely used in the fields of computer vision, natural language processing, game engines and the like, and with the rapid development of neural network structures, the demands of the neural networks on the computing power of different data streams are increased continuously. Therefore, the future hybrid neural network is trended, and the compact algorithm kernel can support the end-to-end tasks in the aspects of perception, control and even driving. Meanwhile, dedicated hardware accelerator structures have been proposed to accelerate the inference phase of neural networks, such as eyeris, google TPU-I and DaDianNao, which achieve high performance and high resource utilization through the cooperative design technology of algorithms and architectures, such as dedicated data stream and systolic array multiplier, but these architectures and neural networks are tightly coupled and cannot accelerate for different neural networks. Therefore, corresponding data stream schemes need to be designed according to different neural networks, and a key data stream reconstruction method is the design key point of the hybrid artificial neural network.
In the prior art, a scheme for performing resource multiplexing by data stream reconstruction aiming at a hybrid neural network structure composed of different neural network layers such as a pooling layer, a full-link layer, a cyclic network LSTM layer, a deep reinforcement learning layer, a residual error layer and the like is lacked, so that the scheme in the prior art often has the defects of high hardware cost, complex structure, low operation speed, large operation power consumption and the like.
Disclosure of Invention
In view of the above, the present invention provides a data stream reconstruction method and a reconfigurable data stream processor to solve the above problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a data stream reconstruction method, which comprises the following steps: acquiring characteristic information of a target neural network layer; determining a data flow mode corresponding to the target neural network layer, the functional configuration of the processing unit and the functional configuration of the system on chip according to the characteristic information of the target neural network layer; performing function configuration corresponding to the processing unit of the target neural network layer and the system on chip on the reusable processing unit and the system on chip, and performing network configuration corresponding to the target neural network layer according to a data flow mode of the target neural network layer to construct the target neural network layer; and obtaining an output result by adopting the constructed target neural network layer.
Preferably, when the target neural network layer is a convolutional layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads; the input or output of the data stream is thread-level parallel serial transmission, a static memory of the system on chip is configured to buffer an activation function of an input feature diagram on a thread, a weight and the activation function are shared among a plurality of threads, and serial output of each thread is output in parallel after output buffering.
Preferably, when the target neural network layer is a pooling layer, the processing unit is configured as a comparator; the input or output of the data streams is a parallel transmission.
Preferably, when the target neural network layer is a fully-connected layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission of thread-level parallel, a static memory of the system on chip is configured as a weight buffer, and the activation function is in serial streaming transmission through a plurality of threads.
Preferably, when the target neural network layer is a residual layer, the processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the input and output shift registers of the system on chip are used for storing operands.
Preferably, when the target neural network layer is a long-term and short-term memory layer, the processing units are divided into four groups, each group of processing units is used for instantiating a sigmoid function and a tanh function, and the input or output of the data stream is serial transmission.
Preferably, when the target neural network layer is a reinforcement learning layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission of thread-level parallel, and the cache of the system on chip is used for state activation and iteration operation.
The invention provides a reconfigurable data stream processor which is used for executing the data stream reconfiguration method. The system on chip is used for controlling each group of processing units to be matched with the corresponding hardware thread, adjusting the processing units to be matched with the functional configuration of the target neural network layer, and constructing the target neural network layer.
Preferably, the system on chip comprises an execution controller, a direct memory access controller, an execution thread and a buffer, wherein the execution controller is used for extracting network instructions of a target neural network layer from an external off-chip memory, configuring the network instructions into a static memory, and performing decoding analysis on the network instructions one by one to drive the execution thread; the direct memory access controller is used for controlling reading and writing between the system on the chip and the memory under the chip; the execution thread is used for running under the control of the execution controller to realize the function of a target neural network layer; the buffer area comprises a static memory pool formed by a plurality of static memories.
Preferably, the hardware thread comprises a core state machine and a shift register, the core state machine is used for controlling data input and output, activation function allocation and weight allocation of the processing units on the same thread, and the shift register is used for constructing input and output of an activation function.
The data stream reconstruction method and the reconfigurable data stream processor provided by the invention support the functions of different neural network operators through calculation, storage and dynamic function change of the data flow unit, carry out large-scale multiplexing on hardware resources, realize the adaptation to various neural networks, particularly novel hybrid neural networks, and achieve the effects of improving the utilization rate of hardware, improving the operation speed, reducing the power consumption and the like.
Drawings
FIG. 1 is an exemplary block diagram of a hybrid neural network architecture;
fig. 2 is a flowchart of a data stream reconstruction method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the structure of convolutional layers as target neural network layers;
FIG. 4 is a schematic diagram of the structure of a pooling layer as a target neural network layer;
FIG. 5 is a schematic diagram of a fully-connected layer as a target neural network layer;
FIG. 6 is a schematic diagram of a structure of a residual layer as a target neural network layer;
FIG. 7 is a schematic structural diagram of a long-short term memory layer as a target neural network layer;
FIG. 8 is a schematic diagram of a reinforcement learning layer as a target neural network layer;
FIG. 9 is a block diagram of a reconfigurable data stream processor according to an embodiment of the present invention;
fig. 10 is a comparison graph of Q iteration time between the architecture designed in the verification experiment of example 2 and the host.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings. Examples of these preferred embodiments are illustrated in the accompanying drawings. The embodiments of the invention shown in the drawings and described in accordance with the drawings are exemplary only, and the invention is not limited to these embodiments.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps that are closely related to the solution according to the present invention are shown in the drawings, and other details that are not relevant are omitted.
Example 1
Referring to fig. 1, the important network layers and their interconnections in a hybrid neural network architecture are depicted, forming an end-to-end network that targets sensing and control. In particular, graphical inputs, cascaded convolutional and pooling layers are used as perceptual modules for visual feature extraction. Model networks such as Yolo-v3 and Resnet-50 can reach tens of layers to mimic the human visual system. For the application of video context understanding and language processing, the time-related feature sequence is used as the input of the LSTM long-short term memory layer, and the LSTM layer re-extracts the feature output related to the sequence. Unlike the previous layer, the LSTM network is a special network structure of the recurrent neural network, and four basic gates are formed, namely an input gate (I), an output gate (O), and cell state gates (C) and (F). When I, O and F gates compute layer outputs by vector operations, the C gate will hold the current map s layer state and serve as the recursive input for the next time series.
The control network layer is carried out after feature extraction, extracted feature parameters are regarded as state nodes in the deep reinforcement learning neural network DQN, and the optimal decision needs to be selected through action nodes. The method is to traverse all possible actions in the current state and execute a regression strategy according to a reinforcement learning strategy to find the maximum or minimum output value (Q value). Since the action nodes need to be iterated, all computations in subsequent layers also need to be iterated, which is indicated by the dashed box. Multilayer sensors employ the most common fully connected layer. The short-circuit mode is also commonly used in the residual error network, and the accuracy of classification and regression is improved by providing key elements in an image layer form before the current input.
For the artificial neural network, each neural network layer is different not only in network structure but also in operand, operator and nonlinear function. In different neural network layers, the data stream attributes are reconfigurable according to different network structures, so that resources which can be reused in the construction process of the hybrid neural network structure, such as, for example, a processing unit PE, data input/output, a static memory SRAM used as a buffer, an interface with an off-chip DRAM memory, and the like, can be found according to common points among the characteristics of the neural network layers by analyzing the characteristics of various neural network layers, such as, for example, data stream access modes, functions of computing resources, and the like, and the data stream reconfiguration is performed according to the idea to solve the prior art problem, and referring to table 1, characteristic information of a plurality of standard neural network layers is summarized.
TABLE 1. Characteristic information of multiple standard neural network layers
It can be seen that the pooling and shorting layers are vector operations, while other inliers are matrix operations, where the convolution process is sparse and the remaining network layers are dense. In each network layer, different activation functions are used, and with respect to its non-linear function, the LSTM network uses both sigmoid and tangent, while the remaining matrix kernels use the ReLU function or sigmoid.
Network data in the convolutional layer and the fully-connected layer needs to be shared between the nodes that output the feature map. The LSTM layer employs a similar serial flow, but in particular its active flow needs to be shared among multiple gates. And the state action layer needs to generate the data stream quickly based on the iteration of the action node. The pooling layer and the residual layer operating on vectors need not share activation functions for the feature map. Thus, the activated vector types can be transmitted in parallel.
Furthermore, after analyzing the function of intermediate data for multiple network layers, the convolutional and pooling layers are mainly determined by activation, while the FC fully-connected and LSTM layers are determined by weight due to the nature of data sparsity. In the residual layer, a pointer to the previous layer to activate needs to be kept in order for the network to process the previous data.
Combining the above analysis ideas, as shown in fig. 2, the present invention provides a data stream reconstruction method, which includes:
s1, acquiring characteristic information of a target neural network layer;
s2, determining a data stream mode corresponding to the target neural network layer, the functional configuration of the processing unit and the functional configuration of the system on chip (SoC) according to the characteristic information of the target neural network layer;
s3, performing function configuration of the processing unit and the system on chip corresponding to the target neural network layer on the reusable processing unit and the system on chip, and performing network configuration corresponding to the target neural network layer according to a data flow mode of the target neural network layer to construct the target neural network layer;
and S4, obtaining an output result by adopting the constructed target neural network layer.
The data stream reconstruction method provided by the invention can perform corresponding function configuration dynamic change on resources such as a computing unit, a storage unit, a data flow unit and the like according to different neural network layers, realize the neural network layers with different functions by multiplexing hardware in a large scale, obtain the effects of improving the utilization rate of the hardware, improving the operation speed, reducing the power consumption and the like aiming at a mixed neural network structure formed by a plurality of neural network layers, and provide a resource multiplexing basis for the realization of constructing other novel neural network layers in subsequent research. Compared with the reuse data stream reuse scheme which only aims at the fine-grained data of the standard convolution network operator, such as weight fixation, output fixation, row fixation and the like in the prior art, the reuse data stream reuse scheme has stronger universality and better obtained effect.
The following describes, with reference to fig. 3 to fig. 8 (in which the dashed lines indicate resources that do not need to be multiplexed), a method for managing data streams and sharing resources by taking an important neural network layer as an example, specifically as follows:
referring to fig. 3, when the target neural network layer is a convolutional layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads, wherein each thread processes data using the same row and column on a plurality of channels of an output feature map; the input or output of the data stream is thread-level parallel serial transmission, a static memory of the system on chip is configured to be used for buffering an activation function of an input feature graph on a thread, a weight and the activation function are shared among a plurality of threads, the activation function is in serial streaming transmission from a single buffer area to realize sharing among processing units, and serial output of each thread is output in parallel through a serial deserializer SERDES and a DRAM controller after being output and buffered.
As shown in fig. 4, when the target neural network layer is a pooling layer, the processing unit is configured as a comparator to implement maximum and minimum operators; the input or output of the data stream is parallel transmission, and the pooling layer directly operates on the vector, so that the activation function acquired from the DRAM is directly provided to the processing unit array without buffering, thereby greatly saving dynamic power consumption, and the activation function compares the passing time by modifying the DRAM access address.
Referring to fig. 5, when the target neural network layer is a fully-connected layer, the output and processing units are configured like convolutional layers, and the processing units include multiply-accumulate operation units and modified linear units which are grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission in parallel at the thread level, and for the kernel network with the weight dominance, the static memory of the system on chip is configured as a weight buffer area, and the activation function is in serial streaming transmission through a plurality of threads.
As shown in fig. 6, when the target neural network layer is a residual layer, which is similar to the pooling layer, and the kernel directly works on the parameters, the processing unit is configured as an adder; the input or output of the data stream is a parallel transfer, the input and output shift registers of the system on chip are used to store operands due to the addition of two vectors, the output result is written to the output shift register and written to the DRAM in parallel, and a pointer buffer is instantiated to address both operands in the DRAM.
Referring to fig. 7, when the target neural network layer is a long-term and short-term memory layer, the network layer multiplexes four processing units, the processing units are divided into four groups, each group of processing units is used for instantiating a sigmoid function and a tanh function, the addition vector operation and the tanh function operation are performed later, the input or output of the data stream is serial transmission, and a mixed input mode is adopted for providing fast data between the shared activation function and different groups in each group of gates. The state unit cache is instantiated for retaining intermediate state information.
Referring to fig. 8, when the target neural network layer is a reinforcement learning layer, the input, output and processing units are configured similarly to a fully connected layer, including various active sources such as DRAM for conventional activation, the processing units include a multiply-accumulate unit and a modified linear unit which are grouped and configured in a plurality of threads, the input or output of data streams is serial transmission in thread-level parallel, and the cache of the system on chip is used for state activation and iterative operation.
In addition to the above-mentioned neural network layer, the target neural network layer may also be applied to other novel neural network layers, and similarly, as long as characteristic information of the novel neural network layer is analyzed, reusable resources are known according to the characteristic information and corresponding configuration is performed, which is significant for the construction of a novel hybrid neural network structure in the future.
Example 2
Referring to fig. 9, based on the data stream reconstruction method described in embodiment 1, the present invention further provides a reconfigurable data stream processor, which is configured to execute the data stream reconstruction method described above, where the reconfigurable data stream processor adopts a hierarchical design and includes a system on chip 1, hardware threads 2, and multiple sets of processing units 3,
the system on chip 1 is configured to control each group of processing units 3 to cooperate with a corresponding hardware thread, adjust the processing units to match with the functional configuration of the target neural network layer, and construct the target neural network layer.
Further, the system on chip 1 includes an execution controller PCI-e, a direct memory access controller DMA, an execution thread, and a buffer. The execution controller coordinates the processing unit 3 and the buffer area according to the network instruction, and is used for extracting the network instruction of the target neural network layer from the external off-chip memory 4, configuring the network instruction into the static memory, and decoding and analyzing the network instruction one by one to drive the execution thread, so that the execution controller plays a role in centralized control, and is beneficial to reducing logic overhead and improving performance.
The direct memory access controller is used for controlling reading and writing between the system-on-chip 1 and the off-chip DRAM memory 4, realizing multiple reading and writing modes between the system-on-chip 1 and the off-chip DRAM memory 4, and being capable of smoothly transmitting network configuration, weight, activation and results. The DDR burst mode is largely used to supply data quickly and reduce DRAM access power. Since memory bandwidth can limit computational throughput. Thus, the DMA is configured according to the algorithm attributes, controlling the memory bandwidth to match the corresponding amount of data, e.g., the element size of the data bundle used for PW and DW convolution is equal to the number of bytes per transfer under a particular DRAM protocol. Thus, it is achieved that consecutive burst reads and writes can be performed without further data buffering.
The execution thread is used for running under the control of the execution controller to realize the function of a target neural network layer.
The buffer area comprises a static memory pool formed by a plurality of static memories, wherein the size of each SRAM is 8KB, and different algorithm kernels are configured with different buffer schemes. With the assistance of the execution controller, the SRAM can be instantiated on the fly with various buffer functions, which are determined by the algorithm kernel.
Further, the hardware thread facilitates resource sharing of data flow and weights, including a core state machine for controlling data input and output, activation function allocation, and weight allocation of processing units on the same thread, and shift registers for building inputs and outputs of activation functions to enable data sharing and reduced power overhead due to single fan-out and reduced load capacitance, which can be dynamically configured in a cascade or parallel fashion. Since some target neural network layers involve computation of vectors, the output data stream is bi-directional, as opposed to the uni-directional direction of the input data stream, thereby facilitating computation using vectors in, for example, the residual layer kernel. The multiple processing units are coordinated by the core state machines FSMs of the thread stages to process the activation and weights in a pipelined manner. The weights are streamed in from a pool of static memory in the system-on-chip 1, wherein the respective processing units may receive different streams of weights.
Further, in order to efficiently compute kernel-dependent functions, the processing unit 3 is designed compactly to implement the required operators. The data input port and the weight input port are used for facilitating matrix calculation and vector calculation. The Sigmoid and tagent modules are designed based on the medium linear approximation technique. The control input receives opcodes from the thread-level FSM, configuring the multiplexers to implement the operators associated with the cores.
The feasibility of the reconfigurable data stream processor provided by the invention is verified through experiments, the architecture of the reconfigurable data stream processor is that a thread is formed by an SRAM (static random access memory) and 16 PE (provider edge) on a 108KB chip, the experiments are realized by adopting a Verilog HDL (hardware description language), and a Modelsim simulation tool is adopted to perform simulation verification on the feasibility and the running time of a design scheme. And network performance analysis was performed at the NVIDIA GTX GPU of MATLAB using a neural network tool library. Three network architectures are demonstrated below to analyze the performance of the proposed architecture.
MobileNet has a hybrid kernel network of standard PW and DW convolutions, pooling and complete connectivity, and MobileNet employs iteratively compact convolution kernels, accounting for 97.91% of the MAC number calculation. Table 2 shows the execution delay of a proposed design of the layers of MobileNet benchmarked between multi-threaded and single-threaded architectures using an FPGA prototype with 256 PE and DRAM support.
TABLE 2 Performance analysis based on MobileNet architecture
Deep reinforcement learning: a typical use of DQN is maze walking, where an intelligent processing unit learns to go to a destination by choosing the right direction at an intersection and avoiding obstacles. As shown in fig. 10, the reinforcement learning action space has been tested on a layer 2, 5 and 10 network for 1, 2, 4 and 6 node data, while the state space has been chosen between 128 and 256 nodes. For all tested motion spaces, the on-chip Q iteration time for all three network structures is less than 2ms. This iteration time increases slightly with the size of the action space and the size of the network.
And (3) sequence classification: the test example uses sensor data obtained from a smartphone worn on the body, the data being trained using the LSTM network to identify the wearer's activity given a time series representing accelerometer readings in three different directions. Referring to table 3, without considering the simulation results under data transfer between disk storage and DRAM, it can be seen that the proposed LSTM network design achieves improved performance compared to both CPU and GPU. However, the MATLAB measurements account for the large latency of data transfers between disk, main memory, and operating system, whereas the present design is currently set as a stand-alone system. However, future LSTM networks tend to be deployed on sensors and fetch data directly from DRAM for direct processing, which is very close to the design principles of the present invention. Compared with the CPU and GPU power consumption, the power consumption of the ASIC hybrid neural network is three orders of magnitude higher, so that the efficiency of the ASIC hybrid neural network is proved to be excellent, and the feasibility of the invention is proved.
TABLE 3 performance benchmarking of LSTM networks in three processing architectures
In summary, the data stream reconstruction method and the reconfigurable data stream processor provided by the invention systematically multiplex hardware to perform dynamic configuration by adjusting the data stream mode, the processing unit and the on-chip stored function mode, achieve the effects of improving the hardware utilization rate, improving the operation speed, reducing the power consumption and the like for the hybrid neural network, and can provide a resource multiplexing basis for the construction of other novel neural network layers and the implementation of the hybrid neural network based on the novel neural network layers for subsequent research.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is directed to embodiments of the present application and it is noted that numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.
Claims (4)
1. A method for reconstructing a data stream, comprising:
acquiring characteristic information of a target neural network layer;
determining a data flow mode corresponding to the target neural network layer, the functional configuration of the processing unit and the functional configuration of the system on chip according to the characteristic information of the target neural network layer;
performing function configuration corresponding to the processing unit of the target neural network layer and the system on chip on the reusable processing unit and the system on chip, and performing network configuration corresponding to the target neural network layer according to a data flow mode of the target neural network layer to construct the target neural network layer;
obtaining an output result by adopting the constructed target neural network layer;
when the target neural network layer is a residual error layer, the processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the input and output shift registers of the system on chip are used for storing operands;
when the target neural network layer is a long-term and short-term memory layer, the processing units are divided into four groups, each group of processing units is used for instantiating a sigmoid function and a tanh function, and the input or output of the data stream is serial transmission.
2. A reconfigurable data stream processor for performing the data stream reconstruction method of claim 1, the reconfigurable data stream processor comprising a system on a chip, hardware threads and a plurality of sets of processing units,
the system on chip is used for controlling each group of processing units to be matched with the corresponding hardware thread, adjusting the processing units to be matched with the functional configuration of the target neural network layer, and constructing the target neural network layer.
3. The reconfigurable data flow processor of claim 2, wherein the system on chip comprises an execution controller, a direct memory access controller, a thread of execution, and a buffer,
the execution controller is used for extracting network instructions of a target neural network layer from an external off-chip memory, configuring the network instructions into a static memory, and decoding and analyzing the network instructions one by one to drive an execution thread;
the direct memory access controller is used for controlling reading and writing between the system on the chip and the memory under the chip;
the execution thread is used for running under the control of the execution controller to realize the function of a target neural network layer;
the buffer area comprises a static memory pool formed by a plurality of static memories.
4. The reconfigurable data stream processor according to claim 2 or 3, characterized in that the hardware threads comprise a core state machine for controlling data input and output, activation function assignment and weight assignment of processing units on the same thread and a shift register for building the input and output of activation functions.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911087000.9A CN111105023B (en) | 2019-11-08 | 2019-11-08 | Data stream reconstruction method and reconfigurable data stream processor |
PCT/CN2020/127250 WO2021089009A1 (en) | 2019-11-08 | 2020-11-06 | Data stream reconstruction method and reconstructable data stream processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911087000.9A CN111105023B (en) | 2019-11-08 | 2019-11-08 | Data stream reconstruction method and reconfigurable data stream processor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111105023A CN111105023A (en) | 2020-05-05 |
CN111105023B true CN111105023B (en) | 2023-03-31 |
Family
ID=70420571
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911087000.9A Active CN111105023B (en) | 2019-11-08 | 2019-11-08 | Data stream reconstruction method and reconfigurable data stream processor |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111105023B (en) |
WO (1) | WO2021089009A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111105023B (en) * | 2019-11-08 | 2023-03-31 | 深圳市中科元物芯科技有限公司 | Data stream reconstruction method and reconfigurable data stream processor |
CN111783971B (en) * | 2020-07-02 | 2024-04-09 | 上海赛昉科技有限公司 | Highly flexibly configurable data post-processor for deep neural network |
CN112560173B (en) * | 2020-12-08 | 2021-08-17 | 北京京航计算通讯研究所 | Vehicle weather resistance temperature prediction method and device based on deep learning |
CN112540950B (en) * | 2020-12-18 | 2023-03-28 | 清华大学 | Reconfigurable processor based on configuration information shared storage and shared storage method thereof |
CN113240084B (en) * | 2021-05-11 | 2024-02-02 | 北京搜狗科技发展有限公司 | Data processing method and device, electronic equipment and readable medium |
CN116702852B (en) * | 2023-08-02 | 2023-10-20 | 电子科技大学 | Dynamic reconfiguration neural network acceleration circuit and system based on multistage event driving |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409510A (en) * | 2018-09-14 | 2019-03-01 | 中国科学院深圳先进技术研究院 | Neuron circuit, chip, system and method, storage medium |
CN109472356A (en) * | 2018-12-29 | 2019-03-15 | 南京宁麒智能计算芯片研究院有限公司 | A kind of accelerator and method of restructural neural network algorithm |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9390369B1 (en) * | 2011-09-21 | 2016-07-12 | Brain Corporation | Multithreaded apparatus and methods for implementing parallel networks |
CN203204615U (en) * | 2013-03-15 | 2013-09-18 | 上海安路信息科技有限公司 | Dynamic reconfigurable system adaptable to various data flow calculation modes |
CN103218345A (en) * | 2013-03-15 | 2013-07-24 | 上海安路信息科技有限公司 | Dynamic reconfigurable system adaptable to plurality of dataflow computation modes and operating method |
US12061990B2 (en) * | 2017-10-17 | 2024-08-13 | Xilinx, Inc. | Static block scheduling in massively parallel software defined hardware systems |
CN107783840B (en) * | 2017-10-27 | 2020-08-21 | 瑞芯微电子股份有限公司 | Distributed multi-layer deep learning resource allocation method and device |
US11636327B2 (en) * | 2017-12-29 | 2023-04-25 | Intel Corporation | Machine learning sparse computation mechanism for arbitrary neural networks, arithmetic compute microarchitecture, and sparsity for training mechanism |
CN111105023B (en) * | 2019-11-08 | 2023-03-31 | 深圳市中科元物芯科技有限公司 | Data stream reconstruction method and reconfigurable data stream processor |
-
2019
- 2019-11-08 CN CN201911087000.9A patent/CN111105023B/en active Active
-
2020
- 2020-11-06 WO PCT/CN2020/127250 patent/WO2021089009A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109409510A (en) * | 2018-09-14 | 2019-03-01 | 中国科学院深圳先进技术研究院 | Neuron circuit, chip, system and method, storage medium |
CN109472356A (en) * | 2018-12-29 | 2019-03-15 | 南京宁麒智能计算芯片研究院有限公司 | A kind of accelerator and method of restructural neural network algorithm |
Non-Patent Citations (2)
Title |
---|
Accelerating Compact Convolutional Neural Networks with Multi-threaded Data Streaming;Weiguang Chen etc.;《2019 IEEE Computer Society Annual Symposium on VLSI(ISVLSI)》;20190919;第3-4节 * |
基于可重构阵列架构的强化学习计算引擎;梁明兰等;《集成技术》;20181130;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111105023A (en) | 2020-05-05 |
WO2021089009A1 (en) | 2021-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111105023B (en) | Data stream reconstruction method and reconfigurable data stream processor | |
US20220050683A1 (en) | Apparatuses, methods, and systems for neural networks | |
CN106940815B (en) | Programmable convolutional neural network coprocessor IP core | |
KR102637735B1 (en) | Neural network processing unit including approximate multiplier and system on chip including the same | |
US11669443B2 (en) | Data layout optimization on processing in memory architecture for executing neural network model | |
US20190026626A1 (en) | Neural network accelerator and operation method thereof | |
Li et al. | Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration | |
KR20230084449A (en) | Neural processing unit | |
Huang et al. | IECA: An in-execution configuration CNN accelerator with 30.55 GOPS/mm² area efficiency | |
EP3971787A1 (en) | Spatial tiling of compute arrays with shared control | |
Chen et al. | Towards efficient allocation of graph convolutional networks on hybrid computation-in-memory architecture | |
Zhang et al. | η-lstm: Co-designing highly-efficient large lstm training via exploiting memory-saving and architectural design opportunities | |
Krishna et al. | Raman: A re-configurable and sparse tinyML accelerator for inference on edge | |
US11704562B1 (en) | Architecture for virtual instructions | |
CN112051981B (en) | Data pipeline calculation path structure and single-thread data pipeline system | |
WO2022047802A1 (en) | Processing-in-memory device and data processing method thereof | |
Liu et al. | A cloud server oriented FPGA accelerator for LSTM recurrent neural network | |
US11922306B2 (en) | Tensor controller architecture | |
CN111178492A (en) | Computing device, related product and computing method for executing artificial neural network model | |
Zeng et al. | Toward a high-performance emulation platformfor brain-inspired intelligent systemsexploring dataflow-based execution model and beyond | |
Bai et al. | An OpenCL-based FPGA accelerator with the Winograd’s minimal filtering algorithm for convolution neuron networks | |
CN112906877A (en) | Data layout conscious processing in memory architectures for executing neural network models | |
Qiu et al. | An FPGA‐Based Convolutional Neural Network Coprocessor | |
CN113705800A (en) | Processing unit, related device and method | |
Chen et al. | Dataflow optimization with layer-wise design variables estimation method for enflame CNN accelerators |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20220927 Address after: Room 201, Building A, No. 1, Qianwan 1st Road, Qianhai Shenzhen-Hong Kong Cooperation Zone, Shenzhen, Guangdong, 518000 (located in Shenzhen Qianhai Road Commercial Secretary Co., Ltd.) Applicant after: Shenzhen Zhongke Yuanwuxin Technology Co.,Ltd. Address before: 1068 No. 518055 Guangdong city of Shenzhen province Nanshan District Shenzhen University city academy Avenue Applicant before: SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY CHINESE ACADEMY OF SCIENCES |
|
GR01 | Patent grant | ||
GR01 | Patent grant |