CN111105023B - Data stream reconstruction method and reconfigurable data stream processor - Google Patents

Data stream reconstruction method and reconfigurable data stream processor Download PDF

Info

Publication number
CN111105023B
CN111105023B CN201911087000.9A CN201911087000A CN111105023B CN 111105023 B CN111105023 B CN 111105023B CN 201911087000 A CN201911087000 A CN 201911087000A CN 111105023 B CN111105023 B CN 111105023B
Authority
CN
China
Prior art keywords
neural network
data stream
target neural
network layer
chip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911087000.9A
Other languages
Chinese (zh)
Other versions
CN111105023A (en
Inventor
王峥
周丽冰
陈伟光
谢文婷
粟金源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhongke Yuanwuxin Technology Co ltd
Original Assignee
Shenzhen Zhongke Yuanwuxin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhongke Yuanwuxin Technology Co ltd filed Critical Shenzhen Zhongke Yuanwuxin Technology Co ltd
Priority to CN201911087000.9A priority Critical patent/CN111105023B/en
Publication of CN111105023A publication Critical patent/CN111105023A/en
Priority to PCT/CN2020/127250 priority patent/WO2021089009A1/en
Application granted granted Critical
Publication of CN111105023B publication Critical patent/CN111105023B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)
  • Memory System (AREA)

Abstract

The invention discloses a data stream reconstruction method and a reconfigurable data stream processor, in particular to data stream reconstruction oriented to a hybrid artificial neural network, which dynamically changes corresponding function configuration on resources such as a computing unit, a storage unit, a data flow unit and the like according to different neural network layers, multiplexes hardware in a large scale to realize the neural network layers with different functions, and obtains the effects of improving the hardware utilization rate, improving the operation speed, reducing the power consumption and the like aiming at a hybrid neural network structure formed by a plurality of neural network layers. Particularly, the reusable configuration is confirmed by acquiring the characteristic information of other novel neural network layers, so that the resource reuse foundation can be provided for the construction of other novel neural network layers and the realization of a hybrid neural network based on the novel neural network layers in the follow-up research, and the universality is extremely strong.

Description

Data stream reconstruction method and reconfigurable data stream processor
Technical Field
The present invention relates to the field of data stream technology of neural networks, and in particular, to a data stream reconstruction method and a reconfigurable data stream processor.
Background
Neural networks are widely used in the fields of computer vision, natural language processing, game engines and the like, and with the rapid development of neural network structures, the demands of the neural networks on the computing power of different data streams are increased continuously. Therefore, the future hybrid neural network is trended, and the compact algorithm kernel can support the end-to-end tasks in the aspects of perception, control and even driving. Meanwhile, dedicated hardware accelerator structures have been proposed to accelerate the inference phase of neural networks, such as eyeris, google TPU-I and DaDianNao, which achieve high performance and high resource utilization through the cooperative design technology of algorithms and architectures, such as dedicated data stream and systolic array multiplier, but these architectures and neural networks are tightly coupled and cannot accelerate for different neural networks. Therefore, corresponding data stream schemes need to be designed according to different neural networks, and a key data stream reconstruction method is the design key point of the hybrid artificial neural network.
In the prior art, a scheme for performing resource multiplexing by data stream reconstruction aiming at a hybrid neural network structure composed of different neural network layers such as a pooling layer, a full-link layer, a cyclic network LSTM layer, a deep reinforcement learning layer, a residual error layer and the like is lacked, so that the scheme in the prior art often has the defects of high hardware cost, complex structure, low operation speed, large operation power consumption and the like.
Disclosure of Invention
In view of the above, the present invention provides a data stream reconstruction method and a reconfigurable data stream processor to solve the above problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a data stream reconstruction method, which comprises the following steps: acquiring characteristic information of a target neural network layer; determining a data flow mode corresponding to the target neural network layer, the functional configuration of the processing unit and the functional configuration of the system on chip according to the characteristic information of the target neural network layer; performing function configuration corresponding to the processing unit of the target neural network layer and the system on chip on the reusable processing unit and the system on chip, and performing network configuration corresponding to the target neural network layer according to a data flow mode of the target neural network layer to construct the target neural network layer; and obtaining an output result by adopting the constructed target neural network layer.
Preferably, when the target neural network layer is a convolutional layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads; the input or output of the data stream is thread-level parallel serial transmission, a static memory of the system on chip is configured to buffer an activation function of an input feature diagram on a thread, a weight and the activation function are shared among a plurality of threads, and serial output of each thread is output in parallel after output buffering.
Preferably, when the target neural network layer is a pooling layer, the processing unit is configured as a comparator; the input or output of the data streams is a parallel transmission.
Preferably, when the target neural network layer is a fully-connected layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission of thread-level parallel, a static memory of the system on chip is configured as a weight buffer, and the activation function is in serial streaming transmission through a plurality of threads.
Preferably, when the target neural network layer is a residual layer, the processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the input and output shift registers of the system on chip are used for storing operands.
Preferably, when the target neural network layer is a long-term and short-term memory layer, the processing units are divided into four groups, each group of processing units is used for instantiating a sigmoid function and a tanh function, and the input or output of the data stream is serial transmission.
Preferably, when the target neural network layer is a reinforcement learning layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission of thread-level parallel, and the cache of the system on chip is used for state activation and iteration operation.
The invention provides a reconfigurable data stream processor which is used for executing the data stream reconfiguration method. The system on chip is used for controlling each group of processing units to be matched with the corresponding hardware thread, adjusting the processing units to be matched with the functional configuration of the target neural network layer, and constructing the target neural network layer.
Preferably, the system on chip comprises an execution controller, a direct memory access controller, an execution thread and a buffer, wherein the execution controller is used for extracting network instructions of a target neural network layer from an external off-chip memory, configuring the network instructions into a static memory, and performing decoding analysis on the network instructions one by one to drive the execution thread; the direct memory access controller is used for controlling reading and writing between the system on the chip and the memory under the chip; the execution thread is used for running under the control of the execution controller to realize the function of a target neural network layer; the buffer area comprises a static memory pool formed by a plurality of static memories.
Preferably, the hardware thread comprises a core state machine and a shift register, the core state machine is used for controlling data input and output, activation function allocation and weight allocation of the processing units on the same thread, and the shift register is used for constructing input and output of an activation function.
The data stream reconstruction method and the reconfigurable data stream processor provided by the invention support the functions of different neural network operators through calculation, storage and dynamic function change of the data flow unit, carry out large-scale multiplexing on hardware resources, realize the adaptation to various neural networks, particularly novel hybrid neural networks, and achieve the effects of improving the utilization rate of hardware, improving the operation speed, reducing the power consumption and the like.
Drawings
FIG. 1 is an exemplary block diagram of a hybrid neural network architecture;
fig. 2 is a flowchart of a data stream reconstruction method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the structure of convolutional layers as target neural network layers;
FIG. 4 is a schematic diagram of the structure of a pooling layer as a target neural network layer;
FIG. 5 is a schematic diagram of a fully-connected layer as a target neural network layer;
FIG. 6 is a schematic diagram of a structure of a residual layer as a target neural network layer;
FIG. 7 is a schematic structural diagram of a long-short term memory layer as a target neural network layer;
FIG. 8 is a schematic diagram of a reinforcement learning layer as a target neural network layer;
FIG. 9 is a block diagram of a reconfigurable data stream processor according to an embodiment of the present invention;
fig. 10 is a comparison graph of Q iteration time between the architecture designed in the verification experiment of example 2 and the host.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings. Examples of these preferred embodiments are illustrated in the accompanying drawings. The embodiments of the invention shown in the drawings and described in accordance with the drawings are exemplary only, and the invention is not limited to these embodiments.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps that are closely related to the solution according to the present invention are shown in the drawings, and other details that are not relevant are omitted.
Example 1
Referring to fig. 1, the important network layers and their interconnections in a hybrid neural network architecture are depicted, forming an end-to-end network that targets sensing and control. In particular, graphical inputs, cascaded convolutional and pooling layers are used as perceptual modules for visual feature extraction. Model networks such as Yolo-v3 and Resnet-50 can reach tens of layers to mimic the human visual system. For the application of video context understanding and language processing, the time-related feature sequence is used as the input of the LSTM long-short term memory layer, and the LSTM layer re-extracts the feature output related to the sequence. Unlike the previous layer, the LSTM network is a special network structure of the recurrent neural network, and four basic gates are formed, namely an input gate (I), an output gate (O), and cell state gates (C) and (F). When I, O and F gates compute layer outputs by vector operations, the C gate will hold the current map s layer state and serve as the recursive input for the next time series.
The control network layer is carried out after feature extraction, extracted feature parameters are regarded as state nodes in the deep reinforcement learning neural network DQN, and the optimal decision needs to be selected through action nodes. The method is to traverse all possible actions in the current state and execute a regression strategy according to a reinforcement learning strategy to find the maximum or minimum output value (Q value). Since the action nodes need to be iterated, all computations in subsequent layers also need to be iterated, which is indicated by the dashed box. Multilayer sensors employ the most common fully connected layer. The short-circuit mode is also commonly used in the residual error network, and the accuracy of classification and regression is improved by providing key elements in an image layer form before the current input.
For the artificial neural network, each neural network layer is different not only in network structure but also in operand, operator and nonlinear function. In different neural network layers, the data stream attributes are reconfigurable according to different network structures, so that resources which can be reused in the construction process of the hybrid neural network structure, such as, for example, a processing unit PE, data input/output, a static memory SRAM used as a buffer, an interface with an off-chip DRAM memory, and the like, can be found according to common points among the characteristics of the neural network layers by analyzing the characteristics of various neural network layers, such as, for example, data stream access modes, functions of computing resources, and the like, and the data stream reconfiguration is performed according to the idea to solve the prior art problem, and referring to table 1, characteristic information of a plurality of standard neural network layers is summarized.
TABLE 1. Characteristic information of multiple standard neural network layers
Figure BDA0002265723660000041
Figure BDA0002265723660000051
It can be seen that the pooling and shorting layers are vector operations, while other inliers are matrix operations, where the convolution process is sparse and the remaining network layers are dense. In each network layer, different activation functions are used, and with respect to its non-linear function, the LSTM network uses both sigmoid and tangent, while the remaining matrix kernels use the ReLU function or sigmoid.
Network data in the convolutional layer and the fully-connected layer needs to be shared between the nodes that output the feature map. The LSTM layer employs a similar serial flow, but in particular its active flow needs to be shared among multiple gates. And the state action layer needs to generate the data stream quickly based on the iteration of the action node. The pooling layer and the residual layer operating on vectors need not share activation functions for the feature map. Thus, the activated vector types can be transmitted in parallel.
Furthermore, after analyzing the function of intermediate data for multiple network layers, the convolutional and pooling layers are mainly determined by activation, while the FC fully-connected and LSTM layers are determined by weight due to the nature of data sparsity. In the residual layer, a pointer to the previous layer to activate needs to be kept in order for the network to process the previous data.
Combining the above analysis ideas, as shown in fig. 2, the present invention provides a data stream reconstruction method, which includes:
s1, acquiring characteristic information of a target neural network layer;
s2, determining a data stream mode corresponding to the target neural network layer, the functional configuration of the processing unit and the functional configuration of the system on chip (SoC) according to the characteristic information of the target neural network layer;
s3, performing function configuration of the processing unit and the system on chip corresponding to the target neural network layer on the reusable processing unit and the system on chip, and performing network configuration corresponding to the target neural network layer according to a data flow mode of the target neural network layer to construct the target neural network layer;
and S4, obtaining an output result by adopting the constructed target neural network layer.
The data stream reconstruction method provided by the invention can perform corresponding function configuration dynamic change on resources such as a computing unit, a storage unit, a data flow unit and the like according to different neural network layers, realize the neural network layers with different functions by multiplexing hardware in a large scale, obtain the effects of improving the utilization rate of the hardware, improving the operation speed, reducing the power consumption and the like aiming at a mixed neural network structure formed by a plurality of neural network layers, and provide a resource multiplexing basis for the realization of constructing other novel neural network layers in subsequent research. Compared with the reuse data stream reuse scheme which only aims at the fine-grained data of the standard convolution network operator, such as weight fixation, output fixation, row fixation and the like in the prior art, the reuse data stream reuse scheme has stronger universality and better obtained effect.
The following describes, with reference to fig. 3 to fig. 8 (in which the dashed lines indicate resources that do not need to be multiplexed), a method for managing data streams and sharing resources by taking an important neural network layer as an example, specifically as follows:
referring to fig. 3, when the target neural network layer is a convolutional layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads, wherein each thread processes data using the same row and column on a plurality of channels of an output feature map; the input or output of the data stream is thread-level parallel serial transmission, a static memory of the system on chip is configured to be used for buffering an activation function of an input feature graph on a thread, a weight and the activation function are shared among a plurality of threads, the activation function is in serial streaming transmission from a single buffer area to realize sharing among processing units, and serial output of each thread is output in parallel through a serial deserializer SERDES and a DRAM controller after being output and buffered.
As shown in fig. 4, when the target neural network layer is a pooling layer, the processing unit is configured as a comparator to implement maximum and minimum operators; the input or output of the data stream is parallel transmission, and the pooling layer directly operates on the vector, so that the activation function acquired from the DRAM is directly provided to the processing unit array without buffering, thereby greatly saving dynamic power consumption, and the activation function compares the passing time by modifying the DRAM access address.
Referring to fig. 5, when the target neural network layer is a fully-connected layer, the output and processing units are configured like convolutional layers, and the processing units include multiply-accumulate operation units and modified linear units which are grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission in parallel at the thread level, and for the kernel network with the weight dominance, the static memory of the system on chip is configured as a weight buffer area, and the activation function is in serial streaming transmission through a plurality of threads.
As shown in fig. 6, when the target neural network layer is a residual layer, which is similar to the pooling layer, and the kernel directly works on the parameters, the processing unit is configured as an adder; the input or output of the data stream is a parallel transfer, the input and output shift registers of the system on chip are used to store operands due to the addition of two vectors, the output result is written to the output shift register and written to the DRAM in parallel, and a pointer buffer is instantiated to address both operands in the DRAM.
Referring to fig. 7, when the target neural network layer is a long-term and short-term memory layer, the network layer multiplexes four processing units, the processing units are divided into four groups, each group of processing units is used for instantiating a sigmoid function and a tanh function, the addition vector operation and the tanh function operation are performed later, the input or output of the data stream is serial transmission, and a mixed input mode is adopted for providing fast data between the shared activation function and different groups in each group of gates. The state unit cache is instantiated for retaining intermediate state information.
Referring to fig. 8, when the target neural network layer is a reinforcement learning layer, the input, output and processing units are configured similarly to a fully connected layer, including various active sources such as DRAM for conventional activation, the processing units include a multiply-accumulate unit and a modified linear unit which are grouped and configured in a plurality of threads, the input or output of data streams is serial transmission in thread-level parallel, and the cache of the system on chip is used for state activation and iterative operation.
In addition to the above-mentioned neural network layer, the target neural network layer may also be applied to other novel neural network layers, and similarly, as long as characteristic information of the novel neural network layer is analyzed, reusable resources are known according to the characteristic information and corresponding configuration is performed, which is significant for the construction of a novel hybrid neural network structure in the future.
Example 2
Referring to fig. 9, based on the data stream reconstruction method described in embodiment 1, the present invention further provides a reconfigurable data stream processor, which is configured to execute the data stream reconstruction method described above, where the reconfigurable data stream processor adopts a hierarchical design and includes a system on chip 1, hardware threads 2, and multiple sets of processing units 3,
the system on chip 1 is configured to control each group of processing units 3 to cooperate with a corresponding hardware thread, adjust the processing units to match with the functional configuration of the target neural network layer, and construct the target neural network layer.
Further, the system on chip 1 includes an execution controller PCI-e, a direct memory access controller DMA, an execution thread, and a buffer. The execution controller coordinates the processing unit 3 and the buffer area according to the network instruction, and is used for extracting the network instruction of the target neural network layer from the external off-chip memory 4, configuring the network instruction into the static memory, and decoding and analyzing the network instruction one by one to drive the execution thread, so that the execution controller plays a role in centralized control, and is beneficial to reducing logic overhead and improving performance.
The direct memory access controller is used for controlling reading and writing between the system-on-chip 1 and the off-chip DRAM memory 4, realizing multiple reading and writing modes between the system-on-chip 1 and the off-chip DRAM memory 4, and being capable of smoothly transmitting network configuration, weight, activation and results. The DDR burst mode is largely used to supply data quickly and reduce DRAM access power. Since memory bandwidth can limit computational throughput. Thus, the DMA is configured according to the algorithm attributes, controlling the memory bandwidth to match the corresponding amount of data, e.g., the element size of the data bundle used for PW and DW convolution is equal to the number of bytes per transfer under a particular DRAM protocol. Thus, it is achieved that consecutive burst reads and writes can be performed without further data buffering.
The execution thread is used for running under the control of the execution controller to realize the function of a target neural network layer.
The buffer area comprises a static memory pool formed by a plurality of static memories, wherein the size of each SRAM is 8KB, and different algorithm kernels are configured with different buffer schemes. With the assistance of the execution controller, the SRAM can be instantiated on the fly with various buffer functions, which are determined by the algorithm kernel.
Further, the hardware thread facilitates resource sharing of data flow and weights, including a core state machine for controlling data input and output, activation function allocation, and weight allocation of processing units on the same thread, and shift registers for building inputs and outputs of activation functions to enable data sharing and reduced power overhead due to single fan-out and reduced load capacitance, which can be dynamically configured in a cascade or parallel fashion. Since some target neural network layers involve computation of vectors, the output data stream is bi-directional, as opposed to the uni-directional direction of the input data stream, thereby facilitating computation using vectors in, for example, the residual layer kernel. The multiple processing units are coordinated by the core state machines FSMs of the thread stages to process the activation and weights in a pipelined manner. The weights are streamed in from a pool of static memory in the system-on-chip 1, wherein the respective processing units may receive different streams of weights.
Further, in order to efficiently compute kernel-dependent functions, the processing unit 3 is designed compactly to implement the required operators. The data input port and the weight input port are used for facilitating matrix calculation and vector calculation. The Sigmoid and tagent modules are designed based on the medium linear approximation technique. The control input receives opcodes from the thread-level FSM, configuring the multiplexers to implement the operators associated with the cores.
The feasibility of the reconfigurable data stream processor provided by the invention is verified through experiments, the architecture of the reconfigurable data stream processor is that a thread is formed by an SRAM (static random access memory) and 16 PE (provider edge) on a 108KB chip, the experiments are realized by adopting a Verilog HDL (hardware description language), and a Modelsim simulation tool is adopted to perform simulation verification on the feasibility and the running time of a design scheme. And network performance analysis was performed at the NVIDIA GTX GPU of MATLAB using a neural network tool library. Three network architectures are demonstrated below to analyze the performance of the proposed architecture.
MobileNet has a hybrid kernel network of standard PW and DW convolutions, pooling and complete connectivity, and MobileNet employs iteratively compact convolution kernels, accounting for 97.91% of the MAC number calculation. Table 2 shows the execution delay of a proposed design of the layers of MobileNet benchmarked between multi-threaded and single-threaded architectures using an FPGA prototype with 256 PE and DRAM support.
TABLE 2 Performance analysis based on MobileNet architecture
Figure BDA0002265723660000091
Deep reinforcement learning: a typical use of DQN is maze walking, where an intelligent processing unit learns to go to a destination by choosing the right direction at an intersection and avoiding obstacles. As shown in fig. 10, the reinforcement learning action space has been tested on a layer 2, 5 and 10 network for 1, 2, 4 and 6 node data, while the state space has been chosen between 128 and 256 nodes. For all tested motion spaces, the on-chip Q iteration time for all three network structures is less than 2ms. This iteration time increases slightly with the size of the action space and the size of the network.
And (3) sequence classification: the test example uses sensor data obtained from a smartphone worn on the body, the data being trained using the LSTM network to identify the wearer's activity given a time series representing accelerometer readings in three different directions. Referring to table 3, without considering the simulation results under data transfer between disk storage and DRAM, it can be seen that the proposed LSTM network design achieves improved performance compared to both CPU and GPU. However, the MATLAB measurements account for the large latency of data transfers between disk, main memory, and operating system, whereas the present design is currently set as a stand-alone system. However, future LSTM networks tend to be deployed on sensors and fetch data directly from DRAM for direct processing, which is very close to the design principles of the present invention. Compared with the CPU and GPU power consumption, the power consumption of the ASIC hybrid neural network is three orders of magnitude higher, so that the efficiency of the ASIC hybrid neural network is proved to be excellent, and the feasibility of the invention is proved.
TABLE 3 performance benchmarking of LSTM networks in three processing architectures
Figure BDA0002265723660000101
In summary, the data stream reconstruction method and the reconfigurable data stream processor provided by the invention systematically multiplex hardware to perform dynamic configuration by adjusting the data stream mode, the processing unit and the on-chip stored function mode, achieve the effects of improving the hardware utilization rate, improving the operation speed, reducing the power consumption and the like for the hybrid neural network, and can provide a resource multiplexing basis for the construction of other novel neural network layers and the implementation of the hybrid neural network based on the novel neural network layers for subsequent research.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is directed to embodiments of the present application and it is noted that numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

Claims (4)

1. A method for reconstructing a data stream, comprising:
acquiring characteristic information of a target neural network layer;
determining a data flow mode corresponding to the target neural network layer, the functional configuration of the processing unit and the functional configuration of the system on chip according to the characteristic information of the target neural network layer;
performing function configuration corresponding to the processing unit of the target neural network layer and the system on chip on the reusable processing unit and the system on chip, and performing network configuration corresponding to the target neural network layer according to a data flow mode of the target neural network layer to construct the target neural network layer;
obtaining an output result by adopting the constructed target neural network layer;
when the target neural network layer is a residual error layer, the processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the input and output shift registers of the system on chip are used for storing operands;
when the target neural network layer is a long-term and short-term memory layer, the processing units are divided into four groups, each group of processing units is used for instantiating a sigmoid function and a tanh function, and the input or output of the data stream is serial transmission.
2. A reconfigurable data stream processor for performing the data stream reconstruction method of claim 1, the reconfigurable data stream processor comprising a system on a chip, hardware threads and a plurality of sets of processing units,
the system on chip is used for controlling each group of processing units to be matched with the corresponding hardware thread, adjusting the processing units to be matched with the functional configuration of the target neural network layer, and constructing the target neural network layer.
3. The reconfigurable data flow processor of claim 2, wherein the system on chip comprises an execution controller, a direct memory access controller, a thread of execution, and a buffer,
the execution controller is used for extracting network instructions of a target neural network layer from an external off-chip memory, configuring the network instructions into a static memory, and decoding and analyzing the network instructions one by one to drive an execution thread;
the direct memory access controller is used for controlling reading and writing between the system on the chip and the memory under the chip;
the execution thread is used for running under the control of the execution controller to realize the function of a target neural network layer;
the buffer area comprises a static memory pool formed by a plurality of static memories.
4. The reconfigurable data stream processor according to claim 2 or 3, characterized in that the hardware threads comprise a core state machine for controlling data input and output, activation function assignment and weight assignment of processing units on the same thread and a shift register for building the input and output of activation functions.
CN201911087000.9A 2019-11-08 2019-11-08 Data stream reconstruction method and reconfigurable data stream processor Active CN111105023B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911087000.9A CN111105023B (en) 2019-11-08 2019-11-08 Data stream reconstruction method and reconfigurable data stream processor
PCT/CN2020/127250 WO2021089009A1 (en) 2019-11-08 2020-11-06 Data stream reconstruction method and reconstructable data stream processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911087000.9A CN111105023B (en) 2019-11-08 2019-11-08 Data stream reconstruction method and reconfigurable data stream processor

Publications (2)

Publication Number Publication Date
CN111105023A CN111105023A (en) 2020-05-05
CN111105023B true CN111105023B (en) 2023-03-31

Family

ID=70420571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911087000.9A Active CN111105023B (en) 2019-11-08 2019-11-08 Data stream reconstruction method and reconfigurable data stream processor

Country Status (2)

Country Link
CN (1) CN111105023B (en)
WO (1) WO2021089009A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105023B (en) * 2019-11-08 2023-03-31 深圳市中科元物芯科技有限公司 Data stream reconstruction method and reconfigurable data stream processor
CN111783971B (en) * 2020-07-02 2024-04-09 上海赛昉科技有限公司 Highly flexibly configurable data post-processor for deep neural network
CN112560173B (en) * 2020-12-08 2021-08-17 北京京航计算通讯研究所 Vehicle weather resistance temperature prediction method and device based on deep learning
CN112540950B (en) * 2020-12-18 2023-03-28 清华大学 Reconfigurable processor based on configuration information shared storage and shared storage method thereof
CN113240084B (en) * 2021-05-11 2024-02-02 北京搜狗科技发展有限公司 Data processing method and device, electronic equipment and readable medium
CN116702852B (en) * 2023-08-02 2023-10-20 电子科技大学 Dynamic reconfiguration neural network acceleration circuit and system based on multistage event driving

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409510A (en) * 2018-09-14 2019-03-01 中国科学院深圳先进技术研究院 Neuron circuit, chip, system and method, storage medium
CN109472356A (en) * 2018-12-29 2019-03-15 南京宁麒智能计算芯片研究院有限公司 A kind of accelerator and method of restructural neural network algorithm

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9390369B1 (en) * 2011-09-21 2016-07-12 Brain Corporation Multithreaded apparatus and methods for implementing parallel networks
CN203204615U (en) * 2013-03-15 2013-09-18 上海安路信息科技有限公司 Dynamic reconfigurable system adaptable to various data flow calculation modes
CN103218345A (en) * 2013-03-15 2013-07-24 上海安路信息科技有限公司 Dynamic reconfigurable system adaptable to plurality of dataflow computation modes and operating method
US12061990B2 (en) * 2017-10-17 2024-08-13 Xilinx, Inc. Static block scheduling in massively parallel software defined hardware systems
CN107783840B (en) * 2017-10-27 2020-08-21 瑞芯微电子股份有限公司 Distributed multi-layer deep learning resource allocation method and device
US11636327B2 (en) * 2017-12-29 2023-04-25 Intel Corporation Machine learning sparse computation mechanism for arbitrary neural networks, arithmetic compute microarchitecture, and sparsity for training mechanism
CN111105023B (en) * 2019-11-08 2023-03-31 深圳市中科元物芯科技有限公司 Data stream reconstruction method and reconfigurable data stream processor

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409510A (en) * 2018-09-14 2019-03-01 中国科学院深圳先进技术研究院 Neuron circuit, chip, system and method, storage medium
CN109472356A (en) * 2018-12-29 2019-03-15 南京宁麒智能计算芯片研究院有限公司 A kind of accelerator and method of restructural neural network algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Accelerating Compact Convolutional Neural Networks with Multi-threaded Data Streaming;Weiguang Chen etc.;《2019 IEEE Computer Society Annual Symposium on VLSI(ISVLSI)》;20190919;第3-4节 *
基于可重构阵列架构的强化学习计算引擎;梁明兰等;《集成技术》;20181130;全文 *

Also Published As

Publication number Publication date
CN111105023A (en) 2020-05-05
WO2021089009A1 (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
US20220050683A1 (en) Apparatuses, methods, and systems for neural networks
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
KR102637735B1 (en) Neural network processing unit including approximate multiplier and system on chip including the same
US11669443B2 (en) Data layout optimization on processing in memory architecture for executing neural network model
US20190026626A1 (en) Neural network accelerator and operation method thereof
Li et al. Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration
KR20230084449A (en) Neural processing unit
Huang et al. IECA: An in-execution configuration CNN accelerator with 30.55 GOPS/mm² area efficiency
EP3971787A1 (en) Spatial tiling of compute arrays with shared control
Chen et al. Towards efficient allocation of graph convolutional networks on hybrid computation-in-memory architecture
Zhang et al. η-lstm: Co-designing highly-efficient large lstm training via exploiting memory-saving and architectural design opportunities
Krishna et al. Raman: A re-configurable and sparse tinyML accelerator for inference on edge
US11704562B1 (en) Architecture for virtual instructions
CN112051981B (en) Data pipeline calculation path structure and single-thread data pipeline system
WO2022047802A1 (en) Processing-in-memory device and data processing method thereof
Liu et al. A cloud server oriented FPGA accelerator for LSTM recurrent neural network
US11922306B2 (en) Tensor controller architecture
CN111178492A (en) Computing device, related product and computing method for executing artificial neural network model
Zeng et al. Toward a high-performance emulation platformfor brain-inspired intelligent systemsexploring dataflow-based execution model and beyond
Bai et al. An OpenCL-based FPGA accelerator with the Winograd’s minimal filtering algorithm for convolution neuron networks
CN112906877A (en) Data layout conscious processing in memory architectures for executing neural network models
Qiu et al. An FPGA‐Based Convolutional Neural Network Coprocessor
CN113705800A (en) Processing unit, related device and method
Chen et al. Dataflow optimization with layer-wise design variables estimation method for enflame CNN accelerators

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220927

Address after: Room 201, Building A, No. 1, Qianwan 1st Road, Qianhai Shenzhen-Hong Kong Cooperation Zone, Shenzhen, Guangdong, 518000 (located in Shenzhen Qianhai Road Commercial Secretary Co., Ltd.)

Applicant after: Shenzhen Zhongke Yuanwuxin Technology Co.,Ltd.

Address before: 1068 No. 518055 Guangdong city of Shenzhen province Nanshan District Shenzhen University city academy Avenue

Applicant before: SHENZHEN INSTITUTES OF ADVANCED TECHNOLOGY CHINESE ACADEMY OF SCIENCES

GR01 Patent grant
GR01 Patent grant