WO2021089009A1

WO2021089009A1 - Data stream reconstruction method and reconstructable data stream processor

Info

Publication number: WO2021089009A1
Application number: PCT/CN2020/127250
Authority: WO
Inventors: 王峥; 周丽冰; 陈伟光; 谢文婷; 粟金源
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2019-11-08
Filing date: 2020-11-06
Publication date: 2021-05-14
Also published as: CN111105023B; CN111105023A

Abstract

A data stream reconstruction method and a reconstructable data stream processor. According to different neural network layers, corresponding dynamic function configuration changes are made to resources, such as a computing unit, a storage unit and a data flow unit. Neural network layers having different functions are realized by means of multiplexing hardware on a large scale. For a hybrid neural network structure formed by a plurality of neural network layers, the effects of improving a hardware utilization rate, improving an operational speed, reducing power consumption, etc. are achieved.

Description

Data stream reconstruction method and reconfigurable data stream processor

Technical field

The present invention relates to the technical field of data flow of neural networks, in particular to a data flow reconstruction method and a reconfigurable data flow processor.

Background technique

Neural networks are widely used in computer vision, natural language processing, game engines and other fields. With the rapid development of neural network structures, the demand for computing power for different data streams will continue to increase. Therefore, the future hybrid neural network is the general trend, and its compact algorithm core can support end-to-end tasks in perception, control and even driving. At the same time, dedicated hardware accelerator structures have been proposed to accelerate the inference stage of neural networks, such as Eyeiss, Google TPU-I and DaDianNao, which use algorithms and architecture co-design technologies, such as dedicated data streams and systolic arrays. Multipliers, etc., to achieve high performance and high resource utilization, but these architectures and neural networks are tightly coupled and cannot be accelerated for different neural networks. Therefore, according to different neural network needs to design corresponding data flow schemes, the key data flow reconstruction method is the design focus of hybrid artificial neural networks.

In the prior art, there is a lack of a hybrid neural network structure composed of different neural network layers such as a pooling layer, a fully connected layer, a recurrent network LSTM layer, a deep reinforcement learning layer, and a residual layer. The scheme of reconfiguration for resource reuse, therefore, the existing technical schemes often have disadvantages such as high hardware cost, complex structure, slow operation speed, and high operation power consumption.

Summary of the invention

In view of this, the purpose of the present invention is to provide a data stream reconstruction method and a reconfigurable data stream processor to solve the above-mentioned problems.

In order to achieve the above objectives, the present invention adopts the following technical solutions:

The present invention provides a data flow reconstruction method, which includes: acquiring characteristic information of a target neural network layer; and determining the data flow mode corresponding to the target neural network layer, the functional configuration of the processing unit, and the on-chip according to the characteristic information of the target neural network layer. The functional configuration of the system; the reusable processing unit and the system-on-chip are configured to correspond to the processing unit of the target neural network layer and the system-on-chip functional configuration, and corresponding to the target neural network according to the data flow mode of the target neural network layer The network configuration of the layers is used to construct the target neural network layer; the constructed target neural network layer is used to obtain the output result.

Preferably, when the target neural network layer is a convolutional layer, the processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and arranged in multiple threads; the input or output of the data stream is a thread-level parallel serial Transmission, the static memory of the system-on-chip is configured to buffer the activation function of the input feature map on the thread. The weight and activation function are shared among multiple threads, and the serial output of each thread is output in parallel after being output buffered.

Preferably, when the target neural network layer is a pooling layer, the processing unit is configured as a comparator; the input or output of the data stream is parallel transmission.

Preferably, when the target neural network layer is a fully connected layer, the processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and arranged in multiple threads; the input or output of the data stream is a thread-level parallel serial For transmission, the static memory of the system-on-chip is configured as a weight buffer, and the activation function is serially streamed through multiple threads.

Preferably, when the target neural network layer is a residual layer, the processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the input and output shift registers of the system-on-chip are used to store operands.

Preferably, when the target neural network layer is a long and short-term memory layer, the processing units are divided into four groups, each group of processing units is used to instantiate a sigmoid function and a tanh function, and the input or output of the data stream is serial transmission.

Preferably, when the target neural network layer is a reinforcement learning layer, the processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and arranged in multiple threads; the input or output of the data stream is a thread-level parallel serial Transmission, the system-on-chip cache is used for state activation and iterative operations.

The present invention provides a reconfigurable data stream processor for executing the data stream reconstruction method as described above. The reconfigurable data stream processor includes a system on a chip, a hardware thread, and multiple sets of processing units. Wherein, the system on chip is used to control each group of processing units to cooperate with corresponding hardware threads, adjust to match the functional configuration of the target neural network layer, and construct the target neural network layer.

Preferably, the system-on-chip includes an execution controller, a direct memory access controller, an execution thread, and a buffer. The execution controller is used to extract network instructions of the target neural network layer from an external off-chip memory, and transfer the network instructions It is configured in static memory, and the network instructions are decoded and analyzed one by one to drive the execution thread; the direct memory access controller is used to control the read and write between the system-on-chip and the off-chip memory; the execution thread is used in the It runs under the control of the execution controller to realize the function of the target neural network layer; the buffer includes a static memory pool composed of a plurality of static memories.

Preferably, the hardware thread includes a core state machine and a shift register, the core state machine is used to control data input and output, activation function allocation, and weight allocation of processing units on the same thread, and the shift register is used to construct The input and output of the activation function.

The data stream reconstruction method and the reconfigurable data stream processor provided by the present invention support the functions of different neural network operators through the dynamic function changes of calculation, storage and data flow units, and perform large-scale multiplexing of hardware resources, Realize the adaptation to various neural networks, especially the new hybrid neural network, and achieve the effects of improving hardware utilization, increasing computing speed and reducing power consumption.

Description of the drawings

Figure 1 is an exemplary structure diagram of a hybrid neural network structure;

Figure 2 is a flowchart of a data stream reconstruction method provided by an embodiment of the present invention;

Figure 3 is a schematic diagram of the structure of the convolutional layer as the target neural network layer;

Figure 4 is a schematic diagram of the structure of the pooling layer as the target neural network layer;

Figure 5 is a schematic diagram of the structure of the fully connected layer as the target neural network layer;

Figure 6 is a schematic diagram of the structure of the residual layer as the target neural network layer;

Figure 7 is a schematic diagram of the structure of the long and short-term memory layer as the target neural network layer;

Figure 8 is a schematic diagram of the structure of the reinforcement learning layer as the target neural network layer;

FIG. 9 is a schematic structural diagram of a reconfigurable data stream processor provided by an embodiment of the present invention;

FIG. 10 is a comparison diagram of Q iteration time between the architecture designed in the verification experiment of Embodiment 2 and the host.

Detailed ways

In order to make the objectives, technical solutions, and advantages of the present invention clearer, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Examples of these preferred embodiments are illustrated in the drawings. The embodiments of the present invention shown in the drawings and described in accordance with the drawings are merely exemplary, and the present invention is not limited to these embodiments.

Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the structure and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and the relationship is omitted. Little other details.

Example 1

Refer to Figure 1, which describes the important network layers in the hybrid neural network structure and their interconnection methods, forming an end-to-end network with the goal of perception and control. Especially for graphics input, cascaded convolution and pooling layers are used as the perception module for visual feature extraction. Model networks such as Yolo-v3 and Resnet-50 can reach dozens of layers and are used to imitate the human visual system. For time-related applications such as video context understanding and language processing, the feature sequence extracted through time is used as the input of the LSTM long and short-term memory layer, and the LSTM layer then extracts the sequence-related feature output. Different from the previous layer, the LSTM network is a special network structure of the cyclic neural network. It consists of four basic gates, input gate (I), output gate (O), cell state gate (C) and (F) Forgotten door. When the I, O and F gates calculate the layer output through vector operations, the C gate will maintain the current state of the s layer and serve as the recursive input for the next time series.

The control network layer is performed after feature extraction. In the deep reinforcement learning neural network DQN, the extracted feature parameters are regarded as state nodes, and the best decision needs to be selected through action nodes. The method is to traverse all possible actions in the current state, and implement a regression strategy according to the reinforcement learning strategy to find the maximum or minimum output value (Q value). Since the action nodes need to be iterated, all calculations in subsequent layers also need to be iterated, which is represented by a dashed box. The multi-layer perceptron uses the most common fully connected layer. The short-circuit mode is also often used in residual networks to improve the accuracy of classification and regression by providing key elements in the form of layers before the current input.

For artificial neural networks, each neural network layer is not only different in network structure, but also in operands, operators and nonlinear functions. In different neural network layers, the data flow attributes are reconfigurable according to different network structures. Therefore, you can analyze the characteristics of various neural network layers, such as data flow access patterns, computing resource functions, etc., according to these neural network layers. Common points between the network layer characteristics, looking for resources that can be reused in the construction of the hybrid neural network structure, such as processing unit PE, data input/output, static memory SRAM used as a buffer, and off-chip DRAM memory In this way, the data flow reconstruction is carried out to solve the existing technical problems. Refer to Table 1, which summarizes the characteristic information of multiple standard neural network layers.

Table 1. Characteristic information of multiple standard neural network layers

It can be seen that the pooling and short-circuiting layers are vector operations, while other kernels operate on matrices, in which the convolution process is a sparse matrix, and the remaining network layers are dense matrices. In each network layer, different activation functions are used. Regarding its nonlinear function, the LSTM network uses both the S-shape and the tangent, while the remaining matrix cores use the ReLU function or the S-shape.

The network data in the convolutional layer and the fully connected layer need to be shared among the nodes that output the feature map. The LSTM layer uses a similar serial stream, but in particular its activation stream needs to be shared among multiple gates. The state action layer needs to quickly generate data streams based on the iteration of action nodes. The pooling layer and the residual layer that operate on the vector do not need to share the activation function for the feature map. Therefore, the activated vector types can be transmitted in parallel.

In addition, after analyzing the functions of the intermediate data for multiple network layers, due to the nature of data sparsity, the convolutional layer and the pooling layer are mainly determined by activation, while the FC fully connected layer and LSTM layer are determined by weights. In the residual layer, it is necessary to keep a pointer to the previous layer for activation so that the network can process the previous data.

Based on the above analysis ideas, as shown in Figure 2, the present invention provides a data stream reconstruction method, including:

S1. Obtain characteristic information of the target neural network layer;

S2, according to the characteristic information of the target neural network layer, determine the data flow mode of the target neural network layer, the functional configuration of the processing unit, and the functional configuration of the system-on-chip SoC;

S3. Perform the functional configuration of the reusable processing unit and the system on chip corresponding to the processing unit of the target neural network layer and the system on chip, and perform a network corresponding to the target neural network layer according to the data flow mode of the target neural network layer Configure to construct the target neural network layer;

S4. Use the constructed target neural network layer to obtain the output result.

The above-mentioned data stream reconstruction method provided by the present invention can dynamically change the corresponding functional configuration of resources such as computing unit, storage unit and data flow unit according to different neural network layers, and reuse hardware on a large scale to realize neural network layers with different functions. Aiming at the hybrid neural network structure composed of multiple neural network layers, it has achieved the effects of improving hardware utilization, increasing computing speed and reducing power consumption, and can provide a resource reuse basis for subsequent research and construction of other new neural network layers. Compared with the prior art such as weight fixation, output fixation, and row fixation, which only target fine-grained data stream reuse schemes of standard convolutional network operators, it has stronger versatility and achieves better results.

The following describes the methods of data flow management and resource sharing with reference to Figures 3 to 8 (the dotted lines indicate resources that do not need to be reused), taking an important neural network layer as an example, and the methods of data flow management and resource sharing are described as follows:

Referring to FIG. 3, when the target neural network layer is a convolutional layer, the processing unit includes a multiplication and accumulation operation unit and a modified linear unit that are grouped and arranged in a plurality of threads, wherein each thread is outputting the multiplication of the feature map. The same rows and columns are used to process data on each channel; the input or output of the data stream is thread-level parallel serial transmission, and the static memory of the system-on-chip is configured to buffer the activation function of the input feature map on the thread, and the weight The activation function and the activation function are shared among multiple threads. The activation function is serially streamed from a single buffer to realize sharing between processing units. The serial output of each thread is serially deserialized after being output buffered. SERDES and DRAM controller for parallel output.

As shown in Figure 4, when the target neural network layer is a pooling layer, the processing unit is configured as a comparator to realize the maximum and minimum operators; the input or output of the data stream is parallel transmission, because the pooling layer directly It operates on a vector, so the activation function obtained from the DRAM is directly provided to the processing unit array without buffering, which greatly saves dynamic power consumption. The activation function compares the passage time by modifying the DRAM access address.

Referring to FIG. 5, when the target neural network layer is a fully connected layer, the configuration of its output and processing units is similar to that of a convolutional layer, and the processing unit includes multiplying, accumulating and adding operation units and corrections grouped in multiple threads. Linear unit; the input or output of the data stream is thread-level parallel serial transmission. For this weight-dominant kernel network, the static memory of the system-on-chip is configured as a weight buffer, and the activation function is serially streamed through multiple threads transmission.

As shown in Figure 6, when the target neural network layer is a residual layer, which is similar to a pooling layer, the kernel directly works on parameters, and the processing unit is configured as an adder; the input or output of the data stream is parallel Transmission, due to the addition of two vectors, the input and output shift registers of the on-chip system are used to store operands, the output result is written to the output shift register and written to DRAM in parallel, and the pointer buffer is instantiated to address the DRAM The two operands.

Referring to FIG. 7, when the target neural network layer is a long- and short-term memory layer, the network layer divides and multiplexes processing units into four groups. Each group of processing units is used to instantiate the sigmoid function and The tanh function, the addition vector operation and the tanh function operation will be performed later. The input or output of the data stream is serial transmission, and the mixed input mode is adopted for the shared activation function in each group of gates and the provision of fast data between different groups. The state unit cache is instantiated to retain intermediate state information.

Referring to FIG. 8, when the target neural network layer is a reinforcement learning layer, the configuration of its input, output, and processing units is similar to that of a fully connected layer, including multiple activation sources including DRAM for conventional activation. The processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and configured in multiple threads. The input or output of the data stream is thread-level parallel serial transmission, and the system-on-chip cache is used for state activation and iterative operations.

In addition to the neural network layer shown above, the target neural network layer can also be applied to other new neural network layers. Similarly, as long as the characteristic information of the new neural network layer is analyzed, the resources that can be reused are known from the characteristic information. And the corresponding configuration is enough, which is of great significance for the construction of a new type of hybrid neural network structure in the future.

Example 2

Referring to FIG. 9, based on the data stream reconstruction method described in Embodiment 1, the present invention also provides a reconfigurable data stream processor, which is characterized in that it is used to execute the data stream reconstruction method described above , The reconfigurable data stream processor adopts a hierarchical design and includes: a system on chip 1, a hardware thread 2 and multiple sets of processing units 3.

Wherein, the system on chip 1 is used to control each group of processing units 3 to cooperate with corresponding hardware threads, adjust to match the functional configuration of the target neural network layer, and construct the target neural network layer.

Further, the system on chip 1 includes an execution controller PCI-e, a direct memory access controller DMA, an execution thread, and a buffer. The execution controller coordinates the processing unit 3 and the buffer according to the network instruction, and is used to extract the network instruction of the target neural network layer from the external off-chip memory 4, configure the network instruction in the static memory, and respond to the network instruction Decoding and analysis are performed one by one to drive the execution threads. Therefore, the execution controller plays a role of centralized control, which is beneficial to reduce logic overhead and improve performance.

The direct memory access controller is used to control the read and write between the system-on-chip 1 and the off-chip DRAM memory 4, realize multiple read and write modes between the system-on-chip 1 and the off-chip DRAM memory 4, and it can transmit network configuration smoothly. , Weight, activation, and result. The DDR burst mode is widely used to quickly supply data and reduce DRAM access power. Because the memory bandwidth will limit the calculation throughput. Therefore, the DMA is configured according to the algorithm properties, and the memory bandwidth is controlled to match the corresponding data amount. For example, the element size of the data bundle used for PW and DW convolution is equal to the number of bytes transmitted each time under a specific DRAM protocol. Therefore, continuous burst reading and writing can be realized without further data buffering.

The execution thread is used to run under the control of the execution controller to realize the function of the target neural network layer.

The buffer includes a static memory pool composed of a plurality of the static memories, wherein the size of each SRAM is 8KB, and different algorithm kernels are configured with different buffering schemes. With the assistance of the execution controller, the SRAM can be instantiated to instantiate various buffering functions, which are determined by the algorithm kernel.

Further, in order to facilitate resource sharing of data flow and weights, the hardware thread includes a core state machine and a shift register. The core state machine is used to control the data input and output of processing units on the same thread, activation function allocation, and Weight distribution. The shift register is used to construct the input and output of the activation function to realize data sharing and power consumption reduced due to single fan-out and reduced load capacitance. The shift register can be dynamically configured as a cascade or parallel connection. Since some target neural network layers involve vector calculations, contrary to the single direction of the input data flow, the output data flow is bidirectional, which facilitates calculations using vectors in the residual layer kernel, for example. Multiple processing units are coordinated through the thread-level core state machine FSM to process activations and weights in a pipeline manner. The weights flow from the static memory pool in the system-on-chip 1, where each processing unit can receive different weight streams.

Further, in order to efficiently calculate functions that depend on the kernel, the processing unit 3 is designed to implement the required operators through a compact design. Through the data input port and a weight input port, this method is convenient for matrix calculation and vector calculation. Sigmoid and Tangent modules are designed based on the linear approximation technology in. The control input receives the opcode from the thread-level FSM, and configures the multiplexer to implement the kernel-related operators.

The following experiments verify the feasibility of the reconfigurable data stream processor provided by the present invention. Its architecture is to use 108KB on-chip SRAM and 16 PEs to form a thread. This experiment is implemented in Verilog HDL language, and the Modelsim simulation tool is used to design the solution. The feasibility and running time are simulated and verified. And the NVIDIA GTX GPU of MATLAB uses neural network tool library for network performance analysis. Three network structures are demonstrated as follows to analyze the performance of the proposed architecture.

MobileNet has standard PW and DW convolution, pooling and fully connected hybrid kernel network, MobileNet adopts iterative compact convolution kernel, accounting for 97.91% of the MAC calculation. Table 2 shows the execution latencies of the proposed design for each layer of MobileNet benchmarked between multi-threaded and single-threaded architectures using FPGA prototypes with 256 PE and DRAM support.

Table 2. Performance analysis based on MobileNet architecture

Deep reinforcement learning: The typical usage of DQN is maze walking, in which the intelligent processing unit learns to go to the destination by choosing the correct direction at the intersection and avoiding obstacles. As shown in Figure 10, the data of 1, 2, 4, and 6 nodes in the reinforcement learning action space were tested on the 2-layer, 5-layer, and 10-layer networks, while the state space was selected between 128 and 256 nodes. For all tested action spaces, the on-chip Q iteration time of all three network structures is less than 2ms. This iteration time slightly increases with the size of the action space and the size of the network.

Sequence classification: The test example uses sensor data obtained from a smartphone worn on the body, and uses the LSTM network to train the data to recognize the wearer’s activity in a given time series, which is represented in three different directions The accelerometer reading on the Referring to Table 3 without considering the simulation results of data transmission between disk storage and DRAM, it can be seen that, compared with CPU and GPU, the proposed LSTM network design achieves improved performance. However, the measurement result of MATLAB takes into account the huge delay of data transmission between the disk, main memory and the operating system, and the design of the present invention is currently set as an independent system. However, the future LSTM network tends to be deployed on sensors and directly retrieve data from DRAM for direct processing, which is very close to the design principle of the present invention. Compared with the CPU and GPU, the power consumption is three orders of magnitude higher than that of the proposed design, thus proving the excellent efficiency of the ASIC hybrid neural network and the feasibility of the present invention.

Table 3. Performance benchmark analysis of LSTM networks in three processing architectures

In summary, the data stream reconstruction method and the reconfigurable data stream processor provided by the present invention, by adjusting the data stream mode, the processing unit and the function mode of on-chip storage, systematically multiplex the hardware for dynamic configuration, and realize The hybrid neural network has achieved the effects of improving hardware utilization, increasing computing speed and reducing power consumption, and can provide a resource reuse foundation for subsequent research to construct other new neural network layers and the realization of hybrid neural networks based on these new neural network layers.

It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply one of these entities or operations. There is any such actual relationship or order between. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes those that are not explicitly listed Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or equipment that includes the element.

The above are only specific implementations of this application. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of this application, several improvements and modifications can be made, and these improvements and modifications are also Should be regarded as the scope of protection of this application.

Claims

A data stream reconstruction method is characterized in that it includes:

Obtain characteristic information of the target neural network layer;

According to the characteristic information of the target neural network layer, determine the data flow mode of the target neural network layer, the functional configuration of the processing unit, and the functional configuration of the system-on-chip;

Perform the functional configuration of the reusable processing unit and the system on chip corresponding to the processing unit of the target neural network layer and the system on chip, and perform the network configuration corresponding to the target neural network layer according to the data flow mode of the target neural network layer, Constructing the target neural network layer;

Use the constructed target neural network layer to obtain the output result.
The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a convolutional layer, the processing unit includes a multiplication accumulation addition operation unit and a modified linear unit that are grouped and arranged in multiple threads ; The input or output of the data stream is thread-level parallel serial transmission. The static memory of the system-on-chip is configured to buffer the activation function of the input feature map on the thread, and the weight and activation function are shared among multiple threads. The serial output of each thread is output in parallel after being output buffered.
The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a pooling layer, the processing unit is configured as a comparator; the input or output of the data stream is parallel transmission.
The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a fully connected layer, the processing unit includes a multiplication accumulation addition operation unit and a modified linear unit that are grouped and arranged in multiple threads ; The input or output of the data stream is thread-level parallel serial transmission, the static memory of the system-on-chip is configured as a weight buffer, and the activation function is serially streamed through multiple threads.
The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a residual layer, the processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the system-on-chip The input and output shift registers are used to store operands.
The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a long and short-term memory layer, the processing units are divided into four groups, and each group of processing units is used to instantiate the sigmoid function and The tanh function, the input or output of the data stream is serial transmission.
The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a reinforcement learning layer, the processing unit includes a multiplication accumulation addition operation unit and a modified linear unit that are grouped and arranged in multiple threads ; The input or output of the data stream is thread-level parallel serial transmission, and the system-on-chip cache is used for state activation and iterative operations.
A reconfigurable data stream processor, characterized in that it is used to execute the data stream reconstruction method according to any one of claims 1-7, and the reconfigurable data stream processor comprises: a system on a chip, hardware Threads and multiple sets of processing units,

Wherein, the system on chip is used to control each group of processing units to cooperate with corresponding hardware threads, adjust to match the functional configuration of the target neural network layer, and construct the target neural network layer.
The reconfigurable data stream processor according to claim 8, wherein the system on chip comprises an execution controller, a direct memory access controller, an execution thread, and a buffer,

The execution controller is used to fetch network instructions of the target neural network layer from an external off-chip memory, configure the network instructions in the static memory, and decode and analyze the network instructions one by one to drive the execution thread;

The direct memory access controller is used to control read and write between the system-on-chip and the off-chip memory;

The execution thread is used to run under the control of the execution controller to realize the function of the target neural network layer;

The buffer includes a static memory pool composed of a plurality of static memories.
The reconfigurable data stream processor according to claim 8 or 9, wherein the hardware thread includes a core state machine and a shift register, and the core state machine is used to control data of processing units on the same thread Input and output, activation function allocation and weight allocation, and the shift register is used to construct the input and output of the activation function.