WO2021089009A1 - Data stream reconstruction method and reconstructable data stream processor - Google Patents

Data stream reconstruction method and reconstructable data stream processor Download PDF

Info

Publication number
WO2021089009A1
WO2021089009A1 PCT/CN2020/127250 CN2020127250W WO2021089009A1 WO 2021089009 A1 WO2021089009 A1 WO 2021089009A1 CN 2020127250 W CN2020127250 W CN 2020127250W WO 2021089009 A1 WO2021089009 A1 WO 2021089009A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
data stream
target neural
network layer
layer
Prior art date
Application number
PCT/CN2020/127250
Other languages
French (fr)
Chinese (zh)
Inventor
王峥
周丽冰
陈伟光
谢文婷
粟金源
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2021089009A1 publication Critical patent/WO2021089009A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the technical field of data flow of neural networks, in particular to a data flow reconstruction method and a reconfigurable data flow processor.
  • Neural networks are widely used in computer vision, natural language processing, game engines and other fields. With the rapid development of neural network structures, the demand for computing power for different data streams will continue to increase. Therefore, the future hybrid neural network is the general trend, and its compact algorithm core can support end-to-end tasks in perception, control and even driving.
  • dedicated hardware accelerator structures have been proposed to accelerate the inference stage of neural networks, such as Eyeiss, Google TPU-I and DaDianNao, which use algorithms and architecture co-design technologies, such as dedicated data streams and systolic arrays. Multipliers, etc., to achieve high performance and high resource utilization, but these architectures and neural networks are tightly coupled and cannot be accelerated for different neural networks. Therefore, according to different neural network needs to design corresponding data flow schemes, the key data flow reconstruction method is the design focus of hybrid artificial neural networks.
  • the purpose of the present invention is to provide a data stream reconstruction method and a reconfigurable data stream processor to solve the above-mentioned problems.
  • the present invention provides a data flow reconstruction method, which includes: acquiring characteristic information of a target neural network layer; and determining the data flow mode corresponding to the target neural network layer, the functional configuration of the processing unit, and the on-chip according to the characteristic information of the target neural network layer.
  • the functional configuration of the system; the reusable processing unit and the system-on-chip are configured to correspond to the processing unit of the target neural network layer and the system-on-chip functional configuration, and corresponding to the target neural network according to the data flow mode of the target neural network layer
  • the network configuration of the layers is used to construct the target neural network layer; the constructed target neural network layer is used to obtain the output result.
  • the processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and arranged in multiple threads; the input or output of the data stream is a thread-level parallel serial Transmission, the static memory of the system-on-chip is configured to buffer the activation function of the input feature map on the thread.
  • the weight and activation function are shared among multiple threads, and the serial output of each thread is output in parallel after being output buffered.
  • the processing unit is configured as a comparator; the input or output of the data stream is parallel transmission.
  • the processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and arranged in multiple threads; the input or output of the data stream is a thread-level parallel serial
  • the static memory of the system-on-chip is configured as a weight buffer, and the activation function is serially streamed through multiple threads.
  • the processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the input and output shift registers of the system-on-chip are used to store operands.
  • the processing units are divided into four groups, each group of processing units is used to instantiate a sigmoid function and a tanh function, and the input or output of the data stream is serial transmission.
  • the processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and arranged in multiple threads; the input or output of the data stream is a thread-level parallel serial Transmission, the system-on-chip cache is used for state activation and iterative operations.
  • the present invention provides a reconfigurable data stream processor for executing the data stream reconstruction method as described above.
  • the reconfigurable data stream processor includes a system on a chip, a hardware thread, and multiple sets of processing units. Wherein, the system on chip is used to control each group of processing units to cooperate with corresponding hardware threads, adjust to match the functional configuration of the target neural network layer, and construct the target neural network layer.
  • the system-on-chip includes an execution controller, a direct memory access controller, an execution thread, and a buffer.
  • the execution controller is used to extract network instructions of the target neural network layer from an external off-chip memory, and transfer the network instructions It is configured in static memory, and the network instructions are decoded and analyzed one by one to drive the execution thread;
  • the direct memory access controller is used to control the read and write between the system-on-chip and the off-chip memory;
  • the execution thread is used in the It runs under the control of the execution controller to realize the function of the target neural network layer;
  • the buffer includes a static memory pool composed of a plurality of static memories.
  • the hardware thread includes a core state machine and a shift register
  • the core state machine is used to control data input and output, activation function allocation, and weight allocation of processing units on the same thread
  • the shift register is used to construct The input and output of the activation function.
  • the data stream reconstruction method and the reconfigurable data stream processor provided by the present invention support the functions of different neural network operators through the dynamic function changes of calculation, storage and data flow units, and perform large-scale multiplexing of hardware resources, Realize the adaptation to various neural networks, especially the new hybrid neural network, and achieve the effects of improving hardware utilization, increasing computing speed and reducing power consumption.
  • Figure 1 is an exemplary structure diagram of a hybrid neural network structure
  • Figure 2 is a flowchart of a data stream reconstruction method provided by an embodiment of the present invention.
  • Figure 3 is a schematic diagram of the structure of the convolutional layer as the target neural network layer
  • Figure 4 is a schematic diagram of the structure of the pooling layer as the target neural network layer
  • Figure 5 is a schematic diagram of the structure of the fully connected layer as the target neural network layer
  • Figure 6 is a schematic diagram of the structure of the residual layer as the target neural network layer
  • Figure 7 is a schematic diagram of the structure of the long and short-term memory layer as the target neural network layer
  • Figure 8 is a schematic diagram of the structure of the reinforcement learning layer as the target neural network layer
  • FIG. 9 is a schematic structural diagram of a reconfigurable data stream processor provided by an embodiment of the present invention.
  • FIG. 10 is a comparison diagram of Q iteration time between the architecture designed in the verification experiment of Embodiment 2 and the host.
  • Figure 1 which describes the important network layers in the hybrid neural network structure and their interconnection methods, forming an end-to-end network with the goal of perception and control.
  • cascaded convolution and pooling layers are used as the perception module for visual feature extraction.
  • Model networks such as Yolo-v3 and Resnet-50 can reach dozens of layers and are used to imitate the human visual system.
  • time-related applications such as video context understanding and language processing, the feature sequence extracted through time is used as the input of the LSTM long and short-term memory layer, and the LSTM layer then extracts the sequence-related feature output.
  • the LSTM network is a special network structure of the cyclic neural network.
  • I input gate
  • O output gate
  • C cell state gate
  • F Forgotten door.
  • the control network layer is performed after feature extraction.
  • the extracted feature parameters are regarded as state nodes, and the best decision needs to be selected through action nodes.
  • the method is to traverse all possible actions in the current state, and implement a regression strategy according to the reinforcement learning strategy to find the maximum or minimum output value (Q value). Since the action nodes need to be iterated, all calculations in subsequent layers also need to be iterated, which is represented by a dashed box.
  • the multi-layer perceptron uses the most common fully connected layer.
  • the short-circuit mode is also often used in residual networks to improve the accuracy of classification and regression by providing key elements in the form of layers before the current input.
  • each neural network layer is not only different in network structure, but also in operands, operators and nonlinear functions.
  • the data flow attributes are reconfigurable according to different network structures. Therefore, you can analyze the characteristics of various neural network layers, such as data flow access patterns, computing resource functions, etc., according to these neural network layers.
  • Common points between the network layer characteristics looking for resources that can be reused in the construction of the hybrid neural network structure, such as processing unit PE, data input/output, static memory SRAM used as a buffer, and off-chip DRAM memory In this way, the data flow reconstruction is carried out to solve the existing technical problems.
  • Table 1 summarizes the characteristic information of multiple standard neural network layers.
  • the pooling and short-circuiting layers are vector operations, while other kernels operate on matrices, in which the convolution process is a sparse matrix, and the remaining network layers are dense matrices.
  • the convolution process is a sparse matrix
  • the remaining network layers are dense matrices.
  • different activation functions are used.
  • the LSTM network uses both the S-shape and the tangent, while the remaining matrix cores use the ReLU function or the S-shape.
  • the network data in the convolutional layer and the fully connected layer need to be shared among the nodes that output the feature map.
  • the LSTM layer uses a similar serial stream, but in particular its activation stream needs to be shared among multiple gates.
  • the state action layer needs to quickly generate data streams based on the iteration of action nodes.
  • the pooling layer and the residual layer that operate on the vector do not need to share the activation function for the feature map. Therefore, the activated vector types can be transmitted in parallel.
  • the convolutional layer and the pooling layer are mainly determined by activation, while the FC fully connected layer and LSTM layer are determined by weights.
  • the residual layer it is necessary to keep a pointer to the previous layer for activation so that the network can process the previous data.
  • the present invention provides a data stream reconstruction method, including:
  • the above-mentioned data stream reconstruction method provided by the present invention can dynamically change the corresponding functional configuration of resources such as computing unit, storage unit and data flow unit according to different neural network layers, and reuse hardware on a large scale to realize neural network layers with different functions.
  • Aiming at the hybrid neural network structure composed of multiple neural network layers it has achieved the effects of improving hardware utilization, increasing computing speed and reducing power consumption, and can provide a resource reuse basis for subsequent research and construction of other new neural network layers.
  • the processing unit when the target neural network layer is a convolutional layer, the processing unit includes a multiplication and accumulation operation unit and a modified linear unit that are grouped and arranged in a plurality of threads, wherein each thread is outputting the multiplication of the feature map.
  • the same rows and columns are used to process data on each channel; the input or output of the data stream is thread-level parallel serial transmission, and the static memory of the system-on-chip is configured to buffer the activation function of the input feature map on the thread, and the weight
  • the activation function and the activation function are shared among multiple threads.
  • the activation function is serially streamed from a single buffer to realize sharing between processing units.
  • the serial output of each thread is serially deserialized after being output buffered.
  • SERDES and DRAM controller for parallel output.
  • the processing unit when the target neural network layer is a pooling layer, the processing unit is configured as a comparator to realize the maximum and minimum operators; the input or output of the data stream is parallel transmission, because the pooling layer directly It operates on a vector, so the activation function obtained from the DRAM is directly provided to the processing unit array without buffering, which greatly saves dynamic power consumption.
  • the activation function compares the passage time by modifying the DRAM access address.
  • the configuration of its output and processing units is similar to that of a convolutional layer, and the processing unit includes multiplying, accumulating and adding operation units and corrections grouped in multiple threads.
  • Linear unit; the input or output of the data stream is thread-level parallel serial transmission.
  • the static memory of the system-on-chip is configured as a weight buffer, and the activation function is serially streamed through multiple threads transmission.
  • the target neural network layer is a residual layer, which is similar to a pooling layer
  • the kernel directly works on parameters, and the processing unit is configured as an adder; the input or output of the data stream is parallel Transmission, due to the addition of two vectors, the input and output shift registers of the on-chip system are used to store operands, the output result is written to the output shift register and written to DRAM in parallel, and the pointer buffer is instantiated to address the DRAM The two operands.
  • the network layer divides and multiplexes processing units into four groups.
  • Each group of processing units is used to instantiate the sigmoid function and The tanh function, the addition vector operation and the tanh function operation will be performed later.
  • the input or output of the data stream is serial transmission, and the mixed input mode is adopted for the shared activation function in each group of gates and the provision of fast data between different groups.
  • the state unit cache is instantiated to retain intermediate state information.
  • the configuration of its input, output, and processing units is similar to that of a fully connected layer, including multiple activation sources including DRAM for conventional activation.
  • the processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and configured in multiple threads.
  • the input or output of the data stream is thread-level parallel serial transmission, and the system-on-chip cache is used for state activation and iterative operations.
  • the target neural network layer can also be applied to other new neural network layers.
  • the characteristics information of the new neural network layer is analyzed, the resources that can be reused are known from the characteristic information. And the corresponding configuration is enough, which is of great significance for the construction of a new type of hybrid neural network structure in the future.
  • the present invention also provides a reconfigurable data stream processor, which is characterized in that it is used to execute the data stream reconstruction method described above .
  • the reconfigurable data stream processor adopts a hierarchical design and includes: a system on chip 1, a hardware thread 2 and multiple sets of processing units 3.
  • system on chip 1 is used to control each group of processing units 3 to cooperate with corresponding hardware threads, adjust to match the functional configuration of the target neural network layer, and construct the target neural network layer.
  • the system on chip 1 includes an execution controller PCI-e, a direct memory access controller DMA, an execution thread, and a buffer.
  • the execution controller coordinates the processing unit 3 and the buffer according to the network instruction, and is used to extract the network instruction of the target neural network layer from the external off-chip memory 4, configure the network instruction in the static memory, and respond to the network instruction Decoding and analysis are performed one by one to drive the execution threads. Therefore, the execution controller plays a role of centralized control, which is beneficial to reduce logic overhead and improve performance.
  • the direct memory access controller is used to control the read and write between the system-on-chip 1 and the off-chip DRAM memory 4, realize multiple read and write modes between the system-on-chip 1 and the off-chip DRAM memory 4, and it can transmit network configuration smoothly. , Weight, activation, and result.
  • the DDR burst mode is widely used to quickly supply data and reduce DRAM access power. Because the memory bandwidth will limit the calculation throughput. Therefore, the DMA is configured according to the algorithm properties, and the memory bandwidth is controlled to match the corresponding data amount. For example, the element size of the data bundle used for PW and DW convolution is equal to the number of bytes transmitted each time under a specific DRAM protocol. Therefore, continuous burst reading and writing can be realized without further data buffering.
  • the execution thread is used to run under the control of the execution controller to realize the function of the target neural network layer.
  • the buffer includes a static memory pool composed of a plurality of the static memories, wherein the size of each SRAM is 8KB, and different algorithm kernels are configured with different buffering schemes.
  • the SRAM can be instantiated to instantiate various buffering functions, which are determined by the algorithm kernel.
  • the hardware thread includes a core state machine and a shift register.
  • the core state machine is used to control the data input and output of processing units on the same thread, activation function allocation, and Weight distribution.
  • the shift register is used to construct the input and output of the activation function to realize data sharing and power consumption reduced due to single fan-out and reduced load capacitance.
  • the shift register can be dynamically configured as a cascade or parallel connection. Since some target neural network layers involve vector calculations, contrary to the single direction of the input data flow, the output data flow is bidirectional, which facilitates calculations using vectors in the residual layer kernel, for example.
  • Multiple processing units are coordinated through the thread-level core state machine FSM to process activations and weights in a pipeline manner. The weights flow from the static memory pool in the system-on-chip 1, where each processing unit can receive different weight streams.
  • the processing unit 3 is designed to implement the required operators through a compact design. Through the data input port and a weight input port, this method is convenient for matrix calculation and vector calculation. Sigmoid and Tangent modules are designed based on the linear approximation technology in.
  • the control input receives the opcode from the thread-level FSM, and configures the multiplexer to implement the kernel-related operators.
  • the following experiments verify the feasibility of the reconfigurable data stream processor provided by the present invention. Its architecture is to use 108KB on-chip SRAM and 16 PEs to form a thread. This experiment is implemented in Verilog HDL language, and the Modelsim simulation tool is used to design the solution. The feasibility and running time are simulated and verified. And the NVIDIA GTX GPU of MATLAB uses neural network tool library for network performance analysis. Three network structures are demonstrated as follows to analyze the performance of the proposed architecture.
  • MobileNet has standard PW and DW convolution, pooling and fully connected hybrid kernel network, MobileNet adopts iterative compact convolution kernel, accounting for 97.91% of the MAC calculation.
  • Table 2 shows the execution latencies of the proposed design for each layer of MobileNet benchmarked between multi-threaded and single-threaded architectures using FPGA prototypes with 256 PE and DRAM support.
  • Deep reinforcement learning The typical usage of DQN is maze walking, in which the intelligent processing unit learns to go to the destination by choosing the correct direction at the intersection and avoiding obstacles.
  • the data of 1, 2, 4, and 6 nodes in the reinforcement learning action space were tested on the 2-layer, 5-layer, and 10-layer networks, while the state space was selected between 128 and 256 nodes.
  • the on-chip Q iteration time of all three network structures is less than 2ms. This iteration time slightly increases with the size of the action space and the size of the network.
  • the test example uses sensor data obtained from a smartphone worn on the body, and uses the LSTM network to train the data to recognize the wearer’s activity in a given time series, which is represented in three different directions
  • the accelerometer reading on the Referring to Table 3 without considering the simulation results of data transmission between disk storage and DRAM, it can be seen that, compared with CPU and GPU, the proposed LSTM network design achieves improved performance.
  • the measurement result of MATLAB takes into account the huge delay of data transmission between the disk, main memory and the operating system, and the design of the present invention is currently set as an independent system.
  • the future LSTM network tends to be deployed on sensors and directly retrieve data from DRAM for direct processing, which is very close to the design principle of the present invention.
  • the power consumption is three orders of magnitude higher than that of the proposed design, thus proving the excellent efficiency of the ASIC hybrid neural network and the feasibility of the present invention.
  • the data stream reconstruction method and the reconfigurable data stream processor provided by the present invention, by adjusting the data stream mode, the processing unit and the function mode of on-chip storage, systematically multiplex the hardware for dynamic configuration, and realize
  • the hybrid neural network has achieved the effects of improving hardware utilization, increasing computing speed and reducing power consumption, and can provide a resource reuse foundation for subsequent research to construct other new neural network layers and the realization of hybrid neural networks based on these new neural network layers.

Abstract

A data stream reconstruction method and a reconstructable data stream processor. According to different neural network layers, corresponding dynamic function configuration changes are made to resources, such as a computing unit, a storage unit and a data flow unit. Neural network layers having different functions are realized by means of multiplexing hardware on a large scale. For a hybrid neural network structure formed by a plurality of neural network layers, the effects of improving a hardware utilization rate, improving an operational speed, reducing power consumption, etc. are achieved.

Description

数据流重构方法及可重构数据流处理器Data stream reconstruction method and reconfigurable data stream processor 技术领域Technical field
本发明涉及神经网络的数据流技术领域,尤其是涉及数据流重构方法及可重构数据流处理器。The present invention relates to the technical field of data flow of neural networks, in particular to a data flow reconstruction method and a reconfigurable data flow processor.
背景技术Background technique
神经网络在计算机视觉,自然语言处理和游戏引擎等领域有着广泛地应用,随着神经网络结构的快速发展,其对不同数据流的计算能力需求也会不断增加。因此未来的混合神经网络是大势所趋,其紧凑的算法内核可支持感知,控制甚至驱动方面的端到端任务。与此同时,专用的硬件加速器结构已经被提出用于加速神经网路的推理阶段,例如Eyeriss,Google TPU-I和DaDianNao,它们通过算法与体系结构的协同设计技术,如专用数据流和脉动阵列乘法器等来实现高性能和高资源利用率,但是这些架构和神经网络都是紧耦合的,无法针对不同的神经网络做加速。因此,针对不同的神经网络需要设计对应数据流方案,其关键的数据流重构方法是混合人工神经网络的设计重点。Neural networks are widely used in computer vision, natural language processing, game engines and other fields. With the rapid development of neural network structures, the demand for computing power for different data streams will continue to increase. Therefore, the future hybrid neural network is the general trend, and its compact algorithm core can support end-to-end tasks in perception, control and even driving. At the same time, dedicated hardware accelerator structures have been proposed to accelerate the inference stage of neural networks, such as Eyeiss, Google TPU-I and DaDianNao, which use algorithms and architecture co-design technologies, such as dedicated data streams and systolic arrays. Multipliers, etc., to achieve high performance and high resource utilization, but these architectures and neural networks are tightly coupled and cannot be accelerated for different neural networks. Therefore, according to different neural network needs to design corresponding data flow schemes, the key data flow reconstruction method is the design focus of hybrid artificial neural networks.
现有技术中,缺少针对存在不同的神经网络层,比如池化层、全连接层、循环网络LSTM层、深度强化学习层以及残差层等不同神经网络层组成的混合神经网络结构通过数据流重构来进行资源复用的方案,由此,现有技术的方案往往存在硬件成本高、结构复杂、运算速度慢以及运算功耗大等缺点。In the prior art, there is a lack of a hybrid neural network structure composed of different neural network layers such as a pooling layer, a fully connected layer, a recurrent network LSTM layer, a deep reinforcement learning layer, and a residual layer. The scheme of reconfiguration for resource reuse, therefore, the existing technical schemes often have disadvantages such as high hardware cost, complex structure, slow operation speed, and high operation power consumption.
发明内容Summary of the invention
有鉴于此,本发明的目的在于提供数据流重构方法及可重构数据流处理器,来解决上述问题。In view of this, the purpose of the present invention is to provide a data stream reconstruction method and a reconfigurable data stream processor to solve the above-mentioned problems.
为了实现上述的目的,本发明采用了如下的技术方案:In order to achieve the above objectives, the present invention adopts the following technical solutions:
本发明提供了一种数据流重构方法,包括:获取目标神经网络层的特性信息;根据目标神经网络层的特性信息,确定对应目标神经网络层的数据流模式、处理单元的功能配置以及片上系统的功能配置;将可复用的处理单元和片上系统进行对应所述目标神经网络层的处理单元和片上系统的功能配置,并根据目标神经网络层的数据流模式进行对应所述目标神经网络层的网络配置,构建所述目标神经网络层;采用构建的目标神经网络层获得输出结果。The present invention provides a data flow reconstruction method, which includes: acquiring characteristic information of a target neural network layer; and determining the data flow mode corresponding to the target neural network layer, the functional configuration of the processing unit, and the on-chip according to the characteristic information of the target neural network layer. The functional configuration of the system; the reusable processing unit and the system-on-chip are configured to correspond to the processing unit of the target neural network layer and the system-on-chip functional configuration, and corresponding to the target neural network according to the data flow mode of the target neural network layer The network configuration of the layers is used to construct the target neural network layer; the constructed target neural network layer is used to obtain the output result.
优选地,当所述目标神经网络层为卷积层,所述处理单元包括分组配置在多个线程中的乘积累加运算单元和修正线性单元;数据流的输入或输出为线程级并行的串行传输,片上系统的静态存储器配置为用于对线程上输入特征图的激活函数进行缓冲,权重和激活函数在多个线程之间进行共享,各个线程的串行输出经过输出缓冲后进行并行输出。Preferably, when the target neural network layer is a convolutional layer, the processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and arranged in multiple threads; the input or output of the data stream is a thread-level parallel serial Transmission, the static memory of the system-on-chip is configured to buffer the activation function of the input feature map on the thread. The weight and activation function are shared among multiple threads, and the serial output of each thread is output in parallel after being output buffered.
优选地,当所述目标神经网络层为池化层,所述处理单元配置为比较器;数据流的输入或输出为并行传输。Preferably, when the target neural network layer is a pooling layer, the processing unit is configured as a comparator; the input or output of the data stream is parallel transmission.
优选地,当所述目标神经网络层为全连接层,所述处理单元包括分组配置在多个线程中的乘积累加运算单元和修正线性单元;数据流的输入或输出为线程级并行的串行传输,片上系统的静态存储器配置为权重缓冲区,激活函数通过多个线程进行串行流式传输。Preferably, when the target neural network layer is a fully connected layer, the processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and arranged in multiple threads; the input or output of the data stream is a thread-level parallel serial For transmission, the static memory of the system-on-chip is configured as a weight buffer, and the activation function is serially streamed through multiple threads.
优选地,当所述目标神经网络层为残差层,所述处理单元配置为加法器;数据流的输入或输出为并行传输,片上系统的输入和输出移位寄存器用于存储操作数。Preferably, when the target neural network layer is a residual layer, the processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the input and output shift registers of the system-on-chip are used to store operands.
优选地,当所述目标神经网络层为长短期记忆层,所述处理单元分为四组,各组处理单元用于实例化sigmoid函数和tanh函数,数据流的输入或输出为串行传输。Preferably, when the target neural network layer is a long and short-term memory layer, the processing units are divided into four groups, each group of processing units is used to instantiate a sigmoid function and a tanh function, and the input or output of the data stream is serial transmission.
优选地,当所述目标神经网络层为强化学习层,所述处理单元包括分组配置在多个线程中的乘积累加运算单元和修正线性单元;数据流的输入或输出为线程级并行的串行传输,片上系统的缓存用于状态激活和迭代操作。Preferably, when the target neural network layer is a reinforcement learning layer, the processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and arranged in multiple threads; the input or output of the data stream is a thread-level parallel serial Transmission, the system-on-chip cache is used for state activation and iterative operations.
本发明提供一种可重构数据流处理器,用于执行如上所述的数据流重构方法,所述可重构数据流处理器包括:片上系统、硬件线程以及多组处理单元。其中,所述片上系统用于控制各组处理单元与对应的硬件线程配合,调整为匹配目标神经网络层的功能配置,进行目标神经网络层的构建。The present invention provides a reconfigurable data stream processor for executing the data stream reconstruction method as described above. The reconfigurable data stream processor includes a system on a chip, a hardware thread, and multiple sets of processing units. Wherein, the system on chip is used to control each group of processing units to cooperate with corresponding hardware threads, adjust to match the functional configuration of the target neural network layer, and construct the target neural network layer.
优选地,所述片上系统包括执行控制器、直接内存访问控制器、执行线程以及缓冲区,所述执行控制器用于从外部的片下存储器提取目标神经网络层的网络指令,将所述网络指令配置到静态存储器中,对所述网络指令逐条进行解码分析以驱动执行线程;所述直接内存访问控制器用于控制片上系统与片下存储器之间的读写;所述执行线程用于在所述执行控制器的控制下运行以实现目标神经网络层的功能;所述缓冲区包括多个所述静态存储器构成的静态存储器 池。Preferably, the system-on-chip includes an execution controller, a direct memory access controller, an execution thread, and a buffer. The execution controller is used to extract network instructions of the target neural network layer from an external off-chip memory, and transfer the network instructions It is configured in static memory, and the network instructions are decoded and analyzed one by one to drive the execution thread; the direct memory access controller is used to control the read and write between the system-on-chip and the off-chip memory; the execution thread is used in the It runs under the control of the execution controller to realize the function of the target neural network layer; the buffer includes a static memory pool composed of a plurality of static memories.
优选地,所述硬件线程包括核心状态机和移位寄存器,所述核心状态机用于控制同一线程上的处理单元的数据输入输出、激活函数配给以及权重配给,所述移位寄存器用于构建激活函数的输入和输出。Preferably, the hardware thread includes a core state machine and a shift register, the core state machine is used to control data input and output, activation function allocation, and weight allocation of processing units on the same thread, and the shift register is used to construct The input and output of the activation function.
本发明提供的数据流重构方法及可重构数据流处理器,通过计算、存储以及数据流动单元的动态功能变化,支持不同神经网络算子的功能,并对硬件资源进行大规模复用,实现对各种神经网络尤其是新型混合神经网络的适应,取得提高硬件利用率、提高运算速度以及降低功耗等效果。The data stream reconstruction method and the reconfigurable data stream processor provided by the present invention support the functions of different neural network operators through the dynamic function changes of calculation, storage and data flow units, and perform large-scale multiplexing of hardware resources, Realize the adaptation to various neural networks, especially the new hybrid neural network, and achieve the effects of improving hardware utilization, increasing computing speed and reducing power consumption.
附图说明Description of the drawings
图1是一种混合神经网络结构的示例性结构图;Figure 1 is an exemplary structure diagram of a hybrid neural network structure;
图2是本发明实施例提供的数据流重构方法的流程图;Figure 2 is a flowchart of a data stream reconstruction method provided by an embodiment of the present invention;
图3是卷积层作为目标神经网络层的结构示意图;Figure 3 is a schematic diagram of the structure of the convolutional layer as the target neural network layer;
图4是池化层作为目标神经网络层的结构示意图;Figure 4 is a schematic diagram of the structure of the pooling layer as the target neural network layer;
图5是全连接层作为目标神经网络层的结构示意图;Figure 5 is a schematic diagram of the structure of the fully connected layer as the target neural network layer;
图6是残差层作为目标神经网络层的结构示意图;Figure 6 is a schematic diagram of the structure of the residual layer as the target neural network layer;
图7是长短期记忆层作为目标神经网络层的结构示意图;Figure 7 is a schematic diagram of the structure of the long and short-term memory layer as the target neural network layer;
图8是强化学习层作为目标神经网络层的结构示意图;Figure 8 is a schematic diagram of the structure of the reinforcement learning layer as the target neural network layer;
图9是本发明实施例提供的可重构数据流处理器的结构示意图;FIG. 9 is a schematic structural diagram of a reconfigurable data stream processor provided by an embodiment of the present invention;
图10是实施例2的验证实验中设计的架构与主机之间的Q迭代时间比较图。FIG. 10 is a comparison diagram of Q iteration time between the architecture designed in the verification experiment of Embodiment 2 and the host.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚,下面结合附图对本发明的具体实施方式进行详细说明。这些优选实施方式的示例在附图中进行了例示。附图中所示和根据附图描述的本发明的实施方式仅仅是示例性的,并且本发明并不限于这些实施方式。In order to make the objectives, technical solutions, and advantages of the present invention clearer, the specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Examples of these preferred embodiments are illustrated in the drawings. The embodiments of the present invention shown in the drawings and described in accordance with the drawings are merely exemplary, and the present invention is not limited to these embodiments.
在此,还需要说明的是,为了避免因不必要的细节而模糊了本发明,在附图中仅仅示出了与根据本发明的方案密切相关的结构和/或处理步骤,而省略了 关系不大的其他细节。Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the structure and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and the relationship is omitted. Little other details.
实施例1Example 1
参阅图1所示,其描述了混合神经网络结构中的重要网络层及其互联方式,组成了以感知和控制为目标的端到端网络。特别是图形输入,级联卷积和池化层被用作视觉特征提取的感知模块。诸如Yolo-v3和Resnet-50之类的模型网络可以达到数十层,用于模仿人类的视觉系统。而对于视频上下文理解和语言处理这种时序相关的应用,通过时间提取的特征序列被作为LSTM长短期记忆层的输入,LSTM层再提取序列相关的特征输出。与以前的层不同的是,LSTM网络作为循环神经网络的一种特殊网络结构,其所构成的四个基本门,分别输入门(I),输出门(O),细胞状态门(C)和(F)遗忘门。当I,O和F门通过矢量运算来计算图层输出时,C门将保持当前图s层状态并作为下一个时间序列的递归输入。Refer to Figure 1, which describes the important network layers in the hybrid neural network structure and their interconnection methods, forming an end-to-end network with the goal of perception and control. Especially for graphics input, cascaded convolution and pooling layers are used as the perception module for visual feature extraction. Model networks such as Yolo-v3 and Resnet-50 can reach dozens of layers and are used to imitate the human visual system. For time-related applications such as video context understanding and language processing, the feature sequence extracted through time is used as the input of the LSTM long and short-term memory layer, and the LSTM layer then extracts the sequence-related feature output. Different from the previous layer, the LSTM network is a special network structure of the cyclic neural network. It consists of four basic gates, input gate (I), output gate (O), cell state gate (C) and (F) Forgotten door. When the I, O and F gates calculate the layer output through vector operations, the C gate will maintain the current state of the s layer and serve as the recursive input for the next time series.
控制网络层在特征提取后进行,在深度强化学习神经网络DQN中,提取的特征参数被视为状态节点,而最佳决策则需要通过动作节点进行选择。该方法是遍历当前状态下的所有可能动作,并根据强化学习的策略执行回归策略以找到最大或最小输出值(Q值)。由于需要对动作节点进行迭代,因此还需要对后续层中的所有计算进行迭代,以虚线框表示。多层感知器采用最普遍的全连接层。其中的短接模式也常使用在残差网络中,通过在当前输入之前提供图层形式的关键要素来提高分类和回归的准确性。The control network layer is performed after feature extraction. In the deep reinforcement learning neural network DQN, the extracted feature parameters are regarded as state nodes, and the best decision needs to be selected through action nodes. The method is to traverse all possible actions in the current state, and implement a regression strategy according to the reinforcement learning strategy to find the maximum or minimum output value (Q value). Since the action nodes need to be iterated, all calculations in subsequent layers also need to be iterated, which is represented by a dashed box. The multi-layer perceptron uses the most common fully connected layer. The short-circuit mode is also often used in residual networks to improve the accuracy of classification and regression by providing key elements in the form of layers before the current input.
针对人工神经网络,各个神经网络层不仅网络结构不同,而且操作数,运算符和非线性函数也不同。在不同的神经网络层中,其数据流属性是根据不同网络结构可重构的,所以,可以通过分析各种神经网络层的特性,例如数据流访问模式、计算资源的功能等,根据这些神经网络层特性之间的共通点,寻找在混合神经网络结构的构建过程中可以复用的资源,比如例如处理单元PE,数据输入/输出,用作缓冲的静态存储器SRAM和与片下的DRAM存储器的接口等,以此思路进行数据流重构来解决现有技术问题,参照表1所示,其总结了多个标准神经网络层的特性信息。For artificial neural networks, each neural network layer is not only different in network structure, but also in operands, operators and nonlinear functions. In different neural network layers, the data flow attributes are reconfigurable according to different network structures. Therefore, you can analyze the characteristics of various neural network layers, such as data flow access patterns, computing resource functions, etc., according to these neural network layers. Common points between the network layer characteristics, looking for resources that can be reused in the construction of the hybrid neural network structure, such as processing unit PE, data input/output, static memory SRAM used as a buffer, and off-chip DRAM memory In this way, the data flow reconstruction is carried out to solve the existing technical problems. Refer to Table 1, which summarizes the characteristic information of multiple standard neural network layers.
表1.多个标准神经网络层的特性信息Table 1. Characteristic information of multiple standard neural network layers
Figure PCTCN2020127250-appb-000001
Figure PCTCN2020127250-appb-000001
Figure PCTCN2020127250-appb-000002
Figure PCTCN2020127250-appb-000002
可以看出,池化和短接层是矢量运算,而其他内核对矩阵运算,其中卷积过程为稀疏矩阵,其余网络层为密集矩阵。在各个网络层中,采用不同的激活函数,关于其非线性函数,LSTM网络同时采用S型和切线,而其余矩阵内核使用ReLU函数或S型。It can be seen that the pooling and short-circuiting layers are vector operations, while other kernels operate on matrices, in which the convolution process is a sparse matrix, and the remaining network layers are dense matrices. In each network layer, different activation functions are used. Regarding its nonlinear function, the LSTM network uses both the S-shape and the tangent, while the remaining matrix cores use the ReLU function or the S-shape.
在卷积层和全连接层中的网络数据需要在输出特征图的节点之间共享。LSTM层采用类似的串行流,但特别的是其激活流需要在多个门之间共享。而状态动作层需要基于动作节点的迭代来快速生成数据流。对向量进行操作的池化层和残差层不需要为特征图共享激活函数。因此,激活的向量类型可以进行并行传输。The network data in the convolutional layer and the fully connected layer need to be shared among the nodes that output the feature map. The LSTM layer uses a similar serial stream, but in particular its activation stream needs to be shared among multiple gates. The state action layer needs to quickly generate data streams based on the iteration of action nodes. The pooling layer and the residual layer that operate on the vector do not need to share the activation function for the feature map. Therefore, the activated vector types can be transmitted in parallel.
此外,在分析了用于多个网络层的中间数据的功能之后,由于数据稀疏性的性质,卷积层和池化层主要由激活决定,而FC全连接层和LSTM层则由权重决定。在残差层中,需要保留指向上一层进行激活的指针,以便网络处理先前的数据。In addition, after analyzing the functions of the intermediate data for multiple network layers, due to the nature of data sparsity, the convolutional layer and the pooling layer are mainly determined by activation, while the FC fully connected layer and LSTM layer are determined by weights. In the residual layer, it is necessary to keep a pointer to the previous layer for activation so that the network can process the previous data.
综合上述分析思路,如图2所示,本发明提供了一种数据流重构方法,包括:Based on the above analysis ideas, as shown in Figure 2, the present invention provides a data stream reconstruction method, including:
S1、获取目标神经网络层的特性信息;S1. Obtain characteristic information of the target neural network layer;
S2、根据目标神经网络层的特性信息,确定对应目标神经网络层的数据流模式、处理单元的功能配置以及片上系统SoC的功能配置;S2, according to the characteristic information of the target neural network layer, determine the data flow mode of the target neural network layer, the functional configuration of the processing unit, and the functional configuration of the system-on-chip SoC;
S3、将可复用的处理单元和片上系统进行对应所述目标神经网络层的处理单元和片上系统的功能配置,并根据目标神经网络层的数据流模式进行对应所述目标神经网络层的网络配置,构建所述目标神经网络层;S3. Perform the functional configuration of the reusable processing unit and the system on chip corresponding to the processing unit of the target neural network layer and the system on chip, and perform a network corresponding to the target neural network layer according to the data flow mode of the target neural network layer Configure to construct the target neural network layer;
S4、采用构建的目标神经网络层获得输出结果。S4. Use the constructed target neural network layer to obtain the output result.
本发明提供的上述数据流重构方法,能够根据不同神经网络层对计算单元、存储单元以及数据流动单元等资源进行对应的功能配置动态变化,大规模复用硬件实现不同功能的神经网络层,针对多个神经网络层构成的混合神经网络结构,取得提高硬件利用率、提高运算速度以及降低功耗等效果,并且,能够为后续研究构造其他新型的神经网络层的实现提供资源复用基础。相比于现有技术中例如权重固定、输出固定以及行固定等仅针对于标准卷积网络算子的细粒度数据的重用数据流重用方案,其泛用性更强,取得的效果更优。The above-mentioned data stream reconstruction method provided by the present invention can dynamically change the corresponding functional configuration of resources such as computing unit, storage unit and data flow unit according to different neural network layers, and reuse hardware on a large scale to realize neural network layers with different functions. Aiming at the hybrid neural network structure composed of multiple neural network layers, it has achieved the effects of improving hardware utilization, increasing computing speed and reducing power consumption, and can provide a resource reuse basis for subsequent research and construction of other new neural network layers. Compared with the prior art such as weight fixation, output fixation, and row fixation, which only target fine-grained data stream reuse schemes of standard convolutional network operators, it has stronger versatility and achieves better results.
下面结合图3至图8(其中虚线表示不需要进行复用的资源),以重要的神经网络层为例,对数据流管理和资源共享的方法进行说明,具体如下:The following describes the methods of data flow management and resource sharing with reference to Figures 3 to 8 (the dotted lines indicate resources that do not need to be reused), taking an important neural network layer as an example, and the methods of data flow management and resource sharing are described as follows:
参照图3所示,当所述目标神经网络层为卷积层,所述处理单元包括分组配置在多个线程中的乘积累加运算单元和修正线性单元,其中每个线程在输出特征图的多个通道上使用相同的行和列来处理数据;数据流的输入或输出为线程级并行的串行传输,片上系统的静态存储器配置为用于对线程上输入特征图的激活函数进行缓冲,权重和激活函数在多个线程之间进行共享,激活函数从单个缓冲区进行串行流式传输,以实现在处理单元之间的共享,各个线程的串行输出经过输出缓冲后通过串行解串器SERDES和DRAM控制器进行并行输出。Referring to FIG. 3, when the target neural network layer is a convolutional layer, the processing unit includes a multiplication and accumulation operation unit and a modified linear unit that are grouped and arranged in a plurality of threads, wherein each thread is outputting the multiplication of the feature map. The same rows and columns are used to process data on each channel; the input or output of the data stream is thread-level parallel serial transmission, and the static memory of the system-on-chip is configured to buffer the activation function of the input feature map on the thread, and the weight The activation function and the activation function are shared among multiple threads. The activation function is serially streamed from a single buffer to realize sharing between processing units. The serial output of each thread is serially deserialized after being output buffered. SERDES and DRAM controller for parallel output.
如图4所示,当所述目标神经网络层为池化层,所述处理单元配置为比较器,以实现最大和最小运算符;数据流的输入或输出为并行传输,由于池化层直接在矢量上操作,因此从DRAM获取的激活函数无需缓冲直接提供给处理单元阵列,从而大大节省了动态功耗,激活函数通过修改DRAM访问地址来比较通过时间。As shown in Figure 4, when the target neural network layer is a pooling layer, the processing unit is configured as a comparator to realize the maximum and minimum operators; the input or output of the data stream is parallel transmission, because the pooling layer directly It operates on a vector, so the activation function obtained from the DRAM is directly provided to the processing unit array without buffering, which greatly saves dynamic power consumption. The activation function compares the passage time by modifying the DRAM access address.
参照图5所示,当所述目标神经网络层为全连接层,其输出和处理单元的配置类似于卷积层,所述处理单元包括分组配置在多个线程中的乘积累加运算单元和修正线性单元;数据流的输入或输出为线程级并行的串行传输,对于这种权重占优的内核网络,片上系统的静态存储器配置为权重缓冲区,激活函数通过多个线程进行串行流式传输。Referring to FIG. 5, when the target neural network layer is a fully connected layer, the configuration of its output and processing units is similar to that of a convolutional layer, and the processing unit includes multiplying, accumulating and adding operation units and corrections grouped in multiple threads. Linear unit; the input or output of the data stream is thread-level parallel serial transmission. For this weight-dominant kernel network, the static memory of the system-on-chip is configured as a weight buffer, and the activation function is serially streamed through multiple threads transmission.
如图6所示,当所述目标神经网络层为残差层,其类似于池化层,内核直接工作在参数上,所述处理单元配置为加法器,;数据流的输入或输出为并行传输,由于添加了两个向量,片上系统的输入和输出移位寄存器用于存储操作数, 输出结果被写入输出移位寄存器并并行写入DRAM,指针缓冲区被实例化以寻址DRAM中的两个操作数。As shown in Figure 6, when the target neural network layer is a residual layer, which is similar to a pooling layer, the kernel directly works on parameters, and the processing unit is configured as an adder; the input or output of the data stream is parallel Transmission, due to the addition of two vectors, the input and output shift registers of the on-chip system are used to store operands, the output result is written to the output shift register and written to DRAM in parallel, and the pointer buffer is instantiated to address the DRAM The two operands.
参照图7所示,当所述目标神经网络层为长短期记忆层,该网络层将处理单元四分复用,所述处理单元分为四组,各组处理单元用于实例化sigmoid函数和tanh函数,加法向量运算和tanh函数运算会在之后进行,数据流的输入或输出为串行传输,对于每组门内的共享激活函数和不同组之间提供快速数据都采用了混合输入模式。状态单元缓存被例化用于保留中间状态信息。Referring to FIG. 7, when the target neural network layer is a long- and short-term memory layer, the network layer divides and multiplexes processing units into four groups. Each group of processing units is used to instantiate the sigmoid function and The tanh function, the addition vector operation and the tanh function operation will be performed later. The input or output of the data stream is serial transmission, and the mixed input mode is adopted for the shared activation function in each group of gates and the provision of fast data between different groups. The state unit cache is instantiated to retain intermediate state information.
参照图8所示,当所述目标神经网络层为强化学习层,其输入、输出以及处理单元的配置类似于全连接层,包括用于常规激活的DRAM在内的多种激活源,所述处理单元包括分组配置在多个线程中的乘积累加运算单元和修正线性单元,数据流的输入或输出为线程级并行的串行传输,片上系统的缓存用于状态激活和迭代操作。Referring to FIG. 8, when the target neural network layer is a reinforcement learning layer, the configuration of its input, output, and processing units is similar to that of a fully connected layer, including multiple activation sources including DRAM for conventional activation. The processing unit includes a multiply-accumulate-add operation unit and a modified linear unit that are grouped and configured in multiple threads. The input or output of the data stream is thread-level parallel serial transmission, and the system-on-chip cache is used for state activation and iterative operations.
除了上述示出的神经网络层以外,所述目标神经网络层还可以应用到其他新型神经网络层中,类似地,只要通过分析新型神经网络层的特性信息,根据特性信息获知可以复用的资源并进行对应的配置即可,对于未来进行新型混合神经网络结构的构建具有重大意义。In addition to the neural network layer shown above, the target neural network layer can also be applied to other new neural network layers. Similarly, as long as the characteristic information of the new neural network layer is analyzed, the resources that can be reused are known from the characteristic information. And the corresponding configuration is enough, which is of great significance for the construction of a new type of hybrid neural network structure in the future.
实施例2Example 2
参照图9所示,基于实施例1所述的数据流重构方法,本发明还提供了一种可重构数据流处理器,其特征在于,用于执行如上所述的数据流重构方法,所述可重构数据流处理器采用分层设计,包括:片上系统1、硬件线程2以及多组处理单元3,Referring to FIG. 9, based on the data stream reconstruction method described in Embodiment 1, the present invention also provides a reconfigurable data stream processor, which is characterized in that it is used to execute the data stream reconstruction method described above , The reconfigurable data stream processor adopts a hierarchical design and includes: a system on chip 1, a hardware thread 2 and multiple sets of processing units 3.
其中,所述片上系统1用于控制各组处理单元3与对应的硬件线程配合,调整为匹配目标神经网络层的功能配置,进行目标神经网络层的构建。Wherein, the system on chip 1 is used to control each group of processing units 3 to cooperate with corresponding hardware threads, adjust to match the functional configuration of the target neural network layer, and construct the target neural network layer.
进一步地,所述片上系统1包括执行控制器PCI-e、直接内存访问控制器DMA、执行线程以及缓冲区。所述执行控制器根据网络指令协调处理单元3和缓冲区,用于从外部的片下存储器4提取目标神经网络层的网络指令,将所述网络指令配置到静态存储器中,对所述网络指令逐条进行解码分析以驱动执行线程,由此,所述执行控制器起到了集中控制的作用,有利于减少逻辑开销并提高性能。Further, the system on chip 1 includes an execution controller PCI-e, a direct memory access controller DMA, an execution thread, and a buffer. The execution controller coordinates the processing unit 3 and the buffer according to the network instruction, and is used to extract the network instruction of the target neural network layer from the external off-chip memory 4, configure the network instruction in the static memory, and respond to the network instruction Decoding and analysis are performed one by one to drive the execution threads. Therefore, the execution controller plays a role of centralized control, which is beneficial to reduce logic overhead and improve performance.
所述直接内存访问控制器用于控制片上系统1与片下DRAM存储器4之间 的读写,实现片上系统1与片下DRAM存储4之间的多种读写模式,它能流畅地传输网络配置、权重、激活以及结果。DDR突发模式被大量用于快速供给数据并降低DRAM访问功率。由于存储器带宽会限制计算吞吐量。因此,根据算法属性配置DMA,控制存储器带宽匹配对应的数据量,例如,用于PW和DW卷积的数据束的元素大小等于在特定DRAM协议下每次传输的字节数。因此,可以实现无需进一步的数据缓冲就可以进行连续的突发读取和写入。The direct memory access controller is used to control the read and write between the system-on-chip 1 and the off-chip DRAM memory 4, realize multiple read and write modes between the system-on-chip 1 and the off-chip DRAM memory 4, and it can transmit network configuration smoothly. , Weight, activation, and result. The DDR burst mode is widely used to quickly supply data and reduce DRAM access power. Because the memory bandwidth will limit the calculation throughput. Therefore, the DMA is configured according to the algorithm properties, and the memory bandwidth is controlled to match the corresponding data amount. For example, the element size of the data bundle used for PW and DW convolution is equal to the number of bytes transmitted each time under a specific DRAM protocol. Therefore, continuous burst reading and writing can be realized without further data buffering.
所述执行线程用于在所述执行控制器的控制下运行以实现目标神经网络层的功能。The execution thread is used to run under the control of the execution controller to realize the function of the target neural network layer.
所述缓冲区包括多个所述静态存储器构成的静态存储器池,其中每个SRAM的大小为8KB,不同的算法内核配置不同缓冲方案。在执行控制器的协助下,SRAM可以被即时实例化程各种缓冲功能,这些功能由算法内核确定。The buffer includes a static memory pool composed of a plurality of the static memories, wherein the size of each SRAM is 8KB, and different algorithm kernels are configured with different buffering schemes. With the assistance of the execution controller, the SRAM can be instantiated to instantiate various buffering functions, which are determined by the algorithm kernel.
进一步地,所述硬件线程为了便于数据流和权重的资源共享,其包括核心状态机和移位寄存器,所述核心状态机用于控制同一线程上的处理单元的数据输入输出、激活函数配给以及权重配给,所述移位寄存器用于构建激活函数的输入和输出,以实现数据共享和由于单扇出和减少负载电容而减少的功率开销,移位寄存器可以动态配置为级联或并联形式。由于某些目标神经网络层涉及矢量的计算,因此,与输入数据流的单一方向相反的是,输出数据流是双向的,由此便于在例如残差层内核中使用矢量进行计算。多个处理单元通过线程级的核心状态机FSM进行协调,以流水线方式处理激活和权重。权重从片上系统1中的静态存储器池中流入,其中各个处理单元可以接收不同的权重流。Further, in order to facilitate resource sharing of data flow and weights, the hardware thread includes a core state machine and a shift register. The core state machine is used to control the data input and output of processing units on the same thread, activation function allocation, and Weight distribution. The shift register is used to construct the input and output of the activation function to realize data sharing and power consumption reduced due to single fan-out and reduced load capacitance. The shift register can be dynamically configured as a cascade or parallel connection. Since some target neural network layers involve vector calculations, contrary to the single direction of the input data flow, the output data flow is bidirectional, which facilitates calculations using vectors in the residual layer kernel, for example. Multiple processing units are coordinated through the thread-level core state machine FSM to process activations and weights in a pipeline manner. The weights flow from the static memory pool in the system-on-chip 1, where each processing unit can receive different weight streams.
进一步地,为了有效地计算依赖于内核的函数,处理单元3经过紧凑设计来实现所需的运算符。通过数据输入端口和一个权重输入端口这种方式既便于矩阵计算,又便于矢量计算。Sigmoid和Tangent模块是基于中的线性逼近技术设计的。控制输入从线程级FSM接收操作码,配置多路复用器以实现与内核相关的运算符。Further, in order to efficiently calculate functions that depend on the kernel, the processing unit 3 is designed to implement the required operators through a compact design. Through the data input port and a weight input port, this method is convenient for matrix calculation and vector calculation. Sigmoid and Tangent modules are designed based on the linear approximation technology in. The control input receives the opcode from the thread-level FSM, and configures the multiplexer to implement the kernel-related operators.
以下通过实验验证本发明提供的可重构数据流处理器的可行性,其架构是采用108KB片上SRAM和16个PE组成一个线程,本实验均采用Verilog HDL语言实现,采用Modelsim仿真工具对设计方案的可行性和运行时间进行仿真验证。并且在MATLAB的NVIDIA GTX GPU使用神经网络工具库进行网络性能分析。如下演示了三种网络结构来分析所提出的该体系结构的性能。The following experiments verify the feasibility of the reconfigurable data stream processor provided by the present invention. Its architecture is to use 108KB on-chip SRAM and 16 PEs to form a thread. This experiment is implemented in Verilog HDL language, and the Modelsim simulation tool is used to design the solution. The feasibility and running time are simulated and verified. And the NVIDIA GTX GPU of MATLAB uses neural network tool library for network performance analysis. Three network structures are demonstrated as follows to analyze the performance of the proposed architecture.
MobileNet具有标准PW和DW卷积,池化和完全连接的混合内核网络, MobileNet采用迭代紧凑的卷积内核,占MAC数量计算的97.91%。表2显示了使用具有256个PE和DRAM支持的FPGA原型在多线程和单线程体系结构之间进行基准测试的MobileNet各个层的拟议设计的执行延迟。MobileNet has standard PW and DW convolution, pooling and fully connected hybrid kernel network, MobileNet adopts iterative compact convolution kernel, accounting for 97.91% of the MAC calculation. Table 2 shows the execution latencies of the proposed design for each layer of MobileNet benchmarked between multi-threaded and single-threaded architectures using FPGA prototypes with 256 PE and DRAM support.
表2.基于MobileNet架构的性能分析Table 2. Performance analysis based on MobileNet architecture
Figure PCTCN2020127250-appb-000003
Figure PCTCN2020127250-appb-000003
深度强化学习:DQN的典型用法是迷宫行走,其中智能处理单元通过在十字路口选择正确的方向并避开障碍物来学习走向目的地。如图10所示,在2层,5层和10层网络上测试了强化学习动作空间的1、2、4和6个节点数据,而状态空间则在128和256个节点之间选择。对于所有经过测试的动作空间,所有三个网络结构的片上Q迭代时间均小于2ms。这种迭代时间随动作空间的尺寸以及网络的大小而略有增加。Deep reinforcement learning: The typical usage of DQN is maze walking, in which the intelligent processing unit learns to go to the destination by choosing the correct direction at the intersection and avoiding obstacles. As shown in Figure 10, the data of 1, 2, 4, and 6 nodes in the reinforcement learning action space were tested on the 2-layer, 5-layer, and 10-layer networks, while the state space was selected between 128 and 256 nodes. For all tested action spaces, the on-chip Q iteration time of all three network structures is less than 2ms. This iteration time slightly increases with the size of the action space and the size of the network.
序列分类:测试示例使用从佩戴在身上的智能手机获得的传感器数据,使用LSTM网络对数据进行训练,以在给定时间序列的情况下识别佩戴者的活动,该 时间序列表示在三个不同方向上的加速度计读数。参照表3不考虑磁盘存储和DRAM之间的数据传输下的仿真结果,可以看出,与CPU和GPU相比,所提出的LSTM网络设计均实现了改进的性能。但是,MATLAB的测量结果考虑了磁盘,主内存和操作系统之间数据传输的巨大延迟,而本发明的设计目前被设定为独立系统。但是,未来的LSTM网络倾向于部署在传感器上,并直接从DRAM中调取数据直接处理,这与本发明的设计原理很接近。与CPU和GPU功耗比拟议的设计高三个数量级,从而证明了ASIC混合神经网络的效率优良,证实了本发明的可行性。Sequence classification: The test example uses sensor data obtained from a smartphone worn on the body, and uses the LSTM network to train the data to recognize the wearer’s activity in a given time series, which is represented in three different directions The accelerometer reading on the Referring to Table 3 without considering the simulation results of data transmission between disk storage and DRAM, it can be seen that, compared with CPU and GPU, the proposed LSTM network design achieves improved performance. However, the measurement result of MATLAB takes into account the huge delay of data transmission between the disk, main memory and the operating system, and the design of the present invention is currently set as an independent system. However, the future LSTM network tends to be deployed on sensors and directly retrieve data from DRAM for direct processing, which is very close to the design principle of the present invention. Compared with the CPU and GPU, the power consumption is three orders of magnitude higher than that of the proposed design, thus proving the excellent efficiency of the ASIC hybrid neural network and the feasibility of the present invention.
表3.三种处理架构中LSTM网络的性能基准分析Table 3. Performance benchmark analysis of LSTM networks in three processing architectures
Figure PCTCN2020127250-appb-000004
Figure PCTCN2020127250-appb-000004
综上所述,本发明提供的数据流重构方法及可重构数据流处理器,通过调整数据流的模式、处理单元和片上存储的功能模式,系统化复用硬件进行动态配置,实现针对混合神经网络取得提高硬件利用率、提高运算速度以及降低功耗等效果,能够为后续研究构造其他新型的神经网络层及基于这些新型神经网络层的混合神经网络的实现提供资源复用基础。In summary, the data stream reconstruction method and the reconfigurable data stream processor provided by the present invention, by adjusting the data stream mode, the processing unit and the function mode of on-chip storage, systematically multiplex the hardware for dynamic configuration, and realize The hybrid neural network has achieved the effects of improving hardware utilization, increasing computing speed and reducing power consumption, and can provide a resource reuse foundation for subsequent research to construct other new neural network layers and the realization of hybrid neural networks based on these new neural network layers.
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要 素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply one of these entities or operations. There is any such actual relationship or order between. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes those that are not explicitly listed Other elements of, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, article, or equipment that includes the element.
以上所述仅是本申请的具体实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。The above are only specific implementations of this application. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of this application, several improvements and modifications can be made, and these improvements and modifications are also Should be regarded as the scope of protection of this application.

Claims (10)

  1. 一种数据流重构方法,其特征在于,包括:A data stream reconstruction method is characterized in that it includes:
    获取目标神经网络层的特性信息;Obtain characteristic information of the target neural network layer;
    根据目标神经网络层的特性信息,确定对应目标神经网络层的数据流模式、处理单元的功能配置以及片上系统的功能配置;According to the characteristic information of the target neural network layer, determine the data flow mode of the target neural network layer, the functional configuration of the processing unit, and the functional configuration of the system-on-chip;
    将可复用的处理单元和片上系统进行对应所述目标神经网络层的处理单元和片上系统的功能配置,并根据目标神经网络层的数据流模式进行对应所述目标神经网络层的网络配置,构建所述目标神经网络层;Perform the functional configuration of the reusable processing unit and the system on chip corresponding to the processing unit of the target neural network layer and the system on chip, and perform the network configuration corresponding to the target neural network layer according to the data flow mode of the target neural network layer, Constructing the target neural network layer;
    采用构建的目标神经网络层获得输出结果。Use the constructed target neural network layer to obtain the output result.
  2. 根据权利要求1所述的数据流重构方法,其特征在于,当所述目标神经网络层为卷积层,所述处理单元包括分组配置在多个线程中的乘积累加运算单元和修正线性单元;数据流的输入或输出为线程级并行的串行传输,片上系统的静态存储器配置为用于对线程上输入特征图的激活函数进行缓冲,权重和激活函数在多个线程之间进行共享,各个线程的串行输出经过输出缓冲后进行并行输出。The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a convolutional layer, the processing unit includes a multiplication accumulation addition operation unit and a modified linear unit that are grouped and arranged in multiple threads ; The input or output of the data stream is thread-level parallel serial transmission. The static memory of the system-on-chip is configured to buffer the activation function of the input feature map on the thread, and the weight and activation function are shared among multiple threads. The serial output of each thread is output in parallel after being output buffered.
  3. 根据权利要求1所述的数据流重构方法,其特征在于,当所述目标神经网络层为池化层,所述处理单元配置为比较器;数据流的输入或输出为并行传输。The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a pooling layer, the processing unit is configured as a comparator; the input or output of the data stream is parallel transmission.
  4. 根据权利要求1所述的数据流重构方法,其特征在于,当所述目标神经网络层为全连接层,所述处理单元包括分组配置在多个线程中的乘积累加运算单元和修正线性单元;数据流的输入或输出为线程级并行的串行传输,片上系统的静态存储器配置为权重缓冲区,激活函数通过多个线程进行串行流式传输。The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a fully connected layer, the processing unit includes a multiplication accumulation addition operation unit and a modified linear unit that are grouped and arranged in multiple threads ; The input or output of the data stream is thread-level parallel serial transmission, the static memory of the system-on-chip is configured as a weight buffer, and the activation function is serially streamed through multiple threads.
  5. 根据权利要求1所述的数据流重构方法,其特征在于,当所述目标神经网络层为残差层,所述处理单元配置为加法器;数据流的输入或输出为并行传输,片上系统的输入和输出移位寄存器用于存储操作数。The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a residual layer, the processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the system-on-chip The input and output shift registers are used to store operands.
  6. 根据权利要求1所述的数据流重构方法,其特征在于,当所述目标神经网络层为长短期记忆层,所述处理单元分为四组,各组处理单元用于实例化sigmoid函数和tanh函数,数据流的输入或输出为串行传输。The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a long and short-term memory layer, the processing units are divided into four groups, and each group of processing units is used to instantiate the sigmoid function and The tanh function, the input or output of the data stream is serial transmission.
  7. 根据权利要求1所述的数据流重构方法,其特征在于,当所述目标神经网络层为强化学习层,所述处理单元包括分组配置在多个线程中的乘积累加运 算单元和修正线性单元;数据流的输入或输出为线程级并行的串行传输,片上系统的缓存用于状态激活和迭代操作。The data stream reconstruction method according to claim 1, wherein when the target neural network layer is a reinforcement learning layer, the processing unit includes a multiplication accumulation addition operation unit and a modified linear unit that are grouped and arranged in multiple threads ; The input or output of the data stream is thread-level parallel serial transmission, and the system-on-chip cache is used for state activation and iterative operations.
  8. 一种可重构数据流处理器,其特征在于,用于执行如权利要求1-7任一项所述的数据流重构方法,所述可重构数据流处理器包括:片上系统、硬件线程以及多组处理单元,A reconfigurable data stream processor, characterized in that it is used to execute the data stream reconstruction method according to any one of claims 1-7, and the reconfigurable data stream processor comprises: a system on a chip, hardware Threads and multiple sets of processing units,
    其中,所述片上系统用于控制各组处理单元与对应的硬件线程配合,调整为匹配目标神经网络层的功能配置,进行目标神经网络层的构建。Wherein, the system on chip is used to control each group of processing units to cooperate with corresponding hardware threads, adjust to match the functional configuration of the target neural network layer, and construct the target neural network layer.
  9. 根据权利要求8所述的可重构数据流处理器,其特征在于,所述片上系统包括执行控制器、直接内存访问控制器、执行线程以及缓冲区,The reconfigurable data stream processor according to claim 8, wherein the system on chip comprises an execution controller, a direct memory access controller, an execution thread, and a buffer,
    所述执行控制器用于从外部的片下存储器提取目标神经网络层的网络指令,将所述网络指令配置到静态存储器中,对所述网络指令逐条进行解码分析以驱动执行线程;The execution controller is used to fetch network instructions of the target neural network layer from an external off-chip memory, configure the network instructions in the static memory, and decode and analyze the network instructions one by one to drive the execution thread;
    所述直接内存访问控制器用于控制片上系统与片下存储器之间的读写;The direct memory access controller is used to control read and write between the system-on-chip and the off-chip memory;
    所述执行线程用于在所述执行控制器的控制下运行以实现目标神经网络层的功能;The execution thread is used to run under the control of the execution controller to realize the function of the target neural network layer;
    所述缓冲区包括多个所述静态存储器构成的静态存储器池。The buffer includes a static memory pool composed of a plurality of static memories.
  10. 根据权利要求8或9所述的可重构数据流处理器,其特征在于,所述硬件线程包括核心状态机和移位寄存器,所述核心状态机用于控制同一线程上的处理单元的数据输入输出、激活函数配给以及权重配给,所述移位寄存器用于构建激活函数的输入和输出。The reconfigurable data stream processor according to claim 8 or 9, wherein the hardware thread includes a core state machine and a shift register, and the core state machine is used to control data of processing units on the same thread Input and output, activation function allocation and weight allocation, and the shift register is used to construct the input and output of the activation function.
PCT/CN2020/127250 2019-11-08 2020-11-06 Data stream reconstruction method and reconstructable data stream processor WO2021089009A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911087000.9 2019-11-08
CN201911087000.9A CN111105023B (en) 2019-11-08 2019-11-08 Data stream reconstruction method and reconfigurable data stream processor

Publications (1)

Publication Number Publication Date
WO2021089009A1 true WO2021089009A1 (en) 2021-05-14

Family

ID=70420571

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/127250 WO2021089009A1 (en) 2019-11-08 2020-11-06 Data stream reconstruction method and reconstructable data stream processor

Country Status (2)

Country Link
CN (1) CN111105023B (en)
WO (1) WO2021089009A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105023B (en) * 2019-11-08 2023-03-31 深圳市中科元物芯科技有限公司 Data stream reconstruction method and reconfigurable data stream processor
CN111783971B (en) * 2020-07-02 2024-04-09 上海赛昉科技有限公司 Highly flexibly configurable data post-processor for deep neural network
CN112560173B (en) * 2020-12-08 2021-08-17 北京京航计算通讯研究所 Vehicle weather resistance temperature prediction method and device based on deep learning
CN112540950B (en) * 2020-12-18 2023-03-28 清华大学 Reconfigurable processor based on configuration information shared storage and shared storage method thereof
CN113240084B (en) * 2021-05-11 2024-02-02 北京搜狗科技发展有限公司 Data processing method and device, electronic equipment and readable medium
CN116702852B (en) * 2023-08-02 2023-10-20 电子科技大学 Dynamic reconfiguration neural network acceleration circuit and system based on multistage event driving

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218345A (en) * 2013-03-15 2013-07-24 上海安路信息科技有限公司 Dynamic reconfigurable system adaptable to plurality of dataflow computation modes and operating method
CN203204615U (en) * 2013-03-15 2013-09-18 上海安路信息科技有限公司 Dynamic reconfigurable system adaptable to various data flow calculation modes
US9390369B1 (en) * 2011-09-21 2016-07-12 Brain Corporation Multithreaded apparatus and methods for implementing parallel networks
CN107783840A (en) * 2017-10-27 2018-03-09 福州瑞芯微电子股份有限公司 A kind of Distributed-tier deep learning resource allocation methods and device
CN109409510A (en) * 2018-09-14 2019-03-01 中国科学院深圳先进技术研究院 Neuron circuit, chip, system and method, storage medium
CN111105023A (en) * 2019-11-08 2020-05-05 中国科学院深圳先进技术研究院 Data stream reconstruction method and reconfigurable data stream processor

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190114548A1 (en) * 2017-10-17 2019-04-18 Xilinx, Inc. Static block scheduling in massively parallel software defined hardware systems
US11636327B2 (en) * 2017-12-29 2023-04-25 Intel Corporation Machine learning sparse computation mechanism for arbitrary neural networks, arithmetic compute microarchitecture, and sparsity for training mechanism
CN109472356A (en) * 2018-12-29 2019-03-15 南京宁麒智能计算芯片研究院有限公司 A kind of accelerator and method of restructural neural network algorithm

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9390369B1 (en) * 2011-09-21 2016-07-12 Brain Corporation Multithreaded apparatus and methods for implementing parallel networks
CN103218345A (en) * 2013-03-15 2013-07-24 上海安路信息科技有限公司 Dynamic reconfigurable system adaptable to plurality of dataflow computation modes and operating method
CN203204615U (en) * 2013-03-15 2013-09-18 上海安路信息科技有限公司 Dynamic reconfigurable system adaptable to various data flow calculation modes
CN107783840A (en) * 2017-10-27 2018-03-09 福州瑞芯微电子股份有限公司 A kind of Distributed-tier deep learning resource allocation methods and device
CN109409510A (en) * 2018-09-14 2019-03-01 中国科学院深圳先进技术研究院 Neuron circuit, chip, system and method, storage medium
CN111105023A (en) * 2019-11-08 2020-05-05 中国科学院深圳先进技术研究院 Data stream reconstruction method and reconfigurable data stream processor

Also Published As

Publication number Publication date
CN111105023B (en) 2023-03-31
CN111105023A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
WO2021089009A1 (en) Data stream reconstruction method and reconstructable data stream processor
US20220050683A1 (en) Apparatuses, methods, and systems for neural networks
US11195079B2 (en) Reconfigurable neuro-synaptic cores for spiking neural network
US20210065005A1 (en) Systems and methods for providing vector-wise sparsity in a neural network
US11763156B2 (en) Neural network compression based on bank-balanced sparsity
Kim et al. A 125 GOPS 583 mW network-on-chip based parallel processor with bio-inspired visual attention engine
Zhou et al. Transpim: A memory-based acceleration via software-hardware co-design for transformer
US10360496B2 (en) Apparatus and method for a digital neuromorphic processor
US20230385233A1 (en) Multiple accumulate busses in a systolic array
CN110383300A (en) A kind of computing device and method
CN112580792B (en) Neural network multi-core tensor processor
Chen et al. Towards efficient allocation of graph convolutional networks on hybrid computation-in-memory architecture
Zhou et al. Gcnear: A hybrid architecture for efficient gcn training with near-memory processing
Song et al. Brahms: Beyond conventional rram-based neural network accelerators using hybrid analog memory system
Zhang et al. A survey of memory-centric energy efficient computer architecture
Wang et al. MAC: Memory access coalescer for 3D-stacked memory
Liu et al. A cloud server oriented FPGA accelerator for LSTM recurrent neural network
US11704562B1 (en) Architecture for virtual instructions
Bai et al. An OpenCL-based FPGA accelerator with the Winograd’s minimal filtering algorithm for convolution neuron networks
Zhang et al. Three-level memory access architecture for FPGA-based real-time remote sensing image processing system
CN112906877A (en) Data layout conscious processing in memory architectures for executing neural network models
Yan et al. S-GAT: Accelerating Graph Attention Networks Inference on FPGA Platform with Shift Operation
US11922306B2 (en) Tensor controller architecture
Cheng et al. Towards a deep-pipelined architecture for accelerating deep GCN on a multi-FPGA platform
Yang et al. Compressedcache: enabling storage compression on neuromorphic processor for liquid state machine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20885647

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20885647

Country of ref document: EP

Kind code of ref document: A1