WO2022027818A1 - Data batch processing method and batch processing apparatus thereof, and storage medium - Google Patents

Data batch processing method and batch processing apparatus thereof, and storage medium Download PDF

Info

Publication number
WO2022027818A1
WO2022027818A1 PCT/CN2020/120177 CN2020120177W WO2022027818A1 WO 2022027818 A1 WO2022027818 A1 WO 2022027818A1 CN 2020120177 W CN2020120177 W CN 2020120177W WO 2022027818 A1 WO2022027818 A1 WO 2022027818A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
batch processing
strips
reorganized
channel data
Prior art date
Application number
PCT/CN2020/120177
Other languages
French (fr)
Chinese (zh)
Inventor
王峥
雷明
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2022027818A1 publication Critical patent/WO2022027818A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7817Specially adapted for signal processing, e.g. Harvard architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention belongs to the technical field of data processing, and in particular, relates to a data batch processing method for neural networks, a batch processing device thereof, and a computer-readable storage medium.
  • neural network With the promotion of big data and artificial intelligence technologies, deep learning algorithms based on artificial neural networks have achieved remarkable results in the fields of computer vision, natural language processing, and autonomous decision-making by agents, relying on their powerful feature extraction capabilities.
  • the structure of neural network is becoming more and more complex, which is accompanied by a sharp increase in the amount of parameters and calculation, which has higher requirements on the data bandwidth and computing power of the hardware platform.
  • the technical problem solved by the present invention is: how to reduce the number of times of reading data from the memory.
  • a data batch processing method for neural networks comprising:
  • the original channel data of described N continuous frame images is spliced, and multiple reorganization data strips are formed, wherein each part of the reorganized data strip includes the original channel data of described N continuous frame images on the same pixel position;
  • a plurality of reconstituted data strips are sequentially input to the parallel computing unit array for convolution operation, wherein all the original channel data of the same reconstituted data strip enter the computing unit at the same time.
  • each piece of the restructured data strip further includes zero-padding data, and the data bit width of each piece of the restructured data strip is equal to the memory bandwidth.
  • the data batch processing method further includes:
  • the plurality of reconstituted data strips are stored in the memory.
  • the method for sequentially inputting multiple pieces of recombined data strips into the parallel computing unit array for convolution operation includes:
  • the memory bandwidth is 128 bits, N is 5, and the original channel data on each pixel position of each consecutive frame image includes red channel data, green channel data and blue channel data.
  • the present application also discloses a data batch processing device for a neural network, the data batch processing device comprising:
  • a data acquisition module for acquiring memory bandwidth and selecting the original channel data of N consecutive frame images according to the memory bandwidth
  • a data reorganization module for splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the original channels of the N continuous frame images at the same pixel position data;
  • the convolution calculation module is used to read multiple reorganized data strips and perform convolution operation on the multiple reorganized data strips in sequence, wherein all the original channel data of the same reorganized data strip are read by the convolution calculation module at the same time Pick.
  • the data batch processing device further includes a memory, and the memory is used for receiving and storing a plurality of reorganized data strips formed by the data reorganization module.
  • the convolution calculation module includes:
  • a multiplier-adder unit used for multiplying and adding the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively;
  • the storage unit is used for storing the result of the multiplication and addition operation of each successive frame image in each piece of the recombined data strip.
  • the present invention also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data batch processing program for neural networks, and when the data batch processing program for neural networks is executed by a processor, the above-mentioned Data batching methods for neural networks.
  • the invention discloses a data batch processing method for neural network, which has the following technical effects compared with the traditional calculation method:
  • the optimized data structure for three-dimensional arrays can realize fast data buffering and avoid repeated weight reading between different frame images, thereby greatly reducing the number of off-chip memory accesses;
  • FIG. 1 is a flowchart of a data batch processing method for a neural network according to Embodiment 1 of the present invention
  • Fig. 2 is the flow chart of the convolution calculation of Embodiment 1 of the present invention.
  • FIG. 3 is a schematic diagram of a data splicing process according to Embodiment 1 of the present invention.
  • FIG. 4 is a schematic diagram of a data batch processing apparatus according to Embodiment 2 of the present invention.
  • FIG. 5 is a schematic diagram of a parallel computing unit array according to Embodiment 2 of the present invention.
  • FIG. 6 is a schematic diagram of a computer device according to an embodiment of the present invention.
  • the convolution calculation is performed on each frame of pictures in turn, and it is necessary to repeatedly read the weight data and read the image data multiple times, It will cause a waste of computing resources.
  • the weight data corresponding to the same pixel position is the same in the convolution calculation process, and the image channel data of the adjacent multi-frame pictures at the same pixel position is rewritten, and the reorganized data is in the same pixel position. It is input into the computing unit at all times, and the convolution operation is performed with the same weight data, which can reduce the number of readings of the weight data and image data, and greatly reduce the computing energy consumption.
  • the data batch processing method for a neural network in the first embodiment includes the following steps:
  • Step S10 obtaining the memory bandwidth and selecting the original channel data of N consecutive frame images according to the memory bandwidth;
  • Step S20 splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the original channel data of the N continuous frame images at the same pixel position;
  • Step S30 Inputting multiple pieces of recombined data strips into the parallel computing unit array in sequence for convolution operation, wherein all the original channel data of the same piece of recombined data strips enter the computing unit at the same time.
  • step S10 taking the memory bandwidth equal to 128 bits as an example, in the prior art, when reading data from the memory, the original channel data of one pixel point is read only each time, including the red channel data R and the green channel data G And blue channel data B, each color channel data occupies 8 bits, a total of 24 bits, so only 24 bits of data are read each time, wasting memory bandwidth.
  • the original channel data of N consecutive frame images are selected and spliced, so that each time the data is read from the memory, it can be Read more channel data and improve the efficiency of memory bandwidth usage.
  • step S20 Splicing the original channel data of 5 consecutive frame images at the same pixel position to form repeated data strips. For example, splicing the original channel data of 5 consecutive frame images at the first pixel point to form a data bit width of 120 bits. Reorganize the data bar. As a preferred embodiment, zero-fill processing is performed on the formed reconstituted data bar, so that the data bit width of the reconstituted data bar is equal to the memory bandwidth. A reconstituted strip of bits.
  • the block memory (block memory) of Xilinx company model 128-32 is used to read the original channel data of each image, but the block memory can only read four color channel data each time, namely 32-bit data, but really need to use 24-bit data, so the data after the block memory read needs to continue to be reorganized.
  • 0 G 0 B 0 is spliced and zero-filled to form a reorganized data bar of the first pixel, that is, R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 0, and set the second register to store the reorganized data bar, so as to complete the reorganization of the original channel data of the first pixel.
  • the B 2 read for the third time and the R 2 G 2 stored in the first register are spliced to form the recombination of the third pixel point Data bar, namely R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 0, and R 3 of each image read for the third time G 3 B 3 is spliced to form the recombined data bar of the fourth pixel, namely R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 0 , so that after three readings, the recombination of the original channel data of the four pixel points can be completed to form four repeated data strips.
  • a 32*64 parallel computing unit array is taken as an example, including 32*64 computing units, 64 data cache TBs and 32 weight caches WB, wherein each data cache TB stores multiple pieces of reorganized data The weight data stored in each weight cache WB is shared by 64 data caches.
  • each data buffer stores the recombined data strips of four adjacent pixels, that is, R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 0, R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 0, R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 0 and R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 G 3 B 3 G 3 B 3 0.
  • writing the same reconstituted data bar into the computing unit at the same time can improve the efficiency of memory bandwidth utilization on the one hand, and reduce the number of memory reads on the other hand.
  • a method for sequentially inputting multiple pieces of recombined data strips into a parallel computing unit array for convolution operation includes:
  • Step S31 Multiply and add the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively;
  • Step S32 Store the result of the multiplication and addition operation of each successive frame image in each piece of the recombined data strip into different registers.
  • each piece of reconstructed data bar includes the original channel data of 5 pixels of the 5 consecutive frame images, wherein 5 third registers are set, respectively used to store 5 consecutive frames.
  • the multiplication and addition operation result is Stored in the corresponding third register, and so on, each calculation result is stored in a different third register.
  • the sharing of weight data can be realized, and there is no need to repeatedly read the weight data. Reading into the computing unit also avoids repeated reading of image data and reduces the number of memory accesses.
  • the apparatus for batch processing data for neural networks includes a data acquisition module 100, a data reorganization module 200 and a convolution calculation module 300.
  • the data acquisition module 100 is used for acquiring memory bandwidth and according to the The memory bandwidth selects the original channel data of N continuous frame images;
  • the data reorganization module 200 is used for splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the described The original channel data of N consecutive frame images at the same pixel position;
  • the convolution calculation module 300 is used to read multiple pieces of reconstructed data strips and perform convolution operation on the multiple pieces of reconstructed data strips in sequence, wherein the same piece of reconstructed data strips All raw channel data are read by the convolution calculation module at the same time.
  • the data batch processing apparatus further includes a memory 400, and the memory 400 is used for receiving and storing a plurality of reorganized data strips formed by the data reorganization module 200.
  • the data acquisition module 100 includes a plurality of buffers, and the plurality of buffers are used to read and temporarily store the original channel data of the corresponding image from the memory module 400 according to the data of the memory bandwidth.
  • the memory bandwidth equal to 128 bits and N equal to 5 as an example
  • 5 different buffers are used to read and store the original channel data of 5 consecutive frame images from the memory, and the color channel data are arranged in sequence according to the pixel position, that is R 0 G 0 B 0 R 1 G 1 B 1 R 2 G 2 B 2 R 3 G 3 B 3 .
  • the data reorganization module 200 includes a block memory, a first register, a second register and a counter.
  • the block memory adopts the Block Memory of Xilinx Company, whose model is 128-32.
  • the block memory read can only read four color channel data each time, that is, 32-bit data, and the real need is 24-bit data, so the data after the block memory read needs to continue to be reorganized.
  • the block memory reads the data from each buffer respectively as R 0 G 0 B 0 R 1 . At this time, R 1 is stored in the first register .
  • 0 B 0 is spliced and zero-filled to form a reorganized data bar of the first pixel, that is, R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 0, and store it in the second register, while setting the value of the counter to 0.
  • the data read from the respective buffers by the block memory is G 1 B 1 R 2 G 2 .
  • R 2 G 2 is stored in the first register, and the The pre-stored R 1 in the first register and the G 1 B 1 read for the second time are spliced and zero-filled to form the reorganized data bar of the second pixel, that is, R 1 G 1 B 1 R 1 G 1 B 1 --R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 0, and store it in the second register, while setting the value of the counter to 1.
  • the data read from the respective buffers by the block memory is B 2 R 3 G 3 B 3 .
  • B 2 is spliced and zero-filled to form a reorganized data bar of the third pixel, namely R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 0 and store it in the second register, while setting the value of the counter to 2.
  • splicing the R 3 G 3 B 3 read for the third time, and performing zero-fill processing to form a reorganized data bar of the fourth pixel point, that is, R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 0, and store it in the second register, while setting the value of the counter to 3.
  • the recombination of the original channel data of the four pixel points can be completed to form four duplicate data strips.
  • the above steps are repeated until the recombination of the original channel data of all the pixels of the 5 consecutive frame images is completed, and all the recombined data strips obtained are stored in the memory for use in subsequent calculations.
  • each data cache TB contains There are multiple pieces of reorganized data stored, and the weight data stored in each weight buffer WB is shared by 64 data buffers, and the convolution calculation module is the calculation unit PE.
  • the convolution calculation module includes a multiplier-adder unit and a storage unit, and the multiplier-adder unit is used for multiplying and adding the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively,
  • the storage unit is used to store the result of the multiplication and addition operation of each successive frame image in each piece of the reconstructed data strip.
  • the multiplier-adder unit includes a multiplier and an adder
  • the storage unit includes a data selector 301 , a data distributor 302 and five third registers 303 .
  • the multiplier Calculate W 00 *R 0 and use the data selector 301 to read the data from the corresponding third register 303
  • the adder After calculation, since the initial value of the third register is zero, the calculation result of the adder is W 00 *R 0 , and then the calculation result W 00 *R 0 is stored in the third register 303 through the data distributor 302 .
  • the present application also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data batch processing program for neural networks, and when the data batch processing program for neural networks is executed by a processor, the above-mentioned Data batching methods for neural networks.
  • the present application also discloses a computer device.
  • the terminal includes a processor 12 , an internal bus 13 , a network interface 14 , and a computer-readable storage medium 11 .
  • the processor 12 reads the corresponding computer program from the computer-readable storage medium and then executes it, forming a request processing device on a logical level.
  • the computer-readable storage medium 11 stores a data batch program for a neural network, and when the data batch program for a neural network is executed by a processor, implements the above-mentioned data batch method for a neural network.
  • Computer-readable storage media includes both persistent and non-permanent, removable and non-removable media, and storage of information can be implemented by any method or technology.
  • Information may be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage , magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
  • PRAM phase-change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Neurology (AREA)
  • Signal Processing (AREA)
  • Image Processing (AREA)

Abstract

A data batch processing method and a batch processing apparatus thereof, and a storage medium. The data batch processing method comprises: obtaining memory bandwidth and selecting, according to the memory bandwidth, raw channel data of N continuous image frames (S10); joining the raw channel data of the N continuous image frames to form multiple reorganized data strips, each reorganized data strip comprising the raw channel data of the N continuous image frames at the same pixel position (S20); and sequentially inputting the multiple reorganized data strips into a parallel computing unit array for convolution operation, all of the raw channel data of the same reorganized data strip entering a computing unit at the same time (S30). Image channel data of multiple adjacent picture frames at the same pixel position is renewed, reorganized data is inputted into a calculation unit at the same time, and convolution operation is performed with the same weight data. As such, the number of times weight data and image data is read can be reduced, and the calculation consumption can be greatly reduced.

Description

数据批处理方法及其批处理装置、存储介质Data batch processing method, batch processing device and storage medium 技术领域technical field
本发明属于数据处理技术领域,具体地讲,涉及用于神经网络的数据批处理方法及其批处理装置、计算机可读存储介质。The present invention belongs to the technical field of data processing, and in particular, relates to a data batch processing method for neural networks, a batch processing device thereof, and a computer-readable storage medium.
背景技术Background technique
随着大数据与人工智能技术的推广,基于人工神经网络的深度学习算法依靠其强大的特征提取能力,在计算机视觉、自然语言处理以及智能体自主决策等领域取得了显著的成果。但神经网络结构日趋复杂,伴随而来的是参数量和计算量的急剧增加,这对硬件平台的数据带宽和计算能力有较高的要求。With the promotion of big data and artificial intelligence technologies, deep learning algorithms based on artificial neural networks have achieved remarkable results in the fields of computer vision, natural language processing, and autonomous decision-making by agents, relying on their powerful feature extraction capabilities. However, the structure of neural network is becoming more and more complex, which is accompanied by a sharp increase in the amount of parameters and calculation, which has higher requirements on the data bandwidth and computing power of the hardware platform.
其中,连续图像处理技术,例如视频流中的目标识别、跟踪、超分辨重建等,在智能应用中占据举足轻重的地位。现今主流深度学习加速器对单帧图像的智能处理具备很好的提速效果,然而对于视频应用,单帧加速技术的直接运用会造成极大的计算资源浪费,特别是造成大量重复的片下存储器读写操作。其核心原因在于不同帧图像间的重复权重读取,离散数据读取等未经优化的内存操作。Among them, continuous image processing technologies, such as target recognition, tracking, and super-resolution reconstruction in video streams, play an important role in intelligent applications. Today's mainstream deep learning accelerators have a very good speed-up effect on the intelligent processing of single-frame images. However, for video applications, the direct application of single-frame acceleration technology will cause a huge waste of computing resources, especially a large number of repeated off-chip memory reads. write operation. The core reason is unoptimized memory operations such as repeated weight reading between different frame images and discrete data reading.
发明内容SUMMARY OF THE INVENTION
(一)本发明所要解决的技术问题(1) Technical problem to be solved by the present invention
本发明解决的技术问题是:如何减少从内存读取数据的次数。The technical problem solved by the present invention is: how to reduce the number of times of reading data from the memory.
(二)本发明所采用的技术方案(2) Technical scheme adopted in the present invention
一种用于神经网络的数据批处理方法,所述数据批处理方法包括:A data batch processing method for neural networks, the data batch processing method comprising:
获取内存带宽并根据所述内存带宽选取N张连续帧图像的原始通道数据;Obtain memory bandwidth and select the original channel data of N continuous frame images according to the memory bandwidth;
将所述N张连续帧图像的原始通道数据进行拼接,形成多份重组数据条,其中每份重组数据条包括所述N张连续帧图像在同一像素位置上的原始通道 数据;The original channel data of described N continuous frame images is spliced, and multiple reorganization data strips are formed, wherein each part of the reorganized data strip includes the original channel data of described N continuous frame images on the same pixel position;
将多份重组数据条依序输入至并行计算单元阵列进行卷积运算,其中同一份重组数据条的全部原始通道数据在同一时刻进入至计算单元。A plurality of reconstituted data strips are sequentially input to the parallel computing unit array for convolution operation, wherein all the original channel data of the same reconstituted data strip enter the computing unit at the same time.
可选择地,每份所述重组数据条还包括补零数据,且每份所述重组数据条的数据位宽等于所述内存带宽。Optionally, each piece of the restructured data strip further includes zero-padding data, and the data bit width of each piece of the restructured data strip is equal to the memory bandwidth.
可选择地,所述数据批处理方法还包括:Optionally, the data batch processing method further includes:
将所述多份重组数据条存储至内存中。The plurality of reconstituted data strips are stored in the memory.
可选择地,将多份重组数据条依序输入至并行计算单元阵列进行卷积运算的方法包括:Optionally, the method for sequentially inputting multiple pieces of recombined data strips into the parallel computing unit array for convolution operation includes:
将每份所述重组数据条中的各张连续帧图像的原始通道数据分别与同一权重数据进行乘加运算;Multiply and add the original channel data of each continuous frame image in each of the reorganized data strips with the same weight data respectively;
将每份所述重组数据条中的各张连续帧图像的乘加运算的结果存储至不同的寄存器中。The results of the multiplication and addition operations of the successive frame images in each of the recombined data strips are stored in different registers.
可选择地,所述内存带宽为128比特,N为5,每张所述连续帧图像的每个像素位置上的原始通道数据包括红色通道数据、绿色通道数据和蓝色通道数据。Optionally, the memory bandwidth is 128 bits, N is 5, and the original channel data on each pixel position of each consecutive frame image includes red channel data, green channel data and blue channel data.
本申请还公开一种用于神经网络的数据批处理装置,所述数据批处理装置包括:The present application also discloses a data batch processing device for a neural network, the data batch processing device comprising:
数据获取模块,用于获取内存带宽并根据所述内存带宽选取N张连续帧图像的原始通道数据;a data acquisition module for acquiring memory bandwidth and selecting the original channel data of N consecutive frame images according to the memory bandwidth;
数据重组模块,用于将所述N张连续帧图像的原始通道数据进行拼接,形成多份重组数据条,其中每份重组数据条包括所述N张连续帧图像在同一像素位置上的原始通道数据;A data reorganization module for splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the original channels of the N continuous frame images at the same pixel position data;
卷积计算模块,用于读取多份重组数据条并依序对多份重组数据条进行卷积运算,其中同一份重组数据条的全部原始通道数据在同一时刻被所述卷积计算模块读取。The convolution calculation module is used to read multiple reorganized data strips and perform convolution operation on the multiple reorganized data strips in sequence, wherein all the original channel data of the same reorganized data strip are read by the convolution calculation module at the same time Pick.
可选择地,所述数据批处理装置还包括内存,所述内存用于接收并存储所述数据重组模块形成的多份重组数据条。Optionally, the data batch processing device further includes a memory, and the memory is used for receiving and storing a plurality of reorganized data strips formed by the data reorganization module.
可选择地,所述卷积计算模块包括:Optionally, the convolution calculation module includes:
乘加器单元,用于将每份所述重组数据条中的各张连续帧图像的原始通道数据分别与同一权重数据进行乘加运算;A multiplier-adder unit, used for multiplying and adding the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively;
存储单元,用于存储每份所述重组数据条中的各张连续帧图像的乘加运算的结果。The storage unit is used for storing the result of the multiplication and addition operation of each successive frame image in each piece of the recombined data strip.
本发明还公开了一种计算机可读存储介质,所述计算机可读存储介质存储有用于神经网络的数据批处理程序,所述用于神经网络的数据批处理程序被处理器执行时实现上述的用于神经网络的数据批处理方法。The present invention also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data batch processing program for neural networks, and when the data batch processing program for neural networks is executed by a processor, the above-mentioned Data batching methods for neural networks.
(三)有益效果(3) Beneficial effects
本发明公开了一种用于神经网络的数据批处理方法,相对于传统的计算方法,具有如下技术效果:The invention discloses a data batch processing method for neural network, which has the following technical effects compared with the traditional calculation method:
(1)面向三维阵列的优化数据结构,能够实现数据快速缓冲,避免不同帧图像间重复权重的读取,从而大幅度降低了片下存储器访问次数;(1) The optimized data structure for three-dimensional arrays can realize fast data buffering and avoid repeated weight reading between different frame images, thereby greatly reducing the number of off-chip memory accesses;
(2)。本方法思路新颖,从输入数据本身特性出发,对于首层卷积层,输入数据如背景图像及监控录像等图像单一不变的情况可以发挥很大效果,对深度神经网络中卷积核深度较大的情况也具有很大潜力。(2). The idea of this method is novel. Starting from the characteristics of the input data itself, for the first convolutional layer, the input data such as background images and surveillance videos can play a great role in the case where the images are single and unchanged. Larger cases also have great potential.
附图说明Description of drawings
图1为本发明的实施例一的用于神经网络的数据批处理方法的流程图;FIG. 1 is a flowchart of a data batch processing method for a neural network according to Embodiment 1 of the present invention;
图2为本发明的实施例一的卷积计算的流程图;Fig. 2 is the flow chart of the convolution calculation of Embodiment 1 of the present invention;
图3为本发明的实施例一的数据拼接过程示意图;3 is a schematic diagram of a data splicing process according to Embodiment 1 of the present invention;
图4为本发明的实施例二的数据批处理装置的示意图;4 is a schematic diagram of a data batch processing apparatus according to Embodiment 2 of the present invention;
图5为本发明的实施例二的并行计算单元阵列的示意图;5 is a schematic diagram of a parallel computing unit array according to Embodiment 2 of the present invention;
图6为本发明的实施例的计算机设备示意图。FIG. 6 is a schematic diagram of a computer device according to an embodiment of the present invention.
具体实施方式detailed description
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实 施例,对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
在详细描述本申请的各个实施例之前,首先简单描述本申请的发明构思:现有技术中,依次对每一帧图片进行卷积计算,需要重复读取权重数据以及多次读取图像数据,会造成计算资源浪费,本申请利用卷积计算过程中,相同像素位置对应的权重数据相同的特点,将相邻多帧图片在相同像素位置上的图像通道数据进行重新,并将重组数据在同一时刻输入到计算单元中,与相同的权重数据进行卷积运算,这样可以减少权重数据和图像数据的读取次数,大大降低了计算能耗。Before describing the various embodiments of the present application in detail, first briefly describe the inventive concept of the present application: in the prior art, the convolution calculation is performed on each frame of pictures in turn, and it is necessary to repeatedly read the weight data and read the image data multiple times, It will cause a waste of computing resources. In this application, the weight data corresponding to the same pixel position is the same in the convolution calculation process, and the image channel data of the adjacent multi-frame pictures at the same pixel position is rewritten, and the reorganized data is in the same pixel position. It is input into the computing unit at all times, and the convolution operation is performed with the same weight data, which can reduce the number of readings of the weight data and image data, and greatly reduce the computing energy consumption.
实施例一Example 1
具体地,如图1所示,本实施例一的用于神经网络的数据批处理方法包括如下步骤:Specifically, as shown in FIG. 1 , the data batch processing method for a neural network in the first embodiment includes the following steps:
步骤S10:获取内存带宽并根据所述内存带宽选取N张连续帧图像的原始通道数据;Step S10: obtaining the memory bandwidth and selecting the original channel data of N consecutive frame images according to the memory bandwidth;
步骤S20:将所述N张连续帧图像的原始通道数据进行拼接,形成多份重组数据条,其中每份重组数据条包括所述N张连续帧图像在同一像素位置上的原始通道数据;Step S20: splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the original channel data of the N continuous frame images at the same pixel position;
步骤S30:将多份重组数据条依序输入至并行计算单元阵列进行卷积运算,其中同一份重组数据条的全部原始通道数据在同一时刻进入至计算单元。Step S30: Inputting multiple pieces of recombined data strips into the parallel computing unit array in sequence for convolution operation, wherein all the original channel data of the same piece of recombined data strips enter the computing unit at the same time.
在步骤S10中,以内存带宽等于128比特为例,在现有技术中,从内存读取数据时,每次只读到一个像素点的原始通道数据,包括红色通道数据R、绿色通道数据G和蓝色通道数据B,每个颜色通道数据占8比特,总共24比特,这样每次只读取了24比特数据,浪费了内存带宽。本实施例根据卷积计算过程的特点,结合图像数据的特性,根据实际使用的内存带宽的大小,选择N张连续帧图像的原始通道数据,进行拼接,使得每次从内存读取数据时可读取到更多的通道数据,提高内存带宽的使用效率。In step S10, taking the memory bandwidth equal to 128 bits as an example, in the prior art, when reading data from the memory, the original channel data of one pixel point is read only each time, including the red channel data R and the green channel data G And blue channel data B, each color channel data occupies 8 bits, a total of 24 bits, so only 24 bits of data are read each time, wasting memory bandwidth. In this embodiment, according to the characteristics of the convolution calculation process, combined with the characteristics of the image data, and according to the size of the memory bandwidth actually used, the original channel data of N consecutive frame images are selected and spliced, so that each time the data is read from the memory, it can be Read more channel data and improve the efficiency of memory bandwidth usage.
示例性地,以内存带宽等于128比特,N等于5为例,对步骤S20中的拼接过程进行详细描述。将5张连续帧图像在同一像素位置上的原始通道数据进行拼接,形成重复数据条,例如将5张连续帧图像在第一像素点的原始通道数 据进行拼接,形成数据位宽为120比特的重组数据条。作为优选实施例,对形成的重组数据条进行补零处理,使得重组数据条的数据位宽等于内存带宽,例如在数据位宽为120比特的重组数据条的末尾补上8个0,形成128比特的重组数据条。Exemplarily, taking the memory bandwidth equal to 128 bits and N equal to 5 as an example, the splicing process in step S20 will be described in detail. Splicing the original channel data of 5 consecutive frame images at the same pixel position to form repeated data strips. For example, splicing the original channel data of 5 consecutive frame images at the first pixel point to form a data bit width of 120 bits. Reorganize the data bar. As a preferred embodiment, zero-fill processing is performed on the formed reconstituted data bar, so that the data bit width of the reconstituted data bar is equal to the memory bandwidth. A reconstituted strip of bits.
作为优选实施例,采用Xilinx公司的型号为128-32的Block Memory(块存储器)读取每张图像的原始通道数据,但是该块存储器读取每次只能读取四个颜色通道数据,即32比特数据,而真正需要用到的是24比特数据,因此块存储器读取之后的数据还需要继续进行重组。示例性地,将5张连续帧图像的原始通道数据分别暂存在5个缓存器中,如图3所示,按照像素位置顺序,依次排列颜色通道数据,即R 0G 0B 0R 1G 1B 1R 2G 2B 2R 3G 3B 3......,利用块存储器读取时,在第一个缓存中依次读取到R 0G 0B 0R 1、G 1B 1R 2G 2、B 2R 3G 3B 3,设置一个第一寄存器,将第一次读取到的R 1存储在第一寄存器以供下一次拼接使用,将5张图像的R 0G 0B 0进行拼接并补零,形成第一个像素点的重组数据条,即R 0G 0B 0R 0G 0B 0R 0G 0B 0R 0G 0B 0R 0G 0B 00,并设置第二寄存器来存储重组数据条,从而完成第一个像素点的原始通道数据的重组。类似地,在进行第二个像素点的原始通道数据的重组时,将第二次读取到的G 1B 1与在第一寄存器中的R 1进行重组,形成第二个像素点的重组数据条,即R 1G 1B 1R 1G 1B 1R 1G 1B 1R 1G 1B 1R 1G 1B 10,并将第二次读取到的R 2G 2存储在第一寄存器中,以供下次重组使用。类似地,在进行第三个像素点的原始通道数据的重组时,将第三次读取到的B 2与在第一寄存器存储的R 2G 2进行拼接,形成第三个像素点的重组数据条,即R 2G 2B 2R 2G 2B 2R 2G 2B 2R 2G 2B 2R 2G 2B 20,并将第三次读取到的各个图像的R 3G 3B 3进行拼接,形成第四个像素点的重组数据条,即R 3G 3B 3R 3G 3B 3R 3G 3B 3R 3G 3B 3R 3G 3B 30,这样每经过三次读取,即可完成四个像素点的原始通道数据的重组,形成四份重复数据条。 As a preferred embodiment, the block memory (block memory) of Xilinx company model 128-32 is used to read the original channel data of each image, but the block memory can only read four color channel data each time, namely 32-bit data, but really need to use 24-bit data, so the data after the block memory read needs to continue to be reorganized. Exemplarily, the original channel data of 5 consecutive frame images are temporarily stored in 5 buffers respectively, as shown in Figure 3, the color channel data are arranged in sequence according to the pixel position order, that is, R 0 G 0 B 0 R 1 G 1 B 1 R 2 G 2 B 2 R 3 G 3 B 3 ......, when using block memory to read, read R 0 G 0 B 0 R 1 , G 1 in sequence in the first cache B 1 R 2 G 2 , B 2 R 3 G 3 B 3 , set a first register, store the R 1 read for the first time in the first register for the next splicing, and store the R 1 of the 5 images in the first register for the next splicing. 0 G 0 B 0 is spliced and zero-filled to form a reorganized data bar of the first pixel, that is, R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 0, and set the second register to store the reorganized data bar, so as to complete the reorganization of the original channel data of the first pixel. Similarly, when recombining the original channel data of the second pixel, recombine the G 1 B 1 read for the second time with R 1 in the first register to form the recombination of the second pixel Data strip, namely R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 0, and store the R 2 G 2 read for the second time in the first register for the next reorganization. Similarly, when recombining the original channel data of the third pixel point, the B 2 read for the third time and the R 2 G 2 stored in the first register are spliced to form the recombination of the third pixel point Data bar, namely R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 0, and R 3 of each image read for the third time G 3 B 3 is spliced to form the recombined data bar of the fourth pixel, namely R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 0 , so that after three readings, the recombination of the original channel data of the four pixel points can be completed to form four repeated data strips.
重复上述步骤,直至完成5张连续帧图像的全部像素点的原始通道数据的重组,并将得到的全部重组数据条存储至内存中,以便于后续计算使用。The above steps are repeated until the recombination of the original channel data of all the pixels of the 5 consecutive frame images is completed, and all the recombined data strips obtained are stored in the memory for use in subsequent calculations.
在步骤S30中,32*64的并行计算单元阵列为例,包括32*64个计算单元,64个数据缓存TB和32个权重缓存WB,其中,每个数据缓存TB中存储有多份重组数据条,每个权重缓存WB存储的权重数据由64个数据缓存共享。In step S30, a 32*64 parallel computing unit array is taken as an example, including 32*64 computing units, 64 data cache TBs and 32 weight caches WB, wherein each data cache TB stores multiple pieces of reorganized data The weight data stored in each weight cache WB is shared by 64 data caches.
作为优选实施例,以卷积计算过程中滑动窗口的大小等于2*2为例,每个 数据缓存中存储有相邻四个像素点的重组数据条,即R 0G 0B 0R 0G 0B 0R 0G 0B 0R 0G 0B 0R 0G 0B 00、R 1G 1B 1R 1G 1B 1R 1G 1B 1R 1G 1B 1R 1G 1B 10、R 2G 2B 2R 2G 2B 2R 2G 2B 2R 2G 2B 2R 2G 2B 20和R 3G 3B 3R 3G 3B 3R 3G 3B 3R 3G 3B 3R 3G 3B 30。在进行卷积计算时,将同一份重组数据条同一时刻写入至计算单元中,一方面可以提高内存带宽利用效率,另一方面可以减少内存读取次数。 As a preferred embodiment, taking the size of the sliding window equal to 2*2 in the convolution calculation process as an example, each data buffer stores the recombined data strips of four adjacent pixels, that is, R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 0, R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 0, R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 0 and R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 0. When performing convolution calculations, writing the same reconstituted data bar into the computing unit at the same time can improve the efficiency of memory bandwidth utilization on the one hand, and reduce the number of memory reads on the other hand.
具体地,如图2所示,多份重组数据条依序输入至并行计算单元阵列进行卷积运算的方法包括:Specifically, as shown in FIG. 2 , a method for sequentially inputting multiple pieces of recombined data strips into a parallel computing unit array for convolution operation includes:
步骤S31:将每份所述重组数据条中的各张连续帧图像的原始通道数据分别与同一权重数据进行乘加运算;Step S31: Multiply and add the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively;
步骤S32:将每份所述重组数据条中的各张连续帧图像的乘加运算的结果存储至不同的寄存器中。Step S32: Store the result of the multiplication and addition operation of each successive frame image in each piece of the recombined data strip into different registers.
示例性地,以5张连续帧图像为例,每份重组数据条包括5张连续帧图像的5个像素点的原始通道数据,其中设置5个第三寄存器,分别用来存储5张连续帧图像的乘加运算的结果。如图5所示,例如,针对第一张连续帧图像的第一个像素点的原始通道数据,乘加运算的结果为F 0=W 00*R 0+W 01*G 0+W 02*B 0,并将该计算结果存储在对应的第三寄存器中,等待第一张连续帧图像的滑动窗口内的全部像素点的乘加运算结果得到后,再进一步将各个乘加运算结果相加。类似地,第二张连续帧图像的第一个像素点的乘加运算的结果为F 1=W 00*R 0+W 01*G 0+W 02*B 0,并将该乘加运算结果存储到相应的第三寄存器中,依次类推,将各个计算结果存储在不同的第三寄存器中。卷积计算过程中,由于同一份重组数据条对应相同的权重数据,这样可以实现权重数据的共享,不需要多次重复读取权重数据,由于同一份重组数据条的全部原始通道数据被一次性读入到计算单元中,也避免了重复读取图像数据,减少内存访问次数。 Exemplarily, taking 5 consecutive frame images as an example, each piece of reconstructed data bar includes the original channel data of 5 pixels of the 5 consecutive frame images, wherein 5 third registers are set, respectively used to store 5 consecutive frames. The result of the multiply-add operation of the image. As shown in Figure 5, for example, for the original channel data of the first pixel of the first continuous frame image, the result of the multiplication and addition operation is F 0 =W 00 *R 0 +W 01 *G 0 +W 02 * B 0 , and store the calculation result in the corresponding third register, wait for the multiplication and addition operation results of all the pixels in the sliding window of the first continuous frame image to be obtained, and then further add the multiplication and addition operation results . Similarly, the result of the multiplication and addition operation of the first pixel of the second continuous frame image is F 1 =W 00 *R 0 +W 01 *G 0 +W 02 *B 0 , and the multiplication and addition operation result is Stored in the corresponding third register, and so on, each calculation result is stored in a different third register. In the process of convolution calculation, since the same reorganized data bar corresponds to the same weight data, the sharing of weight data can be realized, and there is no need to repeatedly read the weight data. Reading into the computing unit also avoids repeated reading of image data and reduces the number of memory accesses.
实施例二Embodiment 2
如图4所示,本实施例二的用于神经网络的数据批处理装置包括数据获取模块100、数据重组模块200和卷积计算模块300,数据获取模块100用于获取内存带宽并根据所述内存带宽选取N张连续帧图像的原始通道数据;数据重组模块200用于将所述N张连续帧图像的原始通道数据进行拼接,形成多份重组数据条,其中每份重组数据条包括所述N张连续帧图像在同一像素位置上的原始通道数据;卷积计算模块300用于读取多份重组数据条并依序对多份重组 数据条进行卷积运算,其中同一份重组数据条的全部原始通道数据在同一时刻被所述卷积计算模块读取。其中数据批处理装置还包括内存400,所述内存400用于接收并存储所述数据重组模块200形成的多份重组数据条。As shown in FIG. 4 , the apparatus for batch processing data for neural networks according to the second embodiment includes a data acquisition module 100, a data reorganization module 200 and a convolution calculation module 300. The data acquisition module 100 is used for acquiring memory bandwidth and according to the The memory bandwidth selects the original channel data of N continuous frame images; the data reorganization module 200 is used for splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the described The original channel data of N consecutive frame images at the same pixel position; the convolution calculation module 300 is used to read multiple pieces of reconstructed data strips and perform convolution operation on the multiple pieces of reconstructed data strips in sequence, wherein the same piece of reconstructed data strips All raw channel data are read by the convolution calculation module at the same time. The data batch processing apparatus further includes a memory 400, and the memory 400 is used for receiving and storing a plurality of reorganized data strips formed by the data reorganization module 200.
具体来说,数据获取模块100包括多个缓存器,多个缓存器用于根据内存带宽的数据从内存模块400中读取并暂存相应图像的原始通道数据。以内存带宽等于128比特,N等于5为例,采用5个不同的缓存器从内存中读取并存储5张连续帧图像的原始通道数据,其中按照像素位置顺序,依次排列颜色通道数据,即R 0G 0B 0R 1G 1B 1R 2G 2B 2R 3G 3B 3Specifically, the data acquisition module 100 includes a plurality of buffers, and the plurality of buffers are used to read and temporarily store the original channel data of the corresponding image from the memory module 400 according to the data of the memory bandwidth. Taking the memory bandwidth equal to 128 bits and N equal to 5 as an example, 5 different buffers are used to read and store the original channel data of 5 consecutive frame images from the memory, and the color channel data are arranged in sequence according to the pixel position, that is R 0 G 0 B 0 R 1 G 1 B 1 R 2 G 2 B 2 R 3 G 3 B 3 .
数据重组模块200包括块存储器、第一寄存器、第二寄存器和计数器,示例性地,块存储器采用Xilinx公司的型号为128-32的Block Memory。该块存储器读取每次只能读取四个颜色通道数据,即32比特数据,而真正需要用到的是24比特数据,因此块存储器读取之后的数据还需要继续进行重组。第一次读取数据时,块存储器分别从各个缓存器中读取到数据均为R 0G 0B 0R 1,此时将R 1存储在第一寄存器中,对各个图像的R 0G 0B 0进行拼接,并进行补零处理,形成第一个像素点的重组数据条,即R 0G 0B 0R 0G 0B 0R 0G 0B 0R 0G 0B 0R 0G 0B 00,并将其存储至第二寄存器中,同时将计数器的值设为0。类似地,第二次读取数据时,块存储器分别从各个缓存器中读取到的数据为G 1B 1R 2G 2,此时将R 2G 2存储在第一寄存器中,并将第一寄存器中事先存储的R 1以及第二次读取的G 1B 1进行拼接,并进行补零处理,形成第二个像素点的重组数据条,即R 1G 1B 1R 1G 1B 1--R 1G 1B 1R 1G 1B 1R 1G 1B 10,并将其存储至第二寄存器中,同时将计数器的值设为1。第三次读取数据时,块存储器分别从各个缓存器中读取到的数据为B 2R 3G 3B 3,将第一寄存器中事先存储的R 2G 2以及第三次读取的B 2进行拼接,并进行补零处理,形成第三个像素点的重组数据条,即R 2G 2B 2R 2G 2B 2R 2G 2B 2R 2G 2B 2R 2G 2B 20,并将其存储至第二寄存器中,同时将计数器的值设为2。接着将第三次读取到的R 3G 3B 3进行拼接,并进行补零处理,形成第四个像素点的重组数据条,即R 3G 3B 3R 3G 3B 3R 3G 3B 3R 3G 3B 3R 3G 3B 30,并将其存储至第二寄存器中,同时将计数器的值设为3。这样每经过三次读取,即可完成四个像素点的原始通道数据的重组,形成四份重复数据条。重复上述步骤,直至完成5张连续帧图像的全部像素点的原始通道数据的重组,并将得到的全部重组数据条存储至内存中,以便于后续计算使用。 The data reorganization module 200 includes a block memory, a first register, a second register and a counter. Exemplarily, the block memory adopts the Block Memory of Xilinx Company, whose model is 128-32. The block memory read can only read four color channel data each time, that is, 32-bit data, and the real need is 24-bit data, so the data after the block memory read needs to continue to be reorganized. When the data is read for the first time, the block memory reads the data from each buffer respectively as R 0 G 0 B 0 R 1 . At this time, R 1 is stored in the first register . 0 B 0 is spliced and zero-filled to form a reorganized data bar of the first pixel, that is, R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 R 0 G 0 B 0 0, and store it in the second register, while setting the value of the counter to 0. Similarly, when the data is read for the second time, the data read from the respective buffers by the block memory is G 1 B 1 R 2 G 2 . At this time, R 2 G 2 is stored in the first register, and the The pre-stored R 1 in the first register and the G 1 B 1 read for the second time are spliced and zero-filled to form the reorganized data bar of the second pixel, that is, R 1 G 1 B 1 R 1 G 1 B 1 --R 1 G 1 B 1 R 1 G 1 B 1 R 1 G 1 B 1 0, and store it in the second register, while setting the value of the counter to 1. When the data is read for the third time, the data read from the respective buffers by the block memory is B 2 R 3 G 3 B 3 . B 2 is spliced and zero-filled to form a reorganized data bar of the third pixel, namely R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 R 2 G 2 B 2 0 and store it in the second register, while setting the value of the counter to 2. Next, splicing the R 3 G 3 B 3 read for the third time, and performing zero-fill processing to form a reorganized data bar of the fourth pixel point, that is, R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 R 3 G 3 B 3 0, and store it in the second register, while setting the value of the counter to 3. In this way, after three readings, the recombination of the original channel data of the four pixel points can be completed to form four duplicate data strips. The above steps are repeated until the recombination of the original channel data of all the pixels of the 5 consecutive frame images is completed, and all the recombined data strips obtained are stored in the memory for use in subsequent calculations.
进一步地,如图5所示,以32*64的并行计算单元阵列为例,包括32*64 个计算单元PE,64个数据缓存TB和32个权重缓存WB,其中,每个数据缓存TB中存储有多份重组数据条,每个权重缓存WB存储的权重数据由64个数据缓存共享,卷积计算模块即为计算单元PE。Further, as shown in FIG. 5 , taking a 32*64 parallel computing unit array as an example, it includes 32*64 computing units PE, 64 data cache TBs and 32 weight caches WB, where each data cache TB contains There are multiple pieces of reorganized data stored, and the weight data stored in each weight buffer WB is shared by 64 data buffers, and the convolution calculation module is the calculation unit PE.
其中,卷积计算模块包括乘加器单元和存储单元,乘加器单元用于将每份所述重组数据条中的各张连续帧图像的原始通道数据分别与同一权重数据进行乘加运算,存储单元用于存储每份所述重组数据条中的各张连续帧图像的乘加运算的结果。Wherein, the convolution calculation module includes a multiplier-adder unit and a storage unit, and the multiplier-adder unit is used for multiplying and adding the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively, The storage unit is used to store the result of the multiplication and addition operation of each successive frame image in each piece of the reconstructed data strip.
示例性地,乘加器单元包括乘法器和加法器,存储单元包括数据选择器301、数据分配器302和5个第三寄存器303。例如,针对第一张连续帧图像的第一个像素点的原始通道数据,利用乘法器
Figure PCTCN2020120177-appb-000001
计算W 00*R 0,并利用数据选择器301从相应的第三寄存器303中读取数据,利用加法器
Figure PCTCN2020120177-appb-000002
计算,由于第三寄存器的初始值为零,则加法器的计算结果为W 00*R 0,接着通过数据分配器302将该计算结果W 00*R 0存储至第三寄存器303。继续利用乘法器
Figure PCTCN2020120177-appb-000003
计算W 01*G 0,并利用数据选择器301从相应的第三寄存器303中读取数据W 00*R 0,利用加法器
Figure PCTCN2020120177-appb-000004
计算,加法器
Figure PCTCN2020120177-appb-000005
计算结果为W 00*R 0+W 01*G 0,接着通过数据分配器302将该计算结果W 00*R 0+W 01*G 0存储至第三寄存器303。最后继续利用乘法器
Figure PCTCN2020120177-appb-000006
计算W 02*B 0,并利用数据选择器301从相应的第三寄存器303中读取数据W 00*R 0+W 01*G 0,利用加法器
Figure PCTCN2020120177-appb-000007
计算,加法器
Figure PCTCN2020120177-appb-000008
计算结果为F 0=W 00*R 0+W 01*G 0+W 02*B 0,接着通过数据分配器302将该计算结果F 0存储至第三寄存器303。以此类推,完成各个原始通道数据的卷积计算。由于同一份重组数据条对应相同的权重数据,例如W 00W 01W 02需要重复利用五次,可通过设置额外的地址指针和计数器,当一组权重数据使用不足5次,都要控制地址指针来复用权重数据。
Exemplarily, the multiplier-adder unit includes a multiplier and an adder, and the storage unit includes a data selector 301 , a data distributor 302 and five third registers 303 . For example, for the original channel data of the first pixel of the first continuous frame image, use the multiplier
Figure PCTCN2020120177-appb-000001
Calculate W 00 *R 0 and use the data selector 301 to read the data from the corresponding third register 303, use the adder
Figure PCTCN2020120177-appb-000002
After calculation, since the initial value of the third register is zero, the calculation result of the adder is W 00 *R 0 , and then the calculation result W 00 *R 0 is stored in the third register 303 through the data distributor 302 . keep using the multiplier
Figure PCTCN2020120177-appb-000003
Calculate W 01 *G 0 , and use data selector 301 to read data W 00 *R 0 from the corresponding third register 303 , use adder
Figure PCTCN2020120177-appb-000004
calculation, adder
Figure PCTCN2020120177-appb-000005
The calculation result is W 00 *R 0 +W 01 *G 0 , and then the calculation result W 00 *R 0 +W 01 *G 0 is stored in the third register 303 through the data distributor 302 . Finally, continue to use the multiplier
Figure PCTCN2020120177-appb-000006
Calculate W 02 *B 0 , and use the data selector 301 to read the data W 00 *R 0 +W 01 *G 0 from the corresponding third register 303 , use the adder
Figure PCTCN2020120177-appb-000007
calculation, adder
Figure PCTCN2020120177-appb-000008
The calculation result is F 0 =W 00 *R 0 +W 01 *G 0 +W 02 *B 0 , and then the calculation result F 0 is stored in the third register 303 through the data distributor 302 . By analogy, the convolution calculation of each original channel data is completed. Since the same reorganized data bar corresponds to the same weight data, for example, W 00 W 01 W 02 needs to be reused five times, additional address pointers and counters can be set. When a set of weight data is used less than five times, the address pointer must be controlled. to reuse weight data.
本申请还公开了一种计算机可读存储介质,所述计算机可读存储介质存储有用于神经网络的数据批处理程序,所述用于神经网络的数据批处理程序被处理器执行时实现上述的用于神经网络的数据批处理方法。The present application also discloses a computer-readable storage medium, where the computer-readable storage medium stores a data batch processing program for neural networks, and when the data batch processing program for neural networks is executed by a processor, the above-mentioned Data batching methods for neural networks.
本申请还公开了一种计算机设备,在硬件层面,如图6所示,该终端包括处理器12、内部总线13、网络接口14、计算机可读存储介质11。处理器12从计算机可读存储介质中读取对应的计算机程序然后运行,在逻辑层面上形成请求处理装置。当然,除了软件实现方式之外,本说明书一个或多个实施例并 不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。所述计算机可读存储介质11上存储有用于神经网络的数据批处理程序,所述用于神经网络的数据批处理程序被处理器执行时实现上述的用于神经网络的数据批处理方法。The present application also discloses a computer device. At the hardware level, as shown in FIG. 6 , the terminal includes a processor 12 , an internal bus 13 , a network interface 14 , and a computer-readable storage medium 11 . The processor 12 reads the corresponding computer program from the computer-readable storage medium and then executes it, forming a request processing device on a logical level. Of course, in addition to software implementations, one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subjects of the following processing procedures are not limited to each Logic unit, which can also be hardware or logic device. The computer-readable storage medium 11 stores a data batch program for a neural network, and when the data batch program for a neural network is executed by a processor, implements the above-mentioned data batch method for a neural network.
计算机可读存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机可读存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带、磁盘存储、量子存储器、基于石墨烯的存储介质或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。Computer-readable storage media includes both persistent and non-permanent, removable and non-removable media, and storage of information can be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage , magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
上面对本发明的具体实施方式进行了详细描述,虽然已表示和描述了一些实施例,但本领域技术人员应该理解,在不脱离由权利要求及其等同物限定其范围的本发明的原理和精神的情况下,可以对这些实施例进行修改和完善,这些修改和完善也应在本发明的保护范围内。The specific embodiments of the present invention have been described in detail above. Although some embodiments have been shown and described, those skilled in the art should understand that the principles and spirit of the present invention, which are defined in the scope of the claims and their equivalents, are not departed from. Under the circumstances, these embodiments can be modified and perfected, and these modifications and improvements should also fall within the protection scope of the present invention.

Claims (13)

  1. 一种用于神经网络的数据批处理方法,其中,所述数据批处理方法包括:A data batch processing method for a neural network, wherein the data batch processing method comprises:
    获取内存带宽并根据所述内存带宽选取N张连续帧图像的原始通道数据;Obtain memory bandwidth and select the original channel data of N continuous frame images according to the memory bandwidth;
    将所述N张连续帧图像的原始通道数据进行拼接,形成多份重组数据条,其中每份重组数据条包括所述N张连续帧图像在同一像素位置上的原始通道数据;The original channel data of the N continuous frame images are spliced to form a plurality of reorganized data strips, wherein each reorganized data strip includes the original channel data of the N continuous frame images at the same pixel position;
    将多份重组数据条依序输入至并行计算单元阵列进行卷积运算,其中同一份重组数据条的全部原始通道数据在同一时刻进入至计算单元。A plurality of reconstituted data strips are sequentially input to the parallel computing unit array for convolution operation, wherein all the original channel data of the same reconstituted data strip enter the computing unit at the same time.
  2. 根据权利要求1所述的用于神经网络的数据批处理方法,其中,每份所述重组数据条还包括补零数据,且每份所述重组数据条的数据位宽等于所述内存带宽。The data batch processing method for neural networks according to claim 1, wherein each piece of the restructured data strip further includes zero-padding data, and the data bit width of each piece of the restructured data strip is equal to the memory bandwidth.
  3. 根据权利要求2所述的用于神经网络的数据批处理方法,其中,所述数据批处理方法还包括:The data batch processing method for neural networks according to claim 2, wherein the data batch processing method further comprises:
    将所述多份重组数据条存储至内存中。The plurality of reconstituted data strips are stored in the memory.
  4. 根据权利要求1所述的用于神经网络的数据批处理方法,其中,将多份重组数据条依序输入至并行计算单元阵列进行卷积运算的方法包括:The data batch processing method for neural networks according to claim 1, wherein the method for sequentially inputting multiple pieces of recombined data strips into a parallel computing unit array for convolution operation comprises:
    将每份所述重组数据条中的各张连续帧图像的原始通道数据分别与同一权重数据进行乘加运算;Multiply and add the original channel data of each continuous frame image in each of the reorganized data strips with the same weight data respectively;
    将每份所述重组数据条中的各张连续帧图像的乘加运算的结果存储至不同的寄存器中。The results of the multiplication and addition operations of the successive frame images in each of the recombined data strips are stored in different registers.
  5. 根据权利要求1所述的用于神经网络的数据批处理方法,其中,所述内存带宽为128比特,N为5,每张所述连续帧图像的每个像素位置上的原始通道数据包括红色通道数据、绿色通道数据和蓝色通道数据。The data batch processing method for a neural network according to claim 1, wherein the memory bandwidth is 128 bits, N is 5, and the original channel data on each pixel position of each of the continuous frame images includes red Channel data, green channel data, and blue channel data.
  6. 一种用于神经网络的数据批处理装置,其中,所述数据批处理装置包括:A data batch processing device for a neural network, wherein the data batch processing device comprises:
    数据获取模块,用于获取内存带宽并根据所述内存带宽选取N张连续帧图像的原始通道数据;a data acquisition module for acquiring memory bandwidth and selecting the original channel data of N consecutive frame images according to the memory bandwidth;
    数据重组模块,用于将所述N张连续帧图像的原始通道数据进行拼接,形成多份重组数据条,其中每份重组数据条包括所述N张连续帧图像在同一像素位置上的原始通道数据;A data reorganization module for splicing the original channel data of the N continuous frame images to form multiple reorganized data strips, wherein each reorganized data strip includes the original channels of the N continuous frame images at the same pixel position data;
    卷积计算模块,用于读取多份重组数据条并依序对多份重组数据条进行卷积运算,其中同一份重组数据条的全部原始通道数据在同一时刻被所述卷积计算模块读取。The convolution calculation module is used to read multiple reorganized data strips and perform convolution operation on the multiple reorganized data strips in sequence, wherein all the original channel data of the same reorganized data strip are read by the convolution calculation module at the same time Pick.
  7. 根据权利要求6所述的用于神经网络的数据批处理装置,其中,所述数据批处理装置还包括内存,所述内存用于接收并存储所述数据重组模块形成的多份重组数据条。The data batch processing device for neural networks according to claim 6, wherein the data batch processing device further comprises a memory, and the memory is used for receiving and storing a plurality of reorganized data strips formed by the data reorganization module.
  8. 根据权利要求6所述的用于神经网络的数据批处理装置,其中,所述卷积计算模块包括:The data batch processing apparatus for neural networks according to claim 6, wherein the convolution calculation module comprises:
    乘加器单元,用于将每份所述重组数据条中的各张连续帧图像的原始通道数据分别与同一权重数据进行乘加运算;A multiplier-adder unit, used for multiplying and adding the original channel data of each continuous frame image in each of the recombined data strips with the same weight data respectively;
    存储单元,用于存储每份所述重组数据条中的各张连续帧图像的乘加运算的结果。The storage unit is used for storing the result of the multiplication and addition operation of each successive frame image in each piece of the recombined data strip.
  9. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有用于神经网络的数据批处理程序,所述用于神经网络的数据批处理程序被处理器执行时实现权利要求1所述的用于神经网络的数据批处理方法。A computer-readable storage medium, wherein the computer-readable storage medium stores a data batch processing program for a neural network, and the data batch processing program for a neural network implements the method of claim 1 when executed by a processor Data batching methods for neural networks.
  10. 根据权利要求9所述的计算机可读存储介质,其中,每份所述重组数据条还包括补零数据,且每份所述重组数据条的数据位宽等于所述内存带宽。The computer-readable storage medium of claim 9, wherein each of the reconstituted data stripes further includes zero-padded data, and a data bit width of each of the reconstituted data strips is equal to the memory bandwidth.
  11. 根据权利要求10所述的计算机可读存储介质,其中,所述数据批处理方法还包括:The computer-readable storage medium of claim 10, wherein the data batch processing method further comprises:
    将所述多份重组数据条存储至内存中。The plurality of reconstituted data strips are stored in the memory.
    将多份重组数据条依序输入至并行计算单元阵列进行卷积运算的方法包括:The method for sequentially inputting multiple pieces of recombined data strips into the parallel computing unit array for convolution operation includes:
  12. 根据权利要求9所述的计算机可读存储介质,其中,将每份所述重组数据条中的各张连续帧图像的原始通道数据分别与同一权重数据进行乘加运算;The computer-readable storage medium according to claim 9, wherein the original channel data of each continuous frame image in each of the recombined data strips is respectively multiplied and added with the same weight data;
    将每份所述重组数据条中的各张连续帧图像的乘加运算的结果存储至不同的寄存器中。The results of the multiplication and addition operations of the successive frame images in each of the recombined data strips are stored in different registers.
  13. 根据权利要求9所述的计算机可读存储介质,其中,所述内存带宽为128比特,N为5,每张所述连续帧图像的每个像素位置上的原始通道数据包括红色通道数据、绿色通道数据和蓝色通道数据。The computer-readable storage medium according to claim 9, wherein the memory bandwidth is 128 bits, N is 5, and the original channel data on each pixel position of each successive frame image comprises red channel data, green Channel data and blue channel data.
PCT/CN2020/120177 2020-08-07 2020-10-10 Data batch processing method and batch processing apparatus thereof, and storage medium WO2022027818A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010791617.5 2020-08-07
CN202010791617.5A CN114065905A (en) 2020-08-07 2020-08-07 Data batch processing method and batch processing device thereof, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
WO2022027818A1 true WO2022027818A1 (en) 2022-02-10

Family

ID=80119905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/120177 WO2022027818A1 (en) 2020-08-07 2020-10-10 Data batch processing method and batch processing apparatus thereof, and storage medium

Country Status (2)

Country Link
CN (1) CN114065905A (en)
WO (1) WO2022027818A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388537A (en) * 2018-03-06 2018-08-10 上海熠知电子科技有限公司 A kind of convolutional neural networks accelerator and method
US20190147299A1 (en) * 2016-10-31 2019-05-16 Tencent Technology (Shenzhen) Company Limited Data processing method and apparatus for convolutional neural network
CN110136066A (en) * 2019-05-23 2019-08-16 北京百度网讯科技有限公司 Super-resolution method, device, equipment and storage medium towards video
CN110211205A (en) * 2019-06-14 2019-09-06 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium
CN110782393A (en) * 2019-10-10 2020-02-11 江南大学 Image resolution compression and reconstruction method based on reversible network
CN110895801A (en) * 2019-11-15 2020-03-20 北京金山云网络技术有限公司 Image processing method, device, equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108876813B (en) * 2017-11-01 2021-01-26 北京旷视科技有限公司 Image processing method, device and equipment for detecting object in video
CN110009102B (en) * 2019-04-12 2023-03-24 南京吉相传感成像技术研究院有限公司 Depth residual error network acceleration method based on photoelectric computing array
CN111199273B (en) * 2019-12-31 2024-03-26 深圳云天励飞技术有限公司 Convolution calculation method, device, equipment and storage medium
CN111459856B (en) * 2020-03-20 2022-02-18 中国科学院计算技术研究所 Data transmission device and transmission method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190147299A1 (en) * 2016-10-31 2019-05-16 Tencent Technology (Shenzhen) Company Limited Data processing method and apparatus for convolutional neural network
CN108388537A (en) * 2018-03-06 2018-08-10 上海熠知电子科技有限公司 A kind of convolutional neural networks accelerator and method
CN110136066A (en) * 2019-05-23 2019-08-16 北京百度网讯科技有限公司 Super-resolution method, device, equipment and storage medium towards video
CN110211205A (en) * 2019-06-14 2019-09-06 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium
CN110782393A (en) * 2019-10-10 2020-02-11 江南大学 Image resolution compression and reconstruction method based on reversible network
CN110895801A (en) * 2019-11-15 2020-03-20 北京金山云网络技术有限公司 Image processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114065905A (en) 2022-02-18

Similar Documents

Publication Publication Date Title
US11775430B1 (en) Memory access for multiple circuit components
US10936937B2 (en) Convolution operation device and convolution operation method
US10545559B2 (en) Data processing system and method
WO2020062284A1 (en) Convolutional neural network-based image processing method and device, and unmanned aerial vehicle
WO2022110386A1 (en) Data processing method and artificial intelligence processor
CN113792621B (en) FPGA-based target detection accelerator design method
WO2020233709A1 (en) Model compression method, and device
WO2022007265A1 (en) Dilated convolution acceleration calculation method and apparatus
CN112799599A (en) Data storage method, computing core, chip and electronic equipment
CN116720549A (en) FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache
CN113301221B (en) Image processing method of depth network camera and terminal
Cadenas et al. Parallel pipelined array architectures for real-time histogram computation in consumer devices
WO2022027818A1 (en) Data batch processing method and batch processing apparatus thereof, and storage medium
CN107085827B (en) Super-resolution image restoration method based on hardware platform
CN109416743A (en) A kind of Three dimensional convolution device artificially acted for identification
WO2020029181A1 (en) Three-dimensional convolutional neural network-based computation device and related product
US6771271B2 (en) Apparatus and method of processing image data
CN116051345A (en) Image data processing method, device, computer equipment and readable storage medium
CN113160321B (en) Geometric mapping method and device for real-time image sequence
CN112001492B (en) Mixed running water type acceleration architecture and acceleration method for binary weight DenseNet model
CN115456858B (en) Image processing method, device, computer equipment and computer readable storage medium
CN113222831B (en) Feature memory forgetting unit, network and system for removing image stripe noise
WO2022000456A1 (en) Image processing method and apparatus, integrated circuit, and device
CN109509218A (en) The method, apparatus of disparity map is obtained based on FPGA
CN116681588A (en) Super-resolution implementation method and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20948605

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20948605

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 030723)

122 Ep: pct application non-entry in european phase

Ref document number: 20948605

Country of ref document: EP

Kind code of ref document: A1