WO2020073801A1 - 一种3d图像处理中数据读写方法及系统、存储介质及终端 - Google Patents

一种3d图像处理中数据读写方法及系统、存储介质及终端 Download PDF

Info

Publication number
WO2020073801A1
WO2020073801A1 PCT/CN2019/107678 CN2019107678W WO2020073801A1 WO 2020073801 A1 WO2020073801 A1 WO 2020073801A1 CN 2019107678 W CN2019107678 W CN 2019107678W WO 2020073801 A1 WO2020073801 A1 WO 2020073801A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sub
image processing
image
picture
Prior art date
Application number
PCT/CN2019/107678
Other languages
English (en)
French (fr)
Inventor
崔中浩
罗文杰
张珂
张慧明
Original Assignee
芯原微电子(上海)股份有限公司
芯原控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 芯原微电子(上海)股份有限公司, 芯原控股有限公司 filed Critical 芯原微电子(上海)股份有限公司
Priority to EP19871524.5A priority Critical patent/EP3816867A4/en
Priority to US17/257,859 priority patent/US11455781B2/en
Priority to JP2021520315A priority patent/JP7201802B2/ja
Priority to KR1020217014106A priority patent/KR20210070369A/ko
Publication of WO2020073801A1 publication Critical patent/WO2020073801A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to the technical field of cache application, in particular to a method and system for reading and writing data in 3D image processing, a storage medium and a terminal.
  • Digital Image Processing is a method and technology for removing noise, enhancing, restoring, segmenting, and extracting features of an image through a computer.
  • 3D image processing algorithms are often divided into multiple layers and processed layer by layer. Each layer has an input image and an output image. Therefore, in the specific implementation process of 3D image processing, huge storage bandwidth is required.
  • 3000M of data access is required relative to the calculation of 724M MACs.
  • DDR Double Rate Memory
  • ALU Arithmetic Logical Unit
  • An effective way to reduce DDR bandwidth For example, a global buffer between DRAM and ALU, add local shared storage that can be accessed between each ALU, and add a register file (Register file) inside ALU.
  • register file Register file
  • the bandwidth is reduced by reducing the bit width of the data.
  • the data is expressed in low bit numbers through quantization, the amount of data to be processed is reduced, and then the output result is dequantized. This method makes ALU simpler, but as the data bit width decreases, the calculation accuracy will inevitably decrease. For neural networks, the data also needs to be retrained.
  • the image processing algorithm processes images in a certain order. Therefore, the data flow can be analyzed and controlled, and buffers can be used reasonably for caching.
  • the image is divided into smaller tiles, which are processed sequentially. This method reduces the memory read span.
  • the cache is in units of tiles, and the cache unit becomes smaller.
  • a smaller memory management unit (Memory Management Unit, MMU) or cache cache unit can be used.
  • MMU Memory Management Unit
  • the data that needs to be processed between tiles is called overlap data. If the tiles are cached, the overlap data also needs to be cached.
  • the object of the present invention is to provide a method and system for reading and writing data in 3D image processing, a storage medium and a terminal, based on the 3D vertical sliding technology and circular buffer ),
  • a storage medium and a terminal based on the 3D vertical sliding technology and circular buffer
  • the present invention provides a method for reading and writing data in 3D image processing, including the following steps: dividing a 3D image in a horizontal direction based on a vertical sliding technology, and dividing the 3D image into at least two sub-components Figure; for each sub-picture, store the processed data of the sub-picture to the circular buffer; after processing the sub-picture, keep the overlapping part of the data required for the next sub-picture in the circular buffer; the image
  • the multi-layer network of the processing algorithm is divided into at least two segments, so that the data between adjacent layers in each segment only interacts through the cache, not through the DDR interaction.
  • the size of the ring buffer occupied by each sub-picture is SubImageXsize * (SubImageYsize + OverlapSize) * SubImageZSize, where SubImageXsize, SubImageYsize, SubImageZSize, and OverlapSize are the sub-image X-direction size, Y-direction size, Z size and overlap size.
  • each segment the output data of each layer except the last layer is written into the cache, and each layer except the first layer reads data from the cache .
  • the present invention provides a data reading and writing system in 3D image processing, including a circular buffer module and a segmented buffer module;
  • the circular buffer module is used to divide the 3D image in the horizontal direction based on the vertical sliding technology and divide the 3D image into at least two sub-pictures; for each sub-picture, store the processing data of the sub-pictures to a circular buffer ; After processing the sub-picture, retain the overlapping part of the data required for the next sub-picture in the circular buffer;
  • the segment cache module is used to divide the multi-layer network of the image processing algorithm into at least two segments, so that the data between adjacent layers in each segment only interacts through the cache and does not undergo DDR interaction.
  • the size of the ring buffer occupied by each sub-picture is SubImageXsize * (SubImageYsize + OverlapSize) * SubImageZSize, where SubImageXsize, SubImageYsize, SubImageZSize, and OverlapSize are the sub-image X-direction size, Y-direction size, Z size and overlap size.
  • each segment the output data of each layer except the last layer is written into the cache, and each layer except the first layer reads data from the cache .
  • the present invention provides a storage medium on which a computer program is stored.
  • the program is executed by a processor, the above-mentioned 3D image processing data reading and writing method is realized.
  • the present invention provides a terminal, including: a processor and a memory;
  • the memory is used to store computer programs
  • the processor is used to execute a computer program stored in the memory, so that the terminal executes the data reading and writing method in the 3D image processing described above.
  • the data reading and writing method and system, storage medium and terminal in 3D image processing according to the present invention have the following beneficial effects:
  • FIG. 1 shows a flowchart of an embodiment of a method for reading and writing data in 3D image processing according to the present invention
  • Figure 2 shows a schematic diagram of the data structure of the image processing algorithm
  • FIG. 3 (a) shows a schematic diagram of vertical sliding of a 3D image as a sub-picture in an embodiment
  • FIG. 3 (b) is a schematic diagram showing that a vertical slide of a 3D image is a sub-picture in another embodiment
  • FIG. 4 is a schematic diagram showing the correspondence between sub-pictures in an embodiment
  • FIG. 5 is a schematic diagram of a circular buffer of 3D images in an embodiment
  • FIG. 6 is a schematic structural diagram of an embodiment of a data reading and writing system for 3D image processing according to the present invention.
  • FIG. 7 is a schematic structural diagram of the terminal of the present invention in an embodiment.
  • the method and system for reading and writing data in the 3D image processing of the present invention are based on the 3D vertical sliding technology and ring buffer, which greatly improves the buffer utilization rate in the 3D image processing and reduces the overlap in the case of limited cache Part of the processing and access to DDR, so as to reduce the bandwidth consumption and read and write delay in image processing as a whole, greatly improving the speed of 3D image processing.
  • the method for reading and writing data in 3D image processing of the present invention includes the following steps:
  • Step S1 Divide the 3D image horizontally based on the vertical sliding technology, divide the 3D image into at least two sub-pictures; for each sub-picture, store the processing data of the sub-pictures to a circular buffer; After the sub-picture is described, the overlapping part of data required for the next sub-picture is retained in the circular buffer.
  • this technique divides the original 3D image into upper and lower layers, and the data contained in each layer does not overlap.
  • the size of the 3D sliding block is fixed during the division process.
  • the first or last layer is adjusted according to the actual 3D image size and 3D sliding square size.
  • this example divides the 3D image into four sub-pictures, which are respectively recorded as subImage0, subImage1, subImage2, and subImage3.
  • ALU accesses DDR through the bus and can directly access the SRAM cache.
  • the first request requests data from DDR and caches the data to be cached in SRAM.
  • the ALU requests data again, if the data is in the cache SRAM, it is directly read from the cache SRAM.
  • each sub-image is made as long as possible.
  • the maximum sub-image height can be calculated according to the size of the available SRAM.
  • Figure 3 (a) shows a typical division. The depth of the Sub image in the X and Z directions is the same as the original image, but the height in the Y direction decreases. If the calculated subimage value is negative or 0, you need to separate the 3D image from left to right.
  • Figure 3 (b) shows a left and right division, which divides the original 3D image into 3x4 3D sub images.
  • the present invention introduces circular buffers in the processing of sub-images. After processing a sub-image and continuing to process the sub-image under the sub-image, by not temporarily destroying the cache of the previous sub-image overlap line, the overlapping data read from the DDR is reduced. Among them, each time it is executed, the data covered in the circular buffer is the data that the last sub image has been consumed and will not be used in the future, which not only saves space, but also reduces the repeated read and write of the overlap. In the image convolution operation, the size of the overlap is highly related to the convolution kernel. Among them, the sub-images in the vertical division direction share the circular buffer; the horizontally adjacent sub-images need to process overlap utilization data.
  • the first line of the second layer needs to multiplex the M-1 line of the first layer.
  • the second layer starts from the end of the first layer and returns to the circular buffer header after encountering the bottom of the circular buffer.
  • the first layer is the last few lines of the first layer that are just covered by the first line of the second layer, so that the cache can be saved and the cache utilization rate is quite high.
  • the sub images between different layers have a corresponding relationship.
  • the two layers are divided into three sub images.
  • the Z direction representation is omitted.
  • the two convolution kernels are 3x3, then SubImage00 and SubImage10 correspond to the input of SubImage10, SubImage10 is the input of SubImage20, and the other dependencies are analogized.
  • SubImage11 is used as an input, the content of SubImage10 needs to be used, and the required behavior is an overlap line. Using circular buffer technology, only the overlap rows and newly generated results need to be stored in SRAM, and the entire original 3D image output is no longer needed.
  • the realization of circular buffer takes the whole 3D image as a cycle unit.
  • Each Z plane reserves space for overlap lines.
  • a 3D image has two faces in the Z direction, denoted as Z0 and Z1.
  • the 3D image is divided up and down into two subimages, called subImage0 and subImage1, subImage0 contains R0 to R3, and subImage1 contains R4 to R7.
  • the size of the convolution kernel is 3x3x2, and the overlap between sub images is two lines.
  • the size of Circular buffer is SubImageXsize * (SubImageYsize + OverlapSize) * SubImageZSize.
  • SubImageXsize, SubImageYsize, SubImageZSize, and OverlapSize are the sub-image X-direction size, Y-direction size, Z-direction size, and overlapping part size, respectively.
  • each Z-plane of subImage1 will start from the corresponding empty of each Z-plane of subImage0, or start from the last position of the corresponding Z-plane, and store them in sequence.
  • the covered part happens to be the part that SubImage0 has been consumed.
  • a certain Z-face will encounter the circular tail and will overwrite the head of the cache.
  • Each row that is not covered by the Z plane is exactly the row required for the overlap.
  • the heights of multiple sub-images divided by the same 3D image are not necessarily the same.
  • Step S2 Divide the multi-layer network of the image processing algorithm into at least two segments, so that the data between adjacent layers in each segment only interacts through the cache and does not undergo DDR interaction.
  • image processing models often include multiple layers, each layer completes the corresponding task, and there is a data dependency relationship between adjacent layers. Therefore, if DDR is used to complete data exchange between two adjacent layers, there will be a large DDR bandwidth and delay. If the intermediate results are all cached in the Buffer, it will occupy a huge cache. After being divided into sub images, the intermediate results between layers take sub image as the cache unit, and it is no longer necessary to cache all the intermediate results of the entire layer. Therefore, according to the size of the cache buffer, the present invention determines how many layers can use the cache to interact.
  • the characteristics of these layers are that the first layer reads data from DDR and buffers the output to the buffer, and the middle layer reads from buffer and writes to the buffer until the last layer of data is written back to DDR.
  • the layer that satisfies the above conditions becomes a segment. That is to say, the results of each layer except the last layer in the segment are written into the SRAM cache, and all layers except the first layer read data from the SRAM.
  • the data reading and writing method in the 3D image processing of the present invention is applied to the 3D image processing of the neural network.
  • the data reading and writing system in the 3D image processing of the present invention includes a circular buffer module 61 and a segment buffer module 62.
  • the circular buffer module 61 is used to divide the 3D image in the horizontal direction based on the vertical sliding technology and divide the 3D image into at least two sub-pictures; for each sub-picture, store the processing data of the sub-picture to the circular buffer Area; after processing the sub-picture, retain the overlapping part of the data required for the next sub-picture in the circular buffer.
  • the segment cache module 62 is used to divide the multi-layer network of the image processing algorithm into at least two segments, so that the data between adjacent layers in each segment only interacts through the cache and does not undergo DDR interaction.
  • each module of the above device is only a division of logical functions, and in actual implementation, it may be integrated in whole or part into a physical entity or may be physically separated.
  • these modules can be implemented in the form of software invocation through processing elements, or in the form of hardware, and some modules can be implemented in the form of software invocation through processing elements, and some modules can be implemented in the form of hardware.
  • the x module may be a separately established processing element, or may be integrated in a chip of the above device.
  • the x module may also be stored in the memory of the above-mentioned device in the form of a program code, and be called and executed by a processing element of the above-mentioned device to perform the function of the above x-module.
  • the implementation of other modules is similar. All or part of these modules can be integrated together or can be implemented independently.
  • the processing element described herein may be an integrated circuit with signal processing capabilities. In the implementation process, each step of the above method or each of the above modules may be completed by an integrated logic circuit of hardware in a processor element or instructions in the form of software.
  • the above modules may be one or more integrated circuits configured to implement the above method, for example: one or more specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), one or more microprocessors (Digital Singnal Processor, (Referred to as DSP), one or more field programmable gate array (Field Programmable Gate Array, referred to as FPGA) and so on.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Singnal Processor
  • FPGA Field Programmable Gate Array
  • the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU for short) or another processor that can call program code.
  • CPU Central Processing Unit
  • SOC system-on-a-chip
  • a computer program is stored on the storage medium of the present invention, and when the program is executed by a processor, the above-mentioned 3D image processing data reading and writing method is realized.
  • the storage medium includes various media that can store program codes, such as ROM, RAM, magnetic disk, U disk, memory card, or optical disk.
  • the terminal of the present invention includes: a processor 71 and a memory 72.
  • the memory 72 is used to store computer programs.
  • the memory 72 includes various media that can store program codes, such as ROM, RAM, magnetic disk, U disk, memory card, or optical disk.
  • the processor 71 is connected to the memory 72, and is used to execute a computer program stored in the memory 72, so that the terminal executes the above-mentioned 3D image processing data reading and writing method.
  • the processor 71 may be a general-purpose processor, including a central processor (Central Processing Unit, CPU for short), a network processor (Network Processor, short for NP), etc .; it may also be a digital signal processor (DigitalSignalProcessor, short for DSP), Application Specific Integrated Circuit (Application Specific Integrated Circuit, ASIC for short), Field Programmable Gate Array (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a central processor Central Processing Unit, CPU for short
  • Network Processor Network Processor
  • NP Network Processor
  • DSP DigitalSignalProcessor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the data reading and writing method and system, storage medium and terminal of the 3D image processing of the present invention are based on the 3D vertical sliding technology and ring buffer, which reduces the processing of overlapping parts and greatly improves in the case of limited cache Cache utilization in 3D image processing; by analyzing the entire network, under limited cache, the results between layers no longer have to interact with DDR, thereby reducing access to DDR, reducing the bandwidth requirements of image processing algorithms, and reducing Read and write latency and power consumption; in hardware design, a smaller buffer area can be used. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Architecture (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • Image Input (AREA)
  • Image Processing (AREA)
  • Image Generation (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Memory System (AREA)

Abstract

本发明提供一种3D图像处理中数据读写方法及系统、存储介质及终端,包括以下步骤:基于垂直滑动技术对3D图像进行水平方向的划分,将所述3D图像划分为至少两个子图;对于每个子图,将所述子图的处理数据存储至循环缓冲区;处理完所述子图后,在所述循环缓冲区中保留下一个子图所需的重叠部分数据;将图像处理算法的多层网络划分为至少两个分段,使得每个分段中相邻层之间的数据仅通过缓存交互,不经过DDR交互。本发明的3D图像处理中数据读写方法及系统、存储介质及终端基于3D垂直滑动技术和环形缓冲区,在有限缓存情况下极大地提升了3D图像处理中的缓存利用率,减少了对重叠部分的处理,从而整体上减轻图像处理中带宽消耗和读写延迟问题。

Description

一种3D图像处理中数据读写方法及系统、存储介质及终端 技术领域
本发明涉及缓存应用的技术领域,特别是涉及一种3D图像处理中数据读写方法及系统、存储介质及终端。
背景技术
数字图像处理(Digital Image Processing)是通过计算机对图像进行去除噪声、增强、复原、分割、提取特征等处理的方法和技术。3D图像处理算法中经常分成多个层逐层处理。每个层有输入图像和输出图像。因此,在3D图像处理的具体实现过程,需要占用巨大的存储带宽。例如,在神经网络Alex net中,相对于724M MACs计算量,需要3000M的数据访问。当存储全部用片外(off-chip)DRAM时候,巨大带宽带来高功耗和高延迟,严重影响系统性能。因此,数据的读写成为3D图像处理的瓶颈。
现有技术中,在双倍速率内存(Double Data Rate,DDR)和算术逻辑部件运算器(Arithmetic Logical Unit,ALU)之间增加多级本地存储,尽可能缓存并重复利用缓存中的内容是一种有效减少DDR带宽的方法。例如DRAM和ALU之间全缓存(global buffer),在各个ALU之间增加可以相互访问的本地共享存储,在ALU内部增加寄存器堆(Register file)。随着buffer级别的逐层降低,各级存储单元处理单位数据的功耗和访问延迟也会指数降低。同时,硬件往往也更加复杂,面积随之增加。
另外,通过降低数据的位宽来减少带宽。具体地,通过量化用低bit位数表示数据,减少要处理数据量,然后输出结果进行反量化。该方法使ALU更加简单,但随着数据位宽的降低必然带来计算精度的降低。对于神经网络而言,也需要对数据进行重新训练。
图像处理算法对图像的处理是按照一定顺序进行的。因此,可以对数据流分析和控制,合理使用buffer进行缓存。将图像划分为较小分块(tile),依次进行处理。该方法使得内存读取跨度减少。缓存以tile为单位,缓存单位变小,可用更小的内存管理单元(Memory Management Unit,MMU)或者cache缓存单位。然而,相邻tile之间存在重叠数据。当处于tile的边界的点需要处理的时候,需要重复访问相邻tile的数据。tile之间需要共同处理的数据称之为overlap数据。如果将tile缓存,也需要将overlap数据也进行缓存。同时,一层完成之前不能进行下一层的操作,层之间的结果存放在DDR中会有巨大的带宽,存放在缓存中又需要巨大的缓存面积。因此,如何提供缓存的利用率是重要的研究技术方向。
发明内容
鉴于以上所述现有技术的缺点,本发明的目的在于提供一种3D图像处理中数据读写方法及系统、存储介质及终端,基于3D垂直滑动(vertical sliding)技术和环形缓冲区(circular buffer),在有限缓存情况下极大地提升了3D图像处理中的缓存利用率,减少了对重叠部分的处理,从而整体上减轻图像处理中带宽消耗和读写延迟问题。
为实现上述目的及其他相关目的,本发明提供一种3D图像处理中数据读写方法,包括以下步骤:基于垂直滑动技术对3D图像进行水平方向的划分,将所述3D图像划分为至少两个子图;对于每个子图,将所述子图的处理数据存储至循环缓冲区;处理完所述子图后,在所述循环缓冲区中保留下一个子图所需的重叠部分数据;将图像处理算法的多层网络划分为至少两个分段,使得每个分段中相邻层之间的数据仅通过缓存交互,不经过DDR交互。
于本发明一实施例中,每个子图占用的环形缓冲区的大小为SubImageXsize*(SubImageYsize+OverlapSize)*SubImageZSize,其中,SubImageXsize、SubImageYsize、SubImageZSize和OverlapSize分别为子图X向大小、Y向大小、Z向大小和重叠部分大小。
于本发明一实施例中,在每个分段中,除最后一层外的每一层的输出数据写入缓存中,除第一层以外的每一层都从所述缓存中读取数据。
于本发明一实施例中,应用于神经网络的3D图像处理。
对应地,本发明提供一种3D图像处理中数据读写系统,包括循环缓存模块和分段缓存模块;
所述循环缓存模块用于基于垂直滑动技术对3D图像进行水平方向的划分,将所述3D图像划分为至少两个子图;对于每个子图,将所述子图的处理数据存储至循环缓冲区;处理完所述子图后,在所述循环缓冲区中保留下一个子图所需的重叠部分数据;
所述分段缓存模块用于将图像处理算法的多层网络划分为至少两个分段,使得每个分段中相邻层之间的数据仅通过缓存交互,不经过DDR交互。
于本发明一实施例中,每个子图占用的环形缓冲区的大小为SubImageXsize*(SubImageYsize+OverlapSize)*SubImageZSize,其中,SubImageXsize、SubImageYsize、SubImageZSize和OverlapSize分别为子图X向大小、Y向大小、Z向大小和重叠部分大小。
于本发明一实施例中,在每个分段中,除最后一层外的每一层的输出数据写入缓存中,除第一层以外的每一层都从所述缓存中读取数据。
于本发明一实施例中,应用于神经网络的3D图像处理。
本发明提供一种存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述的3D图像处理中数据读写方法。
本发明提供一种终端,包括:处理器及存储器;
所述存储器用于存储计算机程序;
所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行上述的3D图像处理中数据读写方法。
如上所述,本发明所述的3D图像处理中数据读写方法及系统、存储介质及终端,具有以下有益效果:
(1)基于3D垂直滑动技术和环形缓冲区,减少了对重叠部分的处理,在有限缓存情况下极大地提升了3D图像处理中的缓存利用率;
(2)通过对整个网络进行分析,在有限的缓存下,层之间结果不再必须用DDR交互,从而减少对DDR的访问,降低图像处理算法对带宽的需求,降低读写延迟和功耗;
(3)在硬件设计中,可以用更小的缓存(buffer)面积。
附图说明
图1显示为本发明的3D图像处理中数据读写方法于一实施例中的流程图;
图2显示为图像处理算法的数据结构示意图;
图3(a)显示为于一实施例中3D图像的垂直滑动为子图的示意图;
图3(b)显示为于另一实施例中3D图像的垂直滑动为子图的示意图;
图4显示为于一实施例中子图对应关系示意图;
图5显示为于一实施例中3D图像的循环缓冲区的示意图;
图6显示为本发明的3D图像处理中数据读写系统于一实施例中的结构示意图;
图7显示为本发明的终端于一实施例中的结构示意图。
元件标号说明
61       循环缓存模块
62       分段缓存模块
71       处理器
72       存储器
具体实施方式
以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。
需要说明的是,以下实施例中所提供的图示仅以示意方式说明本发明的基本构想,遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。
本发明的3D图像处理中数据读写方法及系统、存储介质及终端基于3D垂直滑动技术和环形缓冲区,在有限缓存情况下极大地提升了3D图像处理中的缓存利用率,减少了对重叠部分的处理和对DDR的访问,从而整体上减轻图像处理中带宽消耗和读写延迟问题,极大地提升了3D图像处理的速度。
如图1所示,于一实施例中,本发明的3D图像处理中数据读写方法包括以下步骤:
步骤S1、基于垂直滑动技术对3D图像进行水平方向的划分,将所述3D图像划分为至少两个子图;对于每个子图,将所述子图的处理数据存储至循环缓冲区;处理完所述子图后,在所述循环缓冲区中保留下一个子图所需的重叠部分数据。
具体的,在对3D图像进行划分的时候,按照一定3D的方块大小在垂直方向自上而下,依次进行滑动,该技术称之为垂直滑动技术。垂直滑动技术将原始的3D图像划分成上下多层,每一层包含的数据没有重叠。在划分的过程中3D滑动方块的大小固定。第一层或最后一层根据3D图像实际和大小和3D滑动方块的大小进行调整。如图3(a)所示,该例将3D图像划分成4个子图,分别记做记为subImage0,subImage1,subImage2,subImage3。
如图2所示,ALU通过总线访问DDR,可以直接访问SRAM缓存。第一次请求从DDR中请求数据,把需要缓存数据缓存到SRAM中。ALU再次请求数据时,如果数据位于缓存SRAM中,则直接从缓存SRAM中读取。
为了减少tile之间的overlap的重复处理。在本发明中,采用Vertical sliding技术将3D图像进行水平方向的划分,将每个划分块称之为子图(sub image)。优选地,尽量使得每个sub image扁长。假设每个sub image宽和原始image宽相等,可以根据可用SRAM的大小计算最大sub image的高。图3(a)所示为一个典型的划分。Sub image的X和Z方向深度和原始image 相同,但是Y方向高度减少。如果算出的sub image值是负数或者0,需要将3D图像左右分开。图3(b)所示为一个左右分开的划分,将原始3D图像分成3x4个3D sub images。
具体地,本发明在sub image的处理过程中引入circular buffer。在处理完一个sub image,继续处理该sub image下方sub image时候,通过暂时不销毁上一个sub image overlap行的缓存,来减少从DDR读取重叠数据。其中,每次执行时,circular buffer中被覆盖的数据是上一个sub image已经被消费,将来不再使用的数据,这样不但节省了空间,而且减少了overlap的重复读写。在图像卷积操作中,overlap的大小和卷积核(kernel)高度相关。其中竖直划分方向上的sub image共用circular buffer;横向上相邻的sub image需要处理overlap利用数据。具体地,假如每一个滑动窗口的高为N,卷积核高为M,那么第二层的第一行需要复用第一层的M-1行。在circular buffer中,当第一层处理完,再处理该层下方的一层时候,从第一层的结尾开始下去第二层,遇到circular buffer底后返回到circular buffer头。第一层为被覆盖的刚好是第二层的第一行需要的属于第一层的末几行,从而可以节省缓存,挺高缓存利用率。
在sub image划分中,不同层之间的sub image有对应关系。如图4所示,两个层共三个sub image划分。需要说明的是,为了简单,省去给出Z方向表示。假设SubImage00和SubImage20的高度为2,其他sub image高度为4。设定进行卷积操作时,两次卷积核为3x3,那么SubImage00和SubImage10对应SubImage10的输入,SubImage10为SubImage20的输入,其他依赖关系类推。具体地,SubImage11作为输入时候需要用到SubImage10的内容,需要的行为overlap的行。利用circular buffer技术,在SRAM中仅需要存储overlap的行和新生成的结果,不再需要存放整个原始3D图像输出。
circular buffer的实现以整个3D图像为一个循环单位。每一个Z面预留overlap行的空间。假设一个3D图像在Z方向上有两个面,记做Z0和Z1。在Y方向上有8行,记做R0到R7。将3D图像上下分为两个sub image,称为subImage0和subImage1,subImage0包含R0到R3,subImage1包含R4到R7。假设卷积核的大小为3x3x2,sub image之间的overlap为两行。Circular buffer的大小为SubImageXsize*(SubImageYsize+OverlapSize)*SubImageZSize。其中,SubImageXsize、SubImageYsize、SubImageZSize和OverlapSize分别为子图X向大小、Y向大小、Z向大小和重叠部分大小。
如图5所示,当缓存subImage0时候,在circular buffer中的摆放预留overlap空间,在该实施例中为两行,用“empty(空)”表示。SubImage0缓存后被下一层网络消费。当下次缓存subImage1时候,subImage1的每个Z面会从subImage0每个Z面对应empty开始,或者从对应Z面最后位置开始,顺序存放,覆盖的部分刚好是SubImage0已经被消费掉的部分。 某个Z面会遇到circular尾,会把到缓存头部进行覆盖写。每个Z面未被覆盖的行则刚好为overlap所需的行。
需要说明的时,同一个3D图像划分的多个sub image的高不一定完全相同。每次处理一个sub image需要知道3D sub image的起始地址、宽、高、幅度(stride)和输出地址信息。划分sub image后,这些参数需要正确配置。
步骤S2、将图像处理算法的多层网络划分为至少两个分段,使得每个分段中相邻层之间的数据仅通过缓存交互,不经过DDR交互。
具体地,图像处理模型往往包含多层,每层完成相应任务,相邻层之间有数据依赖关系。因此,如果相邻两层之间用DDR完成数据交互,将有较大的DDR带宽和延迟。如果中间结果全部缓存在Buffer中,将占用巨大的缓存。划分为sub image后,层之间中间结果以sub image为缓存单位,不再需要对整个层全部的中间结果进行缓存。因此,本发明根据缓存buffer的大小,判断有多少层可以利用缓存进行交互。这些层的特点是第一层从DDR读取数据,将输出缓存到buffer中,中间层从buffer中读取并写入缓存中,直到最后一层数据写回到DDR中。满足上述条件的层成为分段(segment)。也就是说,分段中除最后一层外的每一层的结果写入到SRAM缓存中,除第一层以外的层都从SRAM读取数据。
划分的sub image越小,circular buffer占用的SRAM就越小,也意味sub image之间overlap部分所占比重越大,有效数据占用率变小。因此,数据通过DDR缓存还是通过SRAM缓存具有不同的成本。不同的分段划分,会有不同的性能成本和SRAM利用率,故需要找到性能最优解。同时,层的划分同时关系到sub image的划分。
优选地,本发明的3D图像处理中数据读写方法应用于神经网络的3D图像处理。
如图6所示,于一实施例中,本发明的3D图像处理中数据读写系统包括循环缓存模块61和分段缓存模块62。
所述循环缓存模块61用于基于垂直滑动技术对3D图像进行水平方向的划分,将所述3D图像划分为至少两个子图;对于每个子图,将所述子图的处理数据存储至循环缓冲区;处理完所述子图后,在所述循环缓冲区中保留下一个子图所需的重叠部分数据。
所述分段缓存模块62用于将图像处理算法的多层网络划分为至少两个分段,使得每个分段中相邻层之间的数据仅通过缓存交互,不经过DDR交互。
需要说明的是,所述循环缓存模块61和分段缓存模块62的结构和原理与上述3D图像处理中数据读写方法的步骤一一对应,故在此不再赘述。
需要说明的是,应理解以上装置的各个模块的划分仅仅是一种逻辑功能的划分,实际实 现时可以全部或部分集成到一个物理实体上,也可以物理上分开。且这些模块可以全部以软件通过处理元件调用的形式实现,也可以全部以硬件的形式实现,还可以部分模块通过处理元件调用软件的形式实现,部分模块通过硬件的形式实现。例如:x模块可以为单独设立的处理元件,也可以集成在上述装置的某一个芯片中实现。此外,x模块也可以以程序代码的形式存储于上述装置的存储器中,由上述装置的某一个处理元件调用并执行以上x模块的功能。其它模块的实现与之类似。这些模块全部或部分可以集成在一起,也可以独立实现。这里所述的处理元件可以是一种集成电路,具有信号的处理能力。在实现过程中,上述方法的各步骤或以上各个模块可以通过处理器元件中的硬件的集成逻辑电路或者软件形式的指令完成。以上这些模块可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定集成电路(Application Specific Integrated Circuit,简称ASIC),一个或多个微处理器(Digital Singnal Processor,简称DSP),一个或者多个现场可编程门阵列(Field Programmable Gate Array,简称FPGA)等。当以上某个模块通过处理元件调度程序代码的形式实现时,该处理元件可以是通用处理器,如中央处理器(Central Processing Unit,简称CPU)或其它可以调用程序代码的处理器。这些模块可以集成在一起,以片上系统(System-on-a-chip,简称SOC)的形式实现。
本发明的存储介质上存储有计算机程序,该程序被处理器执行时实现上述的3D图像处理中数据读写方法。
优选地,所述存储介质包括:ROM、RAM、磁碟、U盘、存储卡或者光盘等各种可以存储程序代码的介质。
如图7所示,于一实施例中,本发明的终端包括:处理器71及存储器72。
所述存储器72用于存储计算机程序。
所述存储器72包括:ROM、RAM、磁碟、U盘、存储卡或者光盘等各种可以存储程序代码的介质。
所述处理器71与所述存储器72相连,用于执行所述存储器72存储的计算机程序,以使所述终端执行上述的3D图像处理中数据读写方法。
优选地,所述处理器71可以是通用处理器,包括中央处理器(CentralProcessingUnit,简称CPU)、网络处理器(NetworkProcessor,简称NP)等;还可以是数字信号处理器(DigitalSignalProcessor,简称DSP)、专用集成电路(ApplicationSpecificIntegratedCircuit,简称ASIC)、现场可编程门阵列(Field-ProgrammableGateArray,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
综上所述,本发明的3D图像处理中数据读写方法及系统、存储介质及终端基于3D垂直滑动技术和环形缓冲区,减少了对重叠部分的处理,在有限缓存情况下极大地提升了3D图像处理中的缓存利用率;通过对整个网络进行分析,在有限的缓存下,层之间结果不再必须用DDR交互,从而减少对DDR的访问,降低图像处理算法对带宽的需求,降低读写延迟和功耗;在硬件设计中,可以用更小的buffer面积。因此,本发明有效克服了现有技术中的种种缺点而具高度产业利用价值。
上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变,仍应由本发明的权利要求所涵盖。

Claims (10)

  1. 一种3D图像处理中数据读写方法,其特征在于,包括以下步骤:
    基于垂直滑动技术对3D图像进行水平方向的划分,将所述3D图像划分为至少两个子图;对于每个子图,将所述子图的处理数据存储至循环缓冲区;处理完所述子图后,在所述循环缓冲区中保留下一个子图所需的重叠部分数据;
    将图像处理算法的多层网络划分为至少两个分段,使得每个分段中相邻层之间的数据仅通过缓存交互,不经过DDR交互。
  2. 根据权利要求1所述的3D图像处理中数据读写方法,其特征在于,每个子图占用的环形缓冲区的大小为SubImageXsize*(SubImageYsize+OverlapSize)*SubImageZSize,其中,SubImageXsize、SubImageYsize、SubImageZSize和OverlapSize分别为子图X向大小、Y向大小、Z向大小和重叠部分大小。
  3. 根据权利要求1所述的3D图像处理中数据读写方法,其特征在于,在每个分段中,除最后一层外的每一层的输出数据写入缓存中,除第一层以外的每一层都从所述缓存中读取数据。
  4. 根据权利要求1所述的3D图像处理中数据读写方法,其特征在于,应用于神经网络的3D图像处理。
  5. 一种3D图像处理中数据读写系统,其特征在于,包括循环缓存模块和分段缓存模块;
    所述循环缓存模块用于基于垂直滑动技术对3D图像进行水平方向的划分,将所述3D图像划分为至少两个子图;对于每个子图,将所述子图的处理数据存储至循环缓冲区;处理完所述子图后,在所述循环缓冲区中保留下一个子图所需的重叠部分数据;
    所述分段缓存模块用于将图像处理算法的多层网络划分为至少两个分段,使得每个分段中相邻层之间的数据仅通过缓存交互,不经过DDR交互。
  6. 根据权利要求5所述的3D图像处理中数据读写系统,其特征在于,每个子图占用的环形缓冲区的大小为SubImageXsize*(SubImageYsize+OverlapSize)*SubImageZSize,其中,SubImageXsize、SubImageYsize、SubImageZSize和OverlapSize分别为子图X向大小、Y向大小、Z向大小和重叠部分大小。
  7. 根据权利要求5所述的3D图像处理中数据读写系统,其特征在于,在每个分段中,除最后一层外的每一层的输出数据写入缓存中,除第一层以外的每一层都从所述缓存中读取数据。
  8. 根据权利要求5所述的3D图像处理中数据读写系统,其特征在于,应用于神经网络的3D图像处理。
  9. 一种存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1至4中任一项所述的3D图像处理中数据读写方法。
  10. 一种终端,其特征在于,包括:处理器及存储器;
    所述存储器用于存储计算机程序;
    所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行权利要求1至4中任一项所述的3D图像处理中数据读写方法。
PCT/CN2019/107678 2018-10-10 2019-09-25 一种3d图像处理中数据读写方法及系统、存储介质及终端 WO2020073801A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP19871524.5A EP3816867A4 (en) 2018-10-10 2019-09-25 PROCESS AND SYSTEM FOR READING / WRITING DATA IN THREE-DIMENSIONAL (3D) IMAGE PROCESSING, STORAGE MEDIA AND TERMINAL
US17/257,859 US11455781B2 (en) 2018-10-10 2019-09-25 Data reading/writing method and system in 3D image processing, storage medium and terminal
JP2021520315A JP7201802B2 (ja) 2018-10-10 2019-09-25 3次元画像処理におけるデータの読み書き方法とシステム、記憶媒体及び端末
KR1020217014106A KR20210070369A (ko) 2018-10-10 2019-09-25 3d 이미지 처리 중의 데이터 읽기/쓰기 방법 및 시스템, 저장 매체 및 단말

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811179323.6 2018-10-10
CN201811179323.6A CN111028360B (zh) 2018-10-10 2018-10-10 一种3d图像处理中数据读写方法及系统、存储介质及终端

Publications (1)

Publication Number Publication Date
WO2020073801A1 true WO2020073801A1 (zh) 2020-04-16

Family

ID=70164275

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/107678 WO2020073801A1 (zh) 2018-10-10 2019-09-25 一种3d图像处理中数据读写方法及系统、存储介质及终端

Country Status (6)

Country Link
US (1) US11455781B2 (zh)
EP (1) EP3816867A4 (zh)
JP (1) JP7201802B2 (zh)
KR (1) KR20210070369A (zh)
CN (1) CN111028360B (zh)
WO (1) WO2020073801A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117036149A (zh) * 2020-12-01 2023-11-10 华为技术有限公司 一种图像处理方法及芯片
CN112541929A (zh) * 2021-01-25 2021-03-23 翱捷科技股份有限公司 一种用于卷积神经网络的图像处理方法及系统
WO2023033759A1 (en) * 2021-09-03 2023-03-09 Aselsan Elektroni̇k Sanayi̇ Ve Ti̇caret Anoni̇m Şi̇rketi̇ A method to accelerate deep learning applications for embedded environments
US11972504B2 (en) 2022-08-10 2024-04-30 Zhejiang Lab Method and system for overlapping sliding window segmentation of image based on FPGA
CN115035128B (zh) * 2022-08-10 2022-11-08 之江实验室 基于fpga的图像重叠滑窗分割方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859280A (zh) * 2010-06-03 2010-10-13 杭州海康威视软件有限公司 一种二维图像数据的并行传输计算方法及系统
US20150228106A1 (en) * 2014-02-13 2015-08-13 Vixs Systems Inc. Low latency video texture mapping via tight integration of codec engine with 3d graphics engine
CN108475347A (zh) * 2017-11-30 2018-08-31 深圳市大疆创新科技有限公司 神经网络处理的方法、装置、加速器、系统和可移动设备
CN108629734A (zh) * 2017-03-23 2018-10-09 展讯通信(上海)有限公司 图像几何变换方法、装置及终端

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007102116A1 (en) * 2006-03-06 2007-09-13 Nxp B.V. Addressing on chip memory for block operations
CN101080009B (zh) * 2007-07-17 2011-02-23 智原科技股份有限公司 运用于图像编解码器中的去方块过滤方法与装置
CN104281543B (zh) * 2013-07-01 2017-12-26 图芯芯片技术(上海)有限公司 同时支持显示控制器和图形加速器访问内存的架构方法
US10944911B2 (en) * 2014-10-24 2021-03-09 Texas Instruments Incorporated Image data processing for digital overlap wide dynamic range sensors
JP6766557B2 (ja) * 2016-09-29 2020-10-14 アイシン精機株式会社 周辺監視装置
JP6936592B2 (ja) * 2017-03-03 2021-09-15 キヤノン株式会社 演算処理装置およびその制御方法
CN107679621B (zh) * 2017-04-19 2020-12-08 赛灵思公司 人工神经网络处理装置
US11373266B2 (en) * 2017-05-05 2022-06-28 Intel Corporation Data parallelism and halo exchange for distributed machine learning
CN107454364B (zh) * 2017-06-16 2020-04-24 国电南瑞科技股份有限公司 一种视频监控领域的分布式实时图像采集与处理系统
US20190057060A1 (en) * 2017-08-19 2019-02-21 Wave Computing, Inc. Reconfigurable fabric data routing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859280A (zh) * 2010-06-03 2010-10-13 杭州海康威视软件有限公司 一种二维图像数据的并行传输计算方法及系统
US20150228106A1 (en) * 2014-02-13 2015-08-13 Vixs Systems Inc. Low latency video texture mapping via tight integration of codec engine with 3d graphics engine
CN108629734A (zh) * 2017-03-23 2018-10-09 展讯通信(上海)有限公司 图像几何变换方法、装置及终端
CN108475347A (zh) * 2017-11-30 2018-08-31 深圳市大疆创新科技有限公司 神经网络处理的方法、装置、加速器、系统和可移动设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEONARDO, MARTINS: "Accelerating Curvature Estimate in 3D Seismic Data Using GPGPU", 2014 IEEE 26TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING, 30 November 2014 (2014-11-30), XP032696183, ISSN: 1550-6533 *

Also Published As

Publication number Publication date
US11455781B2 (en) 2022-09-27
JP7201802B2 (ja) 2023-01-10
KR20210070369A (ko) 2021-06-14
JP2022508028A (ja) 2022-01-19
EP3816867A1 (en) 2021-05-05
EP3816867A4 (en) 2021-09-15
CN111028360A (zh) 2020-04-17
CN111028360B (zh) 2022-06-14
US20210295607A1 (en) 2021-09-23

Similar Documents

Publication Publication Date Title
WO2020073801A1 (zh) 一种3d图像处理中数据读写方法及系统、存储介质及终端
US11294599B1 (en) Registers for restricted memory
CN108388527B (zh) 直接存储器存取引擎及其方法
CN111465943A (zh) 芯片上计算网络
WO2022206556A1 (zh) 图像数据的矩阵运算方法、装置、设备及存储介质
US11875248B2 (en) Implementation of a neural network in multicore hardware
CN112005251A (zh) 运算处理装置
TWI537980B (zh) 用於寫入經遮罩資料至緩衝器之裝置及方法
US9570125B1 (en) Apparatuses and methods for shifting data during a masked write to a buffer
CN114489475A (zh) 分布式存储系统及其数据存储方法
US11775809B2 (en) Image processing apparatus, imaging apparatus, image processing method, non-transitory computer-readable storage medium
US11430164B2 (en) Tile-based scheduling
CN111914988A (zh) 神经网络设备、计算系统和处理特征图的方法
US7451182B2 (en) Coordinating operations of network and host processors
US9183435B2 (en) Feature generalization using topological model
Wu et al. Hetero Layer Fusion Based Architecture Design and Implementation for of Deep Learning Accelerator
CN112486904A (zh) 可重构处理单元阵列的寄存器堆设计方法及装置
RU168781U1 (ru) Устройство обработки стереоизображений
US11094368B2 (en) Memory, memory chip and memory data access method
US11842273B2 (en) Neural network processing
CN113325999B (zh) 用于处理非结构化源数据的方法和系统
US20230350797A1 (en) Flash-based storage device and copy-back operation method thereof
US10866907B2 (en) Eviction prioritization for image processing
US20210240473A1 (en) Processor device
Qazi et al. Optimization of access latency in DRAM

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19871524

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2019871524

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2019871524

Country of ref document: EP

Effective date: 20210128

ENP Entry into the national phase

Ref document number: 2021520315

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20217014106

Country of ref document: KR

Kind code of ref document: A